Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation

2013-04-04 Thread Gleb Natapov
On Wed, Apr 03, 2013 at 03:58:04PM -0500, Scott Wood wrote:
 On 04/03/2013 10:55:27 AM, Gleb Natapov wrote:
 On Tue, Apr 02, 2013 at 08:57:52PM -0500, Scott Wood wrote:
  Hook the MPIC code up to the KVM interfaces, add locking, etc.
 
  TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
  support
 
  Signed-off-by: Scott Wood scottw...@freescale.com
  ---
 [skip]
 
  diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
  index 20ce2d2..d8f44ef 100644
  --- a/include/uapi/linux/kvm.h
  +++ b/include/uapi/linux/kvm.h
  @@ -927,6 +927,15 @@ struct kvm_device_attr {
 __u64   addr;   /* userspace address of attr data */
   };
 
  +#define KVM_DEV_TYPE_FSL_MPIC_20  1
  +#define KVM_DEV_TYPE_FSL_MPIC_42  2
  +
  +#define KVM_DEV_MPIC_GRP_MISC 1
  +#define   KVM_DEV_MPIC_BASE_ADDR  0   /* 64-bit */
  +
  +#define KVM_DEV_MPIC_GRP_REGISTER 2   /* 32-bit */
  +#define KVM_DEV_MPIC_GRP_IRQ_ACTIVE   3   /* 32-bit */
 Why not put them in arch specific header?
 
 KVM_DEV_TYPE_* is not an arch-specific enumeration -- this was
 discussed last time around.
 
Yes, I am talking about KVM_DEV_MPIC_* only. KVM_DEV_TYPE_ are used by
common code so should stay here.

 KVM_DEV_MPIC_* could go elsewhere if you want to avoid cluttering
 the main kvm.h.  The arch header would be OK, since the non-arch
 header includes the arch header, and thus it wouldn't be visible to
 userspace where it is -- if there later is a need for MPIC (or
 whatever other device follows MPIC's example) on another
 architecture, it could be moved without breaking anything.  Or, we
 could just have a header for each device type.
 
If device will be used by more then one arch it will move into virt/kvm
and will have its own header, like ioapic.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Revert KVM: MMU: Move kvm_mmu_free_some_pages() into kvm_mmu_alloc_page()

2013-04-04 Thread Gleb Natapov
On Fri, Mar 29, 2013 at 02:05:26PM +0900, Takuya Yoshikawa wrote:
 With the following commit, shadow pages can be zapped at random during
 a shadow page talbe walk:
   KVM: MMU: Move kvm_mmu_free_some_pages() into kvm_mmu_alloc_page()
   7ddca7e43c8f28f9419da81a0e7730b66aa60fe9
 
 This patch reverts it and fixes __direct_map() and FNAME(fetch)().
 
 Signed-off-by: Takuya Yoshikawa yoshikawa_takuya...@lab.ntt.co.jp
Applied, thanks.

 ---
  arch/x86/kvm/mmu.c |   11 +++
  arch/x86/kvm/paging_tmpl.h |1 +
  2 files changed, 8 insertions(+), 4 deletions(-)
 
 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 633e30c..004cc87 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -1501,15 +1501,11 @@ static void drop_parent_pte(struct kvm_mmu_page *sp,
   mmu_spte_clear_no_track(parent_pte);
  }
  
 -static void make_mmu_pages_available(struct kvm_vcpu *vcpu);
 -
  static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu,
  u64 *parent_pte, int direct)
  {
   struct kvm_mmu_page *sp;
  
 - make_mmu_pages_available(vcpu);
 -
   sp = mmu_memory_cache_alloc(vcpu-arch.mmu_page_header_cache);
   sp-spt = mmu_memory_cache_alloc(vcpu-arch.mmu_page_cache);
   if (!direct)
 @@ -2806,6 +2802,7 @@ exit:
  
  static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
gva_t gva, pfn_t *pfn, bool write, bool *writable);
 +static void make_mmu_pages_available(struct kvm_vcpu *vcpu);
  
  static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, u32 error_code,
gfn_t gfn, bool prefault)
 @@ -2847,6 +2844,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t 
 v, u32 error_code,
   spin_lock(vcpu-kvm-mmu_lock);
   if (mmu_notifier_retry(vcpu-kvm, mmu_seq))
   goto out_unlock;
 + make_mmu_pages_available(vcpu);
   if (likely(!force_pt_level))
   transparent_hugepage_adjust(vcpu, gfn, pfn, level);
   r = __direct_map(vcpu, v, write, map_writable, level, gfn, pfn,
 @@ -2924,6 +2922,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
  
   if (vcpu-arch.mmu.shadow_root_level == PT64_ROOT_LEVEL) {
   spin_lock(vcpu-kvm-mmu_lock);
 + make_mmu_pages_available(vcpu);
   sp = kvm_mmu_get_page(vcpu, 0, 0, PT64_ROOT_LEVEL,
 1, ACC_ALL, NULL);
   ++sp-root_count;
 @@ -2935,6 +2934,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
  
   ASSERT(!VALID_PAGE(root));
   spin_lock(vcpu-kvm-mmu_lock);
 + make_mmu_pages_available(vcpu);
   sp = kvm_mmu_get_page(vcpu, i  (30 - PAGE_SHIFT),
 i  30,
 PT32_ROOT_LEVEL, 1, ACC_ALL,
 @@ -2973,6 +2973,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
   ASSERT(!VALID_PAGE(root));
  
   spin_lock(vcpu-kvm-mmu_lock);
 + make_mmu_pages_available(vcpu);
   sp = kvm_mmu_get_page(vcpu, root_gfn, 0, PT64_ROOT_LEVEL,
 0, ACC_ALL, NULL);
   root = __pa(sp-spt);
 @@ -3006,6 +3007,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
   return 1;
   }
   spin_lock(vcpu-kvm-mmu_lock);
 + make_mmu_pages_available(vcpu);
   sp = kvm_mmu_get_page(vcpu, root_gfn, i  30,
 PT32_ROOT_LEVEL, 0,
 ACC_ALL, NULL);
 @@ -3311,6 +3313,7 @@ static int tdp_page_fault(struct kvm_vcpu *vcpu, gva_t 
 gpa, u32 error_code,
   spin_lock(vcpu-kvm-mmu_lock);
   if (mmu_notifier_retry(vcpu-kvm, mmu_seq))
   goto out_unlock;
 + make_mmu_pages_available(vcpu);
   if (likely(!force_pt_level))
   transparent_hugepage_adjust(vcpu, gfn, pfn, level);
   r = __direct_map(vcpu, gpa, write, map_writable,
 diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
 index af143f0..da20860 100644
 --- a/arch/x86/kvm/paging_tmpl.h
 +++ b/arch/x86/kvm/paging_tmpl.h
 @@ -627,6 +627,7 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gva_t 
 addr, u32 error_code,
   goto out_unlock;
  
   kvm_mmu_audit(vcpu, AUDIT_PRE_PAGE_FAULT);
 + make_mmu_pages_available(vcpu);
   if (!force_pt_level)
   transparent_hugepage_adjust(vcpu, walker.gfn, pfn, level);
   r = FNAME(fetch)(vcpu, addr, walker, write_fault,
 -- 
 1.7.5.4

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: s390: virtio-ccw bugfix.

2013-04-04 Thread Cornelia Huck
Hi,

another virtio-ccw bugfix for a problem where we would not reset
old I/O errors. Please apply.

Cornelia Huck (1):
  KVM: s390: virtio_ccw: reset errors for new I/O.

 drivers/s390/kvm/virtio_ccw.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

-- 
1.7.12.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] KVM: s390: virtio_ccw: reset errors for new I/O.

2013-04-04 Thread Cornelia Huck
ccw_io_helper neglected to reset vcdev-err after a new channel
program had been successfully started, resulting in stale errors
delivered after one I/O failed. Reset the error after a new
channel program has been successfully started with no old I/O
pending.

Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
---
 drivers/s390/kvm/virtio_ccw.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/s390/kvm/virtio_ccw.c b/drivers/s390/kvm/virtio_ccw.c
index 42d507c..5948f2a 100644
--- a/drivers/s390/kvm/virtio_ccw.c
+++ b/drivers/s390/kvm/virtio_ccw.c
@@ -133,8 +133,11 @@ static int ccw_io_helper(struct virtio_ccw_device *vcdev,
do {
spin_lock_irqsave(get_ccwdev_lock(vcdev-cdev), flags);
ret = ccw_device_start(vcdev-cdev, ccw, intparm, 0, 0);
-   if (!ret)
+   if (!ret) {
+   if (!vcdev-curr_io)
+   vcdev-err = 0;
vcdev-curr_io |= flag;
+   }
spin_unlock_irqrestore(get_ccwdev_lock(vcdev-cdev), flags);
cpu_relax();
} while (ret == -EBUSY);
-- 
1.7.12.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: s390: virtio-ccw bugfix.

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 10:25:05AM +0200, Cornelia Huck wrote:
 Hi,
 
 another virtio-ccw bugfix for a problem where we would not reset
 old I/O errors. Please apply.
 
 Cornelia Huck (1):
   KVM: s390: virtio_ccw: reset errors for new I/O.
 
  drivers/s390/kvm/virtio_ccw.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)
 
Is this for 3.9 or 3.10?

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: s390: virtio-ccw bugfix.

2013-04-04 Thread Cornelia Huck
On Thu, 4 Apr 2013 11:27:39 +0300
Gleb Natapov g...@redhat.com wrote:

 On Thu, Apr 04, 2013 at 10:25:05AM +0200, Cornelia Huck wrote:
  Hi,
  
  another virtio-ccw bugfix for a problem where we would not reset
  old I/O errors. Please apply.
  
  Cornelia Huck (1):
KVM: s390: virtio_ccw: reset errors for new I/O.
  
   drivers/s390/kvm/virtio_ccw.c | 5 -
   1 file changed, 4 insertions(+), 1 deletion(-)
  
 Is this for 3.9 or 3.10?

I created the patch on top of kvm/queue (the problem won't surface with
the current code, I only noticed it when I was mucking around with some
experimental stuff).

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: s390: virtio-ccw bugfix.

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 10:35:54AM +0200, Cornelia Huck wrote:
 On Thu, 4 Apr 2013 11:27:39 +0300
 Gleb Natapov g...@redhat.com wrote:
 
  On Thu, Apr 04, 2013 at 10:25:05AM +0200, Cornelia Huck wrote:
   Hi,
   
   another virtio-ccw bugfix for a problem where we would not reset
   old I/O errors. Please apply.
   
   Cornelia Huck (1):
 KVM: s390: virtio_ccw: reset errors for new I/O.
   
drivers/s390/kvm/virtio_ccw.c | 5 -
1 file changed, 4 insertions(+), 1 deletion(-)
   
  Is this for 3.9 or 3.10?
 
 I created the patch on top of kvm/queue (the problem won't surface with
 the current code, I only noticed it when I was mucking around with some
 experimental stuff).
Applied to queue. Thanks.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: s390: virtio_ccw: reset errors for new I/O.

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 10:25, Cornelia Huck wrote:

 ccw_io_helper neglected to reset vcdev-err after a new channel
 program had been successfully started, resulting in stale errors
 delivered after one I/O failed. Reset the error after a new
 channel program has been successfully started with no old I/O
 pending.
 
 Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com

Shouldn't this be CC: stable?


Alex

 ---
 drivers/s390/kvm/virtio_ccw.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)
 
 diff --git a/drivers/s390/kvm/virtio_ccw.c b/drivers/s390/kvm/virtio_ccw.c
 index 42d507c..5948f2a 100644
 --- a/drivers/s390/kvm/virtio_ccw.c
 +++ b/drivers/s390/kvm/virtio_ccw.c
 @@ -133,8 +133,11 @@ static int ccw_io_helper(struct virtio_ccw_device *vcdev,
   do {
   spin_lock_irqsave(get_ccwdev_lock(vcdev-cdev), flags);
   ret = ccw_device_start(vcdev-cdev, ccw, intparm, 0, 0);
 - if (!ret)
 + if (!ret) {
 + if (!vcdev-curr_io)
 + vcdev-err = 0;
   vcdev-curr_io |= flag;
 + }
   spin_unlock_irqrestore(get_ccwdev_lock(vcdev-cdev), flags);
   cpu_relax();
   } while (ret == -EBUSY);
 -- 
 1.7.12.4
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 00:54, Scott Wood wrote:

 On 04/03/2013 05:12:06 PM, Alexander Graf wrote:
 Am 04.04.2013 um 00:07 schrieb Scott Wood scottw...@freescale.com:
  On 04/03/2013 04:58:56 PM, Alexander Graf wrote:
  Am 03.04.2013 um 23:38 schrieb Scott Wood scottw...@freescale.com:
   On 04/03/2013 11:19:42 AM, Alexander Graf wrote:
   On 03.04.2013, at 03:57, Scott Wood wrote:
Hook the MPIC code up to the KVM interfaces, add locking, etc.
   
TODO: irqfd support, split up into multiple patches, KVM_IRQ_LINE
support
   
Signed-off-by: Scott Wood scottw...@freescale.com
---
v3: mpic_put - kvmppc_mpic_put
   
   
   [...]
+void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
+
int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 struct kvm_config_tlb *cfg);
int kvm_vcpu_ioctl_dirty_tlb(struct kvm_vcpu *vcpu,
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 63c67ec..a87139b 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -151,6 +151,11 @@ config KVM_E500MC
   
 If unsure, say N.
   
+config KVM_MPIC
+bool KVM in-kernel MPIC emulation
+depends on KVM
   This should probably depend on FSL KVM for now, until someone adds 
   support for other MPIC revisions.
  
   I don't see a symbol specifically for FSL KVM.  What part of the MPIC 
   code depends on booke or any FSL-specific code?
  You support only FSL mpic device IDs :). So if someone on book3s goes 
  along and sees this, he'd think yes, I want an in-kernel MPIC, enables 
  the option and wastes space.
 
  Would this waste space is not generally the criteria for kconfig 
  dependencies.  Who is the kernel to get in the way of someone that wants 
  an FSL MPIC on a 4xx VM? :-)
 
  And again, there's no symbol for FSL KVM -- I'd have to use a list that 
  could get out of date.  And it would reduce build testing in 
  allyesconfig-type configs.
 Ok, please indicate compatibility limitations in the Kconfig description at 
 least then.
 
 OK -- not really a compatibility limitation so much as what models are 
 supported.
 
 Note that mpc86xx has a 74xx-derived core, but also has an FSL MPIC...
 
 Is 74xx/e600 supported by book3s_pr?  Can't tell from the kconfig text. :-)

On book3s_pr we don't have a good compatibility check mechanism in place. 
That's really suboptimal.

What I'm saying is that Kconfig should say In-kernel emulation of FSL MPIC 2.0 
and FSL MPIC 4.2 interrupt controllers. Say Y here if you plan to run KVM on an 
FSL system. That'd be in line with what you can actually enable using the 
ioctls and leaves the decision whether that's a good thing to the user.

Code-wise there shouldn't be any dependency on host or guest architecture of 
course.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 07:37, Paul Mackerras wrote:

 On Thu, Mar 21, 2013 at 09:52:16AM +0100, Alexander Graf wrote:
 +/* Platform specific hcalls, used by KVM */
 +#define H_RTAS 0xf000
 
 How about you define a different hcall ID for this? Then QEMU would
 create its rtas entry blob such that KVM-routed RTAS handling goes
 to KVM directly.
 
 QEMU can still do that, and I don't see that it would change the
 kernel side if it did.  We would still have to have agreement between
 the kernel and userspace as to what the hcall number for invoking the
 in-kernel RTAS calls was, and the kernel would still have to keep a
 list of token numbers and how they correspond to the functions it
 provides.  The only thing different would be that the in-kernel RTAS
 hcall could return to the guest if it didn't recognize the token
 number, rather than pushing the problem up to userspace.  However,
 that wouldn't make the code any simpler, and it isn't a situation
 where performance is an issue.
 
 Do you see some kernel-side improvements or simplifications from your
 suggestion that I'm missing?  Remember, the guest gets the token
 numbers from the device tree (properties under the /rtas node), so
 they are under the control of userspace/QEMU.

The code flow with this patch:

  setup time

  foreach (override in overrides)
ioctl(OVERRIDE_RTAS, ...);

  runtime

  switch (hcall_id) {
  case QEMU_RTAS_ID:
foreach (override in kvm_overrides) {
  int rtas_id = ...;
  if (override.rtas_id == rtas_id) {
handle_rtas();
handled = true;
  }
}
if (!handled)
  pass_to_qemu();
break;
  default:
pass_to_qemu();
break
  }

What I'm suggesting:

  setup time

  nothing from KVM's point of view

  runtime

  switch (hcall_id) {
  case KVM_RTAS_ID:
handle_rtas();
break;
  default:
pass_to_qemu();
break;
  }


Which one looks easier and less error prone to you? :)

Speaking of which, how does user space know that the kernel actually supports a 
specific RTAS token?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 1/6] kvm: add device control API

2013-04-04 Thread Gleb Natapov
On Wed, Apr 03, 2013 at 07:39:52PM +0200, Alexander Graf wrote:
 
 On 03.04.2013, at 19:37, Scott Wood wrote:
 
  On 04/03/2013 08:22:37 AM, Alexander Graf wrote:
  On 03.04.2013, at 04:17, Paul Mackerras wrote:
   On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
   On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
   On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
   +4.79 KVM_CREATE_DEVICE
   +
   +Capability: KVM_CAP_DEVICE_CTRL
  
   I notice this patch doesn't add this capability;
  
   Yes, it does (see below).
  
   you add it in a later patch.
  
   Maybe you're thinking of KVM_CAP_IRQ_MPIC?
  
   No, I was referring to the addition to kvm_dev_ioctl_check_extension()
   of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to handle
   KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
   userspace queries the KVM_CAP_DEVICE_CTRL capability.
  
   +/* ioctl for vm fd */
   +#define KVM_CREATE_DEVICE  _IOWR(KVMIO,  0xe0, struct
   kvm_create_device)
  
   This define should go with the other VM ioctls, otherwise the next
   person to add a VM ioctl will probably miss it and reuse the 0xe0
   code.
  
   That's actually why I moved it to a new section, with device control
   ioctls getting their own range, as the legacy device model and
   some other things did.  0xe0 is not the next ioctl that would be
   used for either vm or vcpu.  The ioctl numbering is actually already
   a mess, with sometimes care being taken to keep vcpu and vm ioctls
   from overlapping, but on other places overlapping does happen.  I'm
   not sure what exactly I should do here.
  
   Well, even if you are using a new range, I still think that
   KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM
   ioctls.  I guess it's ultimately up to the maintainers.
  I agree. Things get confusing for VM ioctls otherwise.
  
  Things are already confusing. :-)
  
  I can move KVM_CREATE_DEVICE back with the other VM ioctls, but what number 
  should it get?  The last VM ioctl is 0xab (which is also a VCPU ioctl).  
  Should I use 0xac (which is also a VCPU ioctl)?  Or should I try to avoid a 
  conflict, as was sometimes done in the past -- in which case, which number 
  should I use?
 
 Gleb, Marcelo?
 
 
Yes, ioctls number assignments are a little bit of a mess :( There are 14
numbers that are used twice and some of them are used twice for the same
type of fd, but with different direction bits and there is one, 0x9b,
that is used twice and have the same IO direction and even, potentially,
same parameter size (just checked it has the same parameter size)! Good that
the second use is only on IA64 which is almost dead.

Although ioctls numbers between different types of fds should not conflict
lets try and make them unique. Scott, please put KVM_CREATE_DEVICE near
device model ioctls and give it a unique number.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: s390: virtio_ccw: reset errors for new I/O.

2013-04-04 Thread Cornelia Huck
On Thu, 4 Apr 2013 11:28:35 +0200
Alexander Graf ag...@suse.de wrote:

 
 On 04.04.2013, at 10:25, Cornelia Huck wrote:
 
  ccw_io_helper neglected to reset vcdev-err after a new channel
  program had been successfully started, resulting in stale errors
  delivered after one I/O failed. Reset the error after a new
  channel program has been successfully started with no old I/O
  pending.
  
  Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
 
 Shouldn't this be CC: stable?

I don't think so, current code will not continue after errors anyway.

 
 
 Alex
 
  ---
  drivers/s390/kvm/virtio_ccw.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)
  
  diff --git a/drivers/s390/kvm/virtio_ccw.c b/drivers/s390/kvm/virtio_ccw.c
  index 42d507c..5948f2a 100644
  --- a/drivers/s390/kvm/virtio_ccw.c
  +++ b/drivers/s390/kvm/virtio_ccw.c
  @@ -133,8 +133,11 @@ static int ccw_io_helper(struct virtio_ccw_device 
  *vcdev,
  do {
  spin_lock_irqsave(get_ccwdev_lock(vcdev-cdev), flags);
  ret = ccw_device_start(vcdev-cdev, ccw, intparm, 0, 0);
  -   if (!ret)
  +   if (!ret) {
  +   if (!vcdev-curr_io)
  +   vcdev-err = 0;
  vcdev-curr_io |= flag;
  +   }
  spin_unlock_irqrestore(get_ccwdev_lock(vcdev-cdev), flags);
  cpu_relax();
  } while (ret == -EBUSY);
  -- 
  1.7.12.4
  
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 1/6] kvm: add device control API

2013-04-04 Thread Gleb Natapov
On Tue, Apr 02, 2013 at 08:57:48PM -0500, Scott Wood wrote:
 Currently, devices that are emulated inside KVM are configured in a
 hardcoded manner based on an assumption that any given architecture
 only has one way to do it.  If there's any need to access device state,
 it is done through inflexible one-purpose-only IOCTLs (e.g.
 KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
 cumbersome and depletes a limited numberspace.
 
 This API provides a mechanism to instantiate a device of a certain
 type, returning an ID that can be used to set/get attributes of the
 device.  Attributes may include configuration parameters (e.g.
 register base address), device state, operational commands, etc.  It
 is similar to the ONE_REG API, except that it acts on devices rather
 than vcpus.
 
 Both device types and individual attributes can be tested without having
 to create the device or get/set the attribute, without the need for
 separately managing enumerated capabilities.
 
 Signed-off-by: Scott Wood scottw...@freescale.com
 ---
 v3: remove some changes that were merged into this patch by accident,
 and fix the error documentation for KVM_CREATE_DEVICE.
 
 NOTE: I had some difficulty figuring out what ioctl numbers I should
 assign...  it seems that at one point care was taken to keep vcpu and
 vm ioctls separate, but some overlap exists now (despite not exhausing
 the ioctl space).  Some of that was my fault, but not all of it. :-)
 I moved to a new ioctl range for device control -- please let me know
 if there's something else you'd prefer I do.
 ---
  Documentation/virtual/kvm/api.txt|   70 
 ++
  Documentation/virtual/kvm/devices/README |1 +
  include/uapi/linux/kvm.h |   27 
  virt/kvm/kvm_main.c  |   31 +
  4 files changed, 129 insertions(+)
  create mode 100644 Documentation/virtual/kvm/devices/README
 
 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index 976eb65..d52f3f9 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents 
 from the data
  written, then `n_invalid' invalid entries, invalidating any previously
  valid entries found.
  
 +4.79 KVM_CREATE_DEVICE
 +
 +Capability: KVM_CAP_DEVICE_CTRL
 +Type: vm ioctl
 +Parameters: struct kvm_create_device (in/out)
 +Returns: 0 on success, -1 on error
 +Errors:
 +  ENODEV: The device type is unknown or unsupported
 +  EEXIST: Device already created, and this type of device may not
 +  be instantiated multiple times
 +
 +  Other error conditions may be defined by individual device types or
 +  have their standard meanings.
 +
 +Creates an emulated device in the kernel.  The file descriptor returned
 +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
 +
 +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
 +device type is supported (not necessarily whether it can be created
 +in the current vm).
 +
 +Individual devices should not define flags.  Attributes should be used
 +for specifying any behavior that is not implied by the device type
 +number.
 +
 +struct kvm_create_device {
 + __u32   type;   /* in: KVM_DEV_TYPE_xxx */
 + __u32   fd; /* out: device handle */
 + __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
 +};
 +
 +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
 +
 +Capability: KVM_CAP_DEVICE_CTRL
 +Type: device ioctl
 +Parameters: struct kvm_device_attr
 +Returns: 0 on success, -1 on error
 +Errors:
 +  ENXIO:  The group or attribute is unknown/unsupported for this device
 +  EPERM:  The attribute cannot (currently) be accessed this way
 +  (e.g. read-only attribute, or attribute that only makes
 +  sense when the device is in a different state)
 +
 +  Other error conditions may be defined by individual device types.
 +
 +Gets/sets a specified piece of device configuration and/or state.  The
 +semantics are device-specific.  See individual device documentation in
 +the devices directory.  As with ONE_REG, the size of the data
 +transferred is defined by the particular attribute.
 +
 +struct kvm_device_attr {
 + __u32   flags;  /* no flags currently defined */
 + __u32   group;  /* device-defined */
 + __u64   attr;   /* group-defined */
 + __u64   addr;   /* userspace address of attr data */
 +};
 +
Since now each device has its own fd is it an advantage to enforce
common interface between different devices? If we do so though why
not handle file creation, ioctl and file descriptor lifetime in the
common code. Common code will have struct kvm_device with struct
kvm_device_arch and struct kvm_device_ops members. Instead of
kvm_mpic_ioctl there will be kvm_device_ioctl which will despatch ioctls
to a device using kvm_device-ops-(set|get|has)_attr pointers.

--
Gleb.
--

Re: [PATCH 4/7] ARM: KVM: enforce page alignment for identity mapped code

2013-04-04 Thread Marc Zyngier
On 04/04/13 00:15, Christoffer Dall wrote:
 On Wed, Apr 03, 2013 at 11:00:39AM +0100, Marc Zyngier wrote:
 On 03/04/13 10:50, Will Deacon wrote:
 On Tue, Apr 02, 2013 at 02:25:12PM +0100, Marc Zyngier wrote:
 We're about to move to a init procedure where we rely on the
 fact that the init code fits in a single page. Make sure we
 align the idmap text on a page boundary, and that the code is
 not bigger than a single page.

 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  arch/arm/kernel/vmlinux.lds.S | 2 +-
  arch/arm/kvm/init.S   | 7 +++
  2 files changed, 8 insertions(+), 1 deletion(-)

 diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
 index b571484..d9dd265 100644
 --- a/arch/arm/kernel/vmlinux.lds.S
 +++ b/arch/arm/kernel/vmlinux.lds.S
 @@ -20,7 +20,7 @@
VMLINUX_SYMBOL(__idmap_text_start) = .; \
*(.idmap.text)  \
VMLINUX_SYMBOL(__idmap_text_end) = .;   \
 -  ALIGN_FUNCTION();   \
 +  . = ALIGN(PAGE_SIZE);   \
VMLINUX_SYMBOL(__hyp_idmap_text_start) = .; \
*(.hyp.idmap.text)  \
VMLINUX_SYMBOL(__hyp_idmap_text_end) = .;
 diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
 index 9f37a79..35a463f 100644
 --- a/arch/arm/kvm/init.S
 +++ b/arch/arm/kvm/init.S
 @@ -111,4 +111,11 @@ __do_hyp_init:
.globl __kvm_hyp_init_end
  __kvm_hyp_init_end:
  
 +  /*
 +   * The above code *must* fit in a single page for the trampoline
 +   * madness to work. Whoever decides to change it must make sure
 +   * we map the right amount of memory for the trampoline to work.
 +   * The line below ensures any breakage will get noticed.
 
 This comment is a little funny. What's this about changing it? And why
 madness? OK, it's a little mad.
 
 I think the nice thing to explain here is why .org would complain - I
 had to look in the AS reference to figure out that it will complain if
 the location pointer is forwarded outside the section - if that is the
 right reason?
 
 +   */
 +  .org__kvm_hyp_init + PAGE_SIZE
.popsection

 What effect does this have on the size of the kernel image? I'd expect the
 idmap code to be pretty small, so aligning to a page might be overkill a lot
 of the time.

 So we're indeed wasting memory between the kernel and the HYP idmaps.
 The reason for this is to ensure that the init code is not spread across
 two pages, which would make the trampoline harder to allocate.

 If there's a way to ensure that the init.S code will fit a single page,
 then we can remove this ALIGN(PAGE_SIZE).
 
 Just allocate a temporary page and copy the bit of code into there?  It
 should be easy enough to make sure it's location independent.  We can
 even give that page back later if we want and re-do the trick when a CPU
 gets hot-plugged if we want. No?

I'd really like to avoid freeing stuff, because it gives us even more
chances to screw our TLBs (we'd have to invalidate on unmap, and deal
with potential races between CPUs coming up).

We could allocate this page *if* the HYP init code happens to cross a
page boundary, and only then. I'll have a look at implementing this as
it should solve Will's issue (which is even worse on arm64 with 64k pages).

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] ARM: KVM: switch to a dual-step HYP init code

2013-04-04 Thread Marc Zyngier
On 04/04/13 00:15, Christoffer Dall wrote:
 On Wed, Apr 03, 2013 at 11:38:30AM +0100, Marc Zyngier wrote:
 On 03/04/13 11:07, Will Deacon wrote:
 On Tue, Apr 02, 2013 at 02:25:14PM +0100, Marc Zyngier wrote:
 Our HYP init code suffers from two major design issues:
 - it cannot support CPU hotplug, as we tear down the idmap very early
 - it cannot perform a TLB invalidation when switching from init to
   runtime mappings, as pages are manipulated from PL1 exclusively

 [...]

 diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
 index 35a463f..b2c6967 100644
 --- a/arch/arm/kvm/init.S
 +++ b/arch/arm/kvm/init.S
 @@ -21,6 +21,7 @@
  #include asm/asm-offsets.h
  #include asm/kvm_asm.h
  #include asm/kvm_arm.h
 +#include asm/kvm_mmu.h
  
  /
   * Hypervisor initialization
 @@ -47,6 +48,9 @@ __kvm_hyp_init:
W(b).
  
  __do_hyp_init:
 +  cmp r2, #0  @ We have a SP?
 +  bne phase2  @ Yes, second stage init
 +
@ Set the HTTBR to point to the hypervisor PGD pointer passed
mcrrp15, 4, r0, r1, c2
  
 @@ -96,14 +100,35 @@ __do_hyp_init:
orr r0, r0, r1
isb
mcr p15, 4, r0, c1, c0, 0   @ HSCR
 -  isb
  
 -  @ Set stack pointer and return to the kernel
 +  eret
 +
 +phase2:
 +  @ Set stack pointer
mov sp, r2
  
@ Set HVBAR to point to the HYP vectors
mcr p15, 4, r3, c12, c0, 0  @ HVBAR
  
 +  @ Jump to the trampoline page
 +  ldr r2, =#PAGE_MASK

 Shifting right by PAGE_SHIFT can avoid the load.

 Not really. We're masking out the top bits of target and adding them
 to the trampoline base address, so shifting doesn't help.

 But, as you suggested offline, BFI can come to the rescue and make that
 code totally fun and unreadable. How about (untested):

  ldr r2, =#TRAMPOLINE_VA
  adr r3, target
  bfi r2, r3, #0, #PAGE_SHIFT
  mov pc, r2

 I really like it! :)

 
 What kind of drugs are you on?
 
 Ok, I actually like it too.

Implemented, tested, works.


 +  adr r3, target
 +  bic r3, r3, r2
 +  ldr r2, =#TRAMPOLINE_VA
 +  add r3, r3, r2
 +  mov pc, r3
 +
 +  nop

 insert dead chicken and voodoo chant here

 ... You know I'll never sleep no more ...

 
 Seriously, what kind of drugs are you guys on?

Someone did comment last year about the quality of the water in
Cambridge. He may have been right. But in this occurrence, it's only a
mild case of Frank Zappatis (Zomby Woof variety).

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: fix MMIO/PIO collision misdetection

2013-04-04 Thread Michael S. Tsirkin
PIO and MMIO are separate address spaces, but
ioeventfd registration code mistakenly detected
two eventfds as duplicate if they use the same address,
even if one is PIO and another one MMIO.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 virt/kvm/eventfd.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index adb17f2..93e5b05 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -577,6 +577,7 @@ struct _ioeventfd {
struct eventfd_ctx  *eventfd;
u64  datamatch;
struct kvm_io_device dev;
+   u8   bus_idx;
bool wildcard;
 };
 
@@ -669,7 +670,8 @@ ioeventfd_check_collision(struct kvm *kvm, struct 
_ioeventfd *p)
struct _ioeventfd *_p;
 
list_for_each_entry(_p, kvm-ioeventfds, list)
-   if (_p-addr == p-addr  _p-length == p-length 
+   if (_p-bus_idx == p-bus_idx 
+   _p-addr == p-addr  _p-length == p-length 
(_p-wildcard || p-wildcard ||
 _p-datamatch == p-datamatch))
return true;
@@ -717,6 +719,7 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
 
INIT_LIST_HEAD(p-list);
p-addr= args-addr;
+   p-bus_idx = bus_idx;
p-length  = args-len;
p-eventfd = eventfd;
 
@@ -775,7 +778,8 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct 
kvm_ioeventfd *args)
list_for_each_entry_safe(p, tmp, kvm-ioeventfds, list) {
bool wildcard = !(args-flags  KVM_IOEVENTFD_FLAG_DATAMATCH);
 
-   if (p-eventfd != eventfd  ||
+   if (p-bus_idx != bus_idx ||
+   p-eventfd != eventfd  ||
p-addr != args-addr  ||
p-length != args-len ||
p-wildcard != wildcard)
-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: s390: virtio_ccw: reset errors for new I/O.

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 12:20, Cornelia Huck wrote:

 On Thu, 4 Apr 2013 11:28:35 +0200
 Alexander Graf ag...@suse.de wrote:
 
 
 On 04.04.2013, at 10:25, Cornelia Huck wrote:
 
 ccw_io_helper neglected to reset vcdev-err after a new channel
 program had been successfully started, resulting in stale errors
 delivered after one I/O failed. Reset the error after a new
 channel program has been successfully started with no old I/O
 pending.
 
 Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com
 
 Shouldn't this be CC: stable?
 
 I don't think so, current code will not continue after errors anyway.

Yeah, I got confused by the cover letter for a single patch. Usually, when you 
only send a single patch, you wouldn't also send a cover letter :).


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/6] kvm: pci PORT IO MMIO and PV MMIO speed tests

2013-04-04 Thread Michael S. Tsirkin
These patches add a test device, useful to measure speed of MMIO versus PIO, in
different configurations. As I didn't want to reserve a hardcoded range
of memory, I added pci device for this instead.  Used together with the
kvm unittest patches I posted on kvm mailing list.

To use, simply add the device on the pci bus.
Example test output:
vmcall 1519

outl_to_kernel 1745

mmio-no-eventfd:pci-mem 9075
mmio-wildcard-eventfd:pci-mem 3529
mmio-datamatch-eventfd:pci-mem 3509
mmio-pv-eventfd:pci-mem 1878
portio-no-eventfd:pci-io 5535
portio-wildcard-eventfd:pci-io 1846
portio-datamatch-eventfd:pci-io 1848
portio-pv-eventfd:pci-io 1842


First interesting conclusion is that the overhead of MMIO
exit to QEMU as compared to PIO is double that of
MMIO ioeventfd as compared to PIO eventfd. Is this a known fact?

Second is that PV MMIO is almost as fast as portio, so it
looks like a viable replacement when we have a lot of
devices such that we run out of IO space.

Third is that eventfd has small but measureable overhead,
and that hypercalls are measureably faster than port io.
The difference is probably in hardware. Whether this justifies
considering hypercalls for e.g. virtio, is open for discussion.

I also had to extend kvm in a minor way, making all ioeventfd
options accessible through the API. This actually needs less code
than checking that users DTRT.

Patches 1-5 are useful in themselves.
Patches 5 and 6 add support for PV MMIO eventfds and
add the test for it, tested against a kernel patch
I'll send shortly. They can be applied separately when
we are sure the kernel interface is going to be there.

Michael S. Tsirkin (6):
  kvm: remove unused APIs
  kvm: support any size for pio eventfd
  kvm: support non datamatch ioeventfd
  pci: add pci test device
  kvm: add PV MMIO
  pci-testdev: add pv mmio test

 docs/specs/pci-testdev.txt |  26 
 hw/Makefile.objs   |   1 +
 hw/dataplane/hostmem.c |   1 +
 hw/ivshmem.c   |   2 +
 hw/pci-testdev.c   | 311 +
 hw/pci/pci.h   |   1 +
 hw/vhost.c |   4 +-
 hw/virtio-pci.c|   4 +-
 include/exec/memory.h  |  10 ++
 include/sysemu/kvm.h   |   4 -
 kvm-all.c  | 142 +++--
 kvm-stub.c |  10 --
 linux-headers/linux/kvm.h  |   8 ++
 memory.c   |   9 +-
 14 files changed, 448 insertions(+), 85 deletions(-)
 create mode 100644 docs/specs/pci-testdev.txt
 create mode 100644 hw/pci-testdev.c

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/6] pci: add pci test device

2013-04-04 Thread Michael S. Tsirkin
This device is used for kvm unit tests,
currently it supports testing performance of ioeventfd.
Using updated kvm unittest, here's an example output:
mmio-no-eventfd:pci-mem 8796
mmio-wildcard-eventfd:pci-mem 3609
mmio-datamatch-eventfd:pci-mem 3685
portio-no-eventfd:pci-io 5287
portio-wildcard-eventfd:pci-io 1762
portio-datamatch-eventfd:pci-io 1777

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 docs/specs/pci-testdev.txt |  26 
 hw/Makefile.objs   |   1 +
 hw/pci-testdev.c   | 306 +
 hw/pci/pci.h   |   1 +
 4 files changed, 334 insertions(+)
 create mode 100644 docs/specs/pci-testdev.txt
 create mode 100644 hw/pci-testdev.c

diff --git a/docs/specs/pci-testdev.txt b/docs/specs/pci-testdev.txt
new file mode 100644
index 000..128ae22
--- /dev/null
+++ b/docs/specs/pci-testdev.txt
@@ -0,0 +1,26 @@
+pci-test is a device used for testing low level IO
+
+device implements up to two BARs: BAR0 and BAR1.
+Each BAR can be memory or IO. Guests must detect
+BAR type and act accordingly.
+
+Each BAR size is up to 4K bytes.
+Each BAR starts with the following header:
+
+typedef struct PCITestDevHdr {
+uint8_t test;  - write-only, starts a given test number
+uint8_t width_type; - read-only, type and width of access for a given 
test.
+   1,2,4 for byte,word or long write.
+   any other value if test not supported on this BAR
+uint8_t pad0[2];
+uint32_t offset; - read-only, offset in this BAR for a given test
+uint32_t data;- read-only, data to use for a given test
+uint32_t count;  - for debugging. number of writes detected.
+uint8_t name[]; - for debugging. 0-terminated ASCII string.
+} PCITestDevHdr;
+
+All registers are little endian.
+
+device is expected to always implement tests 0 to N on each BAR, and to add new
+tests with higher numbers.  In this way a guest can scan test numbers until it
+detects an access type that it does not support on this BAR, then stop.
diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index eb7eb31..b4e3b1d 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -12,6 +12,7 @@ common-obj-$(CONFIG_VIRTIO_PCI) += virtio-pci.o
 common-obj-$(CONFIG_VIRTIO) += virtio-bus.o
 common-obj-y += fw_cfg.o
 common-obj-$(CONFIG_PCI) += pci_bridge_dev.o
+common-obj-$(CONFIG_PCI) += pci-testdev.o
 common-obj-$(CONFIG_PCI) += ioh3420.o xio3130_upstream.o xio3130_downstream.o
 common-obj-$(CONFIG_PCI) += i82801b11.o
 common-obj-y += watchdog.o
diff --git a/hw/pci-testdev.c b/hw/pci-testdev.c
new file mode 100644
index 000..9486624
--- /dev/null
+++ b/hw/pci-testdev.c
@@ -0,0 +1,306 @@
+#include hw/hw.h
+#include hw/pci/pci.h
+#include qemu/event_notifier.h
+#include qemu/osdep.h
+
+typedef struct PCITestDevHdr {
+uint8_t test;
+uint8_t width;
+uint8_t pad0[2];
+uint32_t offset;
+uint8_t data;
+uint8_t pad1[3];
+uint32_t count;
+uint8_t name[];
+} PCITestDevHdr;
+
+typedef struct IOTest {
+MemoryRegion *mr;
+EventNotifier notifier;
+bool hasnotifier;
+unsigned size;
+bool match_data;
+PCITestDevHdr *hdr;
+unsigned bufsize;
+} IOTest;
+
+#define IOTEST_DATAMATCH 0xFA
+#define IOTEST_NOMATCH   0xCE
+
+#define IOTEST_IOSIZE 128
+#define IOTEST_MEMSIZE 2048
+
+static const char *iotest_test[] = {
+no-eventfd,
+wildcard-eventfd,
+datamatch-eventfd
+};
+
+static const char *iotest_type[] = {
+mmio,
+portio
+};
+
+#define IOTEST_TEST(i) (iotest_test[((i) % ARRAY_SIZE(iotest_test))])
+#define IOTEST_TYPE(i) (iotest_type[((i) / ARRAY_SIZE(iotest_test))])
+#define IOTEST_MAX_TEST (ARRAY_SIZE(iotest_test))
+#define IOTEST_MAX_TYPE (ARRAY_SIZE(iotest_type))
+#define IOTEST_MAX (IOTEST_MAX_TEST * IOTEST_MAX_TYPE)
+
+enum {
+IOTEST_ACCESS_NAME,
+IOTEST_ACCESS_DATA,
+IOTEST_ACCESS_MAX,
+};
+
+#define IOTEST_ACCESS_TYPE uint8_t
+#define IOTEST_ACCESS_WIDTH (sizeof(uint8_t))
+
+typedef struct PCITestDevState {
+PCIDevice dev;
+MemoryRegion mmio;
+MemoryRegion portio;
+IOTest *tests;
+int current;
+} PCITestDevState;
+
+#define IOTEST_IS_MEM(i) (strcmp(IOTEST_TYPE(i), portio))
+#define IOTEST_REGION(d, i) (IOTEST_IS_MEM(i) ?  (d)-mmio : (d)-portio)
+#define IOTEST_SIZE(i) (IOTEST_IS_MEM(i) ? IOTEST_MEMSIZE : IOTEST_IOSIZE)
+#define IOTEST_PCI_BAR(i) (IOTEST_IS_MEM(i) ? PCI_BASE_ADDRESS_SPACE_MEMORY : \
+   PCI_BASE_ADDRESS_SPACE_IO)
+
+static int pci_testdev_start(IOTest *test)
+{
+test-hdr-count = 0;
+if (!test-hasnotifier) {
+return 0;
+}
+event_notifier_test_and_clear(test-notifier);
+memory_region_add_eventfd(test-mr,
+  le32_to_cpu(test-hdr-offset),
+  test-size,
+  test-match_data,
+  test-hdr-data,
+

[PATCH v2 1/6] kvm: remove unused APIs

2013-04-04 Thread Michael S. Tsirkin
There are only used internally now, move them
out of header and out of stub.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 include/sysemu/kvm.h |   4 --
 kvm-all.c| 107 ++-
 kvm-stub.c   |  10 -
 3 files changed, 54 insertions(+), 67 deletions(-)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index f2d97b5..4a65d9f 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -283,10 +283,6 @@ int kvm_physical_memory_addr_from_host(KVMState *s, void 
*ram_addr,
 #endif
 
 #endif
-int kvm_set_ioeventfd_mmio(int fd, uint32_t adr, uint32_t val, bool assign,
-   uint32_t size);
-
-int kvm_set_ioeventfd_pio_word(int fd, uint16_t adr, uint16_t val, bool 
assign);
 
 int kvm_irqchip_add_msi_route(KVMState *s, MSIMessage msg);
 int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg);
diff --git a/kvm-all.c b/kvm-all.c
index 9b433d3..ca9775d 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -500,6 +500,60 @@ int kvm_check_extension(KVMState *s, unsigned int 
extension)
 return ret;
 }
 
+static int kvm_set_ioeventfd_mmio(int fd, uint32_t addr, uint32_t val, bool 
assign,
+  uint32_t size)
+{
+int ret;
+struct kvm_ioeventfd iofd;
+
+iofd.datamatch = val;
+iofd.addr = addr;
+iofd.len = size;
+iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
+iofd.fd = fd;
+
+if (!kvm_enabled()) {
+return -ENOSYS;
+}
+
+if (!assign) {
+iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
+}
+
+ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, iofd);
+
+if (ret  0) {
+return -errno;
+}
+
+return 0;
+}
+
+static int kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t val,
+  bool assign)
+{
+struct kvm_ioeventfd kick = {
+.datamatch = val,
+.addr = addr,
+.len = 2,
+.flags = KVM_IOEVENTFD_FLAG_DATAMATCH | KVM_IOEVENTFD_FLAG_PIO,
+.fd = fd,
+};
+int r;
+if (!kvm_enabled()) {
+return -ENOSYS;
+}
+if (!assign) {
+kick.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
+}
+r = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, kick);
+if (r  0) {
+return r;
+}
+return 0;
+}
+
+
 static int kvm_check_many_ioeventfds(void)
 {
 /* Userspace can use ioeventfd for io notification.  This requires a host
@@ -1971,59 +2025,6 @@ int kvm_set_signal_mask(CPUArchState *env, const 
sigset_t *sigset)
 
 return r;
 }
-
-int kvm_set_ioeventfd_mmio(int fd, uint32_t addr, uint32_t val, bool assign,
-   uint32_t size)
-{
-int ret;
-struct kvm_ioeventfd iofd;
-
-iofd.datamatch = val;
-iofd.addr = addr;
-iofd.len = size;
-iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
-iofd.fd = fd;
-
-if (!kvm_enabled()) {
-return -ENOSYS;
-}
-
-if (!assign) {
-iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
-}
-
-ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, iofd);
-
-if (ret  0) {
-return -errno;
-}
-
-return 0;
-}
-
-int kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t val, bool 
assign)
-{
-struct kvm_ioeventfd kick = {
-.datamatch = val,
-.addr = addr,
-.len = 2,
-.flags = KVM_IOEVENTFD_FLAG_DATAMATCH | KVM_IOEVENTFD_FLAG_PIO,
-.fd = fd,
-};
-int r;
-if (!kvm_enabled()) {
-return -ENOSYS;
-}
-if (!assign) {
-kick.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
-}
-r = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, kick);
-if (r  0) {
-return r;
-}
-return 0;
-}
-
 int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
 {
 return kvm_arch_on_sigbus_vcpu(cpu, code, addr);
diff --git a/kvm-stub.c b/kvm-stub.c
index 760aadc..ef1f201 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -102,16 +102,6 @@ int kvm_set_signal_mask(CPUArchState *env, const sigset_t 
*sigset)
 }
 #endif
 
-int kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t val, bool 
assign)
-{
-return -ENOSYS;
-}
-
-int kvm_set_ioeventfd_mmio(int fd, uint32_t adr, uint32_t val, bool assign, 
uint32_t len)
-{
-return -ENOSYS;
-}
-
 int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
 {
 return 1;
-- 
MST

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/6] kvm: support any size for pio eventfd

2013-04-04 Thread Michael S. Tsirkin
Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 kvm-all.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index ca9775d..589e37c 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -500,8 +500,8 @@ int kvm_check_extension(KVMState *s, unsigned int extension)
 return ret;
 }
 
-static int kvm_set_ioeventfd_mmio(int fd, uint32_t addr, uint32_t val, bool 
assign,
-  uint32_t size)
+static int kvm_set_ioeventfd_mmio(int fd, uint32_t addr, uint32_t val,
+  bool assign, uint32_t size)
 {
 int ret;
 struct kvm_ioeventfd iofd;
@@ -529,13 +529,13 @@ static int kvm_set_ioeventfd_mmio(int fd, uint32_t addr, 
uint32_t val, bool assi
 return 0;
 }
 
-static int kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t val,
-  bool assign)
+static int kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint16_t val,
+ bool assign, uint32_t size)
 {
 struct kvm_ioeventfd kick = {
 .datamatch = val,
 .addr = addr,
-.len = 2,
+.len = size,
 .flags = KVM_IOEVENTFD_FLAG_DATAMATCH | KVM_IOEVENTFD_FLAG_PIO,
 .fd = fd,
 };
@@ -571,7 +571,7 @@ static int kvm_check_many_ioeventfds(void)
 if (ioeventfds[i]  0) {
 break;
 }
-ret = kvm_set_ioeventfd_pio_word(ioeventfds[i], 0, i, true);
+ret = kvm_set_ioeventfd_pio(ioeventfds[i], 0, i, true, 2);
 if (ret  0) {
 close(ioeventfds[i]);
 break;
@@ -582,7 +582,7 @@ static int kvm_check_many_ioeventfds(void)
 ret = i == ARRAY_SIZE(ioeventfds);
 
 while (i--  0) {
-kvm_set_ioeventfd_pio_word(ioeventfds[i], 0, i, false);
+kvm_set_ioeventfd_pio(ioeventfds[i], 0, i, false, 2);
 close(ioeventfds[i]);
 }
 return ret;
@@ -834,10 +834,10 @@ static void kvm_io_ioeventfd_add(MemoryListener *listener,
 int fd = event_notifier_get_fd(e);
 int r;
 
-assert(match_data  section-size == 2);
+assert(match_data  section-size = 8);
 
-r = kvm_set_ioeventfd_pio_word(fd, section-offset_within_address_space,
-   data, true);
+r = kvm_set_ioeventfd_pio(fd, section-offset_within_address_space,
+  data, true, section-size);
 if (r  0) {
 abort();
 }
@@ -852,8 +852,8 @@ static void kvm_io_ioeventfd_del(MemoryListener *listener,
 int fd = event_notifier_get_fd(e);
 int r;
 
-r = kvm_set_ioeventfd_pio_word(fd, section-offset_within_address_space,
-   data, false);
+r = kvm_set_ioeventfd_pio(fd, section-offset_within_address_space,
+  data, false, section-size);
 if (r  0) {
 abort();
 }
-- 
MST

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 3/6] kvm: support non datamatch ioeventfd

2013-04-04 Thread Michael S. Tsirkin
Adding restrictions just adds code.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 kvm-all.c | 34 ++
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 589e37c..ce823f9 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -501,21 +501,24 @@ int kvm_check_extension(KVMState *s, unsigned int 
extension)
 }
 
 static int kvm_set_ioeventfd_mmio(int fd, uint32_t addr, uint32_t val,
-  bool assign, uint32_t size)
+  bool assign, uint32_t size, bool datamatch)
 {
 int ret;
 struct kvm_ioeventfd iofd;
 
-iofd.datamatch = val;
+iofd.datamatch = datamatch ? val : 0;
 iofd.addr = addr;
 iofd.len = size;
-iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
+iofd.flags = 0;
 iofd.fd = fd;
 
 if (!kvm_enabled()) {
 return -ENOSYS;
 }
 
+if (datamatch) {
+iofd.flags |= KVM_IOEVENTFD_FLAG_DATAMATCH;
+}
 if (!assign) {
 iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
 }
@@ -530,19 +533,22 @@ static int kvm_set_ioeventfd_mmio(int fd, uint32_t addr, 
uint32_t val,
 }
 
 static int kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint16_t val,
- bool assign, uint32_t size)
+ bool assign, uint32_t size, bool datamatch)
 {
 struct kvm_ioeventfd kick = {
-.datamatch = val,
+.datamatch = datamatch ? val : 0,
 .addr = addr,
+.flags = KVM_IOEVENTFD_FLAG_PIO,
 .len = size,
-.flags = KVM_IOEVENTFD_FLAG_DATAMATCH | KVM_IOEVENTFD_FLAG_PIO,
 .fd = fd,
 };
 int r;
 if (!kvm_enabled()) {
 return -ENOSYS;
 }
+if (datamatch) {
+kick.flags |= KVM_IOEVENTFD_FLAG_DATAMATCH;
+}
 if (!assign) {
 kick.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
 }
@@ -571,7 +577,7 @@ static int kvm_check_many_ioeventfds(void)
 if (ioeventfds[i]  0) {
 break;
 }
-ret = kvm_set_ioeventfd_pio(ioeventfds[i], 0, i, true, 2);
+ret = kvm_set_ioeventfd_pio(ioeventfds[i], 0, i, true, 2, true);
 if (ret  0) {
 close(ioeventfds[i]);
 break;
@@ -582,7 +588,7 @@ static int kvm_check_many_ioeventfds(void)
 ret = i == ARRAY_SIZE(ioeventfds);
 
 while (i--  0) {
-kvm_set_ioeventfd_pio(ioeventfds[i], 0, i, false, 2);
+kvm_set_ioeventfd_pio(ioeventfds[i], 0, i, false, 2, true);
 close(ioeventfds[i]);
 }
 return ret;
@@ -802,10 +808,8 @@ static void kvm_mem_ioeventfd_add(MemoryListener *listener,
 int fd = event_notifier_get_fd(e);
 int r;
 
-assert(match_data  section-size = 8);
-
 r = kvm_set_ioeventfd_mmio(fd, section-offset_within_address_space,
-   data, true, section-size);
+   data, true, section-size, match_data);
 if (r  0) {
 abort();
 }
@@ -820,7 +824,7 @@ static void kvm_mem_ioeventfd_del(MemoryListener *listener,
 int r;
 
 r = kvm_set_ioeventfd_mmio(fd, section-offset_within_address_space,
-   data, false, section-size);
+   data, false, section-size, match_data);
 if (r  0) {
 abort();
 }
@@ -834,10 +838,8 @@ static void kvm_io_ioeventfd_add(MemoryListener *listener,
 int fd = event_notifier_get_fd(e);
 int r;
 
-assert(match_data  section-size = 8);
-
 r = kvm_set_ioeventfd_pio(fd, section-offset_within_address_space,
-  data, true, section-size);
+  data, true, section-size, match_data);
 if (r  0) {
 abort();
 }
@@ -853,7 +855,7 @@ static void kvm_io_ioeventfd_del(MemoryListener *listener,
 int r;
 
 r = kvm_set_ioeventfd_pio(fd, section-offset_within_address_space,
-  data, false, section-size);
+  data, false, section-size, match_data);
 if (r  0) {
 abort();
 }
-- 
MST

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 6/6] pci-testdev: add pv mmio test

2013-04-04 Thread Michael S. Tsirkin
Add ability to test the speed of PV MMIO.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 hw/pci-testdev.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/hw/pci-testdev.c b/hw/pci-testdev.c
index f0ebf99..9a50631 100644
--- a/hw/pci-testdev.c
+++ b/hw/pci-testdev.c
@@ -19,6 +19,7 @@ typedef struct IOTest {
 EventNotifier notifier;
 bool hasnotifier;
 unsigned size;
+bool pv;
 bool match_data;
 PCITestDevHdr *hdr;
 unsigned bufsize;
@@ -33,7 +34,8 @@ typedef struct IOTest {
 static const char *iotest_test[] = {
 no-eventfd,
 wildcard-eventfd,
-datamatch-eventfd
+datamatch-eventfd,
+pv-eventfd
 };
 
 static const char *iotest_type[] = {
@@ -80,7 +82,7 @@ static int pci_testdev_start(IOTest *test)
 memory_region_add_eventfd(test-mr,
   le32_to_cpu(test-hdr-offset),
   test-size,
-  false,
+  test-pv,
   test-match_data,
   test-hdr-data,
   test-notifier);
@@ -95,7 +97,7 @@ static void pci_testdev_stop(IOTest *test)
 memory_region_del_eventfd(test-mr,
   le32_to_cpu(test-hdr-offset),
   test-size,
-  false,
+  test-pv,
   test-match_data,
   test-hdr-data,
   test-notifier);
@@ -237,7 +239,8 @@ static int pci_testdev_init(PCIDevice *pci_dev)
 g_free(name);
 test-hdr-offset = cpu_to_le32(IOTEST_SIZE(i) + i * 
IOTEST_ACCESS_WIDTH);
 test-size = IOTEST_ACCESS_WIDTH;
-test-match_data = strcmp(IOTEST_TEST(i), wildcard-eventfd);
+test-pv = !strcmp(IOTEST_TEST(i), pv-eventfd);
+test-match_data = !strcmp(IOTEST_TEST(i), datamatch-eventfd);
 test-hdr-test = i;
 test-hdr-data = test-match_data ? IOTEST_DATAMATCH : IOTEST_NOMATCH;
 test-hdr-width = IOTEST_ACCESS_WIDTH;
-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Michael S. Tsirkin
Add an option for users to specify PV eventfd
listeners, and utilize KVM's PV MMIO underneath.
Upodate all callers.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 hw/dataplane/hostmem.c|  1 +
 hw/ivshmem.c  |  2 ++
 hw/pci-testdev.c  |  2 ++
 hw/vhost.c|  4 ++--
 hw/virtio-pci.c   |  4 ++--
 include/exec/memory.h | 10 ++
 kvm-all.c | 15 ---
 linux-headers/linux/kvm.h |  8 
 memory.c  |  9 +++--
 9 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/hw/dataplane/hostmem.c b/hw/dataplane/hostmem.c
index 380537e..41d43e7 100644
--- a/hw/dataplane/hostmem.c
+++ b/hw/dataplane/hostmem.c
@@ -126,6 +126,7 @@ static void hostmem_listener_section_dummy(MemoryListener 
*listener,
 
 static void hostmem_listener_eventfd_dummy(MemoryListener *listener,
MemoryRegionSection *section,
+   bool pv,
bool match_data, uint64_t data,
EventNotifier *e)
 {
diff --git a/hw/ivshmem.c b/hw/ivshmem.c
index 68a2cf2..28e5978 100644
--- a/hw/ivshmem.c
+++ b/hw/ivshmem.c
@@ -352,6 +352,7 @@ static void ivshmem_add_eventfd(IVShmemState *s, int posn, 
int i)
 memory_region_add_eventfd(s-ivshmem_mmio,
   DOORBELL,
   4,
+  false,
   true,
   (posn  16) | i,
   s-peers[posn].eventfds[i]);
@@ -362,6 +363,7 @@ static void ivshmem_del_eventfd(IVShmemState *s, int posn, 
int i)
 memory_region_del_eventfd(s-ivshmem_mmio,
   DOORBELL,
   4,
+  false,
   true,
   (posn  16) | i,
   s-peers[posn].eventfds[i]);
diff --git a/hw/pci-testdev.c b/hw/pci-testdev.c
index 9486624..f0ebf99 100644
--- a/hw/pci-testdev.c
+++ b/hw/pci-testdev.c
@@ -80,6 +80,7 @@ static int pci_testdev_start(IOTest *test)
 memory_region_add_eventfd(test-mr,
   le32_to_cpu(test-hdr-offset),
   test-size,
+  false,
   test-match_data,
   test-hdr-data,
   test-notifier);
@@ -94,6 +95,7 @@ static void pci_testdev_stop(IOTest *test)
 memory_region_del_eventfd(test-mr,
   le32_to_cpu(test-hdr-offset),
   test-size,
+  false,
   test-match_data,
   test-hdr-data,
   test-notifier);
diff --git a/hw/vhost.c b/hw/vhost.c
index 4d6aee3..9ad7101 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -750,13 +750,13 @@ static void vhost_virtqueue_stop(struct vhost_dev *dev,
 }
 
 static void vhost_eventfd_add(MemoryListener *listener,
-  MemoryRegionSection *section,
+  MemoryRegionSection *section, bool pv,
   bool match_data, uint64_t data, EventNotifier *e)
 {
 }
 
 static void vhost_eventfd_del(MemoryListener *listener,
-  MemoryRegionSection *section,
+  MemoryRegionSection *section, bool pv,
   bool match_data, uint64_t data, EventNotifier *e)
 {
 }
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 19965e5..f9ff994 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -189,10 +189,10 @@ static int 
virtio_pci_set_host_notifier_internal(VirtIOPCIProxy *proxy,
 }
 virtio_queue_set_host_notifier_fd_handler(vq, true, set_handler);
 memory_region_add_eventfd(proxy-bar, VIRTIO_PCI_QUEUE_NOTIFY, 2,
-  true, n, notifier);
+  false, true, n, notifier);
 } else {
 memory_region_del_eventfd(proxy-bar, VIRTIO_PCI_QUEUE_NOTIFY, 2,
-  true, n, notifier);
+  false, true, n, notifier);
 virtio_queue_set_host_notifier_fd_handler(vq, false, false);
 event_notifier_cleanup(notifier);
 }
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 2322732..3612004 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -215,8 +215,10 @@ struct MemoryListener {
 void (*log_global_start)(MemoryListener *listener);
 void (*log_global_stop)(MemoryListener *listener);
 void (*eventfd_add)(MemoryListener *listener, MemoryRegionSection *section,
+bool pv,
 bool match_data, uint64_t data, EventNotifier *e);
 void 

[PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
With KVM, MMIO is much slower than PIO, due to the need to
do page walk and emulation. But with EPT, it does not have to be: we
know the address from the VMCS so if the address is unique, we can look
up the eventfd directly, bypassing emulation.

Add an interface for userspace to specify this per-address, we can
use this e.g. for virtio.

The implementation adds a separate bus internally. This serves two
purposes:
- minimize overhead for old userspace that does not use PV MMIO
- minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)

At the moment, this optimization is only supported for EPT on x86 and
silently ignored for NPT and MMU, so everything works correctly but
slowly.

TODO: NPT, MMU and non x86 architectures.

The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
pre-review and suggestions.

Signed-off-by: Michael S. Tsirkin m...@redhat.com
---
 arch/x86/kvm/vmx.c   |  4 
 arch/x86/kvm/x86.c   |  1 +
 include/linux/kvm_host.h |  1 +
 include/uapi/linux/kvm.h |  9 +
 virt/kvm/eventfd.c   | 47 ++-
 virt/kvm/kvm_main.c  |  1 +
 6 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6667042..cdaac9b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5127,6 +5127,10 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
gpa_t gpa;
 
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+   if (!kvm_io_bus_write(vcpu-kvm, KVM_PV_MMIO_BUS, gpa, 0, NULL)) {
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
 
ret = handle_mmio_page_fault_common(vcpu, gpa, true);
if (likely(ret == 1))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f19ac0a..b9223d9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2483,6 +2483,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_ASSIGN_DEV_IRQ:
case KVM_CAP_IRQFD:
case KVM_CAP_IOEVENTFD:
+   case KVM_CAP_IOEVENTFD_PV_MMIO:
case KVM_CAP_PIT2:
case KVM_CAP_PIT_STATE2:
case KVM_CAP_SET_IDENTITY_MAP_ADDR:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cad77fe..35b74cd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -149,6 +149,7 @@ struct kvm_io_bus {
 enum kvm_bus {
KVM_MMIO_BUS,
KVM_PIO_BUS,
+   KVM_PV_MMIO_BUS,
KVM_NR_BUSES
 };
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 3c56ba3..61783ee 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -449,11 +449,19 @@ enum {
kvm_ioeventfd_flag_nr_datamatch,
kvm_ioeventfd_flag_nr_pio,
kvm_ioeventfd_flag_nr_deassign,
+   kvm_ioeventfd_flag_nr_pv_mmio,
kvm_ioeventfd_flag_nr_max,
 };
 
 #define KVM_IOEVENTFD_FLAG_DATAMATCH (1  kvm_ioeventfd_flag_nr_datamatch)
 #define KVM_IOEVENTFD_FLAG_PIO   (1  kvm_ioeventfd_flag_nr_pio)
+/*
+ * PV_MMIO - Guest can promise us that all accesses touching this address
+ * are writes of specified length, starting at the specified address.
+ * If not - it's a Guest bug.
+ * Can not be used together with either PIO or DATAMATCH.
+ */
+#define KVM_IOEVENTFD_FLAG_PV_MMIO   (1  kvm_ioeventfd_flag_nr_pv_mmio)
 #define KVM_IOEVENTFD_FLAG_DEASSIGN  (1  kvm_ioeventfd_flag_nr_deassign)
 
 #define KVM_IOEVENTFD_VALID_FLAG_MASK  ((1  kvm_ioeventfd_flag_nr_max) - 1)
@@ -665,6 +673,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_EPR 86
 #define KVM_CAP_ARM_PSCI 87
 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
+#define KVM_CAP_IOEVENTFD_PV_MMIO 89
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 93e5b05..1b7619e 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -579,6 +579,7 @@ struct _ioeventfd {
struct kvm_io_device dev;
u8   bus_idx;
bool wildcard;
+   bool pvmmio;
 };
 
 static inline struct _ioeventfd *
@@ -600,7 +601,15 @@ ioeventfd_in_range(struct _ioeventfd *p, gpa_t addr, int 
len, const void *val)
 {
u64 _val;
 
-   if (!(addr == p-addr  len == p-length))
+   if (addr != p-addr)
+   /* address must be precise for a hit */
+   return false;
+
+   if (p-pvmmio)
+   /* pvmmio only looks at the address, so always a hit */
+   return true;
+
+   if (len != p-length)
/* address-range must be precise for a hit */
return false;
 
@@ -671,9 +680,11 @@ ioeventfd_check_collision(struct kvm *kvm, struct 
_ioeventfd *p)
 
list_for_each_entry(_p, kvm-ioeventfds, list)
if (_p-bus_idx == p-bus_idx 
-   _p-addr == p-addr  _p-length == p-length 
-  

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:

 With KVM, MMIO is much slower than PIO, due to the need to
 do page walk and emulation. But with EPT, it does not have to be: we
 know the address from the VMCS so if the address is unique, we can look
 up the eventfd directly, bypassing emulation.
 
 Add an interface for userspace to specify this per-address, we can
 use this e.g. for virtio.
 
 The implementation adds a separate bus internally. This serves two
 purposes:
 - minimize overhead for old userspace that does not use PV MMIO
 - minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)
 
 At the moment, this optimization is only supported for EPT on x86 and
 silently ignored for NPT and MMU, so everything works correctly but
 slowly.
 
 TODO: NPT, MMU and non x86 architectures.
 
 The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
 pre-review and suggestions.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com

This still uses page fault intercepts which are orders of magnitudes slower 
than hypercalls. Why don't you just create a PV MMIO hypercall that the guest 
can use to invoke MMIO accesses towards the host based on physical addresses 
with explicit length encodings?

That way you simplify and speed up all code paths, exceeding the speed of PIO 
exits even. It should also be quite easily portable, as all other platforms 
have hypercalls available as well.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 4/7] KVM: Add reset/restore rtc_status support

2013-04-04 Thread Gleb Natapov
On Mon, Apr 01, 2013 at 08:40:13AM +0800, Yang Zhang wrote:
 From: Yang Zhang yang.z.zh...@intel.com
 
 Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 ---
  arch/x86/kvm/lapic.c |9 +
  arch/x86/kvm/lapic.h |2 ++
  virt/kvm/ioapic.c|   43 +++
  virt/kvm/ioapic.h|1 +
  4 files changed, 55 insertions(+), 0 deletions(-)
 
 diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
 index 96ab160..9c041fa 100644
 --- a/arch/x86/kvm/lapic.c
 +++ b/arch/x86/kvm/lapic.c
 @@ -94,6 +94,14 @@ static inline int apic_test_vector(int vec, void *bitmap)
   return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
  }
  
 +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector)
 +{
 + struct kvm_lapic *apic = vcpu-arch.apic;
 +
 + return apic_test_vector(vector, apic-regs + APIC_ISR) ||
 + apic_test_vector(vector, apic-regs + APIC_IRR);
 +}
 +
  static inline void apic_set_vector(int vec, void *bitmap)
  {
   set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
 @@ -1665,6 +1673,7 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
   apic-highest_isr_cache = -1;
   kvm_x86_ops-hwapic_isr_update(vcpu-kvm, apic_find_highest_isr(apic));
   kvm_make_request(KVM_REQ_EVENT, vcpu);
 + kvm_rtc_irq_restore(vcpu);
  }
  
  void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
 diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
 index 967519c..004d2ad 100644
 --- a/arch/x86/kvm/lapic.h
 +++ b/arch/x86/kvm/lapic.h
 @@ -170,4 +170,6 @@ static inline bool kvm_apic_has_events(struct kvm_vcpu 
 *vcpu)
   return vcpu-arch.apic-pending_events;
  }
  
 +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
 +
  #endif
 diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
 index 8664812..0b12b17 100644
 --- a/virt/kvm/ioapic.c
 +++ b/virt/kvm/ioapic.c
 @@ -90,6 +90,47 @@ static unsigned long ioapic_read_indirect(struct 
 kvm_ioapic *ioapic,
   return result;
  }
  
 +static void rtc_irq_reset(struct kvm_ioapic *ioapic)
 +{
 + ioapic-rtc_status.pending_eoi = 0;
 + bitmap_zero(ioapic-rtc_status.dest_map, KVM_MAX_VCPUS);
 +}
 +
 +static void rtc_irq_restore(struct kvm_ioapic *ioapic)
 +{
 + struct kvm_vcpu *vcpu;
 + int vector, i, pending_eoi = 0;
 +
 + if (RTC_GSI = IOAPIC_NUM_PINS)
 + return;
 +
 + vector = ioapic-redirtbl[RTC_GSI].fields.vector;
 + kvm_for_each_vcpu(i, vcpu, ioapic-kvm) {
 + if (kvm_apic_pending_eoi(vcpu, vector)) {
 + pending_eoi++;
 + __set_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map);
You should cleat dest_map at the beginning to get rid of stale bits.

 + }
 + }
 + ioapic-rtc_status.pending_eoi = pending_eoi;
 +}
 +
 +void kvm_rtc_irq_restore(struct kvm_vcpu *vcpu)
 +{
 + struct kvm_ioapic *ioapic = vcpu-kvm-arch.vioapic;
 + int vector;
 +
 + if (!ioapic)
 + return;
 +
Can this be called if ioapic == NULL?
Should check for if (RTC_GSI = IOAPIC_NUM_PINS) here too.

 + spin_lock(ioapic-lock);
 + vector = ioapic-redirtbl[RTC_GSI].fields.vector;
 + if (kvm_apic_pending_eoi(vcpu, vector)) {
 + __set_bit(vcpu-vcpu_id, ioapic-rtc_status.dest_map);
 + ioapic-rtc_status.pending_eoi++;
The bit may have been set already. The logic should be:

new_val = kvm_apic_pending_eoi(vcpu, vector)
old_val = set_bit(vcpu_id, dest_map)

if (new_val == old_val)
return;

if (new_val) {
__set_bit(vcpu_id, dest_map);
pending_eoi++;
} else {
__clear_bit(vcpu_id, dest_map);
pending_eoi--;
}

The naming of above two functions are not good either. Call
them something like kvm_rtc_eoi_tracking_restore_all() and
kvm_rtc_eoi_tracking_restore_one().  And _all should call _one() for
each vcpu. Make __rtc_irq_eoi_tracking_restore_one() that does not
take ioapic lock and call it from kvm_rtc_eoi_tracking_restore_one()
surrounded by locks.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
  With KVM, MMIO is much slower than PIO, due to the need to
  do page walk and emulation. But with EPT, it does not have to be: we
  know the address from the VMCS so if the address is unique, we can look
  up the eventfd directly, bypassing emulation.
  
  Add an interface for userspace to specify this per-address, we can
  use this e.g. for virtio.
  
  The implementation adds a separate bus internally. This serves two
  purposes:
  - minimize overhead for old userspace that does not use PV MMIO
  - minimize disruption in other code (since we don't know the length,
   devices on the MMIO bus only get a valid address in write, this
   way we don't need to touch all devices to teach them handle
   an dinvalid length)
  
  At the moment, this optimization is only supported for EPT on x86 and
  silently ignored for NPT and MMU, so everything works correctly but
  slowly.
  
  TODO: NPT, MMU and non x86 architectures.
  
  The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
  pre-review and suggestions.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 This still uses page fault intercepts which are orders of magnitudes
 slower than hypercalls.

Not really. Here's a test:
compare vmcall to portio:

vmcall 1519
...
outl_to_kernel 1745

compare portio to mmio:

mmio-wildcard-eventfd:pci-mem 3529
mmio-pv-eventfd:pci-mem 1878
portio-wildcard-eventfd:pci-io 1846

So not orders of magnitude.

 Why don't you just create a PV MMIO hypercall
 that the guest can use to invoke MMIO accesses towards the host based
 on physical addresses with explicit length encodings?
 That way you simplify and speed up all code paths, exceeding the speed
 of PIO exits even. It should also be quite easily portable, as all
 other platforms have hypercalls available as well.
 
 
 Alex

I sent such a patch, but maintainers seem reluctant to add hypercalls.
Gleb, could you comment please?

A fast way to do MMIO is probably useful in any case ...

  Alex
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
  With KVM, MMIO is much slower than PIO, due to the need to
  do page walk and emulation. But with EPT, it does not have to be: we
  know the address from the VMCS so if the address is unique, we can look
  up the eventfd directly, bypassing emulation.
  
  Add an interface for userspace to specify this per-address, we can
  use this e.g. for virtio.
  
  The implementation adds a separate bus internally. This serves two
  purposes:
  - minimize overhead for old userspace that does not use PV MMIO
  - minimize disruption in other code (since we don't know the length,
   devices on the MMIO bus only get a valid address in write, this
   way we don't need to touch all devices to teach them handle
   an dinvalid length)
  
  At the moment, this optimization is only supported for EPT on x86 and
  silently ignored for NPT and MMU, so everything works correctly but
  slowly.
  
  TODO: NPT, MMU and non x86 architectures.
  
  The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
  pre-review and suggestions.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 This still uses page fault intercepts which are orders of magnitudes slower 
 than hypercalls. Why don't you just create a PV MMIO hypercall that the guest 
 can use to invoke MMIO accesses towards the host based on physical addresses 
 with explicit length encodings?
 
It is slower, but not an order of magnitude slower. It become faster
with newer HW.

 That way you simplify and speed up all code paths, exceeding the speed of PIO 
 exits even. It should also be quite easily portable, as all other platforms 
 have hypercalls available as well.
 
We are trying to avoid PV as much as possible (well this is also PV,
but not guest visible). We haven't replaced PIO with hypercall for the
same reason. My hope is that future HW will provide us with instruction
decode for basic mov instruction at which point this optimisation can be
dropped. And hypercall has its own set of problems with Windows guests.
When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
hypercalls.  Mixing KVM hypercalls and Hyper-V requires some tricks. It
may also affect WHQLing Windows drivers since driver will talk to HW
bypassing Windows interfaces.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 13:04, Michael S. Tsirkin wrote:

 On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
 With KVM, MMIO is much slower than PIO, due to the need to
 do page walk and emulation. But with EPT, it does not have to be: we
 know the address from the VMCS so if the address is unique, we can look
 up the eventfd directly, bypassing emulation.
 
 Add an interface for userspace to specify this per-address, we can
 use this e.g. for virtio.
 
 The implementation adds a separate bus internally. This serves two
 purposes:
 - minimize overhead for old userspace that does not use PV MMIO
 - minimize disruption in other code (since we don't know the length,
 devices on the MMIO bus only get a valid address in write, this
 way we don't need to touch all devices to teach them handle
 an dinvalid length)
 
 At the moment, this optimization is only supported for EPT on x86 and
 silently ignored for NPT and MMU, so everything works correctly but
 slowly.
 
 TODO: NPT, MMU and non x86 architectures.
 
 The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
 pre-review and suggestions.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 This still uses page fault intercepts which are orders of magnitudes
 slower than hypercalls.
 
 Not really. Here's a test:
 compare vmcall to portio:
 
 vmcall 1519
 ...
 outl_to_kernel 1745
 
 compare portio to mmio:
 
 mmio-wildcard-eventfd:pci-mem 3529
 mmio-pv-eventfd:pci-mem 1878
 portio-wildcard-eventfd:pci-io 1846
 
 So not orders of magnitude.

https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf

Check out page 41. Higher is better (number is number of loop cycles in a 
second). My test system was an AMD Istanbul based box.

 
 Why don't you just create a PV MMIO hypercall
 that the guest can use to invoke MMIO accesses towards the host based
 on physical addresses with explicit length encodings?
 That way you simplify and speed up all code paths, exceeding the speed
 of PIO exits even. It should also be quite easily portable, as all
 other platforms have hypercalls available as well.
 
 
 Alex
 
 I sent such a patch, but maintainers seem reluctant to add hypercalls.
 Gleb, could you comment please?
 
 A fast way to do MMIO is probably useful in any case ...

Yes, but at least according to my numbers optimizing anything that is not 
hcalls is a waste of time.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 02:09:53PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 13:04, Michael S. Tsirkin wrote:
 
  On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
  
  With KVM, MMIO is much slower than PIO, due to the need to
  do page walk and emulation. But with EPT, it does not have to be: we
  know the address from the VMCS so if the address is unique, we can look
  up the eventfd directly, bypassing emulation.
  
  Add an interface for userspace to specify this per-address, we can
  use this e.g. for virtio.
  
  The implementation adds a separate bus internally. This serves two
  purposes:
  - minimize overhead for old userspace that does not use PV MMIO
  - minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)
  
  At the moment, this optimization is only supported for EPT on x86 and
  silently ignored for NPT and MMU, so everything works correctly but
  slowly.
  
  TODO: NPT, MMU and non x86 architectures.
  
  The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
  pre-review and suggestions.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  This still uses page fault intercepts which are orders of magnitudes
  slower than hypercalls.
  
  Not really. Here's a test:
  compare vmcall to portio:
  
  vmcall 1519
  ...
  outl_to_kernel 1745
  
  compare portio to mmio:
  
  mmio-wildcard-eventfd:pci-mem 3529
  mmio-pv-eventfd:pci-mem 1878
  portio-wildcard-eventfd:pci-io 1846
  
  So not orders of magnitude.
 
 https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf
 
 Check out page 41. Higher is better (number is number of loop cycles in a 
 second). My test system was an AMD Istanbul based box.
 
Have you bypassed instruction emulation in your testing?

  
  Why don't you just create a PV MMIO hypercall
  that the guest can use to invoke MMIO accesses towards the host based
  on physical addresses with explicit length encodings?
  That way you simplify and speed up all code paths, exceeding the speed
  of PIO exits even. It should also be quite easily portable, as all
  other platforms have hypercalls available as well.
  
  
  Alex
  
  I sent such a patch, but maintainers seem reluctant to add hypercalls.
  Gleb, could you comment please?
  
  A fast way to do MMIO is probably useful in any case ...
 
 Yes, but at least according to my numbers optimizing anything that is not 
 hcalls is a waste of time.
 
 
 Alex

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:08, Gleb Natapov wrote:

 On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
 With KVM, MMIO is much slower than PIO, due to the need to
 do page walk and emulation. But with EPT, it does not have to be: we
 know the address from the VMCS so if the address is unique, we can look
 up the eventfd directly, bypassing emulation.
 
 Add an interface for userspace to specify this per-address, we can
 use this e.g. for virtio.
 
 The implementation adds a separate bus internally. This serves two
 purposes:
 - minimize overhead for old userspace that does not use PV MMIO
 - minimize disruption in other code (since we don't know the length,
 devices on the MMIO bus only get a valid address in write, this
 way we don't need to touch all devices to teach them handle
 an dinvalid length)
 
 At the moment, this optimization is only supported for EPT on x86 and
 silently ignored for NPT and MMU, so everything works correctly but
 slowly.
 
 TODO: NPT, MMU and non x86 architectures.
 
 The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
 pre-review and suggestions.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 This still uses page fault intercepts which are orders of magnitudes slower 
 than hypercalls. Why don't you just create a PV MMIO hypercall that the 
 guest can use to invoke MMIO accesses towards the host based on physical 
 addresses with explicit length encodings?
 
 It is slower, but not an order of magnitude slower. It become faster
 with newer HW.
 
 That way you simplify and speed up all code paths, exceeding the speed of 
 PIO exits even. It should also be quite easily portable, as all other 
 platforms have hypercalls available as well.
 
 We are trying to avoid PV as much as possible (well this is also PV,
 but not guest visible). We haven't replaced PIO with hypercall for the
 same reason. My hope is that future HW will provide us with instruction
 decode for basic mov instruction at which point this optimisation can be
 dropped.

The same applies to an MMIO hypercall. Once the PV interface becomes obsolete, 
we can drop the capability we expose to the guest.

 And hypercall has its own set of problems with Windows guests.
 When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
 hypercalls.  Mixing KVM hypercalls and Hyper-V requires some tricks. It

Can't we simply register a hypercall ID range with Microsoft?

 may also affect WHQLing Windows drivers since driver will talk to HW
 bypassing Windows interfaces.

Then the WHQL'ed driver doesn't support the PV MMIO hcall?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 02:09:53PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 13:04, Michael S. Tsirkin wrote:
 
  On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
  
  With KVM, MMIO is much slower than PIO, due to the need to
  do page walk and emulation. But with EPT, it does not have to be: we
  know the address from the VMCS so if the address is unique, we can look
  up the eventfd directly, bypassing emulation.
  
  Add an interface for userspace to specify this per-address, we can
  use this e.g. for virtio.
  
  The implementation adds a separate bus internally. This serves two
  purposes:
  - minimize overhead for old userspace that does not use PV MMIO
  - minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)
  
  At the moment, this optimization is only supported for EPT on x86 and
  silently ignored for NPT and MMU, so everything works correctly but
  slowly.
  
  TODO: NPT, MMU and non x86 architectures.
  
  The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
  pre-review and suggestions.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  This still uses page fault intercepts which are orders of magnitudes
  slower than hypercalls.
  
  Not really. Here's a test:
  compare vmcall to portio:
  
  vmcall 1519
  ...
  outl_to_kernel 1745
  
  compare portio to mmio:
  
  mmio-wildcard-eventfd:pci-mem 3529
  mmio-pv-eventfd:pci-mem 1878
  portio-wildcard-eventfd:pci-io 1846
  
  So not orders of magnitude.
 
 https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf
 
 Check out page 41. Higher is better (number is number of loop cycles in a 
 second). My test system was an AMD Istanbul based box.

Wow 2x difference.  The difference seems to be much smaller now. Newer
hardware? Newer software?

  
  Why don't you just create a PV MMIO hypercall
  that the guest can use to invoke MMIO accesses towards the host based
  on physical addresses with explicit length encodings?
  That way you simplify and speed up all code paths, exceeding the speed
  of PIO exits even. It should also be quite easily portable, as all
  other platforms have hypercalls available as well.
  
  
  Alex
  
  I sent such a patch, but maintainers seem reluctant to add hypercalls.
  Gleb, could you comment please?
  
  A fast way to do MMIO is probably useful in any case ...
 
 Yes, but at least according to my numbers optimizing anything that is not 
 hcalls is a waste of time.
 
 
 Alex

This was the implementation: 'kvm_para: add mmio word store hypercall'.
Again, I don't mind.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:19, Gleb Natapov wrote:

 On Thu, Apr 04, 2013 at 02:09:53PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 13:04, Michael S. Tsirkin wrote:
 
 On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
 With KVM, MMIO is much slower than PIO, due to the need to
 do page walk and emulation. But with EPT, it does not have to be: we
 know the address from the VMCS so if the address is unique, we can look
 up the eventfd directly, bypassing emulation.
 
 Add an interface for userspace to specify this per-address, we can
 use this e.g. for virtio.
 
 The implementation adds a separate bus internally. This serves two
 purposes:
 - minimize overhead for old userspace that does not use PV MMIO
 - minimize disruption in other code (since we don't know the length,
 devices on the MMIO bus only get a valid address in write, this
 way we don't need to touch all devices to teach them handle
 an dinvalid length)
 
 At the moment, this optimization is only supported for EPT on x86 and
 silently ignored for NPT and MMU, so everything works correctly but
 slowly.
 
 TODO: NPT, MMU and non x86 architectures.
 
 The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
 pre-review and suggestions.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 This still uses page fault intercepts which are orders of magnitudes
 slower than hypercalls.
 
 Not really. Here's a test:
 compare vmcall to portio:
 
 vmcall 1519
 ...
 outl_to_kernel 1745
 
 compare portio to mmio:
 
 mmio-wildcard-eventfd:pci-mem 3529
 mmio-pv-eventfd:pci-mem 1878
 portio-wildcard-eventfd:pci-io 1846
 
 So not orders of magnitude.
 
 https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf
 
 Check out page 41. Higher is better (number is number of loop cycles in a 
 second). My test system was an AMD Istanbul based box.
 
 Have you bypassed instruction emulation in your testing?

PIO doesn't do instruction emulation.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:08, Gleb Natapov wrote:

 On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
 With KVM, MMIO is much slower than PIO, due to the need to
 do page walk and emulation. But with EPT, it does not have to be: we
 know the address from the VMCS so if the address is unique, we can look
 up the eventfd directly, bypassing emulation.
 
 Add an interface for userspace to specify this per-address, we can
 use this e.g. for virtio.
 
 The implementation adds a separate bus internally. This serves two
 purposes:
 - minimize overhead for old userspace that does not use PV MMIO
 - minimize disruption in other code (since we don't know the length,
 devices on the MMIO bus only get a valid address in write, this
 way we don't need to touch all devices to teach them handle
 an dinvalid length)
 
 At the moment, this optimization is only supported for EPT on x86 and
 silently ignored for NPT and MMU, so everything works correctly but
 slowly.
 
 TODO: NPT, MMU and non x86 architectures.
 
 The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
 pre-review and suggestions.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 This still uses page fault intercepts which are orders of magnitudes slower 
 than hypercalls. Why don't you just create a PV MMIO hypercall that the 
 guest can use to invoke MMIO accesses towards the host based on physical 
 addresses with explicit length encodings?
 
 It is slower, but not an order of magnitude slower. It become faster
 with newer HW.
 
 That way you simplify and speed up all code paths, exceeding the speed of 
 PIO exits even. It should also be quite easily portable, as all other 
 platforms have hypercalls available as well.
 
 We are trying to avoid PV as much as possible (well this is also PV,
 but not guest visible

Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The 
comment above its definition indicates that the guest does so, so it is guest 
visible.

+/*
+ * PV_MMIO - Guest can promise us that all accesses touching this address
+ * are writes of specified length, starting at the specified address.
+ * If not - it's a Guest bug.
+ * Can not be used together with either PIO or DATAMATCH.
+ */


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] ARM: KVM: simplify HYP mapping population

2013-04-04 Thread Marc Zyngier
On 04/04/13 00:13, Christoffer Dall wrote:
 On Tue, Apr 02, 2013 at 02:25:09PM +0100, Marc Zyngier wrote:
 The way we populate HYP mappings is a bit convoluted, to say the least.
 Passing a pointer around to keep track of the current PFN is quite
 odd, and we end-up having two different PTE accessors for no good
 reason.

 Simplify the whole thing by unifying the two PTE accessors, passing
 a pgprot_t around, and moving the various validity checks to the
 upper layers.

 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  arch/arm/kvm/mmu.c | 100 
 ++---
  1 file changed, 41 insertions(+), 59 deletions(-)

 diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
 index 2f12e40..24811d1 100644
 --- a/arch/arm/kvm/mmu.c
 +++ b/arch/arm/kvm/mmu.c
 @@ -125,54 +125,34 @@ void free_hyp_pmds(void)
  }
  
  static void create_hyp_pte_mappings(pmd_t *pmd, unsigned long start,
 -unsigned long end)
 +unsigned long end, unsigned long pfn,
 +pgprot_t prot)
  {
  pte_t *pte;
  unsigned long addr;
 -struct page *page;
  
 -for (addr = start  PAGE_MASK; addr  end; addr += PAGE_SIZE) {
 -unsigned long hyp_addr = KERN_TO_HYP(addr);
 -
 -pte = pte_offset_kernel(pmd, hyp_addr);
 -BUG_ON(!virt_addr_valid(addr));
 -page = virt_to_page(addr);
 -kvm_set_pte(pte, mk_pte(page, PAGE_HYP));
 -}
 -}
 -
 -static void create_hyp_io_pte_mappings(pmd_t *pmd, unsigned long start,
 -   unsigned long end,
 -   unsigned long *pfn_base)
 -{
 -pte_t *pte;
 -unsigned long addr;
 -
 -for (addr = start  PAGE_MASK; addr  end; addr += PAGE_SIZE) {
 -unsigned long hyp_addr = KERN_TO_HYP(addr);
 -
 -pte = pte_offset_kernel(pmd, hyp_addr);
 -BUG_ON(pfn_valid(*pfn_base));
 -kvm_set_pte(pte, pfn_pte(*pfn_base, PAGE_HYP_DEVICE));
 -(*pfn_base)++;
 +for (addr = start; addr  end; addr += PAGE_SIZE) {
 +pte = pte_offset_kernel(pmd, addr);
 +kvm_set_pte(pte, pfn_pte(pfn, prot));
 +pfn++;
  }
  }
  
  static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
 -   unsigned long end, unsigned long *pfn_base)
 +   unsigned long end, unsigned long pfn,
 +   pgprot_t prot)
  {
  pmd_t *pmd;
  pte_t *pte;
  unsigned long addr, next;
  
  for (addr = start; addr  end; addr = next) {
 -unsigned long hyp_addr = KERN_TO_HYP(addr);
 -pmd = pmd_offset(pud, hyp_addr);
 +pmd = pmd_offset(pud, addr);
  
  BUG_ON(pmd_sect(*pmd));
  
  if (pmd_none(*pmd)) {
 -pte = pte_alloc_one_kernel(NULL, hyp_addr);
 +pte = pte_alloc_one_kernel(NULL, addr);
  if (!pte) {
  kvm_err(Cannot allocate Hyp pte\n);
  return -ENOMEM;
 @@ -182,25 +162,17 @@ static int create_hyp_pmd_mappings(pud_t *pud, 
 unsigned long start,
  
  next = pmd_addr_end(addr, end);
  
 -/*
 - * If pfn_base is NULL, we map kernel pages into HYP with the
 - * virtual address. Otherwise, this is considered an I/O
 - * mapping and we map the physical region starting at
 - * *pfn_base to [start, end[.
 - */
 -if (!pfn_base)
 -create_hyp_pte_mappings(pmd, addr, next);
 -else
 -create_hyp_io_pte_mappings(pmd, addr, next, pfn_base);
 +create_hyp_pte_mappings(pmd, addr, next, pfn, prot);
 +pfn += (next - addr)  PAGE_SHIFT;
 
 so this scheme always assumes a physically contiguous memory reason,
 and I wasn't sure if we ever wanted to support mapping vmalloc'ed
 regions into Hyp mode, which is why I wrote the code the way it was
 (which goes to your for no good reason comment above).

So let's look at what we're actually mapping:
- kernel code/kmalloc-ed data: this is always physically contiguous
- MMIO: While this is mapped in the vmalloc region, this is actually
physically contiguous.

 I'm fine with assuming that this only works for contiguous regions, but
 I think it deserves a line in the comments on the exported functions
 (the non-IO one anyway).

Sure, I'll add that.

  }
  
  return 0;
  }
  
 -static int __create_hyp_mappings(void *from, void *to, unsigned long 
 *pfn_base)
 +static int __create_hyp_mappings(pgd_t *pgdp,
 + unsigned long start, unsigned long end,
 + unsigned long pfn, pgprot_t prot)
  {
 -unsigned long start = (unsigned long)from;
 -unsigned long end = (unsigned long)to;
  pgd_t 

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:08, Gleb Natapov wrote:
 
  On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
  
  With KVM, MMIO is much slower than PIO, due to the need to
  do page walk and emulation. But with EPT, it does not have to be: we
  know the address from the VMCS so if the address is unique, we can look
  up the eventfd directly, bypassing emulation.
  
  Add an interface for userspace to specify this per-address, we can
  use this e.g. for virtio.
  
  The implementation adds a separate bus internally. This serves two
  purposes:
  - minimize overhead for old userspace that does not use PV MMIO
  - minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)
  
  At the moment, this optimization is only supported for EPT on x86 and
  silently ignored for NPT and MMU, so everything works correctly but
  slowly.
  
  TODO: NPT, MMU and non x86 architectures.
  
  The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
  pre-review and suggestions.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  This still uses page fault intercepts which are orders of magnitudes 
  slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
  the guest can use to invoke MMIO accesses towards the host based on 
  physical addresses with explicit length encodings?
  
  It is slower, but not an order of magnitude slower. It become faster
  with newer HW.
  
  That way you simplify and speed up all code paths, exceeding the speed of 
  PIO exits even. It should also be quite easily portable, as all other 
  platforms have hypercalls available as well.
  
  We are trying to avoid PV as much as possible (well this is also PV,
  but not guest visible). We haven't replaced PIO with hypercall for the
  same reason. My hope is that future HW will provide us with instruction
  decode for basic mov instruction at which point this optimisation can be
  dropped.
 
 The same applies to an MMIO hypercall. Once the PV interface becomes 
 obsolete, we can drop the capability we expose to the guest.
 
Disable it on newer HW is easy, but it is not that simple to get rid of the 
guest code.

  And hypercall has its own set of problems with Windows guests.
  When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
  hypercalls.  Mixing KVM hypercalls and Hyper-V requires some tricks. It
 
 Can't we simply register a hypercall ID range with Microsoft?
Doubt it. This is not only about the rang though. The calling convention
is completely different.

 
  may also affect WHQLing Windows drivers since driver will talk to HW
  bypassing Windows interfaces.
 
 Then the WHQL'ed driver doesn't support the PV MMIO hcall?
 
All Windows drivers have to be WHQL'ed, saying that is like saying that
we do not care about Windows guest virtio speed.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:34, Gleb Natapov wrote:

 On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:08, Gleb Natapov wrote:
 
 On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
 With KVM, MMIO is much slower than PIO, due to the need to
 do page walk and emulation. But with EPT, it does not have to be: we
 know the address from the VMCS so if the address is unique, we can look
 up the eventfd directly, bypassing emulation.
 
 Add an interface for userspace to specify this per-address, we can
 use this e.g. for virtio.
 
 The implementation adds a separate bus internally. This serves two
 purposes:
 - minimize overhead for old userspace that does not use PV MMIO
 - minimize disruption in other code (since we don't know the length,
 devices on the MMIO bus only get a valid address in write, this
 way we don't need to touch all devices to teach them handle
 an dinvalid length)
 
 At the moment, this optimization is only supported for EPT on x86 and
 silently ignored for NPT and MMU, so everything works correctly but
 slowly.
 
 TODO: NPT, MMU and non x86 architectures.
 
 The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
 pre-review and suggestions.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 This still uses page fault intercepts which are orders of magnitudes 
 slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
 the guest can use to invoke MMIO accesses towards the host based on 
 physical addresses with explicit length encodings?
 
 It is slower, but not an order of magnitude slower. It become faster
 with newer HW.
 
 That way you simplify and speed up all code paths, exceeding the speed of 
 PIO exits even. It should also be quite easily portable, as all other 
 platforms have hypercalls available as well.
 
 We are trying to avoid PV as much as possible (well this is also PV,
 but not guest visible). We haven't replaced PIO with hypercall for the
 same reason. My hope is that future HW will provide us with instruction
 decode for basic mov instruction at which point this optimisation can be
 dropped.
 
 The same applies to an MMIO hypercall. Once the PV interface becomes 
 obsolete, we can drop the capability we expose to the guest.
 
 Disable it on newer HW is easy, but it is not that simple to get rid of the 
 guest code.

You need to have guest code that runs with non-PV hcalls anyways, since you 
want to be able to run on older KVM versions. So what's the problem?

 
 And hypercall has its own set of problems with Windows guests.
 When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
 hypercalls.  Mixing KVM hypercalls and Hyper-V requires some tricks. It
 
 Can't we simply register a hypercall ID range with Microsoft?
 Doubt it. This is not only about the rang though. The calling convention
 is completely different.

There is no calling convention for KVM hypercalls anymore, or is there?
Either way, we would of course just adhere to the MS calling convention. We're 
only talking about register values here...

 
 
 may also affect WHQLing Windows drivers since driver will talk to HW
 bypassing Windows interfaces.
 
 Then the WHQL'ed driver doesn't support the PV MMIO hcall?
 
 All Windows drivers have to be WHQL'ed, saying that is like saying that
 we do not care about Windows guest virtio speed.

I don't mind if a Linux guest is faster than a Windows guest, if that's what 
you're asking.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:08, Gleb Natapov wrote:
 
  On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
  
  With KVM, MMIO is much slower than PIO, due to the need to
  do page walk and emulation. But with EPT, it does not have to be: we
  know the address from the VMCS so if the address is unique, we can look
  up the eventfd directly, bypassing emulation.
  
  Add an interface for userspace to specify this per-address, we can
  use this e.g. for virtio.
  
  The implementation adds a separate bus internally. This serves two
  purposes:
  - minimize overhead for old userspace that does not use PV MMIO
  - minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)
  
  At the moment, this optimization is only supported for EPT on x86 and
  silently ignored for NPT and MMU, so everything works correctly but
  slowly.
  
  TODO: NPT, MMU and non x86 architectures.
  
  The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
  pre-review and suggestions.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  This still uses page fault intercepts which are orders of magnitudes 
  slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
  the guest can use to invoke MMIO accesses towards the host based on 
  physical addresses with explicit length encodings?
  
  It is slower, but not an order of magnitude slower. It become faster
  with newer HW.
  
  That way you simplify and speed up all code paths, exceeding the speed of 
  PIO exits even. It should also be quite easily portable, as all other 
  platforms have hypercalls available as well.
  
  We are trying to avoid PV as much as possible (well this is also PV,
  but not guest visible
 
 Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The 
 comment above its definition indicates that the guest does so, so it is guest 
 visible.
 
QEMU sets it.

 +/*
 + * PV_MMIO - Guest can promise us that all accesses touching this address
 + * are writes of specified length, starting at the specified address.
 + * If not - it's a Guest bug.
 + * Can not be used together with either PIO or DATAMATCH.
 + */
 
Virtio spec will state that access to a kick register needs to be of
specific length. This is reasonable thing for HW to ask.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:38, Gleb Natapov wrote:

 On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:08, Gleb Natapov wrote:
 
 On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
 With KVM, MMIO is much slower than PIO, due to the need to
 do page walk and emulation. But with EPT, it does not have to be: we
 know the address from the VMCS so if the address is unique, we can look
 up the eventfd directly, bypassing emulation.
 
 Add an interface for userspace to specify this per-address, we can
 use this e.g. for virtio.
 
 The implementation adds a separate bus internally. This serves two
 purposes:
 - minimize overhead for old userspace that does not use PV MMIO
 - minimize disruption in other code (since we don't know the length,
 devices on the MMIO bus only get a valid address in write, this
 way we don't need to touch all devices to teach them handle
 an dinvalid length)
 
 At the moment, this optimization is only supported for EPT on x86 and
 silently ignored for NPT and MMU, so everything works correctly but
 slowly.
 
 TODO: NPT, MMU and non x86 architectures.
 
 The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
 pre-review and suggestions.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 This still uses page fault intercepts which are orders of magnitudes 
 slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
 the guest can use to invoke MMIO accesses towards the host based on 
 physical addresses with explicit length encodings?
 
 It is slower, but not an order of magnitude slower. It become faster
 with newer HW.
 
 That way you simplify and speed up all code paths, exceeding the speed of 
 PIO exits even. It should also be quite easily portable, as all other 
 platforms have hypercalls available as well.
 
 We are trying to avoid PV as much as possible (well this is also PV,
 but not guest visible
 
 Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? 
 The comment above its definition indicates that the guest does so, so it is 
 guest visible.
 
 QEMU sets it.

How does QEMU know?

 
 +/*
 + * PV_MMIO - Guest can promise us that all accesses touching this address
 + * are writes of specified length, starting at the specified address.
 + * If not - it's a Guest bug.
 + * Can not be used together with either PIO or DATAMATCH.
 + */
 
 Virtio spec will state that access to a kick register needs to be of
 specific length. This is reasonable thing for HW to ask.

This is a spec change. So the guest would have to indicate that it adheres to a 
newer spec. Thus it's a guest visible change.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/7] ARM: KVM: fix HYP mapping limitations around zero

2013-04-04 Thread Marc Zyngier
On 04/04/13 00:14, Christoffer Dall wrote:
 On Tue, Apr 02, 2013 at 02:25:10PM +0100, Marc Zyngier wrote:
 The current code for creating HYP mapping doesn't like to wrap
 around zero, which prevents from mapping anything into the last
 page of the virtual address space.

 It doesn't take much effort to remove this limitation, making
 the code more consistent with the rest of the kernel in the process.
 
 This is quite funny, the code used to behave like this, but
 reviewers said that this was old fashioned mm-style code that was
 unwanted and wrapping around zero should be avoided for all costs :)

My original attempt at this series was using the very last page of the
memory, and it took me a while to notice that the code couldn't map
anything in the last page.

Just for this reason, the code needed to be fixed.

 In any case, I welcome the familiarity.

Amen.

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Paolo Bonzini
Il 04/04/2013 12:40, Michael S. Tsirkin ha scritto:
 Add an option for users to specify PV eventfd
 listeners, and utilize KVM's PV MMIO underneath.
 Upodate all callers.

I would like Avi to comment on this, because I think this is not the
memory-API approved way of doing things.  You need KVM to define its
own AddressSpace, and make KVM's listener use
memory_region_to_address_space to figure out if it is for PV MMIO.

To handle accesses from TCG, the PV AddressSpace can simply have just an
alias to the actual MemoryRegion where the doorbell is.

Paolo

 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  hw/dataplane/hostmem.c|  1 +
  hw/ivshmem.c  |  2 ++
  hw/pci-testdev.c  |  2 ++
  hw/vhost.c|  4 ++--
  hw/virtio-pci.c   |  4 ++--
  include/exec/memory.h | 10 ++
  kvm-all.c | 15 ---
  linux-headers/linux/kvm.h |  8 
  memory.c  |  9 +++--
  9 files changed, 46 insertions(+), 9 deletions(-)
 
 diff --git a/hw/dataplane/hostmem.c b/hw/dataplane/hostmem.c
 index 380537e..41d43e7 100644
 --- a/hw/dataplane/hostmem.c
 +++ b/hw/dataplane/hostmem.c
 @@ -126,6 +126,7 @@ static void hostmem_listener_section_dummy(MemoryListener 
 *listener,
  
  static void hostmem_listener_eventfd_dummy(MemoryListener *listener,
 MemoryRegionSection *section,
 +   bool pv,
 bool match_data, uint64_t data,
 EventNotifier *e)
  {
 diff --git a/hw/ivshmem.c b/hw/ivshmem.c
 index 68a2cf2..28e5978 100644
 --- a/hw/ivshmem.c
 +++ b/hw/ivshmem.c
 @@ -352,6 +352,7 @@ static void ivshmem_add_eventfd(IVShmemState *s, int 
 posn, int i)
  memory_region_add_eventfd(s-ivshmem_mmio,
DOORBELL,
4,
 +  false,
true,
(posn  16) | i,
s-peers[posn].eventfds[i]);
 @@ -362,6 +363,7 @@ static void ivshmem_del_eventfd(IVShmemState *s, int 
 posn, int i)
  memory_region_del_eventfd(s-ivshmem_mmio,
DOORBELL,
4,
 +  false,
true,
(posn  16) | i,
s-peers[posn].eventfds[i]);
 diff --git a/hw/pci-testdev.c b/hw/pci-testdev.c
 index 9486624..f0ebf99 100644
 --- a/hw/pci-testdev.c
 +++ b/hw/pci-testdev.c
 @@ -80,6 +80,7 @@ static int pci_testdev_start(IOTest *test)
  memory_region_add_eventfd(test-mr,
le32_to_cpu(test-hdr-offset),
test-size,
 +  false,
test-match_data,
test-hdr-data,
test-notifier);
 @@ -94,6 +95,7 @@ static void pci_testdev_stop(IOTest *test)
  memory_region_del_eventfd(test-mr,
le32_to_cpu(test-hdr-offset),
test-size,
 +  false,
test-match_data,
test-hdr-data,
test-notifier);
 diff --git a/hw/vhost.c b/hw/vhost.c
 index 4d6aee3..9ad7101 100644
 --- a/hw/vhost.c
 +++ b/hw/vhost.c
 @@ -750,13 +750,13 @@ static void vhost_virtqueue_stop(struct vhost_dev *dev,
  }
  
  static void vhost_eventfd_add(MemoryListener *listener,
 -  MemoryRegionSection *section,
 +  MemoryRegionSection *section, bool pv,
bool match_data, uint64_t data, EventNotifier 
 *e)
  {
  }
  
  static void vhost_eventfd_del(MemoryListener *listener,
 -  MemoryRegionSection *section,
 +  MemoryRegionSection *section, bool pv,
bool match_data, uint64_t data, EventNotifier 
 *e)
  {
  }
 diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
 index 19965e5..f9ff994 100644
 --- a/hw/virtio-pci.c
 +++ b/hw/virtio-pci.c
 @@ -189,10 +189,10 @@ static int 
 virtio_pci_set_host_notifier_internal(VirtIOPCIProxy *proxy,
  }
  virtio_queue_set_host_notifier_fd_handler(vq, true, set_handler);
  memory_region_add_eventfd(proxy-bar, VIRTIO_PCI_QUEUE_NOTIFY, 2,
 -  true, n, notifier);
 +  false, true, n, notifier);
  } else {
  memory_region_del_eventfd(proxy-bar, VIRTIO_PCI_QUEUE_NOTIFY, 2,
 -  true, n, notifier);
 +  false, true, n, notifier);
  virtio_queue_set_host_notifier_fd_handler(vq, false, false);
 

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:38, Gleb Natapov wrote:
 
  On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:08, Gleb Natapov wrote:
  
  On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
  
  With KVM, MMIO is much slower than PIO, due to the need to
  do page walk and emulation. But with EPT, it does not have to be: we
  know the address from the VMCS so if the address is unique, we can look
  up the eventfd directly, bypassing emulation.
  
  Add an interface for userspace to specify this per-address, we can
  use this e.g. for virtio.
  
  The implementation adds a separate bus internally. This serves two
  purposes:
  - minimize overhead for old userspace that does not use PV MMIO
  - minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)
  
  At the moment, this optimization is only supported for EPT on x86 and
  silently ignored for NPT and MMU, so everything works correctly but
  slowly.
  
  TODO: NPT, MMU and non x86 architectures.
  
  The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
  pre-review and suggestions.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  This still uses page fault intercepts which are orders of magnitudes 
  slower than hypercalls. Why don't you just create a PV MMIO hypercall 
  that the guest can use to invoke MMIO accesses towards the host based on 
  physical addresses with explicit length encodings?
  
  It is slower, but not an order of magnitude slower. It become faster
  with newer HW.
  
  That way you simplify and speed up all code paths, exceeding the speed 
  of PIO exits even. It should also be quite easily portable, as all other 
  platforms have hypercalls available as well.
  
  We are trying to avoid PV as much as possible (well this is also PV,
  but not guest visible
  
  Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? 
  The comment above its definition indicates that the guest does so, so it 
  is guest visible.
  
  QEMU sets it.
 
 How does QEMU know?
 
Knows what? When to create such eventfd? virtio device knows.

  
  +/*
  + * PV_MMIO - Guest can promise us that all accesses touching this address
  + * are writes of specified length, starting at the specified address.
  + * If not - it's a Guest bug.
  + * Can not be used together with either PIO or DATAMATCH.
  + */
  
  Virtio spec will state that access to a kick register needs to be of
  specific length. This is reasonable thing for HW to ask.
 
 This is a spec change. So the guest would have to indicate that it adheres to 
 a newer spec. Thus it's a guest visible change.
 
There is not virtio spec that has kick register in MMIO. The spec is in
the works AFAIK. Actually PIO will not be deprecated and my suggestion
is to move to MMIO only when PIO address space is exhausted. For PCI it
will be never, for PCI-e it will be after ~16 devices.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:45, Gleb Natapov wrote:

 On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:38, Gleb Natapov wrote:
 
 On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:08, Gleb Natapov wrote:
 
 On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
 With KVM, MMIO is much slower than PIO, due to the need to
 do page walk and emulation. But with EPT, it does not have to be: we
 know the address from the VMCS so if the address is unique, we can look
 up the eventfd directly, bypassing emulation.
 
 Add an interface for userspace to specify this per-address, we can
 use this e.g. for virtio.
 
 The implementation adds a separate bus internally. This serves two
 purposes:
 - minimize overhead for old userspace that does not use PV MMIO
 - minimize disruption in other code (since we don't know the length,
 devices on the MMIO bus only get a valid address in write, this
 way we don't need to touch all devices to teach them handle
 an dinvalid length)
 
 At the moment, this optimization is only supported for EPT on x86 and
 silently ignored for NPT and MMU, so everything works correctly but
 slowly.
 
 TODO: NPT, MMU and non x86 architectures.
 
 The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
 pre-review and suggestions.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 This still uses page fault intercepts which are orders of magnitudes 
 slower than hypercalls. Why don't you just create a PV MMIO hypercall 
 that the guest can use to invoke MMIO accesses towards the host based on 
 physical addresses with explicit length encodings?
 
 It is slower, but not an order of magnitude slower. It become faster
 with newer HW.
 
 That way you simplify and speed up all code paths, exceeding the speed 
 of PIO exits even. It should also be quite easily portable, as all other 
 platforms have hypercalls available as well.
 
 We are trying to avoid PV as much as possible (well this is also PV,
 but not guest visible
 
 Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? 
 The comment above its definition indicates that the guest does so, so it 
 is guest visible.
 
 QEMU sets it.
 
 How does QEMU know?
 
 Knows what? When to create such eventfd? virtio device knows.

Where does it know from?

 
 
 +/*
 + * PV_MMIO - Guest can promise us that all accesses touching this address
 + * are writes of specified length, starting at the specified address.
 + * If not - it's a Guest bug.
 + * Can not be used together with either PIO or DATAMATCH.
 + */
 
 Virtio spec will state that access to a kick register needs to be of
 specific length. This is reasonable thing for HW to ask.
 
 This is a spec change. So the guest would have to indicate that it adheres 
 to a newer spec. Thus it's a guest visible change.
 
 There is not virtio spec that has kick register in MMIO. The spec is in
 the works AFAIK. Actually PIO will not be deprecated and my suggestion

So the guest would indicate that it supports a newer revision of the spec (in 
your case, that it supports MMIO). How is that any different from exposing that 
it supports a PV MMIO hcall?

 is to move to MMIO only when PIO address space is exhausted. For PCI it
 will be never, for PCI-e it will be after ~16 devices.

Ok, let's go back a step here. Are you actually able to measure any speed in 
performance with this patch applied and without when going through MMIO kicks?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Peter Maydell
On 4 April 2013 11:40, Michael S. Tsirkin m...@redhat.com wrote:
 Add an option for users to specify PV eventfd
 listeners, and utilize KVM's PV MMIO underneath.
 Upodate all callers.

 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 ---
  hw/dataplane/hostmem.c|  1 +
  hw/ivshmem.c  |  2 ++
  hw/pci-testdev.c  |  2 ++
  hw/vhost.c|  4 ++--
  hw/virtio-pci.c   |  4 ++--
  include/exec/memory.h | 10 ++
  kvm-all.c | 15 ---
  linux-headers/linux/kvm.h |  8 
  memory.c  |  9 +++--

Can we keep linux-headers updates in separate patches, please
(with commit messages saying what kernel commit/version the
sync is with)?

thanks
-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] ARM: KVM: switch to a dual-step HYP init code

2013-04-04 Thread Marc Zyngier
On 04/04/13 00:15, Christoffer Dall wrote:
 On Tue, Apr 02, 2013 at 02:25:14PM +0100, Marc Zyngier wrote:
 Our HYP init code suffers from two major design issues:
 - it cannot support CPU hotplug, as we tear down the idmap very early
 - it cannot perform a TLB invalidation when switching from init to
   runtime mappings, as pages are manipulated from PL1 exclusively

 The hotplug problem mandates that we keep two sets of page tables
 (boot and runtime). The TLB problem mandates that we're able to
 transition from one PGD to another while in HYP, invalidating the TLBs
 in the process.

 To be able to do this, we need to share a page between the two page
 tables. A page that will have the same VA in both configurations. All we
 need is a VA that has the following properties:
 - This VA can't be used to represent a kernel mapping.
 - This VA will not conflict with the physical address of the kernel text

 The vectors page seems to satisfy this requirement:
 - The kernel never maps anything else there
 - The kernel text being copied at the beginning of the physical memory,
   it is unlikely to use the last 64kB (I doubt we'll ever support KVM
   on a system with something like 4MB of RAM, but patches are very
   welcome).

 Let's call this VA the trampoline VA.

 Now, we map our init page at 3 locations:
 - idmap in the boot pgd
 - trampoline VA in the boot pgd
 - trampoline VA in the runtime pgd

 The init scenario is now the following:
 - We jump in HYP with four parameters: boot HYP pgd, runtime HYP pgd,
   runtime stack, runtime vectors
 - Enable the MMU with the boot pgd
 - Jump to a target into the trampoline page (remember, this is the same
   physical page!)
 - Now switch to the runtime pgd (same VA, and still the same physical
   page!)
 - Invalidate TLBs
 - Set stack and vectors
 - Profit! (or eret, if you only care about the code).
 
 So I'm going to do my usual commenting routine. Was it an idea to insert
 this commit text (which I really liked by the way!) into init.S where
 the current comment is a little lacking giving the massive complexity
 this is turning into, madness?

Will do.

[...]

 diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
 index 35a463f..b2c6967 100644
 --- a/arch/arm/kvm/init.S
 +++ b/arch/arm/kvm/init.S
 @@ -21,6 +21,7 @@
  #include asm/asm-offsets.h
  #include asm/kvm_asm.h
  #include asm/kvm_arm.h
 +#include asm/kvm_mmu.h

  /
   * Hypervisor initialization
 @@ -47,6 +48,9 @@ __kvm_hyp_init:
   W(b).

  __do_hyp_init:
 + cmp r2, #0  @ We have a SP?
 + bne phase2  @ Yes, second stage init
 +
   @ Set the HTTBR to point to the hypervisor PGD pointer passed
   mcrrp15, 4, r0, r1, c2

 @@ -96,14 +100,35 @@ __do_hyp_init:
   orr r0, r0, r1
   isb
   mcr p15, 4, r0, c1, c0, 0   @ HSCR
 - isb

 - @ Set stack pointer and return to the kernel
 + eret
 
 Could you add some comment here to indicate we're done with phase1, it
 seems like this eret should not go unnoticed by casual readers (ok, they
 shouldn't read this code casually, but anyway..., it will make me sleep
 better)

Sure, no problem.

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

2013-04-04 Thread Alexander Graf

On 03.04.2013, at 03:57, Scott Wood wrote:

 Enabling this capability connects the vcpu to the designated in-kernel
 MPIC.  Using explicit connections between vcpus and irqchips allows
 for flexibility, but the main benefit at the moment is that it
 simplifies the code -- KVM doesn't need vm-global state to remember
 which MPIC object is associated with this vm, and it doesn't need to
 care about ordering between irqchip creation and vcpu creation.
 
 Signed-off-by: Scott Wood scottw...@freescale.com
 ---
 Documentation/virtual/kvm/api.txt   |8 ++
 arch/powerpc/include/asm/kvm_host.h |8 ++
 arch/powerpc/include/asm/kvm_ppc.h  |2 ++
 arch/powerpc/kvm/booke.c|4 ++-
 arch/powerpc/kvm/mpic.c |   49 +++
 arch/powerpc/kvm/powerpc.c  |   26 +++
 include/uapi/linux/kvm.h|1 +
 7 files changed, 92 insertions(+), 6 deletions(-)
 
 diff --git a/Documentation/virtual/kvm/api.txt 
 b/Documentation/virtual/kvm/api.txt
 index d52f3f9..4c326ae 100644
 --- a/Documentation/virtual/kvm/api.txt
 +++ b/Documentation/virtual/kvm/api.txt
 @@ -2728,3 +2728,11 @@ to receive the topmost interrupt vector.
 When disabled (args[0] == 0), behavior is as if this facility is unsupported.
 
 When this capability is enabled, KVM_EXIT_EPR can occur.
 +
 +6.6 KVM_CAP_IRQ_MPIC
 +
 +Architectures: ppc
 +Parameters: args[0] is the MPIC device fd
 +args[1] is the MPIC CPU number for this vcpu
 +
 +This capability connects the vcpu to an in-kernel MPIC device.
 diff --git a/arch/powerpc/include/asm/kvm_host.h 
 b/arch/powerpc/include/asm/kvm_host.h
 index 7e7aef9..2a2e235 100644
 --- a/arch/powerpc/include/asm/kvm_host.h
 +++ b/arch/powerpc/include/asm/kvm_host.h
 @@ -375,6 +375,11 @@ struct kvmppc_booke_debug_reg {
   u64 dac[KVMPPC_BOOKE_MAX_DAC];
 };
 
 +#define KVMPPC_IRQ_DEFAULT   0
 +#define KVMPPC_IRQ_MPIC  1
 +
 +struct openpic;
 +
 struct kvm_vcpu_arch {
   ulong host_stack;
   u32 host_pid;
 @@ -554,6 +559,9 @@ struct kvm_vcpu_arch {
   unsigned long magic_page_pa; /* phys addr to map the magic page to */
   unsigned long magic_page_ea; /* effect. addr to map the magic page to */
 
 + int irq_type;   /* one of KVM_IRQ_* */
 + struct openpic *mpic;   /* KVM_IRQ_MPIC */
 +
 #ifdef CONFIG_KVM_BOOK3S_64_HV
   struct kvm_vcpu_arch_shared shregs;
 
 diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
 b/arch/powerpc/include/asm/kvm_ppc.h
 index 3b63b97..f54707f 100644
 --- a/arch/powerpc/include/asm/kvm_ppc.h
 +++ b/arch/powerpc/include/asm/kvm_ppc.h
 @@ -276,6 +276,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, 
 u32 epr)
 }
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
 +int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
 +  u32 cpu);
 
 int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 struct kvm_config_tlb *cfg);
 diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
 index cddc6b3..7d00222 100644
 --- a/arch/powerpc/kvm/booke.c
 +++ b/arch/powerpc/kvm/booke.c
 @@ -430,8 +430,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
 *vcpu,
   if (update_epr == true) {
   if (vcpu-arch.epr_flags  KVMPPC_EPR_USER)
   kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
 - else if (vcpu-arch.epr_flags  KVMPPC_EPR_KERNEL)
 + else if (vcpu-arch.epr_flags  KVMPPC_EPR_KERNEL) {
 + BUG_ON(vcpu-arch.irq_type != KVMPPC_IRQ_MPIC);
   kvmppc_mpic_set_epr(vcpu);
 + }
   }
 
   new_msr = msr_mask;
 diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
 index 8cda2fa..caffe3b 100644
 --- a/arch/powerpc/kvm/mpic.c
 +++ b/arch/powerpc/kvm/mpic.c
 @@ -1159,7 +1159,7 @@ static uint32_t openpic_iack(struct openpic *opp, 
 struct irq_dest *dst,
 
 void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
 {
 - struct openpic *opp = vcpu-arch.irqchip_priv;
 + struct openpic *opp = vcpu-arch.mpic;
   int cpu = vcpu-vcpu_id;
   unsigned long flags;
 
 @@ -1442,10 +1442,10 @@ static void map_mmio(struct openpic *opp)
 
 static void unmap_mmio(struct openpic *opp)
 {
 - BUG_ON(opp-mmio_mapped);
 - opp-mmio_mapped = false;
 -
 - kvm_io_bus_unregister_dev(opp-kvm, KVM_MMIO_BUS, opp-mmio);
 + if (opp-mmio_mapped) {
 + opp-mmio_mapped = false;
 + kvm_io_bus_unregister_dev(opp-kvm, KVM_MMIO_BUS, opp-mmio);
 + }
 }
 
 static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
 @@ -1681,6 +1681,45 @@ static const struct file_operations kvm_mpic_fops = {
   .release = kvm_mpic_release,
 };
 
 +int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
 +  u32 cpu)
 +{
 +   

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:45, Gleb Natapov wrote:
 
  On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:38, Gleb Natapov wrote:
  
  On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:08, Gleb Natapov wrote:
  
  On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
  
  With KVM, MMIO is much slower than PIO, due to the need to
  do page walk and emulation. But with EPT, it does not have to be: we
  know the address from the VMCS so if the address is unique, we can 
  look
  up the eventfd directly, bypassing emulation.
  
  Add an interface for userspace to specify this per-address, we can
  use this e.g. for virtio.
  
  The implementation adds a separate bus internally. This serves two
  purposes:
  - minimize overhead for old userspace that does not use PV MMIO
  - minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)
  
  At the moment, this optimization is only supported for EPT on x86 and
  silently ignored for NPT and MMU, so everything works correctly but
  slowly.
  
  TODO: NPT, MMU and non x86 architectures.
  
  The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
  pre-review and suggestions.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  This still uses page fault intercepts which are orders of magnitudes 
  slower than hypercalls. Why don't you just create a PV MMIO hypercall 
  that the guest can use to invoke MMIO accesses towards the host based 
  on physical addresses with explicit length encodings?
  
  It is slower, but not an order of magnitude slower. It become faster
  with newer HW.
  
  That way you simplify and speed up all code paths, exceeding the speed 
  of PIO exits even. It should also be quite easily portable, as all 
  other platforms have hypercalls available as well.
  
  We are trying to avoid PV as much as possible (well this is also PV,
  but not guest visible
  
  Also, how is this not guest visible? Who sets 
  KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
  that the guest does so, so it is guest visible.
  
  QEMU sets it.
  
  How does QEMU know?
  
  Knows what? When to create such eventfd? virtio device knows.
 
 Where does it know from?
 
It does it always.

  
  
  +/*
  + * PV_MMIO - Guest can promise us that all accesses touching this 
  address
  + * are writes of specified length, starting at the specified address.
  + * If not - it's a Guest bug.
  + * Can not be used together with either PIO or DATAMATCH.
  + */
  
  Virtio spec will state that access to a kick register needs to be of
  specific length. This is reasonable thing for HW to ask.
  
  This is a spec change. So the guest would have to indicate that it adheres 
  to a newer spec. Thus it's a guest visible change.
  
  There is not virtio spec that has kick register in MMIO. The spec is in
  the works AFAIK. Actually PIO will not be deprecated and my suggestion
 
 So the guest would indicate that it supports a newer revision of the spec (in 
 your case, that it supports MMIO). How is that any different from exposing 
 that it supports a PV MMIO hcall?
 
Guest will indicate nothing. New driver will use MMIO if PIO is bar is
not configured. All driver will not work for virtio devices with MMIO
bar, but not PIO bar.

  is to move to MMIO only when PIO address space is exhausted. For PCI it
  will be never, for PCI-e it will be after ~16 devices.
 
 Ok, let's go back a step here. Are you actually able to measure any speed in 
 performance with this patch applied and without when going through MMIO kicks?
 
 
That's the question for MST. I think he did only micro benchmarks till
now and he already posted his result here:

mmio-wildcard-eventfd:pci-mem 3529
mmio-pv-eventfd:pci-mem 1878
portio-wildcard-eventfd:pci-io 1846

So the patch speedup mmio by almost 100% and it is almost the same as PIO.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:56, Gleb Natapov wrote:

 On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:45, Gleb Natapov wrote:
 
 On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:38, Gleb Natapov wrote:
 
 On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:08, Gleb Natapov wrote:
 
 On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
 With KVM, MMIO is much slower than PIO, due to the need to
 do page walk and emulation. But with EPT, it does not have to be: we
 know the address from the VMCS so if the address is unique, we can 
 look
 up the eventfd directly, bypassing emulation.
 
 Add an interface for userspace to specify this per-address, we can
 use this e.g. for virtio.
 
 The implementation adds a separate bus internally. This serves two
 purposes:
 - minimize overhead for old userspace that does not use PV MMIO
 - minimize disruption in other code (since we don't know the length,
 devices on the MMIO bus only get a valid address in write, this
 way we don't need to touch all devices to teach them handle
 an dinvalid length)
 
 At the moment, this optimization is only supported for EPT on x86 and
 silently ignored for NPT and MMU, so everything works correctly but
 slowly.
 
 TODO: NPT, MMU and non x86 architectures.
 
 The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
 pre-review and suggestions.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 This still uses page fault intercepts which are orders of magnitudes 
 slower than hypercalls. Why don't you just create a PV MMIO hypercall 
 that the guest can use to invoke MMIO accesses towards the host based 
 on physical addresses with explicit length encodings?
 
 It is slower, but not an order of magnitude slower. It become faster
 with newer HW.
 
 That way you simplify and speed up all code paths, exceeding the speed 
 of PIO exits even. It should also be quite easily portable, as all 
 other platforms have hypercalls available as well.
 
 We are trying to avoid PV as much as possible (well this is also PV,
 but not guest visible
 
 Also, how is this not guest visible? Who sets 
 KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
 that the guest does so, so it is guest visible.
 
 QEMU sets it.
 
 How does QEMU know?
 
 Knows what? When to create such eventfd? virtio device knows.
 
 Where does it know from?
 
 It does it always.
 
 
 
 +/*
 + * PV_MMIO - Guest can promise us that all accesses touching this 
 address
 + * are writes of specified length, starting at the specified address.
 + * If not - it's a Guest bug.
 + * Can not be used together with either PIO or DATAMATCH.
 + */
 
 Virtio spec will state that access to a kick register needs to be of
 specific length. This is reasonable thing for HW to ask.
 
 This is a spec change. So the guest would have to indicate that it adheres 
 to a newer spec. Thus it's a guest visible change.
 
 There is not virtio spec that has kick register in MMIO. The spec is in
 the works AFAIK. Actually PIO will not be deprecated and my suggestion
 
 So the guest would indicate that it supports a newer revision of the spec 
 (in your case, that it supports MMIO). How is that any different from 
 exposing that it supports a PV MMIO hcall?
 
 Guest will indicate nothing. New driver will use MMIO if PIO is bar is
 not configured. All driver will not work for virtio devices with MMIO
 bar, but not PIO bar.

I can't parse that, sorry :).

 
 is to move to MMIO only when PIO address space is exhausted. For PCI it
 will be never, for PCI-e it will be after ~16 devices.
 
 Ok, let's go back a step here. Are you actually able to measure any speed in 
 performance with this patch applied and without when going through MMIO 
 kicks?
 
 
 That's the question for MST. I think he did only micro benchmarks till
 now and he already posted his result here:
 
 mmio-wildcard-eventfd:pci-mem 3529
 mmio-pv-eventfd:pci-mem 1878
 portio-wildcard-eventfd:pci-io 1846
 
 So the patch speedup mmio by almost 100% and it is almost the same as PIO.

Those numbers don't align at all with what I measured.

MST, could you please do a real world latency benchmark with virtio-net and

  * normal ioeventfd
  * mmio-pv eventfd
  * hcall eventfd

to give us some idea how much performance we would gain from each approach? 
Thoughput should be completely unaffected anyway, since virtio just coalesces 
kicks internally.

I'm also slightly puzzled why the wildcard eventfd mechanism is so 
significantly slower, while it was only a few percent on my test system. What 
are the numbers you're listing above? Cycles? How many cycles do you execute in 
a second?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:56, Gleb Natapov wrote:
 
  On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:45, Gleb Natapov wrote:
  
  On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:38, Gleb Natapov wrote:
  
  On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:08, Gleb Natapov wrote:
  
  On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
  
  With KVM, MMIO is much slower than PIO, due to the need to
  do page walk and emulation. But with EPT, it does not have to be: we
  know the address from the VMCS so if the address is unique, we can 
  look
  up the eventfd directly, bypassing emulation.
  
  Add an interface for userspace to specify this per-address, we can
  use this e.g. for virtio.
  
  The implementation adds a separate bus internally. This serves two
  purposes:
  - minimize overhead for old userspace that does not use PV MMIO
  - minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)
  
  At the moment, this optimization is only supported for EPT on x86 
  and
  silently ignored for NPT and MMU, so everything works correctly but
  slowly.
  
  TODO: NPT, MMU and non x86 architectures.
  
  The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
  pre-review and suggestions.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  This still uses page fault intercepts which are orders of magnitudes 
  slower than hypercalls. Why don't you just create a PV MMIO 
  hypercall that the guest can use to invoke MMIO accesses towards the 
  host based on physical addresses with explicit length encodings?
  
  It is slower, but not an order of magnitude slower. It become faster
  with newer HW.
  
  That way you simplify and speed up all code paths, exceeding the 
  speed of PIO exits even. It should also be quite easily portable, as 
  all other platforms have hypercalls available as well.
  
  We are trying to avoid PV as much as possible (well this is also PV,
  but not guest visible
  
  Also, how is this not guest visible? Who sets 
  KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
  that the guest does so, so it is guest visible.
  
  QEMU sets it.
  
  How does QEMU know?
  
  Knows what? When to create such eventfd? virtio device knows.
  
  Where does it know from?
  
  It does it always.
  
  
  
  +/*
  + * PV_MMIO - Guest can promise us that all accesses touching this 
  address
  + * are writes of specified length, starting at the specified address.
  + * If not - it's a Guest bug.
  + * Can not be used together with either PIO or DATAMATCH.
  + */
  
  Virtio spec will state that access to a kick register needs to be of
  specific length. This is reasonable thing for HW to ask.
  
  This is a spec change. So the guest would have to indicate that it 
  adheres to a newer spec. Thus it's a guest visible change.
  
  There is not virtio spec that has kick register in MMIO. The spec is in
  the works AFAIK. Actually PIO will not be deprecated and my suggestion
  
  So the guest would indicate that it supports a newer revision of the spec 
  (in your case, that it supports MMIO). How is that any different from 
  exposing that it supports a PV MMIO hcall?
  
  Guest will indicate nothing. New driver will use MMIO if PIO is bar is
  not configured. All driver will not work for virtio devices with MMIO
  bar, but not PIO bar.
 
 I can't parse that, sorry :).
 
I am sure MST can explain it better, but I'll try one more time.
Device will have two BARs with kick register one is PIO another is MMIO.
Old driver works only with PIO new one support both. MMIO is used only
when PIO space is exhausted. So old driver will not be able to drive new
virtio device that have no PIO bar configured.

  
  is to move to MMIO only when PIO address space is exhausted. For PCI it
  will be never, for PCI-e it will be after ~16 devices.
  
  Ok, let's go back a step here. Are you actually able to measure any speed 
  in performance with this patch applied and without when going through MMIO 
  kicks?
  
  
  That's the question for MST. I think he did only micro benchmarks till
  now and he already posted his result here:
  
  mmio-wildcard-eventfd:pci-mem 3529
  mmio-pv-eventfd:pci-mem 1878
  portio-wildcard-eventfd:pci-io 1846
  
  So the patch speedup mmio by almost 100% and it is almost the same as PIO.
 
 Those numbers don't align at all with what I measured.
I am trying to run vmexit test on AMD now, but something does not work
there. Next week I'll fix it and see how AMD differs, bit on Intel those are the
numbers.

 
 MST, could you please do a real world latency 

Re: [PATCH] bookehv: Handle debug exception on guest exit

2013-04-04 Thread Alexander Graf

On 20.03.2013, at 18:45, Bharat Bhushan wrote:

 EPCR.DUVD controls whether the debug events can come in
 hypervisor mode or not. When KVM guest is using the debug
 resource then we do not want debug events to be captured
 in guest entry/exit path. So we set EPCR.DUVD when entering
 and clears EPCR.DUVD when exiting from guest.
 
 Debug instruction complete is a post-completion debug
 exception but debug event gets posted on the basis of MSR
 before the instruction is executed. Now if the instruction
 switches the context from guest mode (MSR.GS = 1) to hypervisor
 mode (MSR.GS = 0) then the xSRR0 points to first instruction of
 KVM handler and xSRR1 points that MSR.GS is clear
 (hypervisor context). Now as xSRR1.GS is used to decide whether
 KVM handler will be invoked to handle the exception or host
 host kernel debug handler will be invoked to handle the exception.
 This leads to host kernel debug handler handling the exception
 which should either be handled by KVM.
 
 This is tested on e500mc in 32 bit mode
 
 Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
 ---
 v0:
 - Do not apply this change for debug_crit as we do not know those chips have 
 issue or not.
 - corrected 64bit case branching
 
 arch/powerpc/kernel/exceptions-64e.S |   29 -
 arch/powerpc/kernel/head_booke.h |   26 ++
 2 files changed, 54 insertions(+), 1 deletions(-)
 
 diff --git a/arch/powerpc/kernel/exceptions-64e.S 
 b/arch/powerpc/kernel/exceptions-64e.S
 index 4684e33..8b26294 100644
 --- a/arch/powerpc/kernel/exceptions-64e.S
 +++ b/arch/powerpc/kernel/exceptions-64e.S
 @@ -516,6 +516,33 @@ kernel_dbg_exc:
   andis.  r15,r14,DBSR_IC@h
   beq+1f
 
 +#ifdef CONFIG_KVM_BOOKE_HV
 + /*
 +  * EPCR.DUVD controls whether the debug events can come in
 +  * hypervisor mode or not. When KVM guest is using the debug
 +  * resource then we do not want debug events to be captured
 +  * in guest entry/exit path. So we set EPCR.DUVD when entering
 +  * and clears EPCR.DUVD when exiting from guest.
 +  * Debug instruction complete is a post-completion debug
 +  * exception but debug event gets posted on the basis of MSR
 +  * before the instruction is executed. Now if the instruction
 +  * switches the context from guest mode (MSR.GS = 1) to hypervisor
 +  * mode (MSR.GS = 0) then the xSRR0 points to first instruction of

Can't we just execute that code path with MSR.DE=0?


Alex

 +  * KVM handler and xSRR1 points that MSR.GS is clear
 +  * (hypervisor context). Now as xSRR1.GS is used to decide whether
 +  * KVM handler will be invoked to handle the exception or host
 +  * host kernel debug handler will be invoked to handle the exception.
 +  * This leads to host kernel debug handler handling the exception
 +  * which should either be handled by KVM.
 +  */
 + mfspr   r10, SPRN_EPCR
 + andis.  r10,r10,SPRN_EPCR_DUVD@h
 + beq+2f
 +
 + andis.  r10,r9,MSR_GS@h
 + beq+3f
 +2:
 +#endif
   LOAD_REG_IMMEDIATE(r14,interrupt_base_book3e)
   LOAD_REG_IMMEDIATE(r15,interrupt_end_book3e)
   cmpld   cr0,r10,r14
 @@ -523,7 +550,7 @@ kernel_dbg_exc:
   blt+cr0,1f
   bge+cr1,1f
 
 - /* here it looks like we got an inappropriate debug exception. */
 +3:   /* here it looks like we got an inappropriate debug exception. */
   lis r14,DBSR_IC@h   /* clear the IC event */
   rlwinm  r11,r11,0,~MSR_DE   /* clear DE in the DSRR1 value */
   mtspr   SPRN_DBSR,r14
 diff --git a/arch/powerpc/kernel/head_booke.h 
 b/arch/powerpc/kernel/head_booke.h
 index 5f051ee..edc6a3b 100644
 --- a/arch/powerpc/kernel/head_booke.h
 +++ b/arch/powerpc/kernel/head_booke.h
 @@ -285,7 +285,33 @@ label:
   mfspr   r10,SPRN_DBSR;  /* check single-step/branch taken */  \
   andis.  r10,r10,(DBSR_IC|DBSR_BT)@h;  \
   beq+2f;   \
 +#ifdef CONFIG_KVM_BOOKE_HV \
 + /*\
 +  * EPCR.DUVD controls whether the debug events can come in\
 +  * hypervisor mode or not. When KVM guest is using the debug  \
 +  * resource then we do not want debug events to be captured   \
 +  * in guest entry/exit path. So we set EPCR.DUVD when entering\
 +  * and clears EPCR.DUVD when exiting from guest.  \
 +  * Debug instruction complete is a post-completion debug  \
 +  * exception but debug event gets posted on the basis of MSR  \
 +  * before the instruction is executed. Now if the instruction \
 +  * switches the context from guest mode (MSR.GS = 1) to hypervisor\
 +  * mode (MSR.GS = 0) then the xSRR0 points to first 

RE: [PATCH 1/7 v2] KVM: PPC: e500: Expose MMU registers via ONE_REG

2013-04-04 Thread Caraman Mihai Claudiu-B02008
 -Original Message-
 From: Wood Scott-B07421
 Sent: Wednesday, March 27, 2013 12:43 AM
 To: Caraman Mihai Claudiu-B02008
 Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; ag...@suse.de
 Subject: Re: [PATCH 1/7 v2] KVM: PPC: e500: Expose MMU registers via
 ONE_REG
 
 On 03/26/2013 05:05:06 PM, Mihai Caraman wrote:
  +int kvmppc_set_one_reg_e500_tlb(struct kvm_vcpu *vcpu, u64 id,
  +  union kvmppc_one_reg *val)
  +{
  +   int r = 0;
  +   long int i;
  +
  +   switch (id) {
  +   case KVM_REG_PPC_MAS0:
  +   vcpu-arch.shared-mas0 = set_reg_val(id, *val);
  +   break;
  +   case KVM_REG_PPC_MAS1:
  +   vcpu-arch.shared-mas1 = set_reg_val(id, *val);
  +   break;
  +   case KVM_REG_PPC_MAS2:
  +   vcpu-arch.shared-mas2 = set_reg_val(id, *val);
  +   break;
  +   case KVM_REG_PPC_MAS7_3:
  +   vcpu-arch.shared-mas7_3 = set_reg_val(id, *val);
  +   break;
  +   case KVM_REG_PPC_MAS4:
  +   vcpu-arch.shared-mas4 = set_reg_val(id, *val);
  +   break;
  +   case KVM_REG_PPC_MAS6:
  +   vcpu-arch.shared-mas6 = set_reg_val(id, *val);
  +   break;
  +   /* Only allow MMU registers to be set to the config supported
  by KVM */
  +   case KVM_REG_PPC_MMUCFG: {
  +   if (set_reg_val(id, *val) != vcpu-arch.mmucfg)
  +   r = -EINVAL;
  +   break;
  +   }
  +   case KVM_REG_PPC_TLB0CFG:
  +   case KVM_REG_PPC_TLB1CFG:
  +   case KVM_REG_PPC_TLB2CFG:
  +   case KVM_REG_PPC_TLB3CFG: {
  +   /* MMU geometry (N_ENTRY/ASSOC) can be set only using
  SW_TLB */
  +   i = id - KVM_REG_PPC_TLB0CFG;
  +   if (set_reg_val(id, *val) != vcpu-arch.tlbcfg[i])
  +   r = -EINVAL;
  +   break;
  +   }
 
 Am I the only one that finds the set_reg_val/get_reg_val naming
 confusing?  At first glance it looks like it sets TLBnCFG and then
 later tests whether it should have. :-P
 
 Functions should be named for what they do, not for what context you
 use them in.

I also found the existing code difficult to read. If we stick to reg and
val varible names what about val_to_reg/reg_to_val.
 
-Mike

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:08, Gleb Natapov wrote:
 
  On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
  
  With KVM, MMIO is much slower than PIO, due to the need to
  do page walk and emulation. But with EPT, it does not have to be: we
  know the address from the VMCS so if the address is unique, we can look
  up the eventfd directly, bypassing emulation.
  
  Add an interface for userspace to specify this per-address, we can
  use this e.g. for virtio.
  
  The implementation adds a separate bus internally. This serves two
  purposes:
  - minimize overhead for old userspace that does not use PV MMIO
  - minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)
  
  At the moment, this optimization is only supported for EPT on x86 and
  silently ignored for NPT and MMU, so everything works correctly but
  slowly.
  
  TODO: NPT, MMU and non x86 architectures.
  
  The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
  pre-review and suggestions.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  This still uses page fault intercepts which are orders of magnitudes 
  slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
  the guest can use to invoke MMIO accesses towards the host based on 
  physical addresses with explicit length encodings?
  
  It is slower, but not an order of magnitude slower. It become faster
  with newer HW.
  
  That way you simplify and speed up all code paths, exceeding the speed of 
  PIO exits even. It should also be quite easily portable, as all other 
  platforms have hypercalls available as well.
  
  We are trying to avoid PV as much as possible (well this is also PV,
  but not guest visible). We haven't replaced PIO with hypercall for the
  same reason. My hope is that future HW will provide us with instruction
  decode for basic mov instruction at which point this optimisation can be
  dropped.
 
 The same applies to an MMIO hypercall. Once the PV interface becomes 
 obsolete, we can drop the capability we expose to the guest.

Yes but unlike a hypercall this optimization does not need special code
in the guest. You can use standard OS interfaces for memory access.

  And hypercall has its own set of problems with Windows guests.
  When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
  hypercalls.  Mixing KVM hypercalls and Hyper-V requires some tricks. It
 
 Can't we simply register a hypercall ID range with Microsoft?
 
  may also affect WHQLing Windows drivers since driver will talk to HW
  bypassing Windows interfaces.
 
 Then the WHQL'ed driver doesn't support the PV MMIO hcall?
 
 
 Alex


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:58, Michael S. Tsirkin wrote:

 On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:08, Gleb Natapov wrote:
 
 On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
 With KVM, MMIO is much slower than PIO, due to the need to
 do page walk and emulation. But with EPT, it does not have to be: we
 know the address from the VMCS so if the address is unique, we can look
 up the eventfd directly, bypassing emulation.
 
 Add an interface for userspace to specify this per-address, we can
 use this e.g. for virtio.
 
 The implementation adds a separate bus internally. This serves two
 purposes:
 - minimize overhead for old userspace that does not use PV MMIO
 - minimize disruption in other code (since we don't know the length,
 devices on the MMIO bus only get a valid address in write, this
 way we don't need to touch all devices to teach them handle
 an dinvalid length)
 
 At the moment, this optimization is only supported for EPT on x86 and
 silently ignored for NPT and MMU, so everything works correctly but
 slowly.
 
 TODO: NPT, MMU and non x86 architectures.
 
 The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
 pre-review and suggestions.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 This still uses page fault intercepts which are orders of magnitudes 
 slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
 the guest can use to invoke MMIO accesses towards the host based on 
 physical addresses with explicit length encodings?
 
 It is slower, but not an order of magnitude slower. It become faster
 with newer HW.
 
 That way you simplify and speed up all code paths, exceeding the speed of 
 PIO exits even. It should also be quite easily portable, as all other 
 platforms have hypercalls available as well.
 
 We are trying to avoid PV as much as possible (well this is also PV,
 but not guest visible). We haven't replaced PIO with hypercall for the
 same reason. My hope is that future HW will provide us with instruction
 decode for basic mov instruction at which point this optimisation can be
 dropped.
 
 The same applies to an MMIO hypercall. Once the PV interface becomes 
 obsolete, we can drop the capability we expose to the guest.
 
 Yes but unlike a hypercall this optimization does not need special code
 in the guest. You can use standard OS interfaces for memory access.

Yes, but let's try to understand the room for optimization we're talking about 
here. The normal MMIO case seems excessively slower in MST's benchmarks than 
it did in mine. So maybe we're really just looking at a bug here.

Also, if hcalls are again only 50% of a fast MMIO callback, it's certainly 
worth checking out what room for improvement we're really wasting.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:08, Gleb Natapov wrote:
 
  On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
  
  With KVM, MMIO is much slower than PIO, due to the need to
  do page walk and emulation. But with EPT, it does not have to be: we
  know the address from the VMCS so if the address is unique, we can look
  up the eventfd directly, bypassing emulation.
  
  Add an interface for userspace to specify this per-address, we can
  use this e.g. for virtio.
  
  The implementation adds a separate bus internally. This serves two
  purposes:
  - minimize overhead for old userspace that does not use PV MMIO
  - minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)
  
  At the moment, this optimization is only supported for EPT on x86 and
  silently ignored for NPT and MMU, so everything works correctly but
  slowly.
  
  TODO: NPT, MMU and non x86 architectures.
  
  The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
  pre-review and suggestions.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  This still uses page fault intercepts which are orders of magnitudes 
  slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
  the guest can use to invoke MMIO accesses towards the host based on 
  physical addresses with explicit length encodings?
  
  It is slower, but not an order of magnitude slower. It become faster
  with newer HW.
  
  That way you simplify and speed up all code paths, exceeding the speed of 
  PIO exits even. It should also be quite easily portable, as all other 
  platforms have hypercalls available as well.
  
  We are trying to avoid PV as much as possible (well this is also PV,
  but not guest visible
 
 Also, how is this not guest visible?

It's visible but it does not require special code in the guest.
Only on qemu. Guest can use standard iowritel/w/b.

 Who sets
 KVM_IOEVENTFD_FLAG_PV_MMIO?

It's an ioctl so qemu does this.

 The comment above its definition indicates
 that the guest does so, so it is guest visible.
 
 +/*
 + * PV_MMIO - Guest can promise us that all accesses touching this address
 + * are writes of specified length, starting at the specified address.
 + * If not - it's a Guest bug.
 + * Can not be used together with either PIO or DATAMATCH.
 + */
 
 
 Alex

The requirement is to only access a specific address by a single aligned
instruction.  For example, only aligned signle-word accesses are
allowed.

This is a standard practice with many devices, in fact,
virtio spec already requires this for notifications.


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:38, Gleb Natapov wrote:
 
  On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:08, Gleb Natapov wrote:
  
  On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
  
  With KVM, MMIO is much slower than PIO, due to the need to
  do page walk and emulation. But with EPT, it does not have to be: we
  know the address from the VMCS so if the address is unique, we can look
  up the eventfd directly, bypassing emulation.
  
  Add an interface for userspace to specify this per-address, we can
  use this e.g. for virtio.
  
  The implementation adds a separate bus internally. This serves two
  purposes:
  - minimize overhead for old userspace that does not use PV MMIO
  - minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)
  
  At the moment, this optimization is only supported for EPT on x86 and
  silently ignored for NPT and MMU, so everything works correctly but
  slowly.
  
  TODO: NPT, MMU and non x86 architectures.
  
  The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
  pre-review and suggestions.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  This still uses page fault intercepts which are orders of magnitudes 
  slower than hypercalls. Why don't you just create a PV MMIO hypercall 
  that the guest can use to invoke MMIO accesses towards the host based on 
  physical addresses with explicit length encodings?
  
  It is slower, but not an order of magnitude slower. It become faster
  with newer HW.
  
  That way you simplify and speed up all code paths, exceeding the speed 
  of PIO exits even. It should also be quite easily portable, as all other 
  platforms have hypercalls available as well.
  
  We are trying to avoid PV as much as possible (well this is also PV,
  but not guest visible
  
  Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? 
  The comment above its definition indicates that the guest does so, so it 
  is guest visible.
  
  QEMU sets it.
 
 How does QEMU know?
 
  
  +/*
  + * PV_MMIO - Guest can promise us that all accesses touching this address
  + * are writes of specified length, starting at the specified address.
  + * If not - it's a Guest bug.
  + * Can not be used together with either PIO or DATAMATCH.
  + */
  
  Virtio spec will state that access to a kick register needs to be of
  specific length. This is reasonable thing for HW to ask.
 
 This is a spec change. So the guest would have to indicate that it adheres to 
 a newer spec. Thus it's a guest visible change.
 
 
 Alex

To use MMIO is a spec change, yes.
What Gleb refers to is that the optimization is not guest visible.
If you have hardware that can quickly decode MMIO you can
drop the PV ioctl from qemu without changing guests.


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 02:46:11PM +0200, Paolo Bonzini wrote:
 Il 04/04/2013 12:40, Michael S. Tsirkin ha scritto:
  Add an option for users to specify PV eventfd
  listeners, and utilize KVM's PV MMIO underneath.
  Upodate all callers.
 
 I would like Avi to comment on this, because I think this is not the
 memory-API approved way of doing things.  You need KVM to define its
 own AddressSpace, and make KVM's listener use
 memory_region_to_address_space to figure out if it is for PV MMIO.
 
 To handle accesses from TCG, the PV AddressSpace can simply have just an
 alias to the actual MemoryRegion where the doorbell is.
 
 Paolo

This is not really different from other eventfd flags like datamatch.
Separate address space is not appropriate here.
This is regular memory space, KVM eventfd simply has a flag that says
guest is well behaved so please make eventfd go faster.

  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
   hw/dataplane/hostmem.c|  1 +
   hw/ivshmem.c  |  2 ++
   hw/pci-testdev.c  |  2 ++
   hw/vhost.c|  4 ++--
   hw/virtio-pci.c   |  4 ++--
   include/exec/memory.h | 10 ++
   kvm-all.c | 15 ---
   linux-headers/linux/kvm.h |  8 
   memory.c  |  9 +++--
   9 files changed, 46 insertions(+), 9 deletions(-)
  
  diff --git a/hw/dataplane/hostmem.c b/hw/dataplane/hostmem.c
  index 380537e..41d43e7 100644
  --- a/hw/dataplane/hostmem.c
  +++ b/hw/dataplane/hostmem.c
  @@ -126,6 +126,7 @@ static void 
  hostmem_listener_section_dummy(MemoryListener *listener,
   
   static void hostmem_listener_eventfd_dummy(MemoryListener *listener,
  MemoryRegionSection *section,
  +   bool pv,
  bool match_data, uint64_t data,
  EventNotifier *e)
   {
  diff --git a/hw/ivshmem.c b/hw/ivshmem.c
  index 68a2cf2..28e5978 100644
  --- a/hw/ivshmem.c
  +++ b/hw/ivshmem.c
  @@ -352,6 +352,7 @@ static void ivshmem_add_eventfd(IVShmemState *s, int 
  posn, int i)
   memory_region_add_eventfd(s-ivshmem_mmio,
 DOORBELL,
 4,
  +  false,
 true,
 (posn  16) | i,
 s-peers[posn].eventfds[i]);
  @@ -362,6 +363,7 @@ static void ivshmem_del_eventfd(IVShmemState *s, int 
  posn, int i)
   memory_region_del_eventfd(s-ivshmem_mmio,
 DOORBELL,
 4,
  +  false,
 true,
 (posn  16) | i,
 s-peers[posn].eventfds[i]);
  diff --git a/hw/pci-testdev.c b/hw/pci-testdev.c
  index 9486624..f0ebf99 100644
  --- a/hw/pci-testdev.c
  +++ b/hw/pci-testdev.c
  @@ -80,6 +80,7 @@ static int pci_testdev_start(IOTest *test)
   memory_region_add_eventfd(test-mr,
 le32_to_cpu(test-hdr-offset),
 test-size,
  +  false,
 test-match_data,
 test-hdr-data,
 test-notifier);
  @@ -94,6 +95,7 @@ static void pci_testdev_stop(IOTest *test)
   memory_region_del_eventfd(test-mr,
 le32_to_cpu(test-hdr-offset),
 test-size,
  +  false,
 test-match_data,
 test-hdr-data,
 test-notifier);
  diff --git a/hw/vhost.c b/hw/vhost.c
  index 4d6aee3..9ad7101 100644
  --- a/hw/vhost.c
  +++ b/hw/vhost.c
  @@ -750,13 +750,13 @@ static void vhost_virtqueue_stop(struct vhost_dev 
  *dev,
   }
   
   static void vhost_eventfd_add(MemoryListener *listener,
  -  MemoryRegionSection *section,
  +  MemoryRegionSection *section, bool pv,
 bool match_data, uint64_t data, 
  EventNotifier *e)
   {
   }
   
   static void vhost_eventfd_del(MemoryListener *listener,
  -  MemoryRegionSection *section,
  +  MemoryRegionSection *section, bool pv,
 bool match_data, uint64_t data, 
  EventNotifier *e)
   {
   }
  diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
  index 19965e5..f9ff994 100644
  --- a/hw/virtio-pci.c
  +++ b/hw/virtio-pci.c
  @@ -189,10 +189,10 @@ static int 
  virtio_pci_set_host_notifier_internal(VirtIOPCIProxy *proxy,
   }
   virtio_queue_set_host_notifier_fd_handler(vq, true, set_handler);
   memory_region_add_eventfd(proxy-bar, 

Re: [Qemu-devel] [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 01:50:36PM +0100, Peter Maydell wrote:
 On 4 April 2013 11:40, Michael S. Tsirkin m...@redhat.com wrote:
  Add an option for users to specify PV eventfd
  listeners, and utilize KVM's PV MMIO underneath.
  Upodate all callers.
 
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  ---
   hw/dataplane/hostmem.c|  1 +
   hw/ivshmem.c  |  2 ++
   hw/pci-testdev.c  |  2 ++
   hw/vhost.c|  4 ++--
   hw/virtio-pci.c   |  4 ++--
   include/exec/memory.h | 10 ++
   kvm-all.c | 15 ---
   linux-headers/linux/kvm.h |  8 
   memory.c  |  9 +++--
 
 Can we keep linux-headers updates in separate patches, please
 (with commit messages saying what kernel commit/version the
 sync is with)?
 
 thanks
 -- PMM

OK, first it needs to get into linux though :).

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 03:56:49PM +0300, Gleb Natapov wrote:
 On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:45, Gleb Natapov wrote:
  
   On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
   
   On 04.04.2013, at 14:38, Gleb Natapov wrote:
   
   On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
   
   On 04.04.2013, at 14:08, Gleb Natapov wrote:
   
   On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
   
   On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
   
   With KVM, MMIO is much slower than PIO, due to the need to
   do page walk and emulation. But with EPT, it does not have to be: we
   know the address from the VMCS so if the address is unique, we can 
   look
   up the eventfd directly, bypassing emulation.
   
   Add an interface for userspace to specify this per-address, we can
   use this e.g. for virtio.
   
   The implementation adds a separate bus internally. This serves two
   purposes:
   - minimize overhead for old userspace that does not use PV MMIO
   - minimize disruption in other code (since we don't know the length,
   devices on the MMIO bus only get a valid address in write, this
   way we don't need to touch all devices to teach them handle
   an dinvalid length)
   
   At the moment, this optimization is only supported for EPT on x86 
   and
   silently ignored for NPT and MMU, so everything works correctly but
   slowly.
   
   TODO: NPT, MMU and non x86 architectures.
   
   The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
   pre-review and suggestions.
   
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
   
   This still uses page fault intercepts which are orders of magnitudes 
   slower than hypercalls. Why don't you just create a PV MMIO 
   hypercall that the guest can use to invoke MMIO accesses towards the 
   host based on physical addresses with explicit length encodings?
   
   It is slower, but not an order of magnitude slower. It become faster
   with newer HW.
   
   That way you simplify and speed up all code paths, exceeding the 
   speed of PIO exits even. It should also be quite easily portable, as 
   all other platforms have hypercalls available as well.
   
   We are trying to avoid PV as much as possible (well this is also PV,
   but not guest visible
   
   Also, how is this not guest visible? Who sets 
   KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
   that the guest does so, so it is guest visible.
   
   QEMU sets it.
   
   How does QEMU know?
   
   Knows what? When to create such eventfd? virtio device knows.
  
  Where does it know from?
  
 It does it always.
 
   
   
   +/*
   + * PV_MMIO - Guest can promise us that all accesses touching this 
   address
   + * are writes of specified length, starting at the specified address.
   + * If not - it's a Guest bug.
   + * Can not be used together with either PIO or DATAMATCH.
   + */
   
   Virtio spec will state that access to a kick register needs to be of
   specific length. This is reasonable thing for HW to ask.
   
   This is a spec change. So the guest would have to indicate that it 
   adheres to a newer spec. Thus it's a guest visible change.
   
   There is not virtio spec that has kick register in MMIO. The spec is in
   the works AFAIK. Actually PIO will not be deprecated and my suggestion
  
  So the guest would indicate that it supports a newer revision of the spec 
  (in your case, that it supports MMIO). How is that any different from 
  exposing that it supports a PV MMIO hcall?
  
 Guest will indicate nothing. New driver will use MMIO if PIO is bar is
 not configured. All driver will not work for virtio devices with MMIO
 bar, but not PIO bar.
 
   is to move to MMIO only when PIO address space is exhausted. For PCI it
   will be never, for PCI-e it will be after ~16 devices.
  
  Ok, let's go back a step here. Are you actually able to measure any speed 
  in performance with this patch applied and without when going through MMIO 
  kicks?
  
  
 That's the question for MST. I think he did only micro benchmarks till
 now and he already posted his result here:
 
 mmio-wildcard-eventfd:pci-mem 3529
 mmio-pv-eventfd:pci-mem 1878
 portio-wildcard-eventfd:pci-io 1846
 
 So the patch speedup mmio by almost 100% and it is almost the same as PIO.

Exactly. I sent patches for kvm unittest so you can try it yourself.
At the moment you need a box with EPT to try this.

 --
   Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Peter Maydell
On 4 April 2013 14:11, Michael S. Tsirkin m...@redhat.com wrote:
 On Thu, Apr 04, 2013 at 01:50:36PM +0100, Peter Maydell wrote:
 Can we keep linux-headers updates in separate patches, please
 (with commit messages saying what kernel commit/version the
 sync is with)?

 OK, first it needs to get into linux though :).

In that case you need an RFC tag on your patches :-)

-- PMM
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 03:23:43PM +0100, Peter Maydell wrote:
 On 4 April 2013 14:11, Michael S. Tsirkin m...@redhat.com wrote:
  On Thu, Apr 04, 2013 at 01:50:36PM +0100, Peter Maydell wrote:
  Can we keep linux-headers updates in separate patches, please
  (with commit messages saying what kernel commit/version the
  sync is with)?
 
  OK, first it needs to get into linux though :).
 
 In that case you need an RFC tag on your patches :-)
 
 -- PMM

The cover letter said patches 5 and 6 are RFC,
but patches 1-4 are ready and should go in I think.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:56, Gleb Natapov wrote:
 
  On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:45, Gleb Natapov wrote:
  
  On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:38, Gleb Natapov wrote:
  
  On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:08, Gleb Natapov wrote:
  
  On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
  
  With KVM, MMIO is much slower than PIO, due to the need to
  do page walk and emulation. But with EPT, it does not have to be: we
  know the address from the VMCS so if the address is unique, we can 
  look
  up the eventfd directly, bypassing emulation.
  
  Add an interface for userspace to specify this per-address, we can
  use this e.g. for virtio.
  
  The implementation adds a separate bus internally. This serves two
  purposes:
  - minimize overhead for old userspace that does not use PV MMIO
  - minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)
  
  At the moment, this optimization is only supported for EPT on x86 
  and
  silently ignored for NPT and MMU, so everything works correctly but
  slowly.
  
  TODO: NPT, MMU and non x86 architectures.
  
  The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
  pre-review and suggestions.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  This still uses page fault intercepts which are orders of magnitudes 
  slower than hypercalls. Why don't you just create a PV MMIO 
  hypercall that the guest can use to invoke MMIO accesses towards the 
  host based on physical addresses with explicit length encodings?
  
  It is slower, but not an order of magnitude slower. It become faster
  with newer HW.
  
  That way you simplify and speed up all code paths, exceeding the 
  speed of PIO exits even. It should also be quite easily portable, as 
  all other platforms have hypercalls available as well.
  
  We are trying to avoid PV as much as possible (well this is also PV,
  but not guest visible
  
  Also, how is this not guest visible? Who sets 
  KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
  that the guest does so, so it is guest visible.
  
  QEMU sets it.
  
  How does QEMU know?
  
  Knows what? When to create such eventfd? virtio device knows.
  
  Where does it know from?
  
  It does it always.
  
  
  
  +/*
  + * PV_MMIO - Guest can promise us that all accesses touching this 
  address
  + * are writes of specified length, starting at the specified address.
  + * If not - it's a Guest bug.
  + * Can not be used together with either PIO or DATAMATCH.
  + */
  
  Virtio spec will state that access to a kick register needs to be of
  specific length. This is reasonable thing for HW to ask.
  
  This is a spec change. So the guest would have to indicate that it 
  adheres to a newer spec. Thus it's a guest visible change.
  
  There is not virtio spec that has kick register in MMIO. The spec is in
  the works AFAIK. Actually PIO will not be deprecated and my suggestion
  
  So the guest would indicate that it supports a newer revision of the spec 
  (in your case, that it supports MMIO). How is that any different from 
  exposing that it supports a PV MMIO hcall?
  
  Guest will indicate nothing. New driver will use MMIO if PIO is bar is
  not configured. All driver will not work for virtio devices with MMIO
  bar, but not PIO bar.
 
 I can't parse that, sorry :).

It's simple. Driver does iowrite16 or whatever is appropriate for the OS.
QEMU tells KVM which address driver uses, to make exits faster.  This is not
different from how eventfd works.  For example if exits to QEMU suddenly become
very cheap we can remove eventfd completely.

  
  is to move to MMIO only when PIO address space is exhausted. For PCI it
  will be never, for PCI-e it will be after ~16 devices.
  
  Ok, let's go back a step here. Are you actually able to measure any speed 
  in performance with this patch applied and without when going through MMIO 
  kicks?
  
  
  That's the question for MST. I think he did only micro benchmarks till
  now and he already posted his result here:
  
  mmio-wildcard-eventfd:pci-mem 3529
  mmio-pv-eventfd:pci-mem 1878
  portio-wildcard-eventfd:pci-io 1846
  
  So the patch speedup mmio by almost 100% and it is almost the same as PIO.
 
 Those numbers don't align at all with what I measured.

Yep. But why?
Could be a different hardware. My laptop is i7, what did you measure on?
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 42
model name  : Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz
stepping: 7

KVM Guest Lock up (100%) again!

2013-04-04 Thread Phil Daws
One of my KVM guests locked up again at 100% CPU!  Any thoughts on how I can 
diagnose it ? We would love to put into production but am very concerned about 
the current stability. I have tried to re-direct the console, through screen, 
to see whether there is a spin lock that is causing the problem; though all I 
got in the log file was a login prompt. What is the correct way of redirecting 
the console in KVM on a CentOS 6.4 system please ? Thanks.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 04:02:57PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:58, Michael S. Tsirkin wrote:
 
  On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:08, Gleb Natapov wrote:
  
  On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
  
  With KVM, MMIO is much slower than PIO, due to the need to
  do page walk and emulation. But with EPT, it does not have to be: we
  know the address from the VMCS so if the address is unique, we can look
  up the eventfd directly, bypassing emulation.
  
  Add an interface for userspace to specify this per-address, we can
  use this e.g. for virtio.
  
  The implementation adds a separate bus internally. This serves two
  purposes:
  - minimize overhead for old userspace that does not use PV MMIO
  - minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)
  
  At the moment, this optimization is only supported for EPT on x86 and
  silently ignored for NPT and MMU, so everything works correctly but
  slowly.
  
  TODO: NPT, MMU and non x86 architectures.
  
  The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
  pre-review and suggestions.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  This still uses page fault intercepts which are orders of magnitudes 
  slower than hypercalls. Why don't you just create a PV MMIO hypercall 
  that the guest can use to invoke MMIO accesses towards the host based on 
  physical addresses with explicit length encodings?
  
  It is slower, but not an order of magnitude slower. It become faster
  with newer HW.
  
  That way you simplify and speed up all code paths, exceeding the speed 
  of PIO exits even. It should also be quite easily portable, as all other 
  platforms have hypercalls available as well.
  
  We are trying to avoid PV as much as possible (well this is also PV,
  but not guest visible). We haven't replaced PIO with hypercall for the
  same reason. My hope is that future HW will provide us with instruction
  decode for basic mov instruction at which point this optimisation can be
  dropped.
  
  The same applies to an MMIO hypercall. Once the PV interface becomes 
  obsolete, we can drop the capability we expose to the guest.
  
  Yes but unlike a hypercall this optimization does not need special code
  in the guest. You can use standard OS interfaces for memory access.
 
 Yes, but let's try to understand the room for optimization we're
 talking about here. The normal MMIO case seems excessively slower in
 MST's benchmarks than it did in mine. So maybe we're really just
 looking at a bug here.

Could be. I posted the code (kvm,qemu and test) so please review and try
to spot a bug.

 Also, if hcalls are again only 50% of a fast MMIO callback, it's
 certainly worth checking out what room for improvement we're really
 wasting.
 
 
 Alex

Take a look at 'kvm: pci PORT IO MMIO and PV MMIO speed tests'.
Try running the test on your hardware and see what happens.
Or post the test you used, I can try it on my box if you like.

-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Paolo Bonzini
Il 04/04/2013 15:10, Michael S. Tsirkin ha scritto:
  I would like Avi to comment on this, because I think this is not the
  memory-API approved way of doing things.  You need KVM to define its
  own AddressSpace, and make KVM's listener use
  memory_region_to_address_space to figure out if it is for PV MMIO.
  
  To handle accesses from TCG, the PV AddressSpace can simply have just an
  alias to the actual MemoryRegion where the doorbell is.
 
 This is not really different from other eventfd flags like datamatch.
 Separate address space is not appropriate here.
 This is regular memory space, KVM eventfd simply has a flag that says
 guest is well behaved so please make eventfd go faster.

Having a separate address space would match what you do in the kernel
though.

I just don't like functions with a dozen arguments...  Can you just make
datamatch and pv a single flags argument, as is the case with the KVM API?

Paolo

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] bookehv: Handle debug exception on guest exit

2013-04-04 Thread Bhushan Bharat-R65777


 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Thursday, April 04, 2013 6:55 PM
 To: Bhushan Bharat-R65777
 Cc: linuxppc-...@lists.ozlabs.org; kvm@vger.kernel.org; 
 kvm-...@vger.kernel.org;
 Wood Scott-B07421; Bhushan Bharat-R65777
 Subject: Re: [PATCH] bookehv: Handle debug exception on guest exit
 
 
 On 20.03.2013, at 18:45, Bharat Bhushan wrote:
 
  EPCR.DUVD controls whether the debug events can come in hypervisor
  mode or not. When KVM guest is using the debug resource then we do not
  want debug events to be captured in guest entry/exit path. So we set
  EPCR.DUVD when entering and clears EPCR.DUVD when exiting from guest.
 
  Debug instruction complete is a post-completion debug exception but
  debug event gets posted on the basis of MSR before the instruction is
  executed. Now if the instruction switches the context from guest mode
  (MSR.GS = 1) to hypervisor mode (MSR.GS = 0) then the xSRR0 points to
  first instruction of KVM handler and xSRR1 points that MSR.GS is clear
  (hypervisor context). Now as xSRR1.GS is used to decide whether KVM
  handler will be invoked to handle the exception or host host kernel
  debug handler will be invoked to handle the exception.
  This leads to host kernel debug handler handling the exception which
  should either be handled by KVM.
 
  This is tested on e500mc in 32 bit mode
 
  Signed-off-by: Bharat Bhushan bharat.bhus...@freescale.com
  ---
  v0:
  - Do not apply this change for debug_crit as we do not know those chips have
 issue or not.
  - corrected 64bit case branching
 
  arch/powerpc/kernel/exceptions-64e.S |   29 -
  arch/powerpc/kernel/head_booke.h |   26 ++
  2 files changed, 54 insertions(+), 1 deletions(-)
 
  diff --git a/arch/powerpc/kernel/exceptions-64e.S
  b/arch/powerpc/kernel/exceptions-64e.S
  index 4684e33..8b26294 100644
  --- a/arch/powerpc/kernel/exceptions-64e.S
  +++ b/arch/powerpc/kernel/exceptions-64e.S
  @@ -516,6 +516,33 @@ kernel_dbg_exc:
  andis.  r15,r14,DBSR_IC@h
  beq+1f
 
  +#ifdef CONFIG_KVM_BOOKE_HV
  +   /*
  +* EPCR.DUVD controls whether the debug events can come in
  +* hypervisor mode or not. When KVM guest is using the debug
  +* resource then we do not want debug events to be captured
  +* in guest entry/exit path. So we set EPCR.DUVD when entering
  +* and clears EPCR.DUVD when exiting from guest.
  +* Debug instruction complete is a post-completion debug
  +* exception but debug event gets posted on the basis of MSR
  +* before the instruction is executed. Now if the instruction
  +* switches the context from guest mode (MSR.GS = 1) to hypervisor
  +* mode (MSR.GS = 0) then the xSRR0 points to first instruction of
 
 Can't we just execute that code path with MSR.DE=0?

Single stepping uses DBCR0.IC (instruction complete).
Can you describe how MSR.DE = 0 will work?

 
 
 Alex
 
  +* KVM handler and xSRR1 points that MSR.GS is clear
  +* (hypervisor context). Now as xSRR1.GS is used to decide whether
  +* KVM handler will be invoked to handle the exception or host
  +* host kernel debug handler will be invoked to handle the exception.
  +* This leads to host kernel debug handler handling the exception
  +* which should either be handled by KVM.
  +*/
  +   mfspr   r10, SPRN_EPCR
  +   andis.  r10,r10,SPRN_EPCR_DUVD@h
  +   beq+2f
  +
  +   andis.  r10,r9,MSR_GS@h
  +   beq+3f
  +2:
  +#endif
  LOAD_REG_IMMEDIATE(r14,interrupt_base_book3e)
  LOAD_REG_IMMEDIATE(r15,interrupt_end_book3e)
  cmpld   cr0,r10,r14
  @@ -523,7 +550,7 @@ kernel_dbg_exc:
  blt+cr0,1f
  bge+cr1,1f
 
  -   /* here it looks like we got an inappropriate debug exception. */
  +3: /* here it looks like we got an inappropriate debug exception. */
  lis r14,DBSR_IC@h   /* clear the IC event */
  rlwinm  r11,r11,0,~MSR_DE   /* clear DE in the DSRR1 value */
  mtspr   SPRN_DBSR,r14
  diff --git a/arch/powerpc/kernel/head_booke.h
  b/arch/powerpc/kernel/head_booke.h
  index 5f051ee..edc6a3b 100644
  --- a/arch/powerpc/kernel/head_booke.h
  +++ b/arch/powerpc/kernel/head_booke.h
  @@ -285,7 +285,33 @@ label:
  mfspr   r10,SPRN_DBSR;  /* check single-step/branch taken */  \
  andis.  r10,r10,(DBSR_IC|DBSR_BT)@h;  \
  beq+2f;   \
  +#ifdef CONFIG_KVM_BOOKE_HV   \
  +   /*\
  +* EPCR.DUVD controls whether the debug events can come in\
  +* hypervisor mode or not. When KVM guest is using the debug  \
  +* resource then we do not want debug events to be captured   \
  +* in guest entry/exit path. So we set EPCR.DUVD when entering\
  +* and clears 

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 04:14:57PM +0300, Gleb Natapov wrote:
 On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:56, Gleb Natapov wrote:
  
   On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
   
   On 04.04.2013, at 14:45, Gleb Natapov wrote:
   
   On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
   
   On 04.04.2013, at 14:38, Gleb Natapov wrote:
   
   On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
   
   On 04.04.2013, at 14:08, Gleb Natapov wrote:
   
   On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
   
   On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
   
   With KVM, MMIO is much slower than PIO, due to the need to
   do page walk and emulation. But with EPT, it does not have to be: 
   we
   know the address from the VMCS so if the address is unique, we 
   can look
   up the eventfd directly, bypassing emulation.
   
   Add an interface for userspace to specify this per-address, we can
   use this e.g. for virtio.
   
   The implementation adds a separate bus internally. This serves two
   purposes:
   - minimize overhead for old userspace that does not use PV MMIO
   - minimize disruption in other code (since we don't know the 
   length,
   devices on the MMIO bus only get a valid address in write, this
   way we don't need to touch all devices to teach them handle
   an dinvalid length)
   
   At the moment, this optimization is only supported for EPT on x86 
   and
   silently ignored for NPT and MMU, so everything works correctly 
   but
   slowly.
   
   TODO: NPT, MMU and non x86 architectures.
   
   The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
   pre-review and suggestions.
   
   Signed-off-by: Michael S. Tsirkin m...@redhat.com
   
   This still uses page fault intercepts which are orders of 
   magnitudes slower than hypercalls. Why don't you just create a PV 
   MMIO hypercall that the guest can use to invoke MMIO accesses 
   towards the host based on physical addresses with explicit length 
   encodings?
   
   It is slower, but not an order of magnitude slower. It become faster
   with newer HW.
   
   That way you simplify and speed up all code paths, exceeding the 
   speed of PIO exits even. It should also be quite easily portable, 
   as all other platforms have hypercalls available as well.
   
   We are trying to avoid PV as much as possible (well this is also PV,
   but not guest visible
   
   Also, how is this not guest visible? Who sets 
   KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition 
   indicates that the guest does so, so it is guest visible.
   
   QEMU sets it.
   
   How does QEMU know?
   
   Knows what? When to create such eventfd? virtio device knows.
   
   Where does it know from?
   
   It does it always.
   
   
   
   +/*
   + * PV_MMIO - Guest can promise us that all accesses touching this 
   address
   + * are writes of specified length, starting at the specified 
   address.
   + * If not - it's a Guest bug.
   + * Can not be used together with either PIO or DATAMATCH.
   + */
   
   Virtio spec will state that access to a kick register needs to be of
   specific length. This is reasonable thing for HW to ask.
   
   This is a spec change. So the guest would have to indicate that it 
   adheres to a newer spec. Thus it's a guest visible change.
   
   There is not virtio spec that has kick register in MMIO. The spec is in
   the works AFAIK. Actually PIO will not be deprecated and my suggestion
   
   So the guest would indicate that it supports a newer revision of the 
   spec (in your case, that it supports MMIO). How is that any different 
   from exposing that it supports a PV MMIO hcall?
   
   Guest will indicate nothing. New driver will use MMIO if PIO is bar is
   not configured. All driver will not work for virtio devices with MMIO
   bar, but not PIO bar.
  
  I can't parse that, sorry :).
  
 I am sure MST can explain it better, but I'll try one more time.
 Device will have two BARs with kick register one is PIO another is MMIO.
 Old driver works only with PIO new one support both. MMIO is used only
 when PIO space is exhausted. So old driver will not be able to drive new
 virtio device that have no PIO bar configured.

Right, I think this was the latest proposal by Rusty.

The discussion about the new layout is taking place on the virtio mailing list.
See thread 'virtio_pci: use separate notification offsets for each vq'
started by Rusty.


   
   is to move to MMIO only when PIO address space is exhausted. For PCI it
   will be never, for PCI-e it will be after ~16 devices.
   
   Ok, let's go back a step here. Are you actually able to measure any 
   speed in performance with this patch applied and without when going 
   through MMIO kicks?
   
   
   That's the question for MST. I think he did only micro benchmarks till
   now and he already posted his result here:
   
   

Re: [PATCH 4/7] ARM: KVM: enforce page alignment for identity mapped code

2013-04-04 Thread Christoffer Dall
On Thu, Apr 4, 2013 at 3:47 AM, Marc Zyngier marc.zyng...@arm.com wrote:
 On 04/04/13 00:15, Christoffer Dall wrote:
 On Wed, Apr 03, 2013 at 11:00:39AM +0100, Marc Zyngier wrote:
 On 03/04/13 10:50, Will Deacon wrote:
 On Tue, Apr 02, 2013 at 02:25:12PM +0100, Marc Zyngier wrote:
 We're about to move to a init procedure where we rely on the
 fact that the init code fits in a single page. Make sure we
 align the idmap text on a page boundary, and that the code is
 not bigger than a single page.

 Signed-off-by: Marc Zyngier marc.zyng...@arm.com
 ---
  arch/arm/kernel/vmlinux.lds.S | 2 +-
  arch/arm/kvm/init.S   | 7 +++
  2 files changed, 8 insertions(+), 1 deletion(-)

 diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
 index b571484..d9dd265 100644
 --- a/arch/arm/kernel/vmlinux.lds.S
 +++ b/arch/arm/kernel/vmlinux.lds.S
 @@ -20,7 +20,7 @@
VMLINUX_SYMBOL(__idmap_text_start) = .; \
*(.idmap.text)  \
VMLINUX_SYMBOL(__idmap_text_end) = .;   \
 -  ALIGN_FUNCTION();   \
 +  . = ALIGN(PAGE_SIZE);   \
VMLINUX_SYMBOL(__hyp_idmap_text_start) = .; \
*(.hyp.idmap.text)  \
VMLINUX_SYMBOL(__hyp_idmap_text_end) = .;
 diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
 index 9f37a79..35a463f 100644
 --- a/arch/arm/kvm/init.S
 +++ b/arch/arm/kvm/init.S
 @@ -111,4 +111,11 @@ __do_hyp_init:
.globl __kvm_hyp_init_end
  __kvm_hyp_init_end:

 +  /*
 +   * The above code *must* fit in a single page for the trampoline
 +   * madness to work. Whoever decides to change it must make sure
 +   * we map the right amount of memory for the trampoline to work.
 +   * The line below ensures any breakage will get noticed.

 This comment is a little funny. What's this about changing it? And why
 madness? OK, it's a little mad.

 I think the nice thing to explain here is why .org would complain - I
 had to look in the AS reference to figure out that it will complain if
 the location pointer is forwarded outside the section - if that is the
 right reason?

 +   */
 +  .org__kvm_hyp_init + PAGE_SIZE
.popsection

 What effect does this have on the size of the kernel image? I'd expect the
 idmap code to be pretty small, so aligning to a page might be overkill a 
 lot
 of the time.

 So we're indeed wasting memory between the kernel and the HYP idmaps.
 The reason for this is to ensure that the init code is not spread across
 two pages, which would make the trampoline harder to allocate.

 If there's a way to ensure that the init.S code will fit a single page,
 then we can remove this ALIGN(PAGE_SIZE).

 Just allocate a temporary page and copy the bit of code into there?  It
 should be easy enough to make sure it's location independent.  We can
 even give that page back later if we want and re-do the trick when a CPU
 gets hot-plugged if we want. No?

 I'd really like to avoid freeing stuff, because it gives us even more
 chances to screw our TLBs (we'd have to invalidate on unmap, and deal
 with potential races between CPUs coming up).

 We could allocate this page *if* the HYP init code happens to cross a
 page boundary, and only then. I'll have a look at implementing this as
 it should solve Will's issue (which is even worse on arm64 with 64k pages).

The freeing stuff was a minor point, that I don't care much about, but
reducing the kernel size is a valid concern I guess.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 15:33, Michael S. Tsirkin wrote:

 On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:56, Gleb Natapov wrote:
 
 On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:45, Gleb Natapov wrote:
 
 On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:38, Gleb Natapov wrote:
 
 On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:08, Gleb Natapov wrote:
 
 On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
 With KVM, MMIO is much slower than PIO, due to the need to
 do page walk and emulation. But with EPT, it does not have to be: we
 know the address from the VMCS so if the address is unique, we can 
 look
 up the eventfd directly, bypassing emulation.
 
 Add an interface for userspace to specify this per-address, we can
 use this e.g. for virtio.
 
 The implementation adds a separate bus internally. This serves two
 purposes:
 - minimize overhead for old userspace that does not use PV MMIO
 - minimize disruption in other code (since we don't know the length,
 devices on the MMIO bus only get a valid address in write, this
 way we don't need to touch all devices to teach them handle
 an dinvalid length)
 
 At the moment, this optimization is only supported for EPT on x86 
 and
 silently ignored for NPT and MMU, so everything works correctly but
 slowly.
 
 TODO: NPT, MMU and non x86 architectures.
 
 The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
 pre-review and suggestions.
 
 Signed-off-by: Michael S. Tsirkin m...@redhat.com
 
 This still uses page fault intercepts which are orders of magnitudes 
 slower than hypercalls. Why don't you just create a PV MMIO 
 hypercall that the guest can use to invoke MMIO accesses towards the 
 host based on physical addresses with explicit length encodings?
 
 It is slower, but not an order of magnitude slower. It become faster
 with newer HW.
 
 That way you simplify and speed up all code paths, exceeding the 
 speed of PIO exits even. It should also be quite easily portable, as 
 all other platforms have hypercalls available as well.
 
 We are trying to avoid PV as much as possible (well this is also PV,
 but not guest visible
 
 Also, how is this not guest visible? Who sets 
 KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
 that the guest does so, so it is guest visible.
 
 QEMU sets it.
 
 How does QEMU know?
 
 Knows what? When to create such eventfd? virtio device knows.
 
 Where does it know from?
 
 It does it always.
 
 
 
 +/*
 + * PV_MMIO - Guest can promise us that all accesses touching this 
 address
 + * are writes of specified length, starting at the specified address.
 + * If not - it's a Guest bug.
 + * Can not be used together with either PIO or DATAMATCH.
 + */
 
 Virtio spec will state that access to a kick register needs to be of
 specific length. This is reasonable thing for HW to ask.
 
 This is a spec change. So the guest would have to indicate that it 
 adheres to a newer spec. Thus it's a guest visible change.
 
 There is not virtio spec that has kick register in MMIO. The spec is in
 the works AFAIK. Actually PIO will not be deprecated and my suggestion
 
 So the guest would indicate that it supports a newer revision of the spec 
 (in your case, that it supports MMIO). How is that any different from 
 exposing that it supports a PV MMIO hcall?
 
 Guest will indicate nothing. New driver will use MMIO if PIO is bar is
 not configured. All driver will not work for virtio devices with MMIO
 bar, but not PIO bar.
 
 I can't parse that, sorry :).
 
 It's simple. Driver does iowrite16 or whatever is appropriate for the OS.
 QEMU tells KVM which address driver uses, to make exits faster.  This is not
 different from how eventfd works.  For example if exits to QEMU suddenly 
 become
 very cheap we can remove eventfd completely.
 
 
 is to move to MMIO only when PIO address space is exhausted. For PCI it
 will be never, for PCI-e it will be after ~16 devices.
 
 Ok, let's go back a step here. Are you actually able to measure any speed 
 in performance with this patch applied and without when going through MMIO 
 kicks?
 
 
 That's the question for MST. I think he did only micro benchmarks till
 now and he already posted his result here:
 
 mmio-wildcard-eventfd:pci-mem 3529
 mmio-pv-eventfd:pci-mem 1878
 portio-wildcard-eventfd:pci-io 1846
 
 So the patch speedup mmio by almost 100% and it is almost the same as PIO.
 
 Those numbers don't align at all with what I measured.
 
 Yep. But why?
 Could be a different hardware. My laptop is i7, what did you measure on?
 processor   : 0
 vendor_id   : GenuineIntel
 cpu family  : 6
 model   : 42
 model name  : Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz
 stepping: 7
 microcode   : 0x28
 cpu MHz : 2801.000
 

Re: [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 04:56:11PM +0200, Paolo Bonzini wrote:
 Il 04/04/2013 15:10, Michael S. Tsirkin ha scritto:
   I would like Avi to comment on this, because I think this is not the
   memory-API approved way of doing things.  You need KVM to define its
   own AddressSpace, and make KVM's listener use
   memory_region_to_address_space to figure out if it is for PV MMIO.
   
   To handle accesses from TCG, the PV AddressSpace can simply have just an
   alias to the actual MemoryRegion where the doorbell is.
  
  This is not really different from other eventfd flags like datamatch.
  Separate address space is not appropriate here.
  This is regular memory space, KVM eventfd simply has a flag that says
  guest is well behaved so please make eventfd go faster.
 
 Having a separate address space would match what you do in the kernel
 though.

Not really, in kernel the device is as usual on the MMIO
bus. The pv_mmio_bus is an extra datastructure to enable lookups where
length is not known in an efficient way. Userspace does not need to know
about these details.

 I just don't like functions with a dozen arguments...  Can you just make
 datamatch and pv a single flags argument, as is the case with the KVM API?
 
 Paolo

Extra parameters are detected by compiler :)
I don't really care, anyone else objects to changing
datamatch to a flags argument?

Please let me know to avoid changing it back and forth.


-- 
MST
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [virt-test][PATCH 4/7] virt: Adds named variants to Cartesian config.

2013-04-04 Thread Jiri Zupka


- Original Message -
 Sorry for not reading the commit message before my previous reply. Now I
 see the origin of the  syntax.
 
 On Fri, Mar 29, 2013 at 06:14:07PM +0100, Jiří Župka wrote:
 [...]
  
  For filtering of named variants is used character  because there was
  problem with conflict with = in expression key = value. The char 
  could be changed to something better but it should be different from =
  for optimization of speed.
 
 IMO we need really strong reasons to use anything different from =
 because it is the most obvious choice we have. Using  doesn't make
 any sense to me.


There is not necessary solve conflict with = or : in code. Code parsing is
straightforward with that . Chars = and : was one of my first selection
too but it brings conflicts in parsing. But it could be changed because
there were more voice against it. Users could prefer better syntax instead
of little improve of speed.


 
 What kind of speed optimization are you talking about, exactly? We need
 to keep algorithm time/space complexity under control, but making two or
 three additional regexp matches per line won't make the code much
 slower, will it?


Sometime yes (https://github.com/autotest/virt-test/pull/229).
But I don't think that it is this case. I'll will try think about more.


 Also: whatever symbol we use, I would really like to make it
 whitespace-insensitive.
 
 I mean: if foox or foo=x works, foo  x or foo = x should work,
 too. I am absolutely sure people _will_ eventually try to put whitespace
 around the operator symbol, and this shouldn't cause unpleasant
 surprises.

Thank a lot that you catch this bug. It is only bug not intention.
I have forgot one strip(). But I will repair the bug after we will
finish discussion about named variant.

 
 
  
  Additionally named variant adds keys to final dictionary in case of
  example is it (virt_system = linux). It should reduce size of config file.
  Keys defined in config and keys defined by named variants are in same
  name space.
 
 This is the part I like the most. Thanks!
 
 
  
  Signed-off-by: Jiří Župka jzu...@redhat.com
  ---
   virttest/cartesian_config.py | 138
   ++-
   1 file changed, 124 insertions(+), 14 deletions(-)
  
  diff --git a/virttest/cartesian_config.py b/virttest/cartesian_config.py
  index ef91051..04ed2b5 100755
  --- a/virttest/cartesian_config.py
  +++ b/virttest/cartesian_config.py
  @@ -145,6 +145,74 @@ class MissingIncludeError:
   num_failed_cases = 5
   
   
  +class Label(object):
  +__slots__ = [name, var_name, long_name, hash_val, hash_var]
  +
  +def __init__(self, name, next_name=None):
  +if next_name is None:
  +self.name = name
  +self.var_name = None
  +else:
  +self.name = next_name
  +self.var_name = name
  +
  +if self.var_name is None:
  +self.long_name = %s % (self.name)
  +else:
  +self.long_name = %s%s % (self.var_name, self.name)
  +
  +self.hash_val = self.hash_name()
  +self.hash_var = None
  +if self.var_name:
  +self.hash_var = self.hash_variant()
  +
  +
  +def __str__(self):
  +return self.long_name
  +
  +
  +def __repr__(self):
  +return self.long_name
  +
  +
  +def __eq__(self, o):
  +
  +The comparison is asymmetric due to optimization.
  +
  +if o.var_name:
  +if self.long_name == o.long_name:
  +return True
  +else:
  +if self.name == o.name:
  +return True
  +return False
  +
  +
  +def __ne__(self, o):
  +
  +The comparison is asymmetric due to optimization.
  +
  +if o.var_name:
  +if self.long_name != o.long_name:
  +return True
  +else:
  +if self.name != o.name:
  +return True
  +return False
  +
  +
  +def __hash__(self):
  +return self.hash_val
  +
  +
  +def hash_name(self):
  +return sum([i + 1 * ord(x) for i, x in enumerate(self.name)])
  +
  +
  +def hash_variant(self):
  +return sum([i + 1 * ord(x) for i, x in enumerate(str(self))])
  +
  +
   class Node(object):
   __slots__ = [name, dep, content, children, labels,
append_to_shortname, failed_cases, default]
  @@ -212,18 +280,19 @@ class Filter(object):
   def __init__(self, s):
   self.filter = []
   for char in s:
  -if not (char.isalnum() or char.isspace() or char in .,_-):
  +if not (char.isalnum() or char.isspace() or char in .,_-):
   raise ParserError(Illegal characters in filter)
   for word in s.replace(,,  ).split(): # OR
   word = [block.split(.) for block in word.split(..)]  # AND
  -for word in s.replace(,,  

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 05:36:40PM +0200, Alexander Graf wrote:
  
  #define GOAL (1ull  30)
  
 do {
 iterations *= 2;
 t1 = rdtsc();
  
 for (i = 0; i  iterations; ++i)
 func();
 t2 = rdtsc();
 } while ((t2 - t1)  GOAL);
 printf(%s %d\n, test-name, (int)((t2 - t1) / iterations));
 
 So it's the number of cycles per run.
 
 That means translated my numbers are:
 
   MMIO: 4307
   PIO: 3658
   HCALL: 1756
 
 MMIO - PIO = 649
 
 which aligns roughly with your PV MMIO callback.
 
 My MMIO benchmark was to poke the LAPIC version register. That does go 
 through instruction emulation, no?
 
It should and PIO 'in' or string also goes through the emulator.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 05:36:40PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 15:33, Michael S. Tsirkin wrote:
 
  On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:56, Gleb Natapov wrote:
  
  On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:45, Gleb Natapov wrote:
  
  On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:38, Gleb Natapov wrote:
  
  On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 14:08, Gleb Natapov wrote:
  
  On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
  
  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
  
  With KVM, MMIO is much slower than PIO, due to the need to
  do page walk and emulation. But with EPT, it does not have to be: 
  we
  know the address from the VMCS so if the address is unique, we 
  can look
  up the eventfd directly, bypassing emulation.
  
  Add an interface for userspace to specify this per-address, we can
  use this e.g. for virtio.
  
  The implementation adds a separate bus internally. This serves two
  purposes:
  - minimize overhead for old userspace that does not use PV MMIO
  - minimize disruption in other code (since we don't know the 
  length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)
  
  At the moment, this optimization is only supported for EPT on x86 
  and
  silently ignored for NPT and MMU, so everything works correctly 
  but
  slowly.
  
  TODO: NPT, MMU and non x86 architectures.
  
  The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
  pre-review and suggestions.
  
  Signed-off-by: Michael S. Tsirkin m...@redhat.com
  
  This still uses page fault intercepts which are orders of 
  magnitudes slower than hypercalls. Why don't you just create a PV 
  MMIO hypercall that the guest can use to invoke MMIO accesses 
  towards the host based on physical addresses with explicit length 
  encodings?
  
  It is slower, but not an order of magnitude slower. It become faster
  with newer HW.
  
  That way you simplify and speed up all code paths, exceeding the 
  speed of PIO exits even. It should also be quite easily portable, 
  as all other platforms have hypercalls available as well.
  
  We are trying to avoid PV as much as possible (well this is also PV,
  but not guest visible
  
  Also, how is this not guest visible? Who sets 
  KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition 
  indicates that the guest does so, so it is guest visible.
  
  QEMU sets it.
  
  How does QEMU know?
  
  Knows what? When to create such eventfd? virtio device knows.
  
  Where does it know from?
  
  It does it always.
  
  
  
  +/*
  + * PV_MMIO - Guest can promise us that all accesses touching this 
  address
  + * are writes of specified length, starting at the specified 
  address.
  + * If not - it's a Guest bug.
  + * Can not be used together with either PIO or DATAMATCH.
  + */
  
  Virtio spec will state that access to a kick register needs to be of
  specific length. This is reasonable thing for HW to ask.
  
  This is a spec change. So the guest would have to indicate that it 
  adheres to a newer spec. Thus it's a guest visible change.
  
  There is not virtio spec that has kick register in MMIO. The spec is in
  the works AFAIK. Actually PIO will not be deprecated and my suggestion
  
  So the guest would indicate that it supports a newer revision of the 
  spec (in your case, that it supports MMIO). How is that any different 
  from exposing that it supports a PV MMIO hcall?
  
  Guest will indicate nothing. New driver will use MMIO if PIO is bar is
  not configured. All driver will not work for virtio devices with MMIO
  bar, but not PIO bar.
  
  I can't parse that, sorry :).
  
  It's simple. Driver does iowrite16 or whatever is appropriate for the OS.
  QEMU tells KVM which address driver uses, to make exits faster.  This is not
  different from how eventfd works.  For example if exits to QEMU suddenly 
  become
  very cheap we can remove eventfd completely.
  
  
  is to move to MMIO only when PIO address space is exhausted. For PCI it
  will be never, for PCI-e it will be after ~16 devices.
  
  Ok, let's go back a step here. Are you actually able to measure any 
  speed in performance with this patch applied and without when going 
  through MMIO kicks?
  
  
  That's the question for MST. I think he did only micro benchmarks till
  now and he already posted his result here:
  
  mmio-wildcard-eventfd:pci-mem 3529
  mmio-pv-eventfd:pci-mem 1878
  portio-wildcard-eventfd:pci-io 1846
  
  So the patch speedup mmio by almost 100% and it is almost the same as PIO.
  
  Those numbers don't align at all with what I measured.
  
  Yep. But why?
  Could be a different hardware. My laptop is i7, what did you measure on?
  processor   

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 06:36:30PM +0300, Michael S. Tsirkin wrote:
  processor   : 0
  vendor_id   : AuthenticAMD
  cpu family  : 16
  model   : 8
  model name  : Six-Core AMD Opteron(tm) Processor 8435
  stepping: 0
  cpu MHz : 800.000
  cache size  : 512 KB
  physical id : 0
  siblings: 6
  core id : 0
  cpu cores   : 6
  apicid  : 8
  initial apicid  : 0
  fpu : yes
  fpu_exception   : yes
  cpuid level : 5
  wp  : yes
  flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
  mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
  pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc 
  extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
  cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv 
  svm_lock nrip_save pausefilter
  bogomips: 5199.87
  TLB size: 1024 4K pages
  clflush size: 64
  cache_alignment : 64
  address sizes   : 48 bits physical, 48 bits virtual
  power management: ts ttp tm stc 100mhzsteps hwpstate
 
 Hmm, svm code seems less optimized for MMIO, but PIO
 is almost identical. Gleb says the unittest is broken
 on AMD so I'll wait until it's fixed to test.
 
It's not unittest is broken, its my environment is broken :)

 Did you do PIO reads by chance?
 
   
   Or could be different software, this is on top of 3.9.0-rc5, what
   did you try?
  
  3.0 plus kvm-kmod of whatever was current back in autumn :).
  
   
   MST, could you please do a real world latency benchmark with virtio-net 
   and
   
* normal ioeventfd
* mmio-pv eventfd
* hcall eventfd
   
   I can't do this right away, sorry.  For MMIO we are discussing the new
   layout on the virtio mailing list, guest and qemu need a patch for this
   too.  My hcall patches are stale and would have to be brought up to
   date.
   
   
   to give us some idea how much performance we would gain from each 
   approach? Thoughput should be completely unaffected anyway, since virtio 
   just coalesces kicks internally.
   
   Latency is dominated by the scheduling latency.
   This means virtio-net is not the best benchmark.
  
  So what is a good benchmark?
 
 E.g. ping pong stress will do but need to look at CPU utilization,
 that's what is affected, not latency.
 
  Is there any difference in speed at all? I strongly doubt it. One of 
  virtio's main points is to reduce the number of kicks.
 
 For this stage of the project I think microbenchmarks are more appropriate.
 Doubling the price of exit is likely to be measureable. 30 cycles likely
 not ...
 
   
   I'm also slightly puzzled why the wildcard eventfd mechanism is so 
   significantly slower, while it was only a few percent on my test system. 
   What are the numbers you're listing above? Cycles? How many cycles do 
   you execute in a second?
   
   
   Alex
   
   
   It's the TSC divided by number of iterations.  kvm unittest this value, 
   here's
   what it does (removed some dead code):
   
   #define GOAL (1ull  30)
   
  do {
  iterations *= 2;
  t1 = rdtsc();
   
  for (i = 0; i  iterations; ++i)
  func();
  t2 = rdtsc();
  } while ((t2 - t1)  GOAL);
  printf(%s %d\n, test-name, (int)((t2 - t1) / iterations));
  
  So it's the number of cycles per run.
  
  That means translated my numbers are:
  
MMIO: 4307
PIO: 3658
HCALL: 1756
  
  MMIO - PIO = 649
  
  which aligns roughly with your PV MMIO callback.
  
  My MMIO benchmark was to poke the LAPIC version register. That does go 
  through instruction emulation, no?
  
  
  Alex
 
 Why wouldn't it?
 
Intel decodes access to apic page, but we use it only for fast eoi.

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Yoder Stuart-B08248
Based on the email thread over the last couple of days, I have
below an more concrete proposal (v2) for new ioctls supporting vfio-pci
on SoCs with the Freescale PAMU.

Example usage is as described by Scott:

count = VFIO_IOMMU_GET_MSI_BANK_COUNT
VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
// do other DMA maps now, or later, or not at all, doesn't matter
for (i = 0; i  count; i++)
VFIO_IOMMU_MAP_MSI_BANK(iova, i);
// The kernel now knows where each bank has been mapped, and can
//   update PCI config space appropriately.

Thanks,
Stuart



The Freescale PAMU is an aperture-based IOMMU with the following 
characteristics.  Each device has an entry in a table in memory
describing the iova-phys mapping. The mapping has:
   -an overall aperture that is power of 2 sized, and has a start iova that
is naturally aligned
   -has 1 or more windows within the aperture
  -number of windows must be power of 2, max is 256
  -size of each window is determined by aperture size / # of windows
  -iova of each window is determined by aperture start iova / # of windows
  -the mapped region in each window can be different than
   the window size...mapping must power of 2 
  -physical address of the mapping must be naturally aligned
   with the mapping size

/*
 * VFIO_PAMU_GET_ATTR
 *
 * Gets the iommu attributes for the current vfio container.  This 
 * ioctl is applicable to an iommu type of VFIO_PAMU only.
 * Caller sets argsz and attribute.  The ioctl fills in
 * the provided struct vfio_pamu_attr based on the attribute
 * value that was set.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */
struct vfio_pamu_attr {
__u32   argsz;
__u32   attribute;
#define VFIO_ATTR_GEOMETRY 1
#define VFIO_ATTR_WINDOWS  2
#define VFIO_ATTR_PAMU_STASH   3

/* fowlling fields are for VFIO_ATTR_GEOMETRY */
__u64 aperture_start; /* first address that can be mapped*/
__u64 aperture_end;   /* last address that can be mapped */

/* follwing fields are for VFIO_ATTR_WINDOWS */
__u32 windows;/* number of windows in the aperture */
  /* initially this will be the max number
   * of windows that can be set
   */

/* following fields are for VFIO_ATTR_PAMU_STASH */
__u32 cpu;/* CPU number for stashing */
__u32 cache;  /* cache ID for stashing */
};
#define VFIO_PAMU_GET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_attr)

/*
 * VFIO_PAMU_SET_ATTR
 *
 * Sets the iommu attributes for the current vfio container.  This 
 * ioctl is applicable to an iommu type of VFIO_PAMU only.
 * Caller sets struct vfio_pamu attr, including argsz and attribute and
 * setting any fields that are valid for the attribute.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */
#define VFIO_PAMU_SET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_attr)

/*
 * VFIO_PAMU_GET_MSI_BANK_COUNT
 *
 * Returns the number of MSI banks for this platform.  This tells user space
 * how many aperture windows should be reserved for MSI banks when setting
 * the PAMU geometry and window count.
 * Fills in provided struct vfio_pamu_msi_banks. Caller sets argsz. 
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */
struct vfio_pamu_msi_banks {
__u32   argsz;
__u32   bank_count;  /* the number of MSI
};
#define VFIO_PAMU_GET_MSI_BANK_COUNT  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_msi_banks)

/*
 * VFIO_PAMU_MAP_MSI_BANK
 *
 * Maps the MSI bank at the specified index and iova.  User space must
 * call this ioctl once for each MSI bank (count of banks is returned by
 * VFIO_IOMMU_GET_MSI_BANK_COUNT).
 * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */

struct vfio_pamu_msi_bank_map {
__u32   argsz;
__u32   msi_bank_index;  /* the index of the MSI bank */
__u64   iova;  /* the iova the bank is to be mapped to */
};

/*
 * VFIO_PAMU_UNMAP_MSI_BANK
 *
 * Unmaps the MSI bank at the specified iova.
 * Caller provides struct vfio_pamu_msi_bank_unmap with all fields set.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */

struct vfio_pamu_msi_bank_unmap {
__u32   argsz;
__u64   iova;  /* the iova to be unmapped to */
};

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

2013-04-04 Thread Scott Wood

On 04/04/2013 07:54:20 AM, Alexander Graf wrote:


On 03.04.2013, at 03:57, Scott Wood wrote:

 +  if (opp-mpic_mode_mask == GCR_MODE_PROXY)

Shouldn't this be an ?


The way the mode field was originally documented was a two-bit field,  
where 0b11 was external proxy, and 0b10 was reserved.  If we use  it  
would have to be:


if ((opp-mpic_mode_mask  GCR_MODE_PROXY) == GCR_MODE_PROXY)
...

Simply testing opp-mpic_mode_mask  GCR_MODE_PROXY would return true  
in the case of GCR_MODE_MIXED.


In MPIC 4.3 external proxy is defined as a separate bit (GCR[CI]) that  
is ignored if the mixed-mode bit (GCR[M]) is not set, which makes it a  
bit more legitimate to view it as a bitmap.  Still, I doubt we'll see  
new mode bits.



 @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
tasklet_kill(vcpu-arch.tasklet);

kvmppc_remove_vcpu_debugfs(vcpu);
 +
 +  switch (vcpu-arch.irq_type) {
 +  case KVMPPC_IRQ_MPIC:
 +  kvmppc_mpic_put(vcpu-arch.mpic);

This doesn't tell the MPIC that this exact CPU is getting killed.  
What if we hotplug remove just a single CPU? Don't we have to  
deregister the CPU with the MPIC?


If we ever support hot vcpu removal, yes.  We'd probably need some MPIC  
code changes to accommodate that, and we wouldn't currently have a way  
to test it, so I'd rather make it obviously not supported for now.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Alex Williamson
On Thu, 2013-04-04 at 17:32 +, Yoder Stuart-B08248 wrote:
 Based on the email thread over the last couple of days, I have
 below an more concrete proposal (v2) for new ioctls supporting vfio-pci
 on SoCs with the Freescale PAMU.
 
 Example usage is as described by Scott:
 
 count = VFIO_IOMMU_GET_MSI_BANK_COUNT
 VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
 VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
 // do other DMA maps now, or later, or not at all, doesn't matter
 for (i = 0; i  count; i++)
 VFIO_IOMMU_MAP_MSI_BANK(iova, i);
 // The kernel now knows where each bank has been mapped, and can
 //   update PCI config space appropriately.
 
 Thanks,
 Stuart
 
 
 
 The Freescale PAMU is an aperture-based IOMMU with the following 
 characteristics.  Each device has an entry in a table in memory
 describing the iova-phys mapping. The mapping has:
-an overall aperture that is power of 2 sized, and has a start iova that
 is naturally aligned
-has 1 or more windows within the aperture
   -number of windows must be power of 2, max is 256
   -size of each window is determined by aperture size / # of windows
   -iova of each window is determined by aperture start iova / # of windows
   -the mapped region in each window can be different than
the window size...mapping must power of 2 
   -physical address of the mapping must be naturally aligned
with the mapping size
 
 /*
  * VFIO_PAMU_GET_ATTR
  *
  * Gets the iommu attributes for the current vfio container.  This 
  * ioctl is applicable to an iommu type of VFIO_PAMU only.
  * Caller sets argsz and attribute.  The ioctl fills in
  * the provided struct vfio_pamu_attr based on the attribute
  * value that was set.
  * Operates on VFIO file descriptor (/dev/vfio/vfio).
  * Return: 0 on success, -errno on failure
  */
 struct vfio_pamu_attr {
 __u32   argsz;
 __u32   attribute;
 #define VFIO_ATTR_GEOMETRY 1
 #define VFIO_ATTR_WINDOWS  2
 #define VFIO_ATTR_PAMU_STASH   3
 
 /* fowlling fields are for VFIO_ATTR_GEOMETRY */
 __u64 aperture_start; /* first address that can be mapped*/
 __u64 aperture_end;   /* last address that can be mapped */
 
 /* follwing fields are for VFIO_ATTR_WINDOWS */
 __u32 windows;/* number of windows in the aperture */
   /* initially this will be the max number
* of windows that can be set
*/
 
 /* following fields are for VFIO_ATTR_PAMU_STASH */
 __u32 cpu;/* CPU number for stashing */
 __u32 cache;  /* cache ID for stashing */

Can multiple attribute bits be set at once?  If not then the above could
be a union and the attribute could be an enum.  A flags field is
probably still a good idea.

 };
 #define VFIO_PAMU_GET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
 struct vfio_pamu_attr)
 
 /*
  * VFIO_PAMU_SET_ATTR
  *
  * Sets the iommu attributes for the current vfio container.  This 
  * ioctl is applicable to an iommu type of VFIO_PAMU only.
  * Caller sets struct vfio_pamu attr, including argsz and attribute and
  * setting any fields that are valid for the attribute.
  * Operates on VFIO file descriptor (/dev/vfio/vfio).
  * Return: 0 on success, -errno on failure
  */
 #define VFIO_PAMU_SET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
 struct vfio_pamu_attr)
 
 /*
  * VFIO_PAMU_GET_MSI_BANK_COUNT
  *
  * Returns the number of MSI banks for this platform.  This tells user space
  * how many aperture windows should be reserved for MSI banks when setting
  * the PAMU geometry and window count.
  * Fills in provided struct vfio_pamu_msi_banks. Caller sets argsz. 
  * Operates on VFIO file descriptor (/dev/vfio/vfio).
  * Return: 0 on success, -errno on failure
  */
 struct vfio_pamu_msi_banks {
 __u32   argsz;
 __u32   bank_count;  /* the number of MSI
 };
 #define VFIO_PAMU_GET_MSI_BANK_COUNT  _IO(VFIO_TYPE, VFIO_BASE + x,
 struct vfio_pamu_msi_banks)

argsz w/o flags doesn't really buy us much.

 
 /*
  * VFIO_PAMU_MAP_MSI_BANK
  *
  * Maps the MSI bank at the specified index and iova.  User space must
  * call this ioctl once for each MSI bank (count of banks is returned by
  * VFIO_IOMMU_GET_MSI_BANK_COUNT).
  * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
  * Operates on VFIO file descriptor (/dev/vfio/vfio).
  * Return: 0 on success, -errno on failure
  */
 
 struct vfio_pamu_msi_bank_map {
 __u32   argsz;
 __u32   msi_bank_index;  /* the index of the MSI bank */
 __u64   iova;  /* the iova the bank is to be mapped to */
 };

Again, flags.  If we dynamically add or remove devices from a container
the bank count can change, right?  If bank count goes from 2 to 3, does
userspace know assume the new bank is [2]?  If bank count goes from 3 

[PATCH] KVM: x86: Fix memory leak in vmx.c

2013-04-04 Thread Andrew Honig
If userspace creates and destroys multiple VMs within the same process
we leak 20k of memory in the userspace process context per VM.  This
patch frees the memory in kvm_arch_destroy_vm.  If the process exits
without closing the VM file descriptor or the file descriptor has been
shared with another process then we don't need to free the memory.

Messing with user space memory from an fd is not ideal, but other changes
would require user space changes and this is consistent with how the
memory is currently allocated.

Tested: Test ran several VMs and ran against test program meant to 
demonstrate the leak (www.spinics.net/lists/kvm/msg83734.html).

Signed-off-by: Andrew Honig aho...@google.com

---
 arch/x86/include/asm/kvm_host.h |3 +++
 arch/x86/kvm/vmx.c  |3 +++
 arch/x86/kvm/x86.c  |   11 +++
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4979778..975a74d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -553,6 +553,9 @@ struct kvm_arch {
struct page *ept_identity_pagetable;
bool ept_identity_pagetable_done;
gpa_t ept_identity_map_addr;
+   unsigned long ept_ptr;
+   unsigned long apic_ptr;
+   unsigned long tss_ptr;
 
unsigned long irq_sources_bitmap;
s64 kvmclock_offset;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6667042..8aa5d81 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3703,6 +3703,7 @@ static int alloc_apic_access_page(struct kvm *kvm)
}
 
kvm-arch.apic_access_page = page;
+   kvm-arch.apic_ptr = kvm_userspace_mem.userspace_addr;
 out:
mutex_unlock(kvm-slots_lock);
return r;
@@ -3733,6 +3734,7 @@ static int alloc_identity_pagetable(struct kvm *kvm)
}
 
kvm-arch.ept_identity_pagetable = page;
+   kvm-arch.ept_ptr = kvm_userspace_mem.userspace_addr;
 out:
mutex_unlock(kvm-slots_lock);
return r;
@@ -4366,6 +4368,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int 
addr)
if (ret)
return ret;
kvm-arch.tss_addr = addr;
+   kvm-arch.tss_ptr = tss_mem.userspace_addr;
if (!init_rmode_tss(kvm))
return  -ENOMEM;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f19ac0a..411ff2a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6812,6 +6812,16 @@ void kvm_arch_sync_events(struct kvm *kvm)
 
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
+   if (current-mm == kvm-mm) {
+   /*
+* Free pages allocated on behalf of userspace, unless the
+* the memory map has changed due to process exit or fd
+* copying.
+*/
+   vm_munmap(kvm-arch.apic_ptr, PAGE_SIZE);
+   vm_munmap(kvm-arch.ept_ptr, PAGE_SIZE);
+   vm_munmap(kvm-arch.tss_ptr, PAGE_SIZE * 3);
+   }
kvm_iommu_unmap_guest(kvm);
kfree(kvm-arch.vpic);
kfree(kvm-arch.vioapic);
@@ -6929,6 +6939,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
return PTR_ERR((void *)userspace_addr);
 
memslot-userspace_addr = userspace_addr;
+   mem-userspace_addr = userspace_addr;
}
 
return 0;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] KVM: Call kvm_apic_match_dest() to check destination vcpu

2013-04-04 Thread Gleb Natapov
On Wed, Apr 03, 2013 at 12:34:37PM +0300, Gleb Natapov wrote:
 On Wed, Apr 03, 2013 at 08:47:58AM +0800, Yang Zhang wrote:
  From: Yang Zhang yang.z.zh...@intel.com
  
  For a given vcpu, kvm_apic_match_dest() will tell you whether
  the vcpu in the destination list quickly. Drop kvm_calculate_eoi_exitmap()
  and use kvm_apic_match_dest() instead.
  
  Signed-off-by: Yang Zhang yang.z.zh...@intel.com
 Applied, thanks.
 
The patch does not compile, dropped.

  ---
   arch/x86/kvm/lapic.c |   47 ---
   arch/x86/kvm/lapic.h |4 
   virt/kvm/ioapic.c|9 +++--
   3 files changed, 3 insertions(+), 57 deletions(-)
  
  diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
  index a8e9369..e227474 100644
  --- a/arch/x86/kvm/lapic.c
  +++ b/arch/x86/kvm/lapic.c
  @@ -145,53 +145,6 @@ static inline int kvm_apic_id(struct kvm_lapic *apic)
  return (kvm_apic_get_reg(apic, APIC_ID)  24)  0xff;
   }
   
  -void kvm_calculate_eoi_exitmap(struct kvm_vcpu *vcpu,
  -   struct kvm_lapic_irq *irq,
  -   u64 *eoi_exit_bitmap)
  -{
  -   struct kvm_lapic **dst;
  -   struct kvm_apic_map *map;
  -   unsigned long bitmap = 1;
  -   int i;
  -
  -   rcu_read_lock();
  -   map = rcu_dereference(vcpu-kvm-arch.apic_map);
  -
  -   if (unlikely(!map)) {
  -   __set_bit(irq-vector, (unsigned long *)eoi_exit_bitmap);
  -   goto out;
  -   }
  -
  -   if (irq-dest_mode == 0) { /* physical mode */
  -   if (irq-delivery_mode == APIC_DM_LOWEST ||
  -   irq-dest_id == 0xff) {
  -   __set_bit(irq-vector,
  - (unsigned long *)eoi_exit_bitmap);
  -   goto out;
  -   }
  -   dst = map-phys_map[irq-dest_id  0xff];
  -   } else {
  -   u32 mda = irq-dest_id  (32 - map-ldr_bits);
  -
  -   dst = map-logical_map[apic_cluster_id(map, mda)];
  -
  -   bitmap = apic_logical_id(map, mda);
  -   }
  -
  -   for_each_set_bit(i, bitmap, 16) {
  -   if (!dst[i])
  -   continue;
  -   if (dst[i]-vcpu == vcpu) {
  -   __set_bit(irq-vector,
  - (unsigned long *)eoi_exit_bitmap);
  -   break;
  -   }
  -   }
  -
  -out:
  -   rcu_read_unlock();
  -}
  -
   static void recalculate_apic_map(struct kvm *kvm)
   {
  struct kvm_apic_map *new, *old = NULL;
  diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
  index 2c721b9..baa20cf 100644
  --- a/arch/x86/kvm/lapic.h
  +++ b/arch/x86/kvm/lapic.h
  @@ -160,10 +160,6 @@ static inline u16 apic_logical_id(struct kvm_apic_map 
  *map, u32 ldr)
  return ldr  map-lid_mask;
   }
   
  -void kvm_calculate_eoi_exitmap(struct kvm_vcpu *vcpu,
  -   struct kvm_lapic_irq *irq,
  -   u64 *eoi_bitmap);
  -
   static inline bool kvm_apic_has_events(struct kvm_vcpu *vcpu)
   {
  return vcpu-arch.apic-pending_events;
  diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
  index 5ba005c..bb3906d 100644
  --- a/virt/kvm/ioapic.c
  +++ b/virt/kvm/ioapic.c
  @@ -124,7 +124,6 @@ void kvm_ioapic_calculate_eoi_exitmap(struct kvm_vcpu 
  *vcpu,
   {
  struct kvm_ioapic *ioapic = vcpu-kvm-arch.vioapic;
  union kvm_ioapic_redirect_entry *e;
  -   struct kvm_lapic_irq irqe;
  int index;
   
  spin_lock(ioapic-lock);
  @@ -135,11 +134,9 @@ void kvm_ioapic_calculate_eoi_exitmap(struct kvm_vcpu 
  *vcpu,
  (e-fields.trig_mode == IOAPIC_LEVEL_TRIG ||
   kvm_irq_has_notifier(ioapic-kvm, KVM_IRQCHIP_IOAPIC,
   index))) {
  -   irqe.dest_id = e-fields.dest_id;
  -   irqe.vector = e-fields.vector;
  -   irqe.dest_mode = e-fields.dest_mode;
  -   irqe.delivery_mode = e-fields.delivery_mode  8;
  -   kvm_calculate_eoi_exitmap(vcpu, irqe, eoi_exit_bitmap);
  +   if (kvm_apic_match_dest(vcpu, NULL, 0,
  +   e-fields.dest_id, e-fields.dest_mode))
  +   __set_bit(e-fileds.vector, (unsigned long 
  *)eoi_exit_bitmap);
  }
  }
  spin_unlock(ioapic-lock);
  -- 
  1.7.1
 
 --
   Gleb.
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Yoder Stuart-B08248

  /*
   * VFIO_PAMU_MAP_MSI_BANK
   *
   * Maps the MSI bank at the specified index and iova.  User space must
   * call this ioctl once for each MSI bank (count of banks is returned by
   * VFIO_IOMMU_GET_MSI_BANK_COUNT).
   * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
   * Operates on VFIO file descriptor (/dev/vfio/vfio).
   * Return: 0 on success, -errno on failure
   */
 
  struct vfio_pamu_msi_bank_map {
  __u32   argsz;
  __u32   msi_bank_index;  /* the index of the MSI bank */
  __u64   iova;  /* the iova the bank is to be mapped to */
  };
 
 Again, flags.  If we dynamically add or remove devices from a container
 the bank count can change, right?  If bank count goes from 2 to 3, does
 userspace know assume the new bank is [2]?  If bank count goes from 3 to
 2, which index was that?  If removing a device removes an MSI bank then
 vfio-pamu will automatically do the unmap, right?

My assumption is that the bank count returned by VFIO_IOMMU_GET_MSI_BANK_COUNT
is the max bank count for a platform.  (number will mostly likely always be
3 or 4).  So it won't change as devices are added or removed.

If devices are added or removed, the kernel side can enable or disable
the corresponding MSI windows.  But that is hidden from user space.

Stuart




RFC: vfio API changes needed for powerpc (v3)

2013-04-04 Thread Yoder Stuart-B08248
-v3 updates
   -made vfio_pamu_attr a union, added flags
   -s/VFIO_PAMU_/VFIO_IOMMU_PAMU_/ for the ioctls to make it more
clear which fd is being operated on
   -added flags to vfio_pamu_msi_bank_map/umap
   -VFIO_PAMU_GET_MSI_BANK_COUNT now just returns a __u32
not a struct
   -fixed some typos



The Freescale PAMU is an aperture-based IOMMU with the following
characteristics.  Each device has an entry in a table in memory
describing the iova-phys mapping. The mapping has:
   -an overall aperture that is power of 2 sized, and has a start iova that
is naturally aligned
   -has 1 or more windows within the aperture
  -number of windows must be power of 2, max is 256
  -size of each window is determined by aperture size / # of windows
  -iova of each window is determined by aperture start iova / # of windows
  -the mapped region in each window can be different than
   the window size...mapping must power of 2
  -physical address of the mapping must be naturally aligned
   with the mapping size

These ioctls operate on the VFIO file descriptor (/dev/vfio/vfio).

/*
 * VFIO_IOMMU_PAMU_GET_ATTR
 *
 * Gets the iommu attributes for the current vfio container.  This
 * ioctl is applicable to an iommu type of VFIO_PAMU only.
 * Caller sets argsz and attribute.  The ioctl fills in
 * the provided struct vfio_pamu_attr based on the attribute
 * value that was set.

 * Return: 0 on success, -errno on failure
 */
struct vfio_pamu_attr {
__u32   argsz;
__u32   flags;/* no flags currently */
__u32   attribute;

union {
/* VFIO_ATTR_GEOMETRY */
struct { 
__u64 aperture_start; /* first addr that can be mapped 
*/
__u64 aperture_end;  /* last addr that can be mapped */
} attr;

/* VFIO_ATTR_WINDOWS */
__u32 windows;  /* number of windows in the aperture */
/* initially this will be the max number
 * of windows that can be set
 */

/* VFIO_ATTR_PAMU_STASH */
struct { 
__u32 cpu; /* CPU number for stashing */
__u32 cache;   /* cache ID for stashing */
} stash;
}
};
#define VFIO_IOMMU_PAMU_GET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_attr)

/*
 * VFIO_IOMMU_PAMU_SET_ATTR
 *
 * Sets the iommu attributes for the current vfio container.  This
 * ioctl is applicable to an iommu type of VFIO_PAMU only.
 * Caller sets struct vfio_pamu attr, including argsz and attribute and
 * setting any fields that are valid for the attribute.
 * Return: 0 on success, -errno on failure
 */
#define VFIO_IOMMU_PAMU_SET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_attr)

/*
 * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT
 *
 * Returns the number of MSI banks for this platform.  This tells user space
 * how many aperture windows should be reserved for MSI banks when setting
 * the PAMU geometry and window count.
 * Return: __u32 bank count on success, -errno on failure
 */
#define VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT _IO(VFIO_TYPE, VFIO_BASE + x, __u32)

/*
 * VFIO_IOMMU_PAMU_MAP_MSI_BANK
 *
 * Maps the MSI bank at the specified index and iova.  User space must
 * call this ioctl once for each MSI bank (count of banks is returned by
 * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT).
 * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
 * Return: 0 on success, -errno on failure
 */

struct vfio_pamu_msi_bank_map {
__u32   argsz;
__u32   flags; /* no flags currently */
__u32   msi_bank_index;  /* the index of the MSI bank */
__u64   iova;  /* the iova the bank is to be mapped to */
};
#define VFIO_IOMMU_PAMU_MAP_MSI_BANK  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_msi_bank_map )

/*
 * VFIO_IOMMU_PAMU_UNMAP_MSI_BANK
 *
 * Unmaps the MSI bank at the specified iova.
 * Caller provides struct vfio_pamu_msi_bank_unmap with all fields set.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */

struct vfio_pamu_msi_bank_unmap {
__u32   argsz;
__u32   flags; /* no flags currently */
__u64   iova;  /* the iova to be unmapped to */
};
#define VFIO_IOMMU_PAMU_UNMAP_MSI_BANK  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_msi_bank_unmap )

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] ARM: KVM: switch to a dual-step HYP init code

2013-04-04 Thread Geoff Levand
Hi,

On Tue, 2013-04-02 at 14:25 +0100, Marc Zyngier wrote:
 + @ Jump to the trampoline page
 + ldr r2, =#PAGE_MASK
 + adr r3, target
 + bic r3, r3, r2
 + ldr r2, =#TRAMPOLINE_VA
 + add r3, r3, r2
 + mov pc, r3

I guess you need 'ldr r2, =PAGE_MASK'.

  arch/arm/kvm/init.S:114: Error: bad expression -- `ldr r2,=#(~((112)-1))'
  arch/arm/kvm/init.S:117: Error: bad expression -- `ldr r2,=#0x'

-Geoff


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

2013-04-04 Thread Alexander Graf


Am 04.04.2013 um 20:41 schrieb Scott Wood scottw...@freescale.com:

 On 04/04/2013 07:54:20 AM, Alexander Graf wrote:
 On 03.04.2013, at 03:57, Scott Wood wrote:
  +if (opp-mpic_mode_mask == GCR_MODE_PROXY)
 Shouldn't this be an ?
 
 The way the mode field was originally documented was a two-bit field, where 
 0b11 was external proxy, and 0b10 was reserved.  If we use  it would have to 
 be:
 
if ((opp-mpic_mode_mask  GCR_MODE_PROXY) == GCR_MODE_PROXY)
...
 
 Simply testing opp-mpic_mode_mask  GCR_MODE_PROXY would return true in 
 the case of GCR_MODE_MIXED.
 
 In MPIC 4.3 external proxy is defined as a separate bit (GCR[CI]) that is 
 ignored if the mixed-mode bit (GCR[M]) is not set, which makes it a bit more 
 legitimate to view it as a bitmap.  Still, I doubt we'll see new mode bits.

Ok, please add a comment about this here then :).

 
  @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 tasklet_kill(vcpu-arch.tasklet);
 
 kvmppc_remove_vcpu_debugfs(vcpu);
  +
  +switch (vcpu-arch.irq_type) {
  +case KVMPPC_IRQ_MPIC:
  +kvmppc_mpic_put(vcpu-arch.mpic);
 This doesn't tell the MPIC that this exact CPU is getting killed. What if we 
 hotplug remove just a single CPU? Don't we have to deregister the CPU with 
 the MPIC?
 
 If we ever support hot vcpu removal, yes.  We'd probably need some MPIC code 
 changes to accommodate that, and we wouldn't currently have a way to test it, 
 so I'd rather make it obviously not supported for now.

Is there any way to break heavily if user space attempts this?

Alex

 
 -Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

2013-04-04 Thread Scott Wood

On 04/04/2013 05:30:05 PM, Alexander Graf wrote:



Am 04.04.2013 um 20:41 schrieb Scott Wood scottw...@freescale.com:

 On 04/04/2013 07:54:20 AM, Alexander Graf wrote:
 On 03.04.2013, at 03:57, Scott Wood wrote:
  +if (opp-mpic_mode_mask == GCR_MODE_PROXY)
 Shouldn't this be an ?

 The way the mode field was originally documented was a two-bit  
field, where 0b11 was external proxy, and 0b10 was reserved.  If we  
use  it would have to be:


if ((opp-mpic_mode_mask  GCR_MODE_PROXY) == GCR_MODE_PROXY)
...

 Simply testing opp-mpic_mode_mask  GCR_MODE_PROXY would return  
true in the case of GCR_MODE_MIXED.


 In MPIC 4.3 external proxy is defined as a separate bit (GCR[CI])  
that is ignored if the mixed-mode bit (GCR[M]) is not set, which  
makes it a bit more legitimate to view it as a bitmap.  Still, I  
doubt we'll see new mode bits.


Ok, please add a comment about this here then :).


What sort of comment would you like?  Or do you want me to use the (x  
 y) == y version?


  @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu  
*vcpu)

 tasklet_kill(vcpu-arch.tasklet);
 
 kvmppc_remove_vcpu_debugfs(vcpu);
  +
  +switch (vcpu-arch.irq_type) {
  +case KVMPPC_IRQ_MPIC:
  +kvmppc_mpic_put(vcpu-arch.mpic);
 This doesn't tell the MPIC that this exact CPU is getting killed.  
What if we hotplug remove just a single CPU? Don't we have to  
deregister the CPU with the MPIC?


 If we ever support hot vcpu removal, yes.  We'd probably need some  
MPIC code changes to accommodate that, and we wouldn't currently have  
a way to test it, so I'd rather make it obviously not supported for  
now.


Is there any way to break heavily if user space attempts this?


Is there any way for userspace to request this currently?  They can  
close the vcpu fd, but the vcpu won't actually be destroyed until the  
vm goes down.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls

2013-04-04 Thread Paul Mackerras
On Thu, Apr 04, 2013 at 11:49:55AM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 07:37, Paul Mackerras wrote:
 
  On Thu, Mar 21, 2013 at 09:52:16AM +0100, Alexander Graf wrote:
  +/* Platform specific hcalls, used by KVM */
  +#define H_RTAS   0xf000
  
  How about you define a different hcall ID for this? Then QEMU would
  create its rtas entry blob such that KVM-routed RTAS handling goes
  to KVM directly.
  
  QEMU can still do that, and I don't see that it would change the
  kernel side if it did.  We would still have to have agreement between
  the kernel and userspace as to what the hcall number for invoking the
  in-kernel RTAS calls was, and the kernel would still have to keep a
  list of token numbers and how they correspond to the functions it
  provides.  The only thing different would be that the in-kernel RTAS
  hcall could return to the guest if it didn't recognize the token
  number, rather than pushing the problem up to userspace.  However,
  that wouldn't make the code any simpler, and it isn't a situation
  where performance is an issue.
  
  Do you see some kernel-side improvements or simplifications from your
  suggestion that I'm missing?  Remember, the guest gets the token
  numbers from the device tree (properties under the /rtas node), so
  they are under the control of userspace/QEMU.
 
 The code flow with this patch:
 
   setup time
 
   foreach (override in overrides)
 ioctl(OVERRIDE_RTAS, ...);
 
   runtime
 
   switch (hcall_id) {
   case QEMU_RTAS_ID:
 foreach (override in kvm_overrides) {
   int rtas_id = ...;
   if (override.rtas_id == rtas_id) {
 handle_rtas();

Actually this is more like: override.handler();

 handled = true;
   }
 }
 if (!handled)
   pass_to_qemu();
 break;
   default:
 pass_to_qemu();
 break
   }
 
 What I'm suggesting:
 
   setup time
 
   nothing from KVM's point of view

Actually, this can't be nothing.

The way the RTAS calls work is that there is a name and a token
(32-bit integer value) for each RTAS call.  The tokens have to be
unique for each different name.  Userspace puts the names and tokens
in the device tree under the /rtas node (a set of properties where the
property name is the RTAS function name and the property value is the
token).  The guest looks up the token for each RTAS function it wants
to use, and passes the token in the argument buffer for the RTAS call.

This means that userspace has to know the names and tokens for all
supported RTAS functions, both the ones implemented in the kernel and
the ones implemented in userspace.

Also, the token numbers are pretty arbitrary, and the token numbers
for the kernel-implemented RTAS functions could be chosen by userspace
or by the kernel.  If they're chosen by the kernel, then userspace
needs a way to discover them (so it can put them in the device tree),
and also has to avoid choosing any token numbers for its functions
that collide with a kernel-chosen token.  If userspace chooses the
token numbers, it has to tell the kernel what token numbers it has
chosen for the kernel-implemented RTAS functions.  We chose the latter
since it gives userspace more control.

So this setup time code has to be either (your suggestion):

foreach RTAS function possibly implemented in kernel {
query kernel token for function, by name
if that gives an error, mark function as needing to be
implemented in userspace
}
(userspace) allocate tokens for remaining functions,
avoiding collisions with kernel-chosen tokens

or else it is (my suggestion):

(userspace) allocate tokens for all RTAS functions
foreach RTAS function possibly implemented in kernel {
tell kernel the (name, token) correspondence
}

   runtime
 
   switch (hcall_id) {
   case KVM_RTAS_ID:
 handle_rtas();

Here, you've compressed details that you expanded in your pseudo-code
above, making this a less than fair comparison.  This handle_rtas()
function has to fetch the token and branch out to the appropriate
handler routine.  Whether that's a switch statement or a loop over
registered handlers doesn't make all that much difference.

 break;
   default:
 pass_to_qemu();
 break;
   }
 
 
 Which one looks easier and less error prone to you? :)
 
 Speaking of which, how does user space know that the kernel actually
 supports a specific RTAS token? 

It's really the names that are more important, the tokens are pretty
arbitrary.  In my scheme, userspace does a KVM_PPC_RTAS_DEFINE_TOKEN
ioctl giving the name and the (userspace-chosen) token, which gets an
error if the kernel doesn't recognize the name.  In your scheme, there
would have to be an equivalent ioctl to query the (kernel-chosen)
token for a given name, which once again would return an error if the
kernel doesn't recognize the name.  Either way the kernel has to have
a list of names that it knows about.

Paul.
--
To 

Re: RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Scott Wood

On 04/04/2013 04:38:44 PM, Yoder Stuart-B08248 wrote:


  /*
   * VFIO_PAMU_MAP_MSI_BANK
   *
   * Maps the MSI bank at the specified index and iova.  User space  
must
   * call this ioctl once for each MSI bank (count of banks is  
returned by

   * VFIO_IOMMU_GET_MSI_BANK_COUNT).
   * Caller provides struct vfio_pamu_msi_bank_map with all fields  
set.

   * Operates on VFIO file descriptor (/dev/vfio/vfio).
   * Return: 0 on success, -errno on failure
   */
 
  struct vfio_pamu_msi_bank_map {
  __u32   argsz;
  __u32   msi_bank_index;  /* the index of the MSI bank */
  __u64   iova;  /* the iova the bank is to be mapped  
to */

  };

 Again, flags.  If we dynamically add or remove devices from a  
container
 the bank count can change, right?  If bank count goes from 2 to 3,  
does
 userspace know assume the new bank is [2]?  If bank count goes from  
3 to
 2, which index was that?  If removing a device removes an MSI bank  
then

 vfio-pamu will automatically do the unmap, right?

My assumption is that the bank count returned by  
VFIO_IOMMU_GET_MSI_BANK_COUNT
is the max bank count for a platform.  (number will mostly likely  
always be

3 or 4).  So it won't change as devices are added or removed.


It should be the actual number of banks used.  This is required if  
we're going to have userspace do the iteration and specify the exact  
iovas to use -- and even if we aren't going to do that, it would be  
more restrictive on available iova-space than is necessary.  Usually  
there will be only one bank in the group.


Actually mapping all of the MSI banks, all the time, would completely  
negate the point of using the separate alias pages.


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Yoder Stuart-B08248


 -Original Message-
 From: Wood Scott-B07421
 Sent: Thursday, April 04, 2013 5:52 PM
 To: Yoder Stuart-B08248
 Cc: Alex Williamson; Wood Scott-B07421; ag...@suse.de; Bhushan Bharat-R65777; 
 Sethi Varun-B16395;
 kvm@vger.kernel.org; qemu-de...@nongnu.org; io...@lists.linux-foundation.org
 Subject: Re: RFC: vfio API changes needed for powerpc (v2)
 
 On 04/04/2013 04:38:44 PM, Yoder Stuart-B08248 wrote:
 
/*
 * VFIO_PAMU_MAP_MSI_BANK
 *
 * Maps the MSI bank at the specified index and iova.  User space
  must
 * call this ioctl once for each MSI bank (count of banks is
  returned by
 * VFIO_IOMMU_GET_MSI_BANK_COUNT).
 * Caller provides struct vfio_pamu_msi_bank_map with all fields
  set.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */
   
struct vfio_pamu_msi_bank_map {
__u32   argsz;
__u32   msi_bank_index;  /* the index of the MSI bank */
__u64   iova;  /* the iova the bank is to be mapped
  to */
};
  
   Again, flags.  If we dynamically add or remove devices from a
  container
   the bank count can change, right?  If bank count goes from 2 to 3,
  does
   userspace know assume the new bank is [2]?  If bank count goes from
  3 to
   2, which index was that?  If removing a device removes an MSI bank
  then
   vfio-pamu will automatically do the unmap, right?
 
  My assumption is that the bank count returned by
  VFIO_IOMMU_GET_MSI_BANK_COUNT
  is the max bank count for a platform.  (number will mostly likely
  always be
  3 or 4).  So it won't change as devices are added or removed.
 
 It should be the actual number of banks used.  This is required if
 we're going to have userspace do the iteration and specify the exact
 iovas to use -- and even if we aren't going to do that, it would be
 more restrictive on available iova-space than is necessary.  Usually
 there will be only one bank in the group.
 
 Actually mapping all of the MSI banks, all the time, would completely
 negate the point of using the separate alias pages.

The geometry, windows, DMA mappings, etc is set on a 'container'
which may have multiple groups in it.  So user space needs to
determine the total number of MSI windows needed when setting
the geometry and window count.

In the flow you proposed:

count = VFIO_IOMMU_GET_MSI_BANK_COUNT
VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
// do other DMA maps now, or later, or not at all, doesn't matter
for (i = 0; i  count; i++)
VFIO_IOMMU_MAP_MSI_BANK(iova, i);

...that count has to be the total, not the count for 1 of
N possible groups.   So the get count ioctl is not done on
a group.

However, like you pointed out we don't want to negate isolation 
of the separate alias pages.  All this API is doing is telling
the kernel which windows to use for which MSI banks.   It's up
to the kernel to actually enable them as needed.

Say 3 MSI banks exist.  If there are no groups added to the
vfio container and all 3 MAP_MSI_BANK calls occurred
the picture may look like this (based on my earlier example):

   win gphys/
#   enabled iovaphys  size
--- ---   
5 N 0x1000  0xf_fe044000  4KB// msi bank 1
6 N 0x1400  0xf_fe045000  4KB// msi bank 2
7 N 0x1800  0xf_fe046000  4KB// msi bank 3

User space adds 2 groups that use bank 1:

   win  gphys/
#   enabled iovaphys  size
--- ---   
5 Y 0x1000  0xf_fe044000  4KB// msi bank 1
6 N 0x1400  0xf_fe045000  4KB// msi bank 2
7 N 0x1800  0xf_fe046000  4KB// msi bank 3

User space adds another group that uses bank 3:

   win  gphys/
#   enabled iovaphys  size
--- ---   
5 Y 0x1000  0xf_fe044000  4KB// msi bank 1
6 N 0x1400  0xf_fe045000  4KB// msi bank 2
7 Y 0x1800  0xf_fe046000  4KB// msi bank 3

User space doesn't need to care what is actually enabled,
it just needs to the the kernel which windows to use
and the kernel can take care of the rest.

Stuart


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Christoffer Dall
[...]

  to give us some idea how much performance we would gain from each 
  approach? Thoughput should be completely unaffected anyway, since virtio 
  just coalesces kicks internally.
 
  Latency is dominated by the scheduling latency.
  This means virtio-net is not the best benchmark.

 So what is a good benchmark?

 E.g. ping pong stress will do but need to look at CPU utilization,
 that's what is affected, not latency.

 Is there any difference in speed at all? I strongly doubt it. One of 
 virtio's main points is to reduce the number of kicks.

 For this stage of the project I think microbenchmarks are more appropriate.
 Doubling the price of exit is likely to be measureable. 30 cycles likely
 not ...

I don't quite understand this point here. If we don't have anything
real-world where we can measure a decent difference, then why are we
doing this? I would agree with Alex that the three test scenarios
proposed by him should be tried out before adding this complexity,
measured in CPU utilization or latency as you wish.

FWIW, ARM always uses MMIO and provides hardware decoding of all sane
(not user register-writeback) instruction, but the hypercall vs. mmio
looks like this:

hvc:  4,917
mmio_kernel: 6,248

But I doubt that an hvc wrapper around mmio decoding would take care
of all this difference, because the mmio operation needs to do other
work not realated to emulating the instruction in software, which
you'd have to do for an hvc anyway (populate kvm_mmio structure etc.)

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation

2013-04-04 Thread Scott Wood

On 04/04/2013 12:59:02 AM, Gleb Natapov wrote:

On Wed, Apr 03, 2013 at 03:58:04PM -0500, Scott Wood wrote:
 KVM_DEV_MPIC_* could go elsewhere if you want to avoid cluttering
 the main kvm.h.  The arch header would be OK, since the non-arch
 header includes the arch header, and thus it wouldn't be visible to
 userspace where it is -- if there later is a need for MPIC (or
 whatever other device follows MPIC's example) on another
 architecture, it could be moved without breaking anything.  Or, we
 could just have a header for each device type.

If device will be used by more then one arch it will move into  
virt/kvm

and will have its own header, like ioapic.


virt/kvm/ioapic.h is not uapi.  The ioapic uapi component (e.g. struct  
kvm_ioapic_state) is duplicated between x86 and ia64, which is the sort  
of thing I'd like to avoid.  I'm OK with putting it in the PPC header  
if, upon a later need for multi-architecture support, it could move  
into either the main uapi header or a separate uapi header that the  
main uapi header includes (i.e. no userspace-visible change in which  
header needs to be included).


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/7] kvm/arm: Fix missing include build error

2013-04-04 Thread Geoff Levand
Include linux/cpu.h in kvm/arm.c.  Fixes build errors like
these with ARCH=arm64:

arch/arm/kvm/arm.c: error: ‘CPU_STARTING_FROZEN’ undeclared

Signed-off-by: Geoff Levand ge...@infradead.org
---
 arch/arm/kvm/arm.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 5a93698..49b4bc6 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -24,6 +24,7 @@
 #include linux/fs.h
 #include linux/mman.h
 #include linux/sched.h
+#include linux/cpu.h
 #include linux/kvm.h
 #include trace/events/kvm.h
 
-- 
1.7.9.5


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/7] kvm/arm: Remove unused kvm_arch_set_memory_region

2013-04-04 Thread Geoff Levand
Remove the unused and empty routine kvm_arch_set_memory_region().

Signed-off-by: Geoff Levand ge...@infradead.org
---
 arch/arm/kvm/arm.c |8 
 1 file changed, 8 deletions(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 86becdc..a67b798 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -221,14 +221,6 @@ long kvm_arch_dev_ioctl(struct file *filp,
return -EINVAL;
 }
 
-int kvm_arch_set_memory_region(struct kvm *kvm,
-  struct kvm_userspace_memory_region *mem,
-  struct kvm_memory_slot old,
-  int user_alloc)
-{
-   return 0;
-}
-
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
   struct kvm_memory_slot *memslot,
   struct kvm_memory_slot old,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/7] KVM minor fixups

2013-04-04 Thread Geoff Levand
Hi Marcelo,

These are a few fixups I found when running sparse and building Marc's 64 bit
ARM tree.  Please consider for 3.10.

-Geoff

The following changes since commit 07961ac7c0ee8b546658717034fe692fd12eefa9:

  Linux 3.9-rc5 (2013-03-31 15:12:43 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/geoff/kvm.git for-kvm

for you to fetch changes up to 5d9e775fd2d7c9ab03d70175610a5d8cedcc3799:

  kvm/arm: Remove unused kvm_arch_set_memory_region (2013-04-04 16:08:28 -0700)


Geoff Levand (7):
  kvm: Make local routines static
  kvm: Move vm_list, kvm_lock dec's out of x86
  kvm: Move kvm_spurious_fault dec out of x86
  kvm: Move kvm_rebooting dec out of x86
  kvm/arm: Fix missing include build error
  kvm/arm: Make force_vm_exit static
  kvm/arm: Remove unused kvm_arch_set_memory_region

 arch/arm/include/asm/kvm_host.h |1 -
 arch/arm/kvm/arm.c  |   11 ++-
 arch/x86/include/asm/kvm_host.h |   11 ---
 include/linux/kvm_host.h|   11 +++
 virt/kvm/kvm_main.c |8 
 5 files changed, 17 insertions(+), 25 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] kvm: Move kvm_rebooting dec out of x86

2013-04-04 Thread Geoff Levand
The variable kvm_rebooting is a common kvm variable, so move
its declaration from arch/x86/include/asm/kvm_host.h to
arch/arm/include/asm/kvm_host.h.

Fixes sparse warning when building on arm64:

  virt/kvm/kvm_main.c:warning: symbol 'kvm_rebooting' was not declared. Should 
it be static?

Signed-off-by: Geoff Levand ge...@infradead.org
---
 arch/x86/include/asm/kvm_host.h |2 --
 include/linux/kvm_host.h|1 +
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da7c126..225139a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -964,8 +964,6 @@ enum {
 #define HF_IRET_MASK   (1  4)
 #define HF_GUEST_MASK  (1  5) /* VCPU is in guest-mode */
 
-extern bool kvm_rebooting;
-
 #define kvm_handle_fault_on_reboot(insn, cleanup_insn) \
666:  insn \n\t \
668: \n\t   \
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4bfb062..7165bd8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1037,6 +1037,7 @@ static inline bool kvm_check_request(int req, struct 
kvm_vcpu *vcpu)
  * Trap the fault and ignore the instruction if that happens.
  */
 asmlinkage void kvm_spurious_fault(void);
+extern bool kvm_rebooting;
 
 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
 
-- 
1.7.9.5


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] kvm: Move kvm_spurious_fault dec out of x86

2013-04-04 Thread Geoff Levand
The routine kvm_spurious_fault() is a common kvm routine, so
move its declaration from arch/x86/include/asm/kvm_host.h to
arch/arm/include/asm/kvm_host.h.

Fixes sparse warning when building on arm64:

  virt/kvm/kvm_main.c:warning: symbol 'kvm_spurious_fault' was not declared. 
Should it be static?

Signed-off-by: Geoff Levand ge...@infradead.org
---
 arch/x86/include/asm/kvm_host.h |6 --
 include/linux/kvm_host.h|7 +++
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 40fc39f..da7c126 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -964,12 +964,6 @@ enum {
 #define HF_IRET_MASK   (1  4)
 #define HF_GUEST_MASK  (1  5) /* VCPU is in guest-mode */
 
-/*
- * Hardware virtualization extension instructions may fault if a
- * reboot turns off virtualization while processes are running.
- * Trap the fault and ignore the instruction if that happens.
- */
-asmlinkage void kvm_spurious_fault(void);
 extern bool kvm_rebooting;
 
 #define kvm_handle_fault_on_reboot(insn, cleanup_insn) \
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8ef2212..4bfb062 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1031,6 +1031,13 @@ static inline bool kvm_check_request(int req, struct 
kvm_vcpu *vcpu)
}
 }
 
+/*
+ * Hardware virtualization extension instructions may fault if a
+ * reboot turns off virtualization while processes are running.
+ * Trap the fault and ignore the instruction if that happens.
+ */
+asmlinkage void kvm_spurious_fault(void);
+
 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
 
 static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
-- 
1.7.9.5


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] kvm/arm: Make force_vm_exit static

2013-04-04 Thread Geoff Levand
The routine force_vm_exit() is not referenced outside kvm/arm.c,
so make it have static linkage.

Signed-off-by: Geoff Levand ge...@infradead.org
---
 arch/arm/include/asm/kvm_host.h |1 -
 arch/arm/kvm/arm.c  |2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index d1736a5..1e93cef 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -149,7 +149,6 @@ struct kvm_one_reg;
 int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
 int kvm_arm_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
 u64 kvm_call_hyp(void *hypfn, ...);
-void force_vm_exit(const cpumask_t *mask);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 struct kvm;
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 49b4bc6..86becdc 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -400,7 +400,7 @@ static void exit_vm_noop(void *info)
 {
 }
 
-void force_vm_exit(const cpumask_t *mask)
+static void force_vm_exit(const cpumask_t *mask)
 {
smp_call_function_many(mask, exit_vm_noop, NULL, true);
 }
-- 
1.7.9.5


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] kvm: Move vm_list, kvm_lock dec's out of x86

2013-04-04 Thread Geoff Levand
The variables vm_list and kvm_lock are common to all architectures, so move
the declarations from arch/x86/include/asm/kvm_host.h to
include/linux/kvm_host.h.

Fixes sparse warnings like these when building for arm64:

  virt/kvm/kvm_main.c: warning: symbol 'kvm_lock' was not declared. Should it 
be static?
  virt/kvm/kvm_main.c: warning: symbol 'vm_list' was not declared. Should it be 
static?

Signed-off-by: Geoff Levand ge...@infradead.org
---
 arch/x86/include/asm/kvm_host.h |3 ---
 include/linux/kvm_host.h|3 +++
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4979778..40fc39f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -94,9 +94,6 @@
 
 #define ASYNC_PF_PER_VCPU 64
 
-extern raw_spinlock_t kvm_lock;
-extern struct list_head vm_list;
-
 struct kvm_vcpu;
 struct kvm;
 struct kvm_async_pf;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cad77fe..8ef2212 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -133,6 +133,9 @@ struct kvm;
 struct kvm_vcpu;
 extern struct kmem_cache *kvm_vcpu_cache;
 
+extern raw_spinlock_t kvm_lock;
+extern struct list_head vm_list;
+
 struct kvm_io_range {
gpa_t addr;
int len;
-- 
1.7.9.5


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] kvm: Make local routines static

2013-04-04 Thread Geoff Levand
The routines get_user_page_nowait(), kvm_io_bus_sort_cmp(), 
kvm_io_bus_insert_dev()
and kvm_io_bus_get_first_dev()  are only referenced within kvm_main.c, so give 
them
static linkage.

Fixes sparse warnings like these:

  virt/kvm/kvm_main.c: warning: symbol 'get_user_page_nowait' was not declared. 
Should it be static?

Signed-off-by: Geoff Levand ge...@infradead.org
---
 virt/kvm/kvm_main.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index adc68fe..82ca8e2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1099,7 +1099,7 @@ static int kvm_read_hva_atomic(void *data, void __user 
*hva, int len)
return __copy_from_user_inatomic(data, hva, len);
 }
 
-int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm,
+static int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int write, struct page **page)
 {
int flags = FOLL_TOUCH | FOLL_NOWAIT | FOLL_HWPOISON | FOLL_GET;
@@ -2631,7 +2631,7 @@ static void kvm_io_bus_destroy(struct kvm_io_bus *bus)
kfree(bus);
 }
 
-int kvm_io_bus_sort_cmp(const void *p1, const void *p2)
+static int kvm_io_bus_sort_cmp(const void *p1, const void *p2)
 {
const struct kvm_io_range *r1 = p1;
const struct kvm_io_range *r2 = p2;
@@ -2643,7 +2643,7 @@ int kvm_io_bus_sort_cmp(const void *p1, const void *p2)
return 0;
 }
 
-int kvm_io_bus_insert_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev,
+static int kvm_io_bus_insert_dev(struct kvm_io_bus *bus, struct kvm_io_device 
*dev,
  gpa_t addr, int len)
 {
bus-range[bus-dev_count++] = (struct kvm_io_range) {
@@ -2658,7 +2658,7 @@ int kvm_io_bus_insert_dev(struct kvm_io_bus *bus, struct 
kvm_io_device *dev,
return 0;
 }
 
-int kvm_io_bus_get_first_dev(struct kvm_io_bus *bus,
+static int kvm_io_bus_get_first_dev(struct kvm_io_bus *bus,
 gpa_t addr, int len)
 {
struct kvm_io_range *range, key;
-- 
1.7.9.5


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 1/6] kvm: add device control API

2013-04-04 Thread Scott Wood

On 04/04/2013 05:41:35 AM, Gleb Natapov wrote:

On Tue, Apr 02, 2013 at 08:57:48PM -0500, Scott Wood wrote:
 +struct kvm_device_attr {
 +  __u32   flags;  /* no flags currently defined */
 +  __u32   group;  /* device-defined */
 +  __u64   attr;   /* group-defined */
 +  __u64   addr;   /* userspace address of attr data */
 +};
 +
Since now each device has its own fd is it an advantage to enforce
common interface between different devices?


I think so, even if only to avoid repeating the various pains  
surrounding adding ioctls.  Not necessarily enforce, just enable.  If  
a device has some sort of command that does not fit neatly into the  
set or get model, it could still add a new ioctl.


If we do so though why not handle file creation, ioctl and file  
descriptor lifetime in the

common code. Common code will have struct kvm_device with struct
kvm_device_arch and struct kvm_device_ops members. Instead of
kvm_mpic_ioctl there will be kvm_device_ioctl which will despatch  
ioctls

to a device using kvm_device-ops-(set|get|has)_attr pointers.


So make it more like the pre-fd version, except for the actual fd  
usage?  It would make destruction a bit simpler (assuming there's no  
need for vcpu destruction code to access a device).  Hopefully nobody  
asks me to change it back again, though. :-)


-Scott
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] kvm/arm: Fix missing include build error

2013-04-04 Thread Christoffer Dall
On Thu, Apr 4, 2013 at 4:33 PM, Geoff Levand ge...@infradead.org wrote:
 Include linux/cpu.h in kvm/arm.c.  Fixes build errors like
 these with ARCH=arm64:

 arch/arm/kvm/arm.c: error: ‘CPU_STARTING_FROZEN’ undeclared

 Signed-off-by: Geoff Levand ge...@infradead.org
 ---
  arch/arm/kvm/arm.c |1 +
  1 file changed, 1 insertion(+)

 diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
 index 5a93698..49b4bc6 100644
 --- a/arch/arm/kvm/arm.c
 +++ b/arch/arm/kvm/arm.c
 @@ -24,6 +24,7 @@
  #include linux/fs.h
  #include linux/mman.h
  #include linux/sched.h
 +#include linux/cpu.h
  #include linux/kvm.h
  #include trace/events/kvm.h

 --


I'm confused, I don't see this symbol in arm.c - against which tree do
these patches apply exactly? If this is something introduced by arm64,
then it's premature, and should be added to that series.

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] KVM minor fixups

2013-04-04 Thread Christoffer Dall
On Thu, Apr 4, 2013 at 4:33 PM, Geoff Levand ge...@infradead.org wrote:
 Hi Marcelo,

 These are a few fixups I found when running sparse and building Marc's 64 bit
 ARM tree.  Please consider for 3.10.

running sparse - what does this mean?

-Christoffer
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >