Re: [Qemu-devel] qemu crashed when starting vm(kvm) with vnc connect

2013-04-04 Thread Stefan Hajnoczi
On Tue, Apr 02, 2013 at 09:02:02AM +, Zhanghaoyu (A) wrote:
> I start a kvm VM with vnc(using the zrle protocol) connect, sometimes qemu 
> program crashed during starting period, received signal SIGABRT.
> Trying about 20 times, this crash may be reproduced.
> I guess the cause memory corruption or double free.

Which version of QEMU are you running?

Please try qemu.git/master.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

2013-04-04 Thread Alexander Graf


Am 05.04.2013 um 00:35 schrieb Scott Wood :

> On 04/04/2013 05:30:05 PM, Alexander Graf wrote:
>> Am 04.04.2013 um 20:41 schrieb Scott Wood :
>> > On 04/04/2013 07:54:20 AM, Alexander Graf wrote:
>> >> On 03.04.2013, at 03:57, Scott Wood wrote:
>> >> > +if (opp->mpic_mode_mask == GCR_MODE_PROXY)
>> >> Shouldn't this be an &?
>> >
>> > The way the mode field was originally documented was a two-bit field, 
>> > where 0b11 was external proxy, and 0b10 was reserved.  If we use & it 
>> > would have to be:
>> >
>> >if ((opp->mpic_mode_mask & GCR_MODE_PROXY) == GCR_MODE_PROXY)
>> >...
>> >
>> > Simply testing "opp->mpic_mode_mask & GCR_MODE_PROXY" would return true in 
>> > the case of GCR_MODE_MIXED.
>> >
>> > In MPIC 4.3 external proxy is defined as a separate bit (GCR[CI]) that is 
>> > ignored if the mixed-mode bit (GCR[M]) is not set, which makes it a bit 
>> > more legitimate to view it as a bitmap.  Still, I doubt we'll see new mode 
>> > bits.
>> Ok, please add a comment about this here then :).
> 
> What sort of comment would you like?  Or do you want me to use the "(x & y) 
> == y" version?

/* This might need to be changed if GCR gets extended */

> 
>> >> > @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
>> >> >tasklet_kill(&vcpu->arch.tasklet);
>> >> >
>> >> >kvmppc_remove_vcpu_debugfs(vcpu);
>> >> > +
>> >> > +switch (vcpu->arch.irq_type) {
>> >> > +case KVMPPC_IRQ_MPIC:
>> >> > +kvmppc_mpic_put(vcpu->arch.mpic);
>> >> This doesn't tell the MPIC that this exact CPU is getting killed. What if 
>> >> we hotplug remove just a single CPU? Don't we have to deregister the CPU 
>> >> with the MPIC?
>> >
>> > If we ever support hot vcpu removal, yes.  We'd probably need some MPIC 
>> > code changes to accommodate that, and we wouldn't currently have a way to 
>> > test it, so I'd rather make it obviously not supported for now.
>> Is there any way to break heavily if user space attempts this?
> 
> Is there any way for userspace to request this currently?  They can close the 
> vcpu fd, but the vcpu won't actually be destroyed until the vm goes down.

Are you sure? X86 does CPU hotplug today, so there has to be something :)

Alex

> 
> -Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 1/6] kvm: add device control API

2013-04-04 Thread Paul Mackerras
On Thu, Apr 04, 2013 at 01:41:35PM +0300, Gleb Natapov wrote:

> Since now each device has its own fd is it an advantage to enforce
> common interface between different devices? If we do so though why
> not handle file creation, ioctl and file descriptor lifetime in the
> common code. Common code will have "struct kvm_device" with "struct
> kvm_device_arch" and "struct kvm_device_ops" members. Instead of
> kvm_mpic_ioctl there will be kvm_device_ioctl which will despatch ioctls
> to a device using kvm_device->ops->(set|get|has)_attr pointers.

I thought about making the same request, but when I looked at it, the
amount of code that could be made common in this way is pretty tiny,
and doing that involves a bit of extra complexity, so I thought that
on the whole it wouldn't be worthwhile.

Paul.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] kvm/arm: Fix missing include build error

2013-04-04 Thread Geoff Levand
Hi,

On Thu, 2013-04-04 at 17:15 -0700, Christoffer Dall wrote:
> On Thu, Apr 4, 2013 at 5:04 PM, Geoff Levand  wrote:
> 
> > Sorry, this one is from Marc's kvm-for-next branch.  I'll send it
> > to him to include, unless you just want to take it in preparation.
> >
> Hmm, not sure what the status of his branch is exactly. Does it
> contain the arm64 stuff?

Yes, it is pretty much his most recent.

> I don't really want to merge this before it's needed.

OK, that's fine.  I'll just add it to a few other 64 bit specific
fixups I'll be sending to Marc.

-Geoff




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] kvm/arm: Fix missing include build error

2013-04-04 Thread Christoffer Dall
On Thu, Apr 4, 2013 at 5:04 PM, Geoff Levand  wrote:
> Hi Christoffer,
>
> On Thu, 2013-04-04 at 16:51 -0700, Christoffer Dall wrote:
>> On Thu, Apr 4, 2013 at 4:33 PM, Geoff Levand  wrote:
>> > Include linux/cpu.h in kvm/arm.c.  Fixes build errors like
>> > these with ARCH=arm64:
>> >
>> > arch/arm/kvm/arm.c: error: ‘CPU_STARTING_FROZEN’ undeclared
>> >
>> > Signed-off-by: Geoff Levand 
>> > ---
>> >  arch/arm/kvm/arm.c |1 +
>> >  1 file changed, 1 insertion(+)
>> >
>> > diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
>> > index 5a93698..49b4bc6 100644
>> > --- a/arch/arm/kvm/arm.c
>> > +++ b/arch/arm/kvm/arm.c
>> > @@ -24,6 +24,7 @@
>> >  #include 
>> >  #include 
>> >  #include 
>> > +#include 
>> >  #include 
>> >  #include 
>> >
>> > --
>> >
>>
>> I'm confused, I don't see this symbol in arm.c - against which tree do
>> these patches apply exactly? If this is something introduced by arm64,
>> then it's premature, and should be added to that series.
>
> Sorry, this one is from Marc's kvm-for-next branch.  I'll send it
> to him to include, unless you just want to take it in preparation.
>
Hmm, not sure what the status of his branch is exactly. Does it
contain the arm64 stuff?

I don't really want to merge this before it's needed.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] kvm/arm: Fix missing include build error

2013-04-04 Thread Geoff Levand
Hi Christoffer,

On Thu, 2013-04-04 at 16:51 -0700, Christoffer Dall wrote:
> On Thu, Apr 4, 2013 at 4:33 PM, Geoff Levand  wrote:
> > Include linux/cpu.h in kvm/arm.c.  Fixes build errors like
> > these with ARCH=arm64:
> >
> > arch/arm/kvm/arm.c: error: ‘CPU_STARTING_FROZEN’ undeclared
> >
> > Signed-off-by: Geoff Levand 
> > ---
> >  arch/arm/kvm/arm.c |1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> > index 5a93698..49b4bc6 100644
> > --- a/arch/arm/kvm/arm.c
> > +++ b/arch/arm/kvm/arm.c
> > @@ -24,6 +24,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >
> > --
> >
> 
> I'm confused, I don't see this symbol in arm.c - against which tree do
> these patches apply exactly? If this is something introduced by arm64,
> then it's premature, and should be added to that series.

Sorry, this one is from Marc's kvm-for-next branch.  I'll send it
to him to include, unless you just want to take it in preparation.

-Geoff

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] KVM minor fixups

2013-04-04 Thread Christoffer Dall
On Thu, Apr 4, 2013 at 4:57 PM, Geoff Levand  wrote:
> Hi Christoffer,
>
> On Thu, 2013-04-04 at 16:53 -0700, Christoffer Dall wrote:
>> On Thu, Apr 4, 2013 at 4:33 PM, Geoff Levand  wrote:
>> > Hi Marcelo,
>> >
>> > These are a few fixups I found when running sparse and building Marc's 64 
>> > bit
>> > ARM tree.  Please consider for 3.10.
>>
>> "running sparse" - what does this mean?
>
> make C=1.  It runs the checker, by default, sparse.
>
> http://en.wikipedia.org/wiki/Sparse
>
got it, thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] kvm/arm: Remove unused kvm_arch_set_memory_region

2013-04-04 Thread Christoffer Dall
On Thu, Apr 4, 2013 at 4:33 PM, Geoff Levand  wrote:
> Remove the unused and empty routine kvm_arch_set_memory_region().
>
> Signed-off-by: Geoff Levand 
> ---
>  arch/arm/kvm/arm.c |8 
>  1 file changed, 8 deletions(-)
>
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 86becdc..a67b798 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -221,14 +221,6 @@ long kvm_arch_dev_ioctl(struct file *filp,
> return -EINVAL;
>  }
>
> -int kvm_arch_set_memory_region(struct kvm *kvm,
> -  struct kvm_userspace_memory_region *mem,
> -  struct kvm_memory_slot old,
> -  int user_alloc)
> -{
> -   return 0;
> -}
> -
>  int kvm_arch_prepare_memory_region(struct kvm *kvm,
>struct kvm_memory_slot *memslot,
>struct kvm_memory_slot old,
> --
> 1.7.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

and I can take this one too.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] KVM minor fixups

2013-04-04 Thread Geoff Levand
Hi Christoffer,

On Thu, 2013-04-04 at 16:53 -0700, Christoffer Dall wrote:
> On Thu, Apr 4, 2013 at 4:33 PM, Geoff Levand  wrote:
> > Hi Marcelo,
> >
> > These are a few fixups I found when running sparse and building Marc's 64 
> > bit
> > ARM tree.  Please consider for 3.10.
> 
> "running sparse" - what does this mean?

make C=1.  It runs the checker, by default, sparse.

http://en.wikipedia.org/wiki/Sparse

-Geoff

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] kvm/arm: Make force_vm_exit static

2013-04-04 Thread Christoffer Dall
On Thu, Apr 4, 2013 at 4:33 PM, Geoff Levand  wrote:
> The routine force_vm_exit() is not referenced outside kvm/arm.c,
> so make it have static linkage.
>
> Signed-off-by: Geoff Levand 
> ---
>  arch/arm/include/asm/kvm_host.h |1 -
>  arch/arm/kvm/arm.c  |2 +-
>  2 files changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index d1736a5..1e93cef 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -149,7 +149,6 @@ struct kvm_one_reg;
>  int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
>  int kvm_arm_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
>  u64 kvm_call_hyp(void *hypfn, ...);
> -void force_vm_exit(const cpumask_t *mask);
>
>  #define KVM_ARCH_WANT_MMU_NOTIFIER
>  struct kvm;
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 49b4bc6..86becdc 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -400,7 +400,7 @@ static void exit_vm_noop(void *info)
>  {
>  }
>
> -void force_vm_exit(const cpumask_t *mask)
> +static void force_vm_exit(const cpumask_t *mask)
>  {
> smp_call_function_many(mask, exit_vm_noop, NULL, true);
>  }
> --
> 1.7.9.5
>
>
I can take this one.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] KVM minor fixups

2013-04-04 Thread Christoffer Dall
On Thu, Apr 4, 2013 at 4:33 PM, Geoff Levand  wrote:
> Hi Marcelo,
>
> These are a few fixups I found when running sparse and building Marc's 64 bit
> ARM tree.  Please consider for 3.10.

"running sparse" - what does this mean?

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] kvm/arm: Fix missing include build error

2013-04-04 Thread Christoffer Dall
On Thu, Apr 4, 2013 at 4:33 PM, Geoff Levand  wrote:
> Include linux/cpu.h in kvm/arm.c.  Fixes build errors like
> these with ARCH=arm64:
>
> arch/arm/kvm/arm.c: error: ‘CPU_STARTING_FROZEN’ undeclared
>
> Signed-off-by: Geoff Levand 
> ---
>  arch/arm/kvm/arm.c |1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
> index 5a93698..49b4bc6 100644
> --- a/arch/arm/kvm/arm.c
> +++ b/arch/arm/kvm/arm.c
> @@ -24,6 +24,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>
> --
>

I'm confused, I don't see this symbol in arm.c - against which tree do
these patches apply exactly? If this is something introduced by arm64,
then it's premature, and should be added to that series.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 1/6] kvm: add device control API

2013-04-04 Thread Scott Wood

On 04/04/2013 05:41:35 AM, Gleb Natapov wrote:

On Tue, Apr 02, 2013 at 08:57:48PM -0500, Scott Wood wrote:
> +struct kvm_device_attr {
> +  __u32   flags;  /* no flags currently defined */
> +  __u32   group;  /* device-defined */
> +  __u64   attr;   /* group-defined */
> +  __u64   addr;   /* userspace address of attr data */
> +};
> +
Since now each device has its own fd is it an advantage to enforce
common interface between different devices?


I think so, even if only to avoid repeating the various pains  
surrounding adding ioctls.  Not necessarily "enforce", just enable.  If  
a device has some sort of command that does not fit neatly into the  
"set or get" model, it could still add a new ioctl.


If we do so though why not handle file creation, ioctl and file  
descriptor lifetime in the

common code. Common code will have "struct kvm_device" with "struct
kvm_device_arch" and "struct kvm_device_ops" members. Instead of
kvm_mpic_ioctl there will be kvm_device_ioctl which will despatch  
ioctls

to a device using kvm_device->ops->(set|get|has)_attr pointers.


So make it more like the pre-fd version, except for the actual fd  
usage?  It would make destruction a bit simpler (assuming there's no  
need for vcpu destruction code to access a device).  Hopefully nobody  
asks me to change it back again, though. :-)


-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/7] kvm: Make local routines static

2013-04-04 Thread Geoff Levand
The routines get_user_page_nowait(), kvm_io_bus_sort_cmp(), 
kvm_io_bus_insert_dev()
and kvm_io_bus_get_first_dev()  are only referenced within kvm_main.c, so give 
them
static linkage.

Fixes sparse warnings like these:

  virt/kvm/kvm_main.c: warning: symbol 'get_user_page_nowait' was not declared. 
Should it be static?

Signed-off-by: Geoff Levand 
---
 virt/kvm/kvm_main.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index adc68fe..82ca8e2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1099,7 +1099,7 @@ static int kvm_read_hva_atomic(void *data, void __user 
*hva, int len)
return __copy_from_user_inatomic(data, hva, len);
 }
 
-int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm,
+static int get_user_page_nowait(struct task_struct *tsk, struct mm_struct *mm,
unsigned long start, int write, struct page **page)
 {
int flags = FOLL_TOUCH | FOLL_NOWAIT | FOLL_HWPOISON | FOLL_GET;
@@ -2631,7 +2631,7 @@ static void kvm_io_bus_destroy(struct kvm_io_bus *bus)
kfree(bus);
 }
 
-int kvm_io_bus_sort_cmp(const void *p1, const void *p2)
+static int kvm_io_bus_sort_cmp(const void *p1, const void *p2)
 {
const struct kvm_io_range *r1 = p1;
const struct kvm_io_range *r2 = p2;
@@ -2643,7 +2643,7 @@ int kvm_io_bus_sort_cmp(const void *p1, const void *p2)
return 0;
 }
 
-int kvm_io_bus_insert_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev,
+static int kvm_io_bus_insert_dev(struct kvm_io_bus *bus, struct kvm_io_device 
*dev,
  gpa_t addr, int len)
 {
bus->range[bus->dev_count++] = (struct kvm_io_range) {
@@ -2658,7 +2658,7 @@ int kvm_io_bus_insert_dev(struct kvm_io_bus *bus, struct 
kvm_io_device *dev,
return 0;
 }
 
-int kvm_io_bus_get_first_dev(struct kvm_io_bus *bus,
+static int kvm_io_bus_get_first_dev(struct kvm_io_bus *bus,
 gpa_t addr, int len)
 {
struct kvm_io_range *range, key;
-- 
1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/7] kvm: Move vm_list, kvm_lock dec's out of x86

2013-04-04 Thread Geoff Levand
The variables vm_list and kvm_lock are common to all architectures, so move
the declarations from arch/x86/include/asm/kvm_host.h to
include/linux/kvm_host.h.

Fixes sparse warnings like these when building for arm64:

  virt/kvm/kvm_main.c: warning: symbol 'kvm_lock' was not declared. Should it 
be static?
  virt/kvm/kvm_main.c: warning: symbol 'vm_list' was not declared. Should it be 
static?

Signed-off-by: Geoff Levand 
---
 arch/x86/include/asm/kvm_host.h |3 ---
 include/linux/kvm_host.h|3 +++
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4979778..40fc39f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -94,9 +94,6 @@
 
 #define ASYNC_PF_PER_VCPU 64
 
-extern raw_spinlock_t kvm_lock;
-extern struct list_head vm_list;
-
 struct kvm_vcpu;
 struct kvm;
 struct kvm_async_pf;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cad77fe..8ef2212 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -133,6 +133,9 @@ struct kvm;
 struct kvm_vcpu;
 extern struct kmem_cache *kvm_vcpu_cache;
 
+extern raw_spinlock_t kvm_lock;
+extern struct list_head vm_list;
+
 struct kvm_io_range {
gpa_t addr;
int len;
-- 
1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/7] kvm/arm: Make force_vm_exit static

2013-04-04 Thread Geoff Levand
The routine force_vm_exit() is not referenced outside kvm/arm.c,
so make it have static linkage.

Signed-off-by: Geoff Levand 
---
 arch/arm/include/asm/kvm_host.h |1 -
 arch/arm/kvm/arm.c  |2 +-
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index d1736a5..1e93cef 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -149,7 +149,6 @@ struct kvm_one_reg;
 int kvm_arm_get_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
 int kvm_arm_set_reg(struct kvm_vcpu *vcpu, const struct kvm_one_reg *reg);
 u64 kvm_call_hyp(void *hypfn, ...);
-void force_vm_exit(const cpumask_t *mask);
 
 #define KVM_ARCH_WANT_MMU_NOTIFIER
 struct kvm;
diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 49b4bc6..86becdc 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -400,7 +400,7 @@ static void exit_vm_noop(void *info)
 {
 }
 
-void force_vm_exit(const cpumask_t *mask)
+static void force_vm_exit(const cpumask_t *mask)
 {
smp_call_function_many(mask, exit_vm_noop, NULL, true);
 }
-- 
1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/7] kvm: Move kvm_spurious_fault dec out of x86

2013-04-04 Thread Geoff Levand
The routine kvm_spurious_fault() is a common kvm routine, so
move its declaration from arch/x86/include/asm/kvm_host.h to
arch/arm/include/asm/kvm_host.h.

Fixes sparse warning when building on arm64:

  virt/kvm/kvm_main.c:warning: symbol 'kvm_spurious_fault' was not declared. 
Should it be static?

Signed-off-by: Geoff Levand 
---
 arch/x86/include/asm/kvm_host.h |6 --
 include/linux/kvm_host.h|7 +++
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 40fc39f..da7c126 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -964,12 +964,6 @@ enum {
 #define HF_IRET_MASK   (1 << 4)
 #define HF_GUEST_MASK  (1 << 5) /* VCPU is in guest-mode */
 
-/*
- * Hardware virtualization extension instructions may fault if a
- * reboot turns off virtualization while processes are running.
- * Trap the fault and ignore the instruction if that happens.
- */
-asmlinkage void kvm_spurious_fault(void);
 extern bool kvm_rebooting;
 
 #define kvm_handle_fault_on_reboot(insn, cleanup_insn) \
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 8ef2212..4bfb062 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1031,6 +1031,13 @@ static inline bool kvm_check_request(int req, struct 
kvm_vcpu *vcpu)
}
 }
 
+/*
+ * Hardware virtualization extension instructions may fault if a
+ * reboot turns off virtualization while processes are running.
+ * Trap the fault and ignore the instruction if that happens.
+ */
+asmlinkage void kvm_spurious_fault(void);
+
 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
 
 static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
-- 
1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/7] kvm: Move kvm_rebooting dec out of x86

2013-04-04 Thread Geoff Levand
The variable kvm_rebooting is a common kvm variable, so move
its declaration from arch/x86/include/asm/kvm_host.h to
arch/arm/include/asm/kvm_host.h.

Fixes sparse warning when building on arm64:

  virt/kvm/kvm_main.c:warning: symbol 'kvm_rebooting' was not declared. Should 
it be static?

Signed-off-by: Geoff Levand 
---
 arch/x86/include/asm/kvm_host.h |2 --
 include/linux/kvm_host.h|1 +
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index da7c126..225139a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -964,8 +964,6 @@ enum {
 #define HF_IRET_MASK   (1 << 4)
 #define HF_GUEST_MASK  (1 << 5) /* VCPU is in guest-mode */
 
-extern bool kvm_rebooting;
-
 #define kvm_handle_fault_on_reboot(insn, cleanup_insn) \
"666: " insn "\n\t" \
"668: \n\t"   \
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4bfb062..7165bd8 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1037,6 +1037,7 @@ static inline bool kvm_check_request(int req, struct 
kvm_vcpu *vcpu)
  * Trap the fault and ignore the instruction if that happens.
  */
 asmlinkage void kvm_spurious_fault(void);
+extern bool kvm_rebooting;
 
 #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
 
-- 
1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/7] KVM minor fixups

2013-04-04 Thread Geoff Levand
Hi Marcelo,

These are a few fixups I found when running sparse and building Marc's 64 bit
ARM tree.  Please consider for 3.10.

-Geoff

The following changes since commit 07961ac7c0ee8b546658717034fe692fd12eefa9:

  Linux 3.9-rc5 (2013-03-31 15:12:43 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/geoff/kvm.git for-kvm

for you to fetch changes up to 5d9e775fd2d7c9ab03d70175610a5d8cedcc3799:

  kvm/arm: Remove unused kvm_arch_set_memory_region (2013-04-04 16:08:28 -0700)


Geoff Levand (7):
  kvm: Make local routines static
  kvm: Move vm_list, kvm_lock dec's out of x86
  kvm: Move kvm_spurious_fault dec out of x86
  kvm: Move kvm_rebooting dec out of x86
  kvm/arm: Fix missing include build error
  kvm/arm: Make force_vm_exit static
  kvm/arm: Remove unused kvm_arch_set_memory_region

 arch/arm/include/asm/kvm_host.h |1 -
 arch/arm/kvm/arm.c  |   11 ++-
 arch/x86/include/asm/kvm_host.h |   11 ---
 include/linux/kvm_host.h|   11 +++
 virt/kvm/kvm_main.c |8 
 5 files changed, 17 insertions(+), 25 deletions(-)

-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/7] kvm/arm: Fix missing include build error

2013-04-04 Thread Geoff Levand
Include linux/cpu.h in kvm/arm.c.  Fixes build errors like
these with ARCH=arm64:

arch/arm/kvm/arm.c: error: ‘CPU_STARTING_FROZEN’ undeclared

Signed-off-by: Geoff Levand 
---
 arch/arm/kvm/arm.c |1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 5a93698..49b4bc6 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
-- 
1.7.9.5


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 7/7] kvm/arm: Remove unused kvm_arch_set_memory_region

2013-04-04 Thread Geoff Levand
Remove the unused and empty routine kvm_arch_set_memory_region().

Signed-off-by: Geoff Levand 
---
 arch/arm/kvm/arm.c |8 
 1 file changed, 8 deletions(-)

diff --git a/arch/arm/kvm/arm.c b/arch/arm/kvm/arm.c
index 86becdc..a67b798 100644
--- a/arch/arm/kvm/arm.c
+++ b/arch/arm/kvm/arm.c
@@ -221,14 +221,6 @@ long kvm_arch_dev_ioctl(struct file *filp,
return -EINVAL;
 }
 
-int kvm_arch_set_memory_region(struct kvm *kvm,
-  struct kvm_userspace_memory_region *mem,
-  struct kvm_memory_slot old,
-  int user_alloc)
-{
-   return 0;
-}
-
 int kvm_arch_prepare_memory_region(struct kvm *kvm,
   struct kvm_memory_slot *memslot,
   struct kvm_memory_slot old,
-- 
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 5/6] kvm/ppc/mpic: in-kernel MPIC emulation

2013-04-04 Thread Scott Wood

On 04/04/2013 12:59:02 AM, Gleb Natapov wrote:

On Wed, Apr 03, 2013 at 03:58:04PM -0500, Scott Wood wrote:
> KVM_DEV_MPIC_* could go elsewhere if you want to avoid cluttering
> the main kvm.h.  The arch header would be OK, since the non-arch
> header includes the arch header, and thus it wouldn't be visible to
> userspace where it is -- if there later is a need for MPIC (or
> whatever other device follows MPIC's example) on another
> architecture, it could be moved without breaking anything.  Or, we
> could just have a header for each device type.
>
If device will be used by more then one arch it will move into  
virt/kvm

and will have its own header, like ioapic.


virt/kvm/ioapic.h is not uapi.  The ioapic uapi component (e.g. struct  
kvm_ioapic_state) is duplicated between x86 and ia64, which is the sort  
of thing I'd like to avoid.  I'm OK with putting it in the PPC header  
if, upon a later need for multi-architecture support, it could move  
into either the main uapi header or a separate uapi header that the  
main uapi header includes (i.e. no userspace-visible change in which  
header needs to be included).


-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Christoffer Dall
[...]

>> >> to give us some idea how much performance we would gain from each 
>> >> approach? Thoughput should be completely unaffected anyway, since virtio 
>> >> just coalesces kicks internally.
>> >
>> > Latency is dominated by the scheduling latency.
>> > This means virtio-net is not the best benchmark.
>>
>> So what is a good benchmark?
>
> E.g. ping pong stress will do but need to look at CPU utilization,
> that's what is affected, not latency.
>
>> Is there any difference in speed at all? I strongly doubt it. One of 
>> virtio's main points is to reduce the number of kicks.
>
> For this stage of the project I think microbenchmarks are more appropriate.
> Doubling the price of exit is likely to be measureable. 30 cycles likely
> not ...
>
I don't quite understand this point here. If we don't have anything
real-world where we can measure a decent difference, then why are we
doing this? I would agree with Alex that the three test scenarios
proposed by him should be tried out before adding this complexity,
measured in CPU utilization or latency as you wish.

FWIW, ARM always uses MMIO and provides hardware decoding of all sane
(not user register-writeback) instruction, but the hypercall vs. mmio
looks like this:

hvc:  4,917
mmio_kernel: 6,248

But I doubt that an hvc wrapper around mmio decoding would take care
of all this difference, because the mmio operation needs to do other
work not realated to emulating the instruction in software, which
you'd have to do for an hvc anyway (populate kvm_mmio structure etc.)

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Yoder Stuart-B08248


> -Original Message-
> From: Wood Scott-B07421
> Sent: Thursday, April 04, 2013 5:52 PM
> To: Yoder Stuart-B08248
> Cc: Alex Williamson; Wood Scott-B07421; ag...@suse.de; Bhushan Bharat-R65777; 
> Sethi Varun-B16395;
> kvm@vger.kernel.org; qemu-de...@nongnu.org; io...@lists.linux-foundation.org
> Subject: Re: RFC: vfio API changes needed for powerpc (v2)
> 
> On 04/04/2013 04:38:44 PM, Yoder Stuart-B08248 wrote:
> >
> > > > /*
> > > >  * VFIO_PAMU_MAP_MSI_BANK
> > > >  *
> > > >  * Maps the MSI bank at the specified index and iova.  User space
> > must
> > > >  * call this ioctl once for each MSI bank (count of banks is
> > returned by
> > > >  * VFIO_IOMMU_GET_MSI_BANK_COUNT).
> > > >  * Caller provides struct vfio_pamu_msi_bank_map with all fields
> > set.
> > > >  * Operates on VFIO file descriptor (/dev/vfio/vfio).
> > > >  * Return: 0 on success, -errno on failure
> > > >  */
> > > >
> > > > struct vfio_pamu_msi_bank_map {
> > > > __u32   argsz;
> > > > __u32   msi_bank_index;  /* the index of the MSI bank */
> > > > __u64   iova;  /* the iova the bank is to be mapped
> > to */
> > > > };
> > >
> > > Again, flags.  If we dynamically add or remove devices from a
> > container
> > > the bank count can change, right?  If bank count goes from 2 to 3,
> > does
> > > userspace know assume the new bank is [2]?  If bank count goes from
> > 3 to
> > > 2, which index was that?  If removing a device removes an MSI bank
> > then
> > > vfio-pamu will automatically do the unmap, right?
> >
> > My assumption is that the bank count returned by
> > VFIO_IOMMU_GET_MSI_BANK_COUNT
> > is the max bank count for a platform.  (number will mostly likely
> > always be
> > 3 or 4).  So it won't change as devices are added or removed.
> 
> It should be the actual number of banks used.  This is required if
> we're going to have userspace do the iteration and specify the exact
> iovas to use -- and even if we aren't going to do that, it would be
> more restrictive on available iova-space than is necessary.  Usually
> there will be only one bank in the group.
> 
> Actually mapping all of the MSI banks, all the time, would completely
> negate the point of using the separate alias pages.

The geometry, windows, DMA mappings, etc is set on a 'container'
which may have multiple groups in it.  So user space needs to
determine the total number of MSI windows needed when setting
the geometry and window count.

In the flow you proposed:

count = VFIO_IOMMU_GET_MSI_BANK_COUNT
VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
// do other DMA maps now, or later, or not at all, doesn't matter
for (i = 0; i < count; i++)
VFIO_IOMMU_MAP_MSI_BANK(iova, i);

...that "count" has to be the total, not the count for 1 of
N possible groups.   So the get count ioctl is not done on
a group.

However, like you pointed out we don't want to negate isolation 
of the separate alias pages.  All this API is doing is telling
the kernel which windows to use for which MSI banks.   It's up
to the kernel to actually enable them as needed.

Say 3 MSI banks exist.  If there are no groups added to the
vfio container and all 3 MAP_MSI_BANK calls occurred
the picture may look like this (based on my earlier example):

   win gphys/
#   enabled iovaphys  size
--- ---   
5 N 0x1000  0xf_fe044000  4KB// msi bank 1
6 N 0x1400  0xf_fe045000  4KB// msi bank 2
7 N 0x1800  0xf_fe046000  4KB// msi bank 3

User space adds 2 groups that use bank 1:

   win  gphys/
#   enabled iovaphys  size
--- ---   
5 Y 0x1000  0xf_fe044000  4KB// msi bank 1
6 N 0x1400  0xf_fe045000  4KB// msi bank 2
7 N 0x1800  0xf_fe046000  4KB// msi bank 3

User space adds another group that uses bank 3:

   win  gphys/
#   enabled iovaphys  size
--- ---   
5 Y 0x1000  0xf_fe044000  4KB// msi bank 1
6 N 0x1400  0xf_fe045000  4KB// msi bank 2
7 Y 0x1800  0xf_fe046000  4KB// msi bank 3

User space doesn't need to care what is actually enabled,
it just needs to the the kernel which windows to use
and the kernel can take care of the rest.

Stuart


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Scott Wood

On 04/04/2013 04:38:44 PM, Yoder Stuart-B08248 wrote:


> > /*
> >  * VFIO_PAMU_MAP_MSI_BANK
> >  *
> >  * Maps the MSI bank at the specified index and iova.  User space  
must
> >  * call this ioctl once for each MSI bank (count of banks is  
returned by

> >  * VFIO_IOMMU_GET_MSI_BANK_COUNT).
> >  * Caller provides struct vfio_pamu_msi_bank_map with all fields  
set.

> >  * Operates on VFIO file descriptor (/dev/vfio/vfio).
> >  * Return: 0 on success, -errno on failure
> >  */
> >
> > struct vfio_pamu_msi_bank_map {
> > __u32   argsz;
> > __u32   msi_bank_index;  /* the index of the MSI bank */
> > __u64   iova;  /* the iova the bank is to be mapped  
to */

> > };
>
> Again, flags.  If we dynamically add or remove devices from a  
container
> the bank count can change, right?  If bank count goes from 2 to 3,  
does
> userspace know assume the new bank is [2]?  If bank count goes from  
3 to
> 2, which index was that?  If removing a device removes an MSI bank  
then

> vfio-pamu will automatically do the unmap, right?

My assumption is that the bank count returned by  
VFIO_IOMMU_GET_MSI_BANK_COUNT
is the max bank count for a platform.  (number will mostly likely  
always be

3 or 4).  So it won't change as devices are added or removed.


It should be the actual number of banks used.  This is required if  
we're going to have userspace do the iteration and specify the exact  
iovas to use -- and even if we aren't going to do that, it would be  
more restrictive on available iova-space than is necessary.  Usually  
there will be only one bank in the group.


Actually mapping all of the MSI banks, all the time, would completely  
negate the point of using the separate alias pages.


-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/9] KVM: PPC: Book3S: Add infrastructure to implement kernel-side RTAS calls

2013-04-04 Thread Paul Mackerras
On Thu, Apr 04, 2013 at 11:49:55AM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 07:37, Paul Mackerras wrote:
> 
> > On Thu, Mar 21, 2013 at 09:52:16AM +0100, Alexander Graf wrote:
> >>> +/* Platform specific hcalls, used by KVM */
> >>> +#define H_RTAS   0xf000
> >> 
> >> How about you define a different hcall ID for this? Then QEMU would
> >> create its "rtas entry blob" such that KVM-routed RTAS handling goes
> >> to KVM directly.
> > 
> > QEMU can still do that, and I don't see that it would change the
> > kernel side if it did.  We would still have to have agreement between
> > the kernel and userspace as to what the hcall number for invoking the
> > in-kernel RTAS calls was, and the kernel would still have to keep a
> > list of token numbers and how they correspond to the functions it
> > provides.  The only thing different would be that the in-kernel RTAS
> > hcall could return to the guest if it didn't recognize the token
> > number, rather than pushing the problem up to userspace.  However,
> > that wouldn't make the code any simpler, and it isn't a situation
> > where performance is an issue.
> > 
> > Do you see some kernel-side improvements or simplifications from your
> > suggestion that I'm missing?  Remember, the guest gets the token
> > numbers from the device tree (properties under the /rtas node), so
> > they are under the control of userspace/QEMU.
> 
> The code flow with this patch:
> 
>   
> 
>   foreach (override in overrides)
> ioctl(OVERRIDE_RTAS, ...);
> 
>   
> 
>   switch (hcall_id) {
>   case QEMU_RTAS_ID:
> foreach (override in kvm_overrides) {
>   int rtas_id = ...;
>   if (override.rtas_id == rtas_id) {
> handle_rtas();

Actually this is more like: override.handler();

> handled = true;
>   }
> }
> if (!handled)
>   pass_to_qemu();
> break;
>   default:
> pass_to_qemu();
> break
>   }
> 
> What I'm suggesting:
> 
>   
> 
>   nothing from KVM's point of view

Actually, this can't be "nothing".

The way the RTAS calls work is that there is a name and a "token"
(32-bit integer value) for each RTAS call.  The tokens have to be
unique for each different name.  Userspace puts the names and tokens
in the device tree under the /rtas node (a set of properties where the
property name is the RTAS function name and the property value is the
token).  The guest looks up the token for each RTAS function it wants
to use, and passes the token in the argument buffer for the RTAS call.

This means that userspace has to know the names and tokens for all
supported RTAS functions, both the ones implemented in the kernel and
the ones implemented in userspace.

Also, the token numbers are pretty arbitrary, and the token numbers
for the kernel-implemented RTAS functions could be chosen by userspace
or by the kernel.  If they're chosen by the kernel, then userspace
needs a way to discover them (so it can put them in the device tree),
and also has to avoid choosing any token numbers for its functions
that collide with a kernel-chosen token.  If userspace chooses the
token numbers, it has to tell the kernel what token numbers it has
chosen for the kernel-implemented RTAS functions.  We chose the latter
since it gives userspace more control.

So this  code has to be either (your suggestion):

foreach RTAS function possibly implemented in kernel {
query kernel token for function, by name
if that gives an error, mark function as needing to be
implemented in userspace
}
(userspace) allocate tokens for remaining functions,
avoiding collisions with kernel-chosen tokens

or else it is (my suggestion):

(userspace) allocate tokens for all RTAS functions
foreach RTAS function possibly implemented in kernel {
tell kernel the (name, token) correspondence
}

>   
> 
>   switch (hcall_id) {
>   case KVM_RTAS_ID:
> handle_rtas();

Here, you've compressed details that you expanded in your pseudo-code
above, making this a less than fair comparison.  This handle_rtas()
function has to fetch the token and branch out to the appropriate
handler routine.  Whether that's a switch statement or a loop over
registered handlers doesn't make all that much difference.

> break;
>   default:
> pass_to_qemu();
> break;
>   }
> 
> 
> Which one looks easier and less error prone to you? :)
> 
> Speaking of which, how does user space know that the kernel actually
> supports a specific RTAS token? 

It's really the names that are more important, the tokens are pretty
arbitrary.  In my scheme, userspace does a KVM_PPC_RTAS_DEFINE_TOKEN
ioctl giving the name and the (userspace-chosen) token, which gets an
error if the kernel doesn't recognize the name.  In your scheme, there
would have to be an equivalent ioctl to query the (kernel-chosen)
token for a given name, which once again would return an error if the
kernel doesn't recognize the name.  Either way the kernel 

Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

2013-04-04 Thread Scott Wood

On 04/04/2013 05:30:05 PM, Alexander Graf wrote:



Am 04.04.2013 um 20:41 schrieb Scott Wood :

> On 04/04/2013 07:54:20 AM, Alexander Graf wrote:
>> On 03.04.2013, at 03:57, Scott Wood wrote:
>> > +if (opp->mpic_mode_mask == GCR_MODE_PROXY)
>> Shouldn't this be an &?
>
> The way the mode field was originally documented was a two-bit  
field, where 0b11 was external proxy, and 0b10 was reserved.  If we  
use & it would have to be:

>
>if ((opp->mpic_mode_mask & GCR_MODE_PROXY) == GCR_MODE_PROXY)
>...
>
> Simply testing "opp->mpic_mode_mask & GCR_MODE_PROXY" would return  
true in the case of GCR_MODE_MIXED.

>
> In MPIC 4.3 external proxy is defined as a separate bit (GCR[CI])  
that is ignored if the mixed-mode bit (GCR[M]) is not set, which  
makes it a bit more legitimate to view it as a bitmap.  Still, I  
doubt we'll see new mode bits.


Ok, please add a comment about this here then :).


What sort of comment would you like?  Or do you want me to use the "(x  
& y) == y" version?


>> > @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu  
*vcpu)

>> >tasklet_kill(&vcpu->arch.tasklet);
>> >
>> >kvmppc_remove_vcpu_debugfs(vcpu);
>> > +
>> > +switch (vcpu->arch.irq_type) {
>> > +case KVMPPC_IRQ_MPIC:
>> > +kvmppc_mpic_put(vcpu->arch.mpic);
>> This doesn't tell the MPIC that this exact CPU is getting killed.  
What if we hotplug remove just a single CPU? Don't we have to  
deregister the CPU with the MPIC?

>
> If we ever support hot vcpu removal, yes.  We'd probably need some  
MPIC code changes to accommodate that, and we wouldn't currently have  
a way to test it, so I'd rather make it obviously not supported for  
now.


Is there any way to break heavily if user space attempts this?


Is there any way for userspace to request this currently?  They can  
close the vcpu fd, but the vcpu won't actually be destroyed until the  
vm goes down.


-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

2013-04-04 Thread Alexander Graf


Am 04.04.2013 um 20:41 schrieb Scott Wood :

> On 04/04/2013 07:54:20 AM, Alexander Graf wrote:
>> On 03.04.2013, at 03:57, Scott Wood wrote:
>> > +if (opp->mpic_mode_mask == GCR_MODE_PROXY)
>> Shouldn't this be an &?
> 
> The way the mode field was originally documented was a two-bit field, where 
> 0b11 was external proxy, and 0b10 was reserved.  If we use & it would have to 
> be:
> 
>if ((opp->mpic_mode_mask & GCR_MODE_PROXY) == GCR_MODE_PROXY)
>...
> 
> Simply testing "opp->mpic_mode_mask & GCR_MODE_PROXY" would return true in 
> the case of GCR_MODE_MIXED.
> 
> In MPIC 4.3 external proxy is defined as a separate bit (GCR[CI]) that is 
> ignored if the mixed-mode bit (GCR[M]) is not set, which makes it a bit more 
> legitimate to view it as a bitmap.  Still, I doubt we'll see new mode bits.

Ok, please add a comment about this here then :).

> 
>> > @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
>> >tasklet_kill(&vcpu->arch.tasklet);
>> >
>> >kvmppc_remove_vcpu_debugfs(vcpu);
>> > +
>> > +switch (vcpu->arch.irq_type) {
>> > +case KVMPPC_IRQ_MPIC:
>> > +kvmppc_mpic_put(vcpu->arch.mpic);
>> This doesn't tell the MPIC that this exact CPU is getting killed. What if we 
>> hotplug remove just a single CPU? Don't we have to deregister the CPU with 
>> the MPIC?
> 
> If we ever support hot vcpu removal, yes.  We'd probably need some MPIC code 
> changes to accommodate that, and we wouldn't currently have a way to test it, 
> so I'd rather make it obviously not supported for now.

Is there any way to break heavily if user space attempts this?

Alex

> 
> -Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] ARM: KVM: switch to a dual-step HYP init code

2013-04-04 Thread Geoff Levand
Hi,

On Tue, 2013-04-02 at 14:25 +0100, Marc Zyngier wrote:
> + @ Jump to the trampoline page
> + ldr r2, =#PAGE_MASK
> + adr r3, target
> + bic r3, r3, r2
> + ldr r2, =#TRAMPOLINE_VA
> + add r3, r3, r2
> + mov pc, r3

I guess you need 'ldr r2, =PAGE_MASK'.

  arch/arm/kvm/init.S:114: Error: bad expression -- `ldr r2,=#(~((1<<12)-1))'
  arch/arm/kvm/init.S:117: Error: bad expression -- `ldr r2,=#0x'

-Geoff


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: vfio API changes needed for powerpc (v3)

2013-04-04 Thread Yoder Stuart-B08248
-v3 updates
   -made vfio_pamu_attr a union, added flags
   -s/VFIO_PAMU_/VFIO_IOMMU_PAMU_/ for the ioctls to make it more
clear which fd is being operated on
   -added flags to vfio_pamu_msi_bank_map/umap
   -VFIO_PAMU_GET_MSI_BANK_COUNT now just returns a __u32
not a struct
   -fixed some typos



The Freescale PAMU is an aperture-based IOMMU with the following
characteristics.  Each device has an entry in a table in memory
describing the iova->phys mapping. The mapping has:
   -an overall aperture that is power of 2 sized, and has a start iova that
is naturally aligned
   -has 1 or more windows within the aperture
  -number of windows must be power of 2, max is 256
  -size of each window is determined by aperture size / # of windows
  -iova of each window is determined by aperture start iova / # of windows
  -the mapped region in each window can be different than
   the window size...mapping must power of 2
  -physical address of the mapping must be naturally aligned
   with the mapping size

These ioctls operate on the VFIO file descriptor (/dev/vfio/vfio).

/*
 * VFIO_IOMMU_PAMU_GET_ATTR
 *
 * Gets the iommu attributes for the current vfio container.  This
 * ioctl is applicable to an iommu type of VFIO_PAMU only.
 * Caller sets argsz and attribute.  The ioctl fills in
 * the provided struct vfio_pamu_attr based on the attribute
 * value that was set.

 * Return: 0 on success, -errno on failure
 */
struct vfio_pamu_attr {
__u32   argsz;
__u32   flags;/* no flags currently */
__u32   attribute;

union {
/* VFIO_ATTR_GEOMETRY */
struct { 
__u64 aperture_start; /* first addr that can be mapped 
*/
__u64 aperture_end;  /* last addr that can be mapped */
} attr;

/* VFIO_ATTR_WINDOWS */
__u32 windows;  /* number of windows in the aperture */
/* initially this will be the max number
 * of windows that can be set
 */

/* VFIO_ATTR_PAMU_STASH */
struct { 
__u32 cpu; /* CPU number for stashing */
__u32 cache;   /* cache ID for stashing */
} stash;
}
};
#define VFIO_IOMMU_PAMU_GET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_attr)

/*
 * VFIO_IOMMU_PAMU_SET_ATTR
 *
 * Sets the iommu attributes for the current vfio container.  This
 * ioctl is applicable to an iommu type of VFIO_PAMU only.
 * Caller sets struct vfio_pamu attr, including argsz and attribute and
 * setting any fields that are valid for the attribute.
 * Return: 0 on success, -errno on failure
 */
#define VFIO_IOMMU_PAMU_SET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_attr)

/*
 * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT
 *
 * Returns the number of MSI banks for this platform.  This tells user space
 * how many aperture windows should be reserved for MSI banks when setting
 * the PAMU geometry and window count.
 * Return: __u32 bank count on success, -errno on failure
 */
#define VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT _IO(VFIO_TYPE, VFIO_BASE + x, __u32)

/*
 * VFIO_IOMMU_PAMU_MAP_MSI_BANK
 *
 * Maps the MSI bank at the specified index and iova.  User space must
 * call this ioctl once for each MSI bank (count of banks is returned by
 * VFIO_IOMMU_PAMU_GET_MSI_BANK_COUNT).
 * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
 * Return: 0 on success, -errno on failure
 */

struct vfio_pamu_msi_bank_map {
__u32   argsz;
__u32   flags; /* no flags currently */
__u32   msi_bank_index;  /* the index of the MSI bank */
__u64   iova;  /* the iova the bank is to be mapped to */
};
#define VFIO_IOMMU_PAMU_MAP_MSI_BANK  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_msi_bank_map )

/*
 * VFIO_IOMMU_PAMU_UNMAP_MSI_BANK
 *
 * Unmaps the MSI bank at the specified iova.
 * Caller provides struct vfio_pamu_msi_bank_unmap with all fields set.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */

struct vfio_pamu_msi_bank_unmap {
__u32   argsz;
__u32   flags; /* no flags currently */
__u64   iova;  /* the iova to be unmapped to */
};
#define VFIO_IOMMU_PAMU_UNMAP_MSI_BANK  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_msi_bank_unmap )

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Yoder Stuart-B08248

> > /*
> >  * VFIO_PAMU_MAP_MSI_BANK
> >  *
> >  * Maps the MSI bank at the specified index and iova.  User space must
> >  * call this ioctl once for each MSI bank (count of banks is returned by
> >  * VFIO_IOMMU_GET_MSI_BANK_COUNT).
> >  * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
> >  * Operates on VFIO file descriptor (/dev/vfio/vfio).
> >  * Return: 0 on success, -errno on failure
> >  */
> >
> > struct vfio_pamu_msi_bank_map {
> > __u32   argsz;
> > __u32   msi_bank_index;  /* the index of the MSI bank */
> > __u64   iova;  /* the iova the bank is to be mapped to */
> > };
> 
> Again, flags.  If we dynamically add or remove devices from a container
> the bank count can change, right?  If bank count goes from 2 to 3, does
> userspace know assume the new bank is [2]?  If bank count goes from 3 to
> 2, which index was that?  If removing a device removes an MSI bank then
> vfio-pamu will automatically do the unmap, right?

My assumption is that the bank count returned by VFIO_IOMMU_GET_MSI_BANK_COUNT
is the max bank count for a platform.  (number will mostly likely always be
3 or 4).  So it won't change as devices are added or removed.

If devices are added or removed, the kernel side can enable or disable
the corresponding MSI windows.  But that is hidden from user space.

Stuart




Re: [PATCH v2] KVM: Call kvm_apic_match_dest() to check destination vcpu

2013-04-04 Thread Gleb Natapov
On Wed, Apr 03, 2013 at 12:34:37PM +0300, Gleb Natapov wrote:
> On Wed, Apr 03, 2013 at 08:47:58AM +0800, Yang Zhang wrote:
> > From: Yang Zhang 
> > 
> > For a given vcpu, kvm_apic_match_dest() will tell you whether
> > the vcpu in the destination list quickly. Drop kvm_calculate_eoi_exitmap()
> > and use kvm_apic_match_dest() instead.
> > 
> > Signed-off-by: Yang Zhang 
> Applied, thanks.
> 
The patch does not compile, dropped.

> > ---
> >  arch/x86/kvm/lapic.c |   47 ---
> >  arch/x86/kvm/lapic.h |4 
> >  virt/kvm/ioapic.c|9 +++--
> >  3 files changed, 3 insertions(+), 57 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index a8e9369..e227474 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -145,53 +145,6 @@ static inline int kvm_apic_id(struct kvm_lapic *apic)
> > return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
> >  }
> >  
> > -void kvm_calculate_eoi_exitmap(struct kvm_vcpu *vcpu,
> > -   struct kvm_lapic_irq *irq,
> > -   u64 *eoi_exit_bitmap)
> > -{
> > -   struct kvm_lapic **dst;
> > -   struct kvm_apic_map *map;
> > -   unsigned long bitmap = 1;
> > -   int i;
> > -
> > -   rcu_read_lock();
> > -   map = rcu_dereference(vcpu->kvm->arch.apic_map);
> > -
> > -   if (unlikely(!map)) {
> > -   __set_bit(irq->vector, (unsigned long *)eoi_exit_bitmap);
> > -   goto out;
> > -   }
> > -
> > -   if (irq->dest_mode == 0) { /* physical mode */
> > -   if (irq->delivery_mode == APIC_DM_LOWEST ||
> > -   irq->dest_id == 0xff) {
> > -   __set_bit(irq->vector,
> > - (unsigned long *)eoi_exit_bitmap);
> > -   goto out;
> > -   }
> > -   dst = &map->phys_map[irq->dest_id & 0xff];
> > -   } else {
> > -   u32 mda = irq->dest_id << (32 - map->ldr_bits);
> > -
> > -   dst = map->logical_map[apic_cluster_id(map, mda)];
> > -
> > -   bitmap = apic_logical_id(map, mda);
> > -   }
> > -
> > -   for_each_set_bit(i, &bitmap, 16) {
> > -   if (!dst[i])
> > -   continue;
> > -   if (dst[i]->vcpu == vcpu) {
> > -   __set_bit(irq->vector,
> > - (unsigned long *)eoi_exit_bitmap);
> > -   break;
> > -   }
> > -   }
> > -
> > -out:
> > -   rcu_read_unlock();
> > -}
> > -
> >  static void recalculate_apic_map(struct kvm *kvm)
> >  {
> > struct kvm_apic_map *new, *old = NULL;
> > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > index 2c721b9..baa20cf 100644
> > --- a/arch/x86/kvm/lapic.h
> > +++ b/arch/x86/kvm/lapic.h
> > @@ -160,10 +160,6 @@ static inline u16 apic_logical_id(struct kvm_apic_map 
> > *map, u32 ldr)
> > return ldr & map->lid_mask;
> >  }
> >  
> > -void kvm_calculate_eoi_exitmap(struct kvm_vcpu *vcpu,
> > -   struct kvm_lapic_irq *irq,
> > -   u64 *eoi_bitmap);
> > -
> >  static inline bool kvm_apic_has_events(struct kvm_vcpu *vcpu)
> >  {
> > return vcpu->arch.apic->pending_events;
> > diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
> > index 5ba005c..bb3906d 100644
> > --- a/virt/kvm/ioapic.c
> > +++ b/virt/kvm/ioapic.c
> > @@ -124,7 +124,6 @@ void kvm_ioapic_calculate_eoi_exitmap(struct kvm_vcpu 
> > *vcpu,
> >  {
> > struct kvm_ioapic *ioapic = vcpu->kvm->arch.vioapic;
> > union kvm_ioapic_redirect_entry *e;
> > -   struct kvm_lapic_irq irqe;
> > int index;
> >  
> > spin_lock(&ioapic->lock);
> > @@ -135,11 +134,9 @@ void kvm_ioapic_calculate_eoi_exitmap(struct kvm_vcpu 
> > *vcpu,
> > (e->fields.trig_mode == IOAPIC_LEVEL_TRIG ||
> >  kvm_irq_has_notifier(ioapic->kvm, KVM_IRQCHIP_IOAPIC,
> >  index))) {
> > -   irqe.dest_id = e->fields.dest_id;
> > -   irqe.vector = e->fields.vector;
> > -   irqe.dest_mode = e->fields.dest_mode;
> > -   irqe.delivery_mode = e->fields.delivery_mode << 8;
> > -   kvm_calculate_eoi_exitmap(vcpu, &irqe, eoi_exit_bitmap);
> > +   if (kvm_apic_match_dest(vcpu, NULL, 0,
> > +   e->fields.dest_id, e->fields.dest_mode))
> > +   __set_bit(e->fileds.vector, (unsigned long 
> > *)eoi_exit_bitmap);
> > }
> > }
> > spin_unlock(&ioapic->lock);
> > -- 
> > 1.7.1
> 
> --
>   Gleb.
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More major

[PATCH] KVM: x86: Fix memory leak in vmx.c

2013-04-04 Thread Andrew Honig
If userspace creates and destroys multiple VMs within the same process
we leak 20k of memory in the userspace process context per VM.  This
patch frees the memory in kvm_arch_destroy_vm.  If the process exits
without closing the VM file descriptor or the file descriptor has been
shared with another process then we don't need to free the memory.

Messing with user space memory from an fd is not ideal, but other changes
would require user space changes and this is consistent with how the
memory is currently allocated.

Tested: Test ran several VMs and ran against test program meant to 
demonstrate the leak (www.spinics.net/lists/kvm/msg83734.html).

Signed-off-by: Andrew Honig 

---
 arch/x86/include/asm/kvm_host.h |3 +++
 arch/x86/kvm/vmx.c  |3 +++
 arch/x86/kvm/x86.c  |   11 +++
 3 files changed, 17 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4979778..975a74d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -553,6 +553,9 @@ struct kvm_arch {
struct page *ept_identity_pagetable;
bool ept_identity_pagetable_done;
gpa_t ept_identity_map_addr;
+   unsigned long ept_ptr;
+   unsigned long apic_ptr;
+   unsigned long tss_ptr;
 
unsigned long irq_sources_bitmap;
s64 kvmclock_offset;
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6667042..8aa5d81 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -3703,6 +3703,7 @@ static int alloc_apic_access_page(struct kvm *kvm)
}
 
kvm->arch.apic_access_page = page;
+   kvm->arch.apic_ptr = kvm_userspace_mem.userspace_addr;
 out:
mutex_unlock(&kvm->slots_lock);
return r;
@@ -3733,6 +3734,7 @@ static int alloc_identity_pagetable(struct kvm *kvm)
}
 
kvm->arch.ept_identity_pagetable = page;
+   kvm->arch.ept_ptr = kvm_userspace_mem.userspace_addr;
 out:
mutex_unlock(&kvm->slots_lock);
return r;
@@ -4366,6 +4368,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int 
addr)
if (ret)
return ret;
kvm->arch.tss_addr = addr;
+   kvm->arch.tss_ptr = tss_mem.userspace_addr;
if (!init_rmode_tss(kvm))
return  -ENOMEM;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f19ac0a..411ff2a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6812,6 +6812,16 @@ void kvm_arch_sync_events(struct kvm *kvm)
 
 void kvm_arch_destroy_vm(struct kvm *kvm)
 {
+   if (current->mm == kvm->mm) {
+   /*
+* Free pages allocated on behalf of userspace, unless the
+* the memory map has changed due to process exit or fd
+* copying.
+*/
+   vm_munmap(kvm->arch.apic_ptr, PAGE_SIZE);
+   vm_munmap(kvm->arch.ept_ptr, PAGE_SIZE);
+   vm_munmap(kvm->arch.tss_ptr, PAGE_SIZE * 3);
+   }
kvm_iommu_unmap_guest(kvm);
kfree(kvm->arch.vpic);
kfree(kvm->arch.vioapic);
@@ -6929,6 +6939,7 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
return PTR_ERR((void *)userspace_addr);
 
memslot->userspace_addr = userspace_addr;
+   mem->userspace_addr = userspace_addr;
}
 
return 0;
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Alex Williamson
On Thu, 2013-04-04 at 17:32 +, Yoder Stuart-B08248 wrote:
> Based on the email thread over the last couple of days, I have
> below an more concrete proposal (v2) for new ioctls supporting vfio-pci
> on SoCs with the Freescale PAMU.
> 
> Example usage is as described by Scott:
> 
> count = VFIO_IOMMU_GET_MSI_BANK_COUNT
> VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
> VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
> // do other DMA maps now, or later, or not at all, doesn't matter
> for (i = 0; i < count; i++)
> VFIO_IOMMU_MAP_MSI_BANK(iova, i);
> // The kernel now knows where each bank has been mapped, and can
> //   update PCI config space appropriately.
> 
> Thanks,
> Stuart
> 
> 
> 
> The Freescale PAMU is an aperture-based IOMMU with the following 
> characteristics.  Each device has an entry in a table in memory
> describing the iova->phys mapping. The mapping has:
>-an overall aperture that is power of 2 sized, and has a start iova that
> is naturally aligned
>-has 1 or more windows within the aperture
>   -number of windows must be power of 2, max is 256
>   -size of each window is determined by aperture size / # of windows
>   -iova of each window is determined by aperture start iova / # of windows
>   -the mapped region in each window can be different than
>the window size...mapping must power of 2 
>   -physical address of the mapping must be naturally aligned
>with the mapping size
> 
> /*
>  * VFIO_PAMU_GET_ATTR
>  *
>  * Gets the iommu attributes for the current vfio container.  This 
>  * ioctl is applicable to an iommu type of VFIO_PAMU only.
>  * Caller sets argsz and attribute.  The ioctl fills in
>  * the provided struct vfio_pamu_attr based on the attribute
>  * value that was set.
>  * Operates on VFIO file descriptor (/dev/vfio/vfio).
>  * Return: 0 on success, -errno on failure
>  */
> struct vfio_pamu_attr {
> __u32   argsz;
> __u32   attribute;
> #define VFIO_ATTR_GEOMETRY 1
> #define VFIO_ATTR_WINDOWS  2
> #define VFIO_ATTR_PAMU_STASH   3
> 
> /* fowlling fields are for VFIO_ATTR_GEOMETRY */
> __u64 aperture_start; /* first address that can be mapped*/
> __u64 aperture_end;   /* last address that can be mapped */
> 
> /* follwing fields are for VFIO_ATTR_WINDOWS */
> __u32 windows;/* number of windows in the aperture */
>   /* initially this will be the max number
>* of windows that can be set
>*/
> 
> /* following fields are for VFIO_ATTR_PAMU_STASH */
> __u32 cpu;/* CPU number for stashing */
> __u32 cache;  /* cache ID for stashing */

Can multiple attribute bits be set at once?  If not then the above could
be a union and the attribute could be an enum.  A flags field is
probably still a good idea.

> };
> #define VFIO_PAMU_GET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
> struct vfio_pamu_attr)
> 
> /*
>  * VFIO_PAMU_SET_ATTR
>  *
>  * Sets the iommu attributes for the current vfio container.  This 
>  * ioctl is applicable to an iommu type of VFIO_PAMU only.
>  * Caller sets struct vfio_pamu attr, including argsz and attribute and
>  * setting any fields that are valid for the attribute.
>  * Operates on VFIO file descriptor (/dev/vfio/vfio).
>  * Return: 0 on success, -errno on failure
>  */
> #define VFIO_PAMU_SET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
> struct vfio_pamu_attr)
> 
> /*
>  * VFIO_PAMU_GET_MSI_BANK_COUNT
>  *
>  * Returns the number of MSI banks for this platform.  This tells user space
>  * how many aperture windows should be reserved for MSI banks when setting
>  * the PAMU geometry and window count.
>  * Fills in provided struct vfio_pamu_msi_banks. Caller sets argsz. 
>  * Operates on VFIO file descriptor (/dev/vfio/vfio).
>  * Return: 0 on success, -errno on failure
>  */
> struct vfio_pamu_msi_banks {
> __u32   argsz;
> __u32   bank_count;  /* the number of MSI
> };
> #define VFIO_PAMU_GET_MSI_BANK_COUNT  _IO(VFIO_TYPE, VFIO_BASE + x,
> struct vfio_pamu_msi_banks)

argsz w/o flags doesn't really buy us much.

> 
> /*
>  * VFIO_PAMU_MAP_MSI_BANK
>  *
>  * Maps the MSI bank at the specified index and iova.  User space must
>  * call this ioctl once for each MSI bank (count of banks is returned by
>  * VFIO_IOMMU_GET_MSI_BANK_COUNT).
>  * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
>  * Operates on VFIO file descriptor (/dev/vfio/vfio).
>  * Return: 0 on success, -errno on failure
>  */
> 
> struct vfio_pamu_msi_bank_map {
> __u32   argsz;
> __u32   msi_bank_index;  /* the index of the MSI bank */
> __u64   iova;  /* the iova the bank is to be mapped to */
> };

Again, flags.  If we dynamically add or remove devices from a container
the bank count can change

Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

2013-04-04 Thread Scott Wood

On 04/04/2013 07:54:20 AM, Alexander Graf wrote:


On 03.04.2013, at 03:57, Scott Wood wrote:

> +  if (opp->mpic_mode_mask == GCR_MODE_PROXY)

Shouldn't this be an &?


The way the mode field was originally documented was a two-bit field,  
where 0b11 was external proxy, and 0b10 was reserved.  If we use & it  
would have to be:


if ((opp->mpic_mode_mask & GCR_MODE_PROXY) == GCR_MODE_PROXY)
...

Simply testing "opp->mpic_mode_mask & GCR_MODE_PROXY" would return true  
in the case of GCR_MODE_MIXED.


In MPIC 4.3 external proxy is defined as a separate bit (GCR[CI]) that  
is ignored if the mixed-mode bit (GCR[M]) is not set, which makes it a  
bit more legitimate to view it as a bitmap.  Still, I doubt we'll see  
new mode bits.



> @@ -460,6 +464,13 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
>tasklet_kill(&vcpu->arch.tasklet);
>
>kvmppc_remove_vcpu_debugfs(vcpu);
> +
> +  switch (vcpu->arch.irq_type) {
> +  case KVMPPC_IRQ_MPIC:
> +  kvmppc_mpic_put(vcpu->arch.mpic);

This doesn't tell the MPIC that this exact CPU is getting killed.  
What if we hotplug remove just a single CPU? Don't we have to  
deregister the CPU with the MPIC?


If we ever support hot vcpu removal, yes.  We'd probably need some MPIC  
code changes to accommodate that, and we wouldn't currently have a way  
to test it, so I'd rather make it obviously not supported for now.


-Scott
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RFC: vfio API changes needed for powerpc (v2)

2013-04-04 Thread Yoder Stuart-B08248
Based on the email thread over the last couple of days, I have
below an more concrete proposal (v2) for new ioctls supporting vfio-pci
on SoCs with the Freescale PAMU.

Example usage is as described by Scott:

count = VFIO_IOMMU_GET_MSI_BANK_COUNT
VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY)
VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS)
// do other DMA maps now, or later, or not at all, doesn't matter
for (i = 0; i < count; i++)
VFIO_IOMMU_MAP_MSI_BANK(iova, i);
// The kernel now knows where each bank has been mapped, and can
//   update PCI config space appropriately.

Thanks,
Stuart



The Freescale PAMU is an aperture-based IOMMU with the following 
characteristics.  Each device has an entry in a table in memory
describing the iova->phys mapping. The mapping has:
   -an overall aperture that is power of 2 sized, and has a start iova that
is naturally aligned
   -has 1 or more windows within the aperture
  -number of windows must be power of 2, max is 256
  -size of each window is determined by aperture size / # of windows
  -iova of each window is determined by aperture start iova / # of windows
  -the mapped region in each window can be different than
   the window size...mapping must power of 2 
  -physical address of the mapping must be naturally aligned
   with the mapping size

/*
 * VFIO_PAMU_GET_ATTR
 *
 * Gets the iommu attributes for the current vfio container.  This 
 * ioctl is applicable to an iommu type of VFIO_PAMU only.
 * Caller sets argsz and attribute.  The ioctl fills in
 * the provided struct vfio_pamu_attr based on the attribute
 * value that was set.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */
struct vfio_pamu_attr {
__u32   argsz;
__u32   attribute;
#define VFIO_ATTR_GEOMETRY 1
#define VFIO_ATTR_WINDOWS  2
#define VFIO_ATTR_PAMU_STASH   3

/* fowlling fields are for VFIO_ATTR_GEOMETRY */
__u64 aperture_start; /* first address that can be mapped*/
__u64 aperture_end;   /* last address that can be mapped */

/* follwing fields are for VFIO_ATTR_WINDOWS */
__u32 windows;/* number of windows in the aperture */
  /* initially this will be the max number
   * of windows that can be set
   */

/* following fields are for VFIO_ATTR_PAMU_STASH */
__u32 cpu;/* CPU number for stashing */
__u32 cache;  /* cache ID for stashing */
};
#define VFIO_PAMU_GET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_attr)

/*
 * VFIO_PAMU_SET_ATTR
 *
 * Sets the iommu attributes for the current vfio container.  This 
 * ioctl is applicable to an iommu type of VFIO_PAMU only.
 * Caller sets struct vfio_pamu attr, including argsz and attribute and
 * setting any fields that are valid for the attribute.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */
#define VFIO_PAMU_SET_ATTR  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_attr)

/*
 * VFIO_PAMU_GET_MSI_BANK_COUNT
 *
 * Returns the number of MSI banks for this platform.  This tells user space
 * how many aperture windows should be reserved for MSI banks when setting
 * the PAMU geometry and window count.
 * Fills in provided struct vfio_pamu_msi_banks. Caller sets argsz. 
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */
struct vfio_pamu_msi_banks {
__u32   argsz;
__u32   bank_count;  /* the number of MSI
};
#define VFIO_PAMU_GET_MSI_BANK_COUNT  _IO(VFIO_TYPE, VFIO_BASE + x,
struct vfio_pamu_msi_banks)

/*
 * VFIO_PAMU_MAP_MSI_BANK
 *
 * Maps the MSI bank at the specified index and iova.  User space must
 * call this ioctl once for each MSI bank (count of banks is returned by
 * VFIO_IOMMU_GET_MSI_BANK_COUNT).
 * Caller provides struct vfio_pamu_msi_bank_map with all fields set.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */

struct vfio_pamu_msi_bank_map {
__u32   argsz;
__u32   msi_bank_index;  /* the index of the MSI bank */
__u64   iova;  /* the iova the bank is to be mapped to */
};

/*
 * VFIO_PAMU_UNMAP_MSI_BANK
 *
 * Unmaps the MSI bank at the specified iova.
 * Caller provides struct vfio_pamu_msi_bank_unmap with all fields set.
 * Operates on VFIO file descriptor (/dev/vfio/vfio).
 * Return: 0 on success, -errno on failure
 */

struct vfio_pamu_msi_bank_unmap {
__u32   argsz;
__u64   iova;  /* the iova to be unmapped to */
};

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 06:36:30PM +0300, Michael S. Tsirkin wrote:
> > processor   : 0
> > vendor_id   : AuthenticAMD
> > cpu family  : 16
> > model   : 8
> > model name  : Six-Core AMD Opteron(tm) Processor 8435
> > stepping: 0
> > cpu MHz : 800.000
> > cache size  : 512 KB
> > physical id : 0
> > siblings: 6
> > core id : 0
> > cpu cores   : 6
> > apicid  : 8
> > initial apicid  : 0
> > fpu : yes
> > fpu_exception   : yes
> > cpuid level : 5
> > wp  : yes
> > flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
> > mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt 
> > pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc 
> > extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic 
> > cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt npt lbrv 
> > svm_lock nrip_save pausefilter
> > bogomips: 5199.87
> > TLB size: 1024 4K pages
> > clflush size: 64
> > cache_alignment : 64
> > address sizes   : 48 bits physical, 48 bits virtual
> > power management: ts ttp tm stc 100mhzsteps hwpstate
> 
> Hmm, svm code seems less optimized for MMIO, but PIO
> is almost identical. Gleb says the unittest is broken
> on AMD so I'll wait until it's fixed to test.
> 
It's not unittest is broken, its my environment is broken :)

> Did you do PIO reads by chance?
> 
> > > 
> > > Or could be different software, this is on top of 3.9.0-rc5, what
> > > did you try?
> > 
> > 3.0 plus kvm-kmod of whatever was current back in autumn :).
> > 
> > > 
> > >> MST, could you please do a real world latency benchmark with virtio-net 
> > >> and
> > >> 
> > >>  * normal ioeventfd
> > >>  * mmio-pv eventfd
> > >>  * hcall eventfd
> > > 
> > > I can't do this right away, sorry.  For MMIO we are discussing the new
> > > layout on the virtio mailing list, guest and qemu need a patch for this
> > > too.  My hcall patches are stale and would have to be brought up to
> > > date.
> > > 
> > > 
> > >> to give us some idea how much performance we would gain from each 
> > >> approach? Thoughput should be completely unaffected anyway, since virtio 
> > >> just coalesces kicks internally.
> > > 
> > > Latency is dominated by the scheduling latency.
> > > This means virtio-net is not the best benchmark.
> > 
> > So what is a good benchmark?
> 
> E.g. ping pong stress will do but need to look at CPU utilization,
> that's what is affected, not latency.
> 
> > Is there any difference in speed at all? I strongly doubt it. One of 
> > virtio's main points is to reduce the number of kicks.
> 
> For this stage of the project I think microbenchmarks are more appropriate.
> Doubling the price of exit is likely to be measureable. 30 cycles likely
> not ...
> 
> > > 
> > >> I'm also slightly puzzled why the wildcard eventfd mechanism is so 
> > >> significantly slower, while it was only a few percent on my test system. 
> > >> What are the numbers you're listing above? Cycles? How many cycles do 
> > >> you execute in a second?
> > >> 
> > >> 
> > >> Alex
> > > 
> > > 
> > > It's the TSC divided by number of iterations.  kvm unittest this value, 
> > > here's
> > > what it does (removed some dead code):
> > > 
> > > #define GOAL (1ull << 30)
> > > 
> > >do {
> > >iterations *= 2;
> > >t1 = rdtsc();
> > > 
> > >for (i = 0; i < iterations; ++i)
> > >func();
> > >t2 = rdtsc();
> > >} while ((t2 - t1) < GOAL);
> > >printf("%s %d\n", test->name, (int)((t2 - t1) / iterations));
> > 
> > So it's the number of cycles per run.
> > 
> > That means translated my numbers are:
> > 
> >   MMIO: 4307
> >   PIO: 3658
> >   HCALL: 1756
> > 
> > MMIO - PIO = 649
> > 
> > which aligns roughly with your PV MMIO callback.
> > 
> > My MMIO benchmark was to poke the LAPIC version register. That does go 
> > through instruction emulation, no?
> > 
> > 
> > Alex
> 
> Why wouldn't it?
> 
Intel decodes access to apic page, but we use it only for fast eoi.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 05:36:40PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 15:33, Michael S. Tsirkin wrote:
> 
> > On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:56, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
>  
>  On 04.04.2013, at 14:45, Gleb Natapov wrote:
>  
> > On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
>  
>  On 04.04.2013, at 14:08, Gleb Natapov wrote:
>  
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >> 
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: 
> >>> we
> >>> know the address from the VMCS so if the address is unique, we 
> >>> can look
> >>> up the eventfd directly, bypassing emulation.
> >>> 
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>> 
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the 
> >>> length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>> 
> >>> At the moment, this optimization is only supported for EPT on x86 
> >>> and
> >>> silently ignored for NPT and MMU, so everything works correctly 
> >>> but
> >>> slowly.
> >>> 
> >>> TODO: NPT, MMU and non x86 architectures.
> >>> 
> >>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>> 
> >>> Signed-off-by: Michael S. Tsirkin 
> >> 
> >> This still uses page fault intercepts which are orders of 
> >> magnitudes slower than hypercalls. Why don't you just create a PV 
> >> MMIO hypercall that the guest can use to invoke MMIO accesses 
> >> towards the host based on physical addresses with explicit length 
> >> encodings?
> >> 
> > It is slower, but not an order of magnitude slower. It become faster
> > with newer HW.
> > 
> >> That way you simplify and speed up all code paths, exceeding the 
> >> speed of PIO exits even. It should also be quite easily portable, 
> >> as all other platforms have hypercalls available as well.
> >> 
> > We are trying to avoid PV as much as possible (well this is also PV,
> > but not guest visible
>  
>  Also, how is this not guest visible? Who sets 
>  KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition 
>  indicates that the guest does so, so it is guest visible.
>  
> >>> QEMU sets it.
> >> 
> >> How does QEMU know?
> >> 
> > Knows what? When to create such eventfd? virtio device knows.
>  
>  Where does it know from?
>  
> >>> It does it always.
> >>> 
> > 
> >>> 
>  +/*
>  + * PV_MMIO - Guest can promise us that all accesses touching this 
>  address
>  + * are writes of specified length, starting at the specified 
>  address.
>  + * If not - it's a Guest bug.
>  + * Can not be used together with either PIO or DATAMATCH.
>  + */
>  
> >>> Virtio spec will state that access to a kick register needs to be of
> >>> specific length. This is reasonable thing for HW to ask.
> >> 
> >> This is a spec change. So the guest would have to indicate that it 
> >> adheres to a newer spec. Thus it's a guest visible change.
> >> 
> > There is not virtio spec that has kick register in MMIO. The spec is in
> > the works AFAIK. Actually PIO will not be deprecated and my suggestion
>  
>  So the guest would indicate that it supports a newer revision of the 
>  spec (in your case, that it supports MMIO). How is that any different 
>  from exposing that it supports a PV MMIO hcall?
>  
> >>> Guest will indicate nothing. New driver will use MMIO if PIO is bar is
> >>> not configured. All driver will not work for virtio devices with MMIO
> >>> bar, but not PIO bar.
> >> 
> >> I can't parse that, sorry :).
> > 
> > It's simple. Driver does iowrite16 or whatever is appropriate for the OS.
> > QEMU tells KVM which addr

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 05:36:40PM +0200, Alexander Graf wrote:
> > 
> > #define GOAL (1ull << 30)
> > 
> >do {
> >iterations *= 2;
> >t1 = rdtsc();
> > 
> >for (i = 0; i < iterations; ++i)
> >func();
> >t2 = rdtsc();
> >} while ((t2 - t1) < GOAL);
> >printf("%s %d\n", test->name, (int)((t2 - t1) / iterations));
> 
> So it's the number of cycles per run.
> 
> That means translated my numbers are:
> 
>   MMIO: 4307
>   PIO: 3658
>   HCALL: 1756
> 
> MMIO - PIO = 649
> 
> which aligns roughly with your PV MMIO callback.
> 
> My MMIO benchmark was to poke the LAPIC version register. That does go 
> through instruction emulation, no?
> 
It should and PIO 'in' or string also goes through the emulator.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [virt-test][PATCH 4/7] virt: Adds named variants to Cartesian config.

2013-04-04 Thread Jiri Zupka


- Original Message -
> Sorry for not reading the commit message before my previous reply. Now I
> see the origin of the ">" syntax.
> 
> On Fri, Mar 29, 2013 at 06:14:07PM +0100, Jiří Župka wrote:
> [...]
> > 
> > For filtering of named variants is used character ">" because there was
> > problem with conflict with = in expression key = value. The char ">"
> > could be changed to something better but it should be different from "="
> > for optimization of speed.
> 
> IMO we need really strong reasons to use anything different from "="
> because it is the most obvious choice we have. Using ">" doesn't make
> any sense to me.


There is not necessary solve conflict with = or : in code. Code parsing is
straightforward with that . Chars "=" and ":" was one of my first selection
too but it brings conflicts in parsing. But it could be changed because
there were more voice against it. Users could prefer better syntax instead
of little improve of speed.


> 
> What kind of speed optimization are you talking about, exactly? We need
> to keep algorithm time/space complexity under control, but making two or
> three additional regexp matches per line won't make the code much
> slower, will it?


Sometime yes (https://github.com/autotest/virt-test/pull/229).
But I don't think that it is this case. I'll will try think about more.


> Also: whatever symbol we use, I would really like to make it
> whitespace-insensitive.
> 
> I mean: if "foo>x" or "foo=x" works, "foo > x" or "foo = x" should work,
> too. I am absolutely sure people _will_ eventually try to put whitespace
> around the operator symbol, and this shouldn't cause unpleasant
> surprises.

Thank a lot that you catch this bug. It is only bug not intention.
I have forgot one strip(). But I will repair the bug after we will
finish discussion about named variant.

> 
> 
> > 
> > Additionally named variant adds keys to final dictionary in case of
> > example is it (virt_system = linux). It should reduce size of config file.
> > Keys defined in config and keys defined by named variants are in same
> > name space.
> 
> This is the part I like the most. Thanks!
> 
> 
> > 
> > Signed-off-by: Jiří Župka 
> > ---
> >  virttest/cartesian_config.py | 138
> >  ++-
> >  1 file changed, 124 insertions(+), 14 deletions(-)
> > 
> > diff --git a/virttest/cartesian_config.py b/virttest/cartesian_config.py
> > index ef91051..04ed2b5 100755
> > --- a/virttest/cartesian_config.py
> > +++ b/virttest/cartesian_config.py
> > @@ -145,6 +145,74 @@ class MissingIncludeError:
> >  num_failed_cases = 5
> >  
> >  
> > +class Label(object):
> > +__slots__ = ["name", "var_name", "long_name", "hash_val", "hash_var"]
> > +
> > +def __init__(self, name, next_name=None):
> > +if next_name is None:
> > +self.name = name
> > +self.var_name = None
> > +else:
> > +self.name = next_name
> > +self.var_name = name
> > +
> > +if self.var_name is None:
> > +self.long_name = "%s" % (self.name)
> > +else:
> > +self.long_name = "%s>%s" % (self.var_name, self.name)
> > +
> > +self.hash_val = self.hash_name()
> > +self.hash_var = None
> > +if self.var_name:
> > +self.hash_var = self.hash_variant()
> > +
> > +
> > +def __str__(self):
> > +return self.long_name
> > +
> > +
> > +def __repr__(self):
> > +return self.long_name
> > +
> > +
> > +def __eq__(self, o):
> > +"""
> > +The comparison is asymmetric due to optimization.
> > +"""
> > +if o.var_name:
> > +if self.long_name == o.long_name:
> > +return True
> > +else:
> > +if self.name == o.name:
> > +return True
> > +return False
> > +
> > +
> > +def __ne__(self, o):
> > +"""
> > +The comparison is asymmetric due to optimization.
> > +"""
> > +if o.var_name:
> > +if self.long_name != o.long_name:
> > +return True
> > +else:
> > +if self.name != o.name:
> > +return True
> > +return False
> > +
> > +
> > +def __hash__(self):
> > +return self.hash_val
> > +
> > +
> > +def hash_name(self):
> > +return sum([i + 1 * ord(x) for i, x in enumerate(self.name)])
> > +
> > +
> > +def hash_variant(self):
> > +return sum([i + 1 * ord(x) for i, x in enumerate(str(self))])
> > +
> > +
> >  class Node(object):
> >  __slots__ = ["name", "dep", "content", "children", "labels",
> >   "append_to_shortname", "failed_cases", "default"]
> > @@ -212,18 +280,19 @@ class Filter(object):
> >  def __init__(self, s):
> >  self.filter = []
> >  for char in s:
> > -if not (char.isalnum() or char.isspace() or char in ".,_-"):
> > +if not (char.isalnum() or cha

Re: [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 04:56:11PM +0200, Paolo Bonzini wrote:
> Il 04/04/2013 15:10, Michael S. Tsirkin ha scritto:
> > > I would like Avi to comment on this, because I think this is not the
> > > "memory-API approved" way of doing things.  You need KVM to define its
> > > own AddressSpace, and make KVM's listener use
> > > memory_region_to_address_space to figure out if it is for PV MMIO.
> > > 
> > > To handle accesses from TCG, the PV AddressSpace can simply have just an
> > > alias to the actual MemoryRegion where the doorbell is.
> > 
> > This is not really different from other eventfd flags like datamatch.
> > Separate address space is not appropriate here.
> > This is regular memory space, KVM eventfd simply has a flag that says
> > "guest is well behaved so please make eventfd go faster".
> 
> Having a separate address space would match what you do in the kernel
> though.

Not really, in kernel the device is as usual on the MMIO
bus. The pv_mmio_bus is an extra datastructure to enable lookups where
length is not known in an efficient way. Userspace does not need to know
about these details.

> I just don't like functions with a dozen arguments...  Can you just make
> datamatch and pv a single flags argument, as is the case with the KVM API?
> 
> Paolo

Extra parameters are detected by compiler :)
I don't really care, anyone else objects to changing
datamatch to a flags argument?

Please let me know to avoid changing it back and forth.


-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 15:33, Michael S. Tsirkin wrote:

> On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
>> 
>> On 04.04.2013, at 14:56, Gleb Natapov wrote:
>> 
>>> On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:45, Gleb Natapov wrote:
 
> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
>> 
>> On 04.04.2013, at 14:38, Gleb Natapov wrote:
>> 
>>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:08, Gleb Natapov wrote:
 
> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>> 
>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>> 
>>> With KVM, MMIO is much slower than PIO, due to the need to
>>> do page walk and emulation. But with EPT, it does not have to be: we
>>> know the address from the VMCS so if the address is unique, we can 
>>> look
>>> up the eventfd directly, bypassing emulation.
>>> 
>>> Add an interface for userspace to specify this per-address, we can
>>> use this e.g. for virtio.
>>> 
>>> The implementation adds a separate bus internally. This serves two
>>> purposes:
>>> - minimize overhead for old userspace that does not use PV MMIO
>>> - minimize disruption in other code (since we don't know the length,
>>> devices on the MMIO bus only get a valid address in write, this
>>> way we don't need to touch all devices to teach them handle
>>> an dinvalid length)
>>> 
>>> At the moment, this optimization is only supported for EPT on x86 
>>> and
>>> silently ignored for NPT and MMU, so everything works correctly but
>>> slowly.
>>> 
>>> TODO: NPT, MMU and non x86 architectures.
>>> 
>>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
>>> pre-review and suggestions.
>>> 
>>> Signed-off-by: Michael S. Tsirkin 
>> 
>> This still uses page fault intercepts which are orders of magnitudes 
>> slower than hypercalls. Why don't you just create a PV MMIO 
>> hypercall that the guest can use to invoke MMIO accesses towards the 
>> host based on physical addresses with explicit length encodings?
>> 
> It is slower, but not an order of magnitude slower. It become faster
> with newer HW.
> 
>> That way you simplify and speed up all code paths, exceeding the 
>> speed of PIO exits even. It should also be quite easily portable, as 
>> all other platforms have hypercalls available as well.
>> 
> We are trying to avoid PV as much as possible (well this is also PV,
> but not guest visible
 
 Also, how is this not guest visible? Who sets 
 KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
 that the guest does so, so it is guest visible.
 
>>> QEMU sets it.
>> 
>> How does QEMU know?
>> 
> Knows what? When to create such eventfd? virtio device knows.
 
 Where does it know from?
 
>>> It does it always.
>>> 
> 
>>> 
 +/*
 + * PV_MMIO - Guest can promise us that all accesses touching this 
 address
 + * are writes of specified length, starting at the specified address.
 + * If not - it's a Guest bug.
 + * Can not be used together with either PIO or DATAMATCH.
 + */
 
>>> Virtio spec will state that access to a kick register needs to be of
>>> specific length. This is reasonable thing for HW to ask.
>> 
>> This is a spec change. So the guest would have to indicate that it 
>> adheres to a newer spec. Thus it's a guest visible change.
>> 
> There is not virtio spec that has kick register in MMIO. The spec is in
> the works AFAIK. Actually PIO will not be deprecated and my suggestion
 
 So the guest would indicate that it supports a newer revision of the spec 
 (in your case, that it supports MMIO). How is that any different from 
 exposing that it supports a PV MMIO hcall?
 
>>> Guest will indicate nothing. New driver will use MMIO if PIO is bar is
>>> not configured. All driver will not work for virtio devices with MMIO
>>> bar, but not PIO bar.
>> 
>> I can't parse that, sorry :).
> 
> It's simple. Driver does iowrite16 or whatever is appropriate for the OS.
> QEMU tells KVM which address driver uses, to make exits faster.  This is not
> different from how eventfd works.  For example if exits to QEMU suddenly 
> become
> very cheap we can remove eventfd completely.
> 
>>> 
> is to move to MMIO only when PIO address space is exhausted. For PCI it
> will be never, for PCI-e it will be after ~16 devices.
 
 Ok, let's go back

Re: [PATCH 4/7] ARM: KVM: enforce page alignment for identity mapped code

2013-04-04 Thread Christoffer Dall
On Thu, Apr 4, 2013 at 3:47 AM, Marc Zyngier  wrote:
> On 04/04/13 00:15, Christoffer Dall wrote:
>> On Wed, Apr 03, 2013 at 11:00:39AM +0100, Marc Zyngier wrote:
>>> On 03/04/13 10:50, Will Deacon wrote:
 On Tue, Apr 02, 2013 at 02:25:12PM +0100, Marc Zyngier wrote:
> We're about to move to a init procedure where we rely on the
> fact that the init code fits in a single page. Make sure we
> align the idmap text on a page boundary, and that the code is
> not bigger than a single page.
>
> Signed-off-by: Marc Zyngier 
> ---
>  arch/arm/kernel/vmlinux.lds.S | 2 +-
>  arch/arm/kvm/init.S   | 7 +++
>  2 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
> index b571484..d9dd265 100644
> --- a/arch/arm/kernel/vmlinux.lds.S
> +++ b/arch/arm/kernel/vmlinux.lds.S
> @@ -20,7 +20,7 @@
>VMLINUX_SYMBOL(__idmap_text_start) = .; \
>*(.idmap.text)  \
>VMLINUX_SYMBOL(__idmap_text_end) = .;   \
> -  ALIGN_FUNCTION();   \
> +  . = ALIGN(PAGE_SIZE);   \
>VMLINUX_SYMBOL(__hyp_idmap_text_start) = .; \
>*(.hyp.idmap.text)  \
>VMLINUX_SYMBOL(__hyp_idmap_text_end) = .;
> diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
> index 9f37a79..35a463f 100644
> --- a/arch/arm/kvm/init.S
> +++ b/arch/arm/kvm/init.S
> @@ -111,4 +111,11 @@ __do_hyp_init:
>.globl __kvm_hyp_init_end
>  __kvm_hyp_init_end:
>
> +  /*
> +   * The above code *must* fit in a single page for the trampoline
> +   * madness to work. Whoever decides to change it must make sure
> +   * we map the right amount of memory for the trampoline to work.
> +   * The line below ensures any breakage will get noticed.
>>
>> This comment is a little funny. What's this about changing it? And why
>> madness? OK, it's a little mad.
>>
>> I think the nice thing to explain here is why .org would complain - I
>> had to look in the AS reference to figure out that it will complain if
>> the location pointer is forwarded outside the section - if that is the
>> right reason?
>>
> +   */
> +  .org__kvm_hyp_init + PAGE_SIZE
>.popsection

 What effect does this have on the size of the kernel image? I'd expect the
 idmap code to be pretty small, so aligning to a page might be overkill a 
 lot
 of the time.
>>>
>>> So we're indeed wasting memory between the kernel and the HYP idmaps.
>>> The reason for this is to ensure that the init code is not spread across
>>> two pages, which would make the trampoline harder to allocate.
>>>
>>> If there's a way to ensure that the init.S code will fit a single page,
>>> then we can remove this ALIGN(PAGE_SIZE).
>>
>> Just allocate a temporary page and copy the bit of code into there?  It
>> should be easy enough to make sure it's location independent.  We can
>> even give that page back later if we want and re-do the trick when a CPU
>> gets hot-plugged if we want. No?
>
> I'd really like to avoid freeing stuff, because it gives us even more
> chances to screw our TLBs (we'd have to invalidate on unmap, and deal
> with potential races between CPUs coming up).
>
> We could allocate this page *if* the HYP init code happens to cross a
> page boundary, and only then. I'll have a look at implementing this as
> it should solve Will's issue (which is even worse on arm64 with 64k pages).
>
The freeing stuff was a minor point, that I don't care much about, but
reducing the kernel size is a valid concern I guess.

-Christoffer
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 04:14:57PM +0300, Gleb Natapov wrote:
> On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
> > 
> > On 04.04.2013, at 14:56, Gleb Natapov wrote:
> > 
> > > On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
> > >> 
> > >> On 04.04.2013, at 14:45, Gleb Natapov wrote:
> > >> 
> > >>> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> >  
> >  On 04.04.2013, at 14:38, Gleb Natapov wrote:
> >  
> > > On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> > >> 
> > >> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> > >> 
> > >>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >  
> >  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >  
> > > With KVM, MMIO is much slower than PIO, due to the need to
> > > do page walk and emulation. But with EPT, it does not have to be: 
> > > we
> > > know the address from the VMCS so if the address is unique, we 
> > > can look
> > > up the eventfd directly, bypassing emulation.
> > > 
> > > Add an interface for userspace to specify this per-address, we can
> > > use this e.g. for virtio.
> > > 
> > > The implementation adds a separate bus internally. This serves two
> > > purposes:
> > > - minimize overhead for old userspace that does not use PV MMIO
> > > - minimize disruption in other code (since we don't know the 
> > > length,
> > > devices on the MMIO bus only get a valid address in write, this
> > > way we don't need to touch all devices to teach them handle
> > > an dinvalid length)
> > > 
> > > At the moment, this optimization is only supported for EPT on x86 
> > > and
> > > silently ignored for NPT and MMU, so everything works correctly 
> > > but
> > > slowly.
> > > 
> > > TODO: NPT, MMU and non x86 architectures.
> > > 
> > > The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> > > pre-review and suggestions.
> > > 
> > > Signed-off-by: Michael S. Tsirkin 
> >  
> >  This still uses page fault intercepts which are orders of 
> >  magnitudes slower than hypercalls. Why don't you just create a PV 
> >  MMIO hypercall that the guest can use to invoke MMIO accesses 
> >  towards the host based on physical addresses with explicit length 
> >  encodings?
> >  
> > >>> It is slower, but not an order of magnitude slower. It become faster
> > >>> with newer HW.
> > >>> 
> >  That way you simplify and speed up all code paths, exceeding the 
> >  speed of PIO exits even. It should also be quite easily portable, 
> >  as all other platforms have hypercalls available as well.
> >  
> > >>> We are trying to avoid PV as much as possible (well this is also PV,
> > >>> but not guest visible
> > >> 
> > >> Also, how is this not guest visible? Who sets 
> > >> KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition 
> > >> indicates that the guest does so, so it is guest visible.
> > >> 
> > > QEMU sets it.
> >  
> >  How does QEMU know?
> >  
> > >>> Knows what? When to create such eventfd? virtio device knows.
> > >> 
> > >> Where does it know from?
> > >> 
> > > It does it always.
> > > 
> > >>> 
> > > 
> > >> +/*
> > >> + * PV_MMIO - Guest can promise us that all accesses touching this 
> > >> address
> > >> + * are writes of specified length, starting at the specified 
> > >> address.
> > >> + * If not - it's a Guest bug.
> > >> + * Can not be used together with either PIO or DATAMATCH.
> > >> + */
> > >> 
> > > Virtio spec will state that access to a kick register needs to be of
> > > specific length. This is reasonable thing for HW to ask.
> >  
> >  This is a spec change. So the guest would have to indicate that it 
> >  adheres to a newer spec. Thus it's a guest visible change.
> >  
> > >>> There is not virtio spec that has kick register in MMIO. The spec is in
> > >>> the works AFAIK. Actually PIO will not be deprecated and my suggestion
> > >> 
> > >> So the guest would indicate that it supports a newer revision of the 
> > >> spec (in your case, that it supports MMIO). How is that any different 
> > >> from exposing that it supports a PV MMIO hcall?
> > >> 
> > > Guest will indicate nothing. New driver will use MMIO if PIO is bar is
> > > not configured. All driver will not work for virtio devices with MMIO
> > > bar, but not PIO bar.
> > 
> > I can't parse that, sorry :).
> > 
> I am sure MST can explain it better, but I'll try one more time.
> Device will have two BARs with kick register one is PIO another is MMIO.
> Old driver works only with PIO 

RE: [PATCH] bookehv: Handle debug exception on guest exit

2013-04-04 Thread Bhushan Bharat-R65777


> -Original Message-
> From: Alexander Graf [mailto:ag...@suse.de]
> Sent: Thursday, April 04, 2013 6:55 PM
> To: Bhushan Bharat-R65777
> Cc: linuxppc-...@lists.ozlabs.org; kvm@vger.kernel.org; 
> kvm-...@vger.kernel.org;
> Wood Scott-B07421; Bhushan Bharat-R65777
> Subject: Re: [PATCH] bookehv: Handle debug exception on guest exit
> 
> 
> On 20.03.2013, at 18:45, Bharat Bhushan wrote:
> 
> > EPCR.DUVD controls whether the debug events can come in hypervisor
> > mode or not. When KVM guest is using the debug resource then we do not
> > want debug events to be captured in guest entry/exit path. So we set
> > EPCR.DUVD when entering and clears EPCR.DUVD when exiting from guest.
> >
> > Debug instruction complete is a post-completion debug exception but
> > debug event gets posted on the basis of MSR before the instruction is
> > executed. Now if the instruction switches the context from guest mode
> > (MSR.GS = 1) to hypervisor mode (MSR.GS = 0) then the xSRR0 points to
> > first instruction of KVM handler and xSRR1 points that MSR.GS is clear
> > (hypervisor context). Now as xSRR1.GS is used to decide whether KVM
> > handler will be invoked to handle the exception or host host kernel
> > debug handler will be invoked to handle the exception.
> > This leads to host kernel debug handler handling the exception which
> > should either be handled by KVM.
> >
> > This is tested on e500mc in 32 bit mode
> >
> > Signed-off-by: Bharat Bhushan 
> > ---
> > v0:
> > - Do not apply this change for debug_crit as we do not know those chips have
> issue or not.
> > - corrected 64bit case branching
> >
> > arch/powerpc/kernel/exceptions-64e.S |   29 -
> > arch/powerpc/kernel/head_booke.h |   26 ++
> > 2 files changed, 54 insertions(+), 1 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/exceptions-64e.S
> > b/arch/powerpc/kernel/exceptions-64e.S
> > index 4684e33..8b26294 100644
> > --- a/arch/powerpc/kernel/exceptions-64e.S
> > +++ b/arch/powerpc/kernel/exceptions-64e.S
> > @@ -516,6 +516,33 @@ kernel_dbg_exc:
> > andis.  r15,r14,DBSR_IC@h
> > beq+1f
> >
> > +#ifdef CONFIG_KVM_BOOKE_HV
> > +   /*
> > +* EPCR.DUVD controls whether the debug events can come in
> > +* hypervisor mode or not. When KVM guest is using the debug
> > +* resource then we do not want debug events to be captured
> > +* in guest entry/exit path. So we set EPCR.DUVD when entering
> > +* and clears EPCR.DUVD when exiting from guest.
> > +* Debug instruction complete is a post-completion debug
> > +* exception but debug event gets posted on the basis of MSR
> > +* before the instruction is executed. Now if the instruction
> > +* switches the context from guest mode (MSR.GS = 1) to hypervisor
> > +* mode (MSR.GS = 0) then the xSRR0 points to first instruction of
> 
> Can't we just execute that code path with MSR.DE=0?

Single stepping uses DBCR0.IC (instruction complete).
Can you describe how MSR.DE = 0 will work?

> 
> 
> Alex
> 
> > +* KVM handler and xSRR1 points that MSR.GS is clear
> > +* (hypervisor context). Now as xSRR1.GS is used to decide whether
> > +* KVM handler will be invoked to handle the exception or host
> > +* host kernel debug handler will be invoked to handle the exception.
> > +* This leads to host kernel debug handler handling the exception
> > +* which should either be handled by KVM.
> > +*/
> > +   mfspr   r10, SPRN_EPCR
> > +   andis.  r10,r10,SPRN_EPCR_DUVD@h
> > +   beq+2f
> > +
> > +   andis.  r10,r9,MSR_GS@h
> > +   beq+3f
> > +2:
> > +#endif
> > LOAD_REG_IMMEDIATE(r14,interrupt_base_book3e)
> > LOAD_REG_IMMEDIATE(r15,interrupt_end_book3e)
> > cmpld   cr0,r10,r14
> > @@ -523,7 +550,7 @@ kernel_dbg_exc:
> > blt+cr0,1f
> > bge+cr1,1f
> >
> > -   /* here it looks like we got an inappropriate debug exception. */
> > +3: /* here it looks like we got an inappropriate debug exception. */
> > lis r14,DBSR_IC@h   /* clear the IC event */
> > rlwinm  r11,r11,0,~MSR_DE   /* clear DE in the DSRR1 value */
> > mtspr   SPRN_DBSR,r14
> > diff --git a/arch/powerpc/kernel/head_booke.h
> > b/arch/powerpc/kernel/head_booke.h
> > index 5f051ee..edc6a3b 100644
> > --- a/arch/powerpc/kernel/head_booke.h
> > +++ b/arch/powerpc/kernel/head_booke.h
> > @@ -285,7 +285,33 @@ label:
> > mfspr   r10,SPRN_DBSR;  /* check single-step/branch taken */  \
> > andis.  r10,r10,(DBSR_IC|DBSR_BT)@h;  \
> > beq+2f;   \
> > +#ifdef CONFIG_KVM_BOOKE_HV   \
> > +   /*\
> > +* EPCR.DUVD controls whether the debug events can come in\
> > +* hypervisor mode or not. When KVM guest is using the debug  \
> > +*

Re: [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Paolo Bonzini
Il 04/04/2013 15:10, Michael S. Tsirkin ha scritto:
> > I would like Avi to comment on this, because I think this is not the
> > "memory-API approved" way of doing things.  You need KVM to define its
> > own AddressSpace, and make KVM's listener use
> > memory_region_to_address_space to figure out if it is for PV MMIO.
> > 
> > To handle accesses from TCG, the PV AddressSpace can simply have just an
> > alias to the actual MemoryRegion where the doorbell is.
> 
> This is not really different from other eventfd flags like datamatch.
> Separate address space is not appropriate here.
> This is regular memory space, KVM eventfd simply has a flag that says
> "guest is well behaved so please make eventfd go faster".

Having a separate address space would match what you do in the kernel
though.

I just don't like functions with a dozen arguments...  Can you just make
datamatch and pv a single flags argument, as is the case with the KVM API?

Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 04:02:57PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:58, Michael S. Tsirkin wrote:
> 
> > On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>  
>  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>  
> > With KVM, MMIO is much slower than PIO, due to the need to
> > do page walk and emulation. But with EPT, it does not have to be: we
> > know the address from the VMCS so if the address is unique, we can look
> > up the eventfd directly, bypassing emulation.
> > 
> > Add an interface for userspace to specify this per-address, we can
> > use this e.g. for virtio.
> > 
> > The implementation adds a separate bus internally. This serves two
> > purposes:
> > - minimize overhead for old userspace that does not use PV MMIO
> > - minimize disruption in other code (since we don't know the length,
> > devices on the MMIO bus only get a valid address in write, this
> > way we don't need to touch all devices to teach them handle
> > an dinvalid length)
> > 
> > At the moment, this optimization is only supported for EPT on x86 and
> > silently ignored for NPT and MMU, so everything works correctly but
> > slowly.
> > 
> > TODO: NPT, MMU and non x86 architectures.
> > 
> > The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> > pre-review and suggestions.
> > 
> > Signed-off-by: Michael S. Tsirkin 
>  
>  This still uses page fault intercepts which are orders of magnitudes 
>  slower than hypercalls. Why don't you just create a PV MMIO hypercall 
>  that the guest can use to invoke MMIO accesses towards the host based on 
>  physical addresses with explicit length encodings?
>  
> >>> It is slower, but not an order of magnitude slower. It become faster
> >>> with newer HW.
> >>> 
>  That way you simplify and speed up all code paths, exceeding the speed 
>  of PIO exits even. It should also be quite easily portable, as all other 
>  platforms have hypercalls available as well.
>  
> >>> We are trying to avoid PV as much as possible (well this is also PV,
> >>> but not guest visible). We haven't replaced PIO with hypercall for the
> >>> same reason. My hope is that future HW will provide us with instruction
> >>> decode for basic mov instruction at which point this optimisation can be
> >>> dropped.
> >> 
> >> The same applies to an MMIO hypercall. Once the PV interface becomes 
> >> obsolete, we can drop the capability we expose to the guest.
> > 
> > Yes but unlike a hypercall this optimization does not need special code
> > in the guest. You can use standard OS interfaces for memory access.
> 
> Yes, but let's try to understand the room for optimization we're
> talking about here. The "normal" MMIO case seems excessively slower in
> MST's benchmarks than it did in mine. So maybe we're really just
> looking at a bug here.

Could be. I posted the code (kvm,qemu and test) so please review and try
to spot a bug.

> Also, if hcalls are again only 50% of a fast MMIO callback, it's
> certainly worth checking out what room for improvement we're really
> wasting.
> 
> 
> Alex

Take a look at 'kvm: pci PORT IO MMIO and PV MMIO speed tests'.
Try running the test on your hardware and see what happens.
Or post the test you used, I can try it on my box if you like.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM Guest Lock up (100%) again!

2013-04-04 Thread Phil Daws
One of my KVM guests locked up again at 100% CPU!  Any thoughts on how I can 
diagnose it ? We would love to put into production but am very concerned about 
the current stability. I have tried to re-direct the console, through screen, 
to see whether there is a spin lock that is causing the problem; though all I 
got in the log file was a login prompt. What is the correct way of redirecting 
the console in KVM on a CentOS 6.4 system please ? Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:56, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:45, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
>  
>  On 04.04.2013, at 14:38, Gleb Natapov wrote:
>  
> > On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>  
>  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>  
> > With KVM, MMIO is much slower than PIO, due to the need to
> > do page walk and emulation. But with EPT, it does not have to be: we
> > know the address from the VMCS so if the address is unique, we can 
> > look
> > up the eventfd directly, bypassing emulation.
> > 
> > Add an interface for userspace to specify this per-address, we can
> > use this e.g. for virtio.
> > 
> > The implementation adds a separate bus internally. This serves two
> > purposes:
> > - minimize overhead for old userspace that does not use PV MMIO
> > - minimize disruption in other code (since we don't know the length,
> > devices on the MMIO bus only get a valid address in write, this
> > way we don't need to touch all devices to teach them handle
> > an dinvalid length)
> > 
> > At the moment, this optimization is only supported for EPT on x86 
> > and
> > silently ignored for NPT and MMU, so everything works correctly but
> > slowly.
> > 
> > TODO: NPT, MMU and non x86 architectures.
> > 
> > The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> > pre-review and suggestions.
> > 
> > Signed-off-by: Michael S. Tsirkin 
>  
>  This still uses page fault intercepts which are orders of magnitudes 
>  slower than hypercalls. Why don't you just create a PV MMIO 
>  hypercall that the guest can use to invoke MMIO accesses towards the 
>  host based on physical addresses with explicit length encodings?
>  
> >>> It is slower, but not an order of magnitude slower. It become faster
> >>> with newer HW.
> >>> 
>  That way you simplify and speed up all code paths, exceeding the 
>  speed of PIO exits even. It should also be quite easily portable, as 
>  all other platforms have hypercalls available as well.
>  
> >>> We are trying to avoid PV as much as possible (well this is also PV,
> >>> but not guest visible
> >> 
> >> Also, how is this not guest visible? Who sets 
> >> KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
> >> that the guest does so, so it is guest visible.
> >> 
> > QEMU sets it.
>  
>  How does QEMU know?
>  
> >>> Knows what? When to create such eventfd? virtio device knows.
> >> 
> >> Where does it know from?
> >> 
> > It does it always.
> > 
> >>> 
> > 
> >> +/*
> >> + * PV_MMIO - Guest can promise us that all accesses touching this 
> >> address
> >> + * are writes of specified length, starting at the specified address.
> >> + * If not - it's a Guest bug.
> >> + * Can not be used together with either PIO or DATAMATCH.
> >> + */
> >> 
> > Virtio spec will state that access to a kick register needs to be of
> > specific length. This is reasonable thing for HW to ask.
>  
>  This is a spec change. So the guest would have to indicate that it 
>  adheres to a newer spec. Thus it's a guest visible change.
>  
> >>> There is not virtio spec that has kick register in MMIO. The spec is in
> >>> the works AFAIK. Actually PIO will not be deprecated and my suggestion
> >> 
> >> So the guest would indicate that it supports a newer revision of the spec 
> >> (in your case, that it supports MMIO). How is that any different from 
> >> exposing that it supports a PV MMIO hcall?
> >> 
> > Guest will indicate nothing. New driver will use MMIO if PIO is bar is
> > not configured. All driver will not work for virtio devices with MMIO
> > bar, but not PIO bar.
> 
> I can't parse that, sorry :).

It's simple. Driver does iowrite16 or whatever is appropriate for the OS.
QEMU tells KVM which address driver uses, to make exits faster.  This is not
different from how eventfd works.  For example if exits to QEMU suddenly become
very cheap we can remove eventfd completely.

> > 
> >>> is to move to MMIO only when PIO address space is exhausted. For PCI it
> >>> will be never, for PCI-e it will be after ~16 devices.
> >> 
> >> Ok, let's go back a step here. Are you actually able to measure any speed 
> >> in performan

Re: [Qemu-devel] [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 03:23:43PM +0100, Peter Maydell wrote:
> On 4 April 2013 14:11, Michael S. Tsirkin  wrote:
> > On Thu, Apr 04, 2013 at 01:50:36PM +0100, Peter Maydell wrote:
> >> Can we keep linux-headers updates in separate patches, please
> >> (with commit messages saying what kernel commit/version the
> >> sync is with)?
> 
> > OK, first it needs to get into linux though :).
> 
> In that case you need an RFC tag on your patches :-)
> 
> -- PMM

The cover letter said patches 5 and 6 are RFC,
but patches 1-4 are ready and should go in I think.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Peter Maydell
On 4 April 2013 14:11, Michael S. Tsirkin  wrote:
> On Thu, Apr 04, 2013 at 01:50:36PM +0100, Peter Maydell wrote:
>> Can we keep linux-headers updates in separate patches, please
>> (with commit messages saying what kernel commit/version the
>> sync is with)?

> OK, first it needs to get into linux though :).

In that case you need an RFC tag on your patches :-)

-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 03:56:49PM +0300, Gleb Natapov wrote:
> On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
> > 
> > On 04.04.2013, at 14:45, Gleb Natapov wrote:
> > 
> > > On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> > >> 
> > >> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> > >> 
> > >>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >  
> >  On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >  
> > > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> > >> 
> > >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> > >> 
> > >>> With KVM, MMIO is much slower than PIO, due to the need to
> > >>> do page walk and emulation. But with EPT, it does not have to be: we
> > >>> know the address from the VMCS so if the address is unique, we can 
> > >>> look
> > >>> up the eventfd directly, bypassing emulation.
> > >>> 
> > >>> Add an interface for userspace to specify this per-address, we can
> > >>> use this e.g. for virtio.
> > >>> 
> > >>> The implementation adds a separate bus internally. This serves two
> > >>> purposes:
> > >>> - minimize overhead for old userspace that does not use PV MMIO
> > >>> - minimize disruption in other code (since we don't know the length,
> > >>> devices on the MMIO bus only get a valid address in write, this
> > >>> way we don't need to touch all devices to teach them handle
> > >>> an dinvalid length)
> > >>> 
> > >>> At the moment, this optimization is only supported for EPT on x86 
> > >>> and
> > >>> silently ignored for NPT and MMU, so everything works correctly but
> > >>> slowly.
> > >>> 
> > >>> TODO: NPT, MMU and non x86 architectures.
> > >>> 
> > >>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> > >>> pre-review and suggestions.
> > >>> 
> > >>> Signed-off-by: Michael S. Tsirkin 
> > >> 
> > >> This still uses page fault intercepts which are orders of magnitudes 
> > >> slower than hypercalls. Why don't you just create a PV MMIO 
> > >> hypercall that the guest can use to invoke MMIO accesses towards the 
> > >> host based on physical addresses with explicit length encodings?
> > >> 
> > > It is slower, but not an order of magnitude slower. It become faster
> > > with newer HW.
> > > 
> > >> That way you simplify and speed up all code paths, exceeding the 
> > >> speed of PIO exits even. It should also be quite easily portable, as 
> > >> all other platforms have hypercalls available as well.
> > >> 
> > > We are trying to avoid PV as much as possible (well this is also PV,
> > > but not guest visible
> >  
> >  Also, how is this not guest visible? Who sets 
> >  KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
> >  that the guest does so, so it is guest visible.
> >  
> > >>> QEMU sets it.
> > >> 
> > >> How does QEMU know?
> > >> 
> > > Knows what? When to create such eventfd? virtio device knows.
> > 
> > Where does it know from?
> > 
> It does it always.
> 
> > > 
> > >>> 
> >  +/*
> >  + * PV_MMIO - Guest can promise us that all accesses touching this 
> >  address
> >  + * are writes of specified length, starting at the specified address.
> >  + * If not - it's a Guest bug.
> >  + * Can not be used together with either PIO or DATAMATCH.
> >  + */
> >  
> > >>> Virtio spec will state that access to a kick register needs to be of
> > >>> specific length. This is reasonable thing for HW to ask.
> > >> 
> > >> This is a spec change. So the guest would have to indicate that it 
> > >> adheres to a newer spec. Thus it's a guest visible change.
> > >> 
> > > There is not virtio spec that has kick register in MMIO. The spec is in
> > > the works AFAIK. Actually PIO will not be deprecated and my suggestion
> > 
> > So the guest would indicate that it supports a newer revision of the spec 
> > (in your case, that it supports MMIO). How is that any different from 
> > exposing that it supports a PV MMIO hcall?
> > 
> Guest will indicate nothing. New driver will use MMIO if PIO is bar is
> not configured. All driver will not work for virtio devices with MMIO
> bar, but not PIO bar.
> 
> > > is to move to MMIO only when PIO address space is exhausted. For PCI it
> > > will be never, for PCI-e it will be after ~16 devices.
> > 
> > Ok, let's go back a step here. Are you actually able to measure any speed 
> > in performance with this patch applied and without when going through MMIO 
> > kicks?
> > 
> > 
> That's the question for MST. I think he did only micro benchmarks till
> now and he already posted his result here:
> 
> mmio-wildcard-eventfd:pci-mem 3529
> mmio-pv-eventfd:pci-mem 1878
> portio-wildcard-eventfd:pci-io 1846
> 
> So the patch speedup mmio by almost 100% and it is almost the same as PIO.

Exa

Re: [Qemu-devel] [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 01:50:36PM +0100, Peter Maydell wrote:
> On 4 April 2013 11:40, Michael S. Tsirkin  wrote:
> > Add an option for users to specify PV eventfd
> > listeners, and utilize KVM's PV MMIO underneath.
> > Upodate all callers.
> >
> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >  hw/dataplane/hostmem.c|  1 +
> >  hw/ivshmem.c  |  2 ++
> >  hw/pci-testdev.c  |  2 ++
> >  hw/vhost.c|  4 ++--
> >  hw/virtio-pci.c   |  4 ++--
> >  include/exec/memory.h | 10 ++
> >  kvm-all.c | 15 ---
> >  linux-headers/linux/kvm.h |  8 
> >  memory.c  |  9 +++--
> 
> Can we keep linux-headers updates in separate patches, please
> (with commit messages saying what kernel commit/version the
> sync is with)?
> 
> thanks
> -- PMM

OK, first it needs to get into linux though :).

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 02:46:11PM +0200, Paolo Bonzini wrote:
> Il 04/04/2013 12:40, Michael S. Tsirkin ha scritto:
> > Add an option for users to specify PV eventfd
> > listeners, and utilize KVM's PV MMIO underneath.
> > Upodate all callers.
> 
> I would like Avi to comment on this, because I think this is not the
> "memory-API approved" way of doing things.  You need KVM to define its
> own AddressSpace, and make KVM's listener use
> memory_region_to_address_space to figure out if it is for PV MMIO.
> 
> To handle accesses from TCG, the PV AddressSpace can simply have just an
> alias to the actual MemoryRegion where the doorbell is.
> 
> Paolo

This is not really different from other eventfd flags like datamatch.
Separate address space is not appropriate here.
This is regular memory space, KVM eventfd simply has a flag that says
"guest is well behaved so please make eventfd go faster".

> > Signed-off-by: Michael S. Tsirkin 
> > ---
> >  hw/dataplane/hostmem.c|  1 +
> >  hw/ivshmem.c  |  2 ++
> >  hw/pci-testdev.c  |  2 ++
> >  hw/vhost.c|  4 ++--
> >  hw/virtio-pci.c   |  4 ++--
> >  include/exec/memory.h | 10 ++
> >  kvm-all.c | 15 ---
> >  linux-headers/linux/kvm.h |  8 
> >  memory.c  |  9 +++--
> >  9 files changed, 46 insertions(+), 9 deletions(-)
> > 
> > diff --git a/hw/dataplane/hostmem.c b/hw/dataplane/hostmem.c
> > index 380537e..41d43e7 100644
> > --- a/hw/dataplane/hostmem.c
> > +++ b/hw/dataplane/hostmem.c
> > @@ -126,6 +126,7 @@ static void 
> > hostmem_listener_section_dummy(MemoryListener *listener,
> >  
> >  static void hostmem_listener_eventfd_dummy(MemoryListener *listener,
> > MemoryRegionSection *section,
> > +   bool pv,
> > bool match_data, uint64_t data,
> > EventNotifier *e)
> >  {
> > diff --git a/hw/ivshmem.c b/hw/ivshmem.c
> > index 68a2cf2..28e5978 100644
> > --- a/hw/ivshmem.c
> > +++ b/hw/ivshmem.c
> > @@ -352,6 +352,7 @@ static void ivshmem_add_eventfd(IVShmemState *s, int 
> > posn, int i)
> >  memory_region_add_eventfd(&s->ivshmem_mmio,
> >DOORBELL,
> >4,
> > +  false,
> >true,
> >(posn << 16) | i,
> >&s->peers[posn].eventfds[i]);
> > @@ -362,6 +363,7 @@ static void ivshmem_del_eventfd(IVShmemState *s, int 
> > posn, int i)
> >  memory_region_del_eventfd(&s->ivshmem_mmio,
> >DOORBELL,
> >4,
> > +  false,
> >true,
> >(posn << 16) | i,
> >&s->peers[posn].eventfds[i]);
> > diff --git a/hw/pci-testdev.c b/hw/pci-testdev.c
> > index 9486624..f0ebf99 100644
> > --- a/hw/pci-testdev.c
> > +++ b/hw/pci-testdev.c
> > @@ -80,6 +80,7 @@ static int pci_testdev_start(IOTest *test)
> >  memory_region_add_eventfd(test->mr,
> >le32_to_cpu(test->hdr->offset),
> >test->size,
> > +  false,
> >test->match_data,
> >test->hdr->data,
> >&test->notifier);
> > @@ -94,6 +95,7 @@ static void pci_testdev_stop(IOTest *test)
> >  memory_region_del_eventfd(test->mr,
> >le32_to_cpu(test->hdr->offset),
> >test->size,
> > +  false,
> >test->match_data,
> >test->hdr->data,
> >&test->notifier);
> > diff --git a/hw/vhost.c b/hw/vhost.c
> > index 4d6aee3..9ad7101 100644
> > --- a/hw/vhost.c
> > +++ b/hw/vhost.c
> > @@ -750,13 +750,13 @@ static void vhost_virtqueue_stop(struct vhost_dev 
> > *dev,
> >  }
> >  
> >  static void vhost_eventfd_add(MemoryListener *listener,
> > -  MemoryRegionSection *section,
> > +  MemoryRegionSection *section, bool pv,
> >bool match_data, uint64_t data, 
> > EventNotifier *e)
> >  {
> >  }
> >  
> >  static void vhost_eventfd_del(MemoryListener *listener,
> > -  MemoryRegionSection *section,
> > +  MemoryRegionSection *section, bool pv,
> >bool match_data, uint64_t data, 
> > EventNotifier *e)
> >  {
> >  }
> > diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> > index 19965e5..f9ff994 100644
> > --- a/hw/virtio-pci.c
> > +++ b/hw/virtio-pci.c
> > @@ -189,10 +189,10 

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>  
>  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>  
> > With KVM, MMIO is much slower than PIO, due to the need to
> > do page walk and emulation. But with EPT, it does not have to be: we
> > know the address from the VMCS so if the address is unique, we can look
> > up the eventfd directly, bypassing emulation.
> > 
> > Add an interface for userspace to specify this per-address, we can
> > use this e.g. for virtio.
> > 
> > The implementation adds a separate bus internally. This serves two
> > purposes:
> > - minimize overhead for old userspace that does not use PV MMIO
> > - minimize disruption in other code (since we don't know the length,
> > devices on the MMIO bus only get a valid address in write, this
> > way we don't need to touch all devices to teach them handle
> > an dinvalid length)
> > 
> > At the moment, this optimization is only supported for EPT on x86 and
> > silently ignored for NPT and MMU, so everything works correctly but
> > slowly.
> > 
> > TODO: NPT, MMU and non x86 architectures.
> > 
> > The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> > pre-review and suggestions.
> > 
> > Signed-off-by: Michael S. Tsirkin 
>  
>  This still uses page fault intercepts which are orders of magnitudes 
>  slower than hypercalls. Why don't you just create a PV MMIO hypercall 
>  that the guest can use to invoke MMIO accesses towards the host based on 
>  physical addresses with explicit length encodings?
>  
> >>> It is slower, but not an order of magnitude slower. It become faster
> >>> with newer HW.
> >>> 
>  That way you simplify and speed up all code paths, exceeding the speed 
>  of PIO exits even. It should also be quite easily portable, as all other 
>  platforms have hypercalls available as well.
>  
> >>> We are trying to avoid PV as much as possible (well this is also PV,
> >>> but not guest visible
> >> 
> >> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? 
> >> The comment above its definition indicates that the guest does so, so it 
> >> is guest visible.
> >> 
> > QEMU sets it.
> 
> How does QEMU know?
> 
> > 
> >> +/*
> >> + * PV_MMIO - Guest can promise us that all accesses touching this address
> >> + * are writes of specified length, starting at the specified address.
> >> + * If not - it's a Guest bug.
> >> + * Can not be used together with either PIO or DATAMATCH.
> >> + */
> >> 
> > Virtio spec will state that access to a kick register needs to be of
> > specific length. This is reasonable thing for HW to ask.
> 
> This is a spec change. So the guest would have to indicate that it adheres to 
> a newer spec. Thus it's a guest visible change.
> 
> 
> Alex

To use MMIO is a spec change, yes.
What Gleb refers to is that the optimization is not guest visible.
If you have hardware that can quickly decode MMIO you can
drop the PV ioctl from qemu without changing guests.


-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >> 
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>> 
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>> 
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>> 
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>> 
> >>> TODO: NPT, MMU and non x86 architectures.
> >>> 
> >>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>> 
> >>> Signed-off-by: Michael S. Tsirkin 
> >> 
> >> This still uses page fault intercepts which are orders of magnitudes 
> >> slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
> >> the guest can use to invoke MMIO accesses towards the host based on 
> >> physical addresses with explicit length encodings?
> >> 
> > It is slower, but not an order of magnitude slower. It become faster
> > with newer HW.
> > 
> >> That way you simplify and speed up all code paths, exceeding the speed of 
> >> PIO exits even. It should also be quite easily portable, as all other 
> >> platforms have hypercalls available as well.
> >> 
> > We are trying to avoid PV as much as possible (well this is also PV,
> > but not guest visible
> 
> Also, how is this not guest visible?

It's visible but it does not require special code in the guest.
Only on qemu. Guest can use standard iowritel/w/b.

> Who sets
> KVM_IOEVENTFD_FLAG_PV_MMIO?

It's an ioctl so qemu does this.

> The comment above its definition indicates
> that the guest does so, so it is guest visible.
> 
> +/*
> + * PV_MMIO - Guest can promise us that all accesses touching this address
> + * are writes of specified length, starting at the specified address.
> + * If not - it's a Guest bug.
> + * Can not be used together with either PIO or DATAMATCH.
> + */
> 
> 
> Alex

The requirement is to only access a specific address by a single aligned
instruction.  For example, only aligned signle-word accesses are
allowed.

This is a standard practice with many devices, in fact,
virtio spec already requires this for notifications.


-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:58, Michael S. Tsirkin wrote:

> On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
>> 
>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
>> 
>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
> With KVM, MMIO is much slower than PIO, due to the need to
> do page walk and emulation. But with EPT, it does not have to be: we
> know the address from the VMCS so if the address is unique, we can look
> up the eventfd directly, bypassing emulation.
> 
> Add an interface for userspace to specify this per-address, we can
> use this e.g. for virtio.
> 
> The implementation adds a separate bus internally. This serves two
> purposes:
> - minimize overhead for old userspace that does not use PV MMIO
> - minimize disruption in other code (since we don't know the length,
> devices on the MMIO bus only get a valid address in write, this
> way we don't need to touch all devices to teach them handle
> an dinvalid length)
> 
> At the moment, this optimization is only supported for EPT on x86 and
> silently ignored for NPT and MMU, so everything works correctly but
> slowly.
> 
> TODO: NPT, MMU and non x86 architectures.
> 
> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> pre-review and suggestions.
> 
> Signed-off-by: Michael S. Tsirkin 
 
 This still uses page fault intercepts which are orders of magnitudes 
 slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
 the guest can use to invoke MMIO accesses towards the host based on 
 physical addresses with explicit length encodings?
 
>>> It is slower, but not an order of magnitude slower. It become faster
>>> with newer HW.
>>> 
 That way you simplify and speed up all code paths, exceeding the speed of 
 PIO exits even. It should also be quite easily portable, as all other 
 platforms have hypercalls available as well.
 
>>> We are trying to avoid PV as much as possible (well this is also PV,
>>> but not guest visible). We haven't replaced PIO with hypercall for the
>>> same reason. My hope is that future HW will provide us with instruction
>>> decode for basic mov instruction at which point this optimisation can be
>>> dropped.
>> 
>> The same applies to an MMIO hypercall. Once the PV interface becomes 
>> obsolete, we can drop the capability we expose to the guest.
> 
> Yes but unlike a hypercall this optimization does not need special code
> in the guest. You can use standard OS interfaces for memory access.

Yes, but let's try to understand the room for optimization we're talking about 
here. The "normal" MMIO case seems excessively slower in MST's benchmarks than 
it did in mine. So maybe we're really just looking at a bug here.

Also, if hcalls are again only 50% of a fast MMIO callback, it's certainly 
worth checking out what room for improvement we're really wasting.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >> 
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>> 
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>> 
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>> 
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>> 
> >>> TODO: NPT, MMU and non x86 architectures.
> >>> 
> >>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>> 
> >>> Signed-off-by: Michael S. Tsirkin 
> >> 
> >> This still uses page fault intercepts which are orders of magnitudes 
> >> slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
> >> the guest can use to invoke MMIO accesses towards the host based on 
> >> physical addresses with explicit length encodings?
> >> 
> > It is slower, but not an order of magnitude slower. It become faster
> > with newer HW.
> > 
> >> That way you simplify and speed up all code paths, exceeding the speed of 
> >> PIO exits even. It should also be quite easily portable, as all other 
> >> platforms have hypercalls available as well.
> >> 
> > We are trying to avoid PV as much as possible (well this is also PV,
> > but not guest visible). We haven't replaced PIO with hypercall for the
> > same reason. My hope is that future HW will provide us with instruction
> > decode for basic mov instruction at which point this optimisation can be
> > dropped.
> 
> The same applies to an MMIO hypercall. Once the PV interface becomes 
> obsolete, we can drop the capability we expose to the guest.

Yes but unlike a hypercall this optimization does not need special code
in the guest. You can use standard OS interfaces for memory access.

> > And hypercall has its own set of problems with Windows guests.
> > When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
> > hypercalls.  Mixing KVM hypercalls and Hyper-V requires some tricks. It
> 
> Can't we simply register a hypercall ID range with Microsoft?
> 
> > may also affect WHQLing Windows drivers since driver will talk to HW
> > bypassing Windows interfaces.
> 
> Then the WHQL'ed driver doesn't support the PV MMIO hcall?
> 
> 
> Alex


-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 1/7 v2] KVM: PPC: e500: Expose MMU registers via ONE_REG

2013-04-04 Thread Caraman Mihai Claudiu-B02008
> -Original Message-
> From: Wood Scott-B07421
> Sent: Wednesday, March 27, 2013 12:43 AM
> To: Caraman Mihai Claudiu-B02008
> Cc: kvm-...@vger.kernel.org; kvm@vger.kernel.org; ag...@suse.de
> Subject: Re: [PATCH 1/7 v2] KVM: PPC: e500: Expose MMU registers via
> ONE_REG
> 
> On 03/26/2013 05:05:06 PM, Mihai Caraman wrote:
> > +int kvmppc_set_one_reg_e500_tlb(struct kvm_vcpu *vcpu, u64 id,
> > +  union kvmppc_one_reg *val)
> > +{
> > +   int r = 0;
> > +   long int i;
> > +
> > +   switch (id) {
> > +   case KVM_REG_PPC_MAS0:
> > +   vcpu->arch.shared->mas0 = set_reg_val(id, *val);
> > +   break;
> > +   case KVM_REG_PPC_MAS1:
> > +   vcpu->arch.shared->mas1 = set_reg_val(id, *val);
> > +   break;
> > +   case KVM_REG_PPC_MAS2:
> > +   vcpu->arch.shared->mas2 = set_reg_val(id, *val);
> > +   break;
> > +   case KVM_REG_PPC_MAS7_3:
> > +   vcpu->arch.shared->mas7_3 = set_reg_val(id, *val);
> > +   break;
> > +   case KVM_REG_PPC_MAS4:
> > +   vcpu->arch.shared->mas4 = set_reg_val(id, *val);
> > +   break;
> > +   case KVM_REG_PPC_MAS6:
> > +   vcpu->arch.shared->mas6 = set_reg_val(id, *val);
> > +   break;
> > +   /* Only allow MMU registers to be set to the config supported
> > by KVM */
> > +   case KVM_REG_PPC_MMUCFG: {
> > +   if (set_reg_val(id, *val) != vcpu->arch.mmucfg)
> > +   r = -EINVAL;
> > +   break;
> > +   }
> > +   case KVM_REG_PPC_TLB0CFG:
> > +   case KVM_REG_PPC_TLB1CFG:
> > +   case KVM_REG_PPC_TLB2CFG:
> > +   case KVM_REG_PPC_TLB3CFG: {
> > +   /* MMU geometry (N_ENTRY/ASSOC) can be set only using
> > SW_TLB */
> > +   i = id - KVM_REG_PPC_TLB0CFG;
> > +   if (set_reg_val(id, *val) != vcpu->arch.tlbcfg[i])
> > +   r = -EINVAL;
> > +   break;
> > +   }
> 
> Am I the only one that finds the set_reg_val/get_reg_val naming
> confusing?  At first glance it looks like it sets TLBnCFG and then
> later tests whether it should have. :-P
> 
> Functions should be named for what they do, not for what context you
> use them in.

I also found the existing code difficult to read. If we stick to reg and
val varible names what about val_to_reg/reg_to_val.
 
-Mike

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] bookehv: Handle debug exception on guest exit

2013-04-04 Thread Alexander Graf

On 20.03.2013, at 18:45, Bharat Bhushan wrote:

> EPCR.DUVD controls whether the debug events can come in
> hypervisor mode or not. When KVM guest is using the debug
> resource then we do not want debug events to be captured
> in guest entry/exit path. So we set EPCR.DUVD when entering
> and clears EPCR.DUVD when exiting from guest.
> 
> Debug instruction complete is a post-completion debug
> exception but debug event gets posted on the basis of MSR
> before the instruction is executed. Now if the instruction
> switches the context from guest mode (MSR.GS = 1) to hypervisor
> mode (MSR.GS = 0) then the xSRR0 points to first instruction of
> KVM handler and xSRR1 points that MSR.GS is clear
> (hypervisor context). Now as xSRR1.GS is used to decide whether
> KVM handler will be invoked to handle the exception or host
> host kernel debug handler will be invoked to handle the exception.
> This leads to host kernel debug handler handling the exception
> which should either be handled by KVM.
> 
> This is tested on e500mc in 32 bit mode
> 
> Signed-off-by: Bharat Bhushan 
> ---
> v0:
> - Do not apply this change for debug_crit as we do not know those chips have 
> issue or not.
> - corrected 64bit case branching
> 
> arch/powerpc/kernel/exceptions-64e.S |   29 -
> arch/powerpc/kernel/head_booke.h |   26 ++
> 2 files changed, 54 insertions(+), 1 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/exceptions-64e.S 
> b/arch/powerpc/kernel/exceptions-64e.S
> index 4684e33..8b26294 100644
> --- a/arch/powerpc/kernel/exceptions-64e.S
> +++ b/arch/powerpc/kernel/exceptions-64e.S
> @@ -516,6 +516,33 @@ kernel_dbg_exc:
>   andis.  r15,r14,DBSR_IC@h
>   beq+1f
> 
> +#ifdef CONFIG_KVM_BOOKE_HV
> + /*
> +  * EPCR.DUVD controls whether the debug events can come in
> +  * hypervisor mode or not. When KVM guest is using the debug
> +  * resource then we do not want debug events to be captured
> +  * in guest entry/exit path. So we set EPCR.DUVD when entering
> +  * and clears EPCR.DUVD when exiting from guest.
> +  * Debug instruction complete is a post-completion debug
> +  * exception but debug event gets posted on the basis of MSR
> +  * before the instruction is executed. Now if the instruction
> +  * switches the context from guest mode (MSR.GS = 1) to hypervisor
> +  * mode (MSR.GS = 0) then the xSRR0 points to first instruction of

Can't we just execute that code path with MSR.DE=0?


Alex

> +  * KVM handler and xSRR1 points that MSR.GS is clear
> +  * (hypervisor context). Now as xSRR1.GS is used to decide whether
> +  * KVM handler will be invoked to handle the exception or host
> +  * host kernel debug handler will be invoked to handle the exception.
> +  * This leads to host kernel debug handler handling the exception
> +  * which should either be handled by KVM.
> +  */
> + mfspr   r10, SPRN_EPCR
> + andis.  r10,r10,SPRN_EPCR_DUVD@h
> + beq+2f
> +
> + andis.  r10,r9,MSR_GS@h
> + beq+3f
> +2:
> +#endif
>   LOAD_REG_IMMEDIATE(r14,interrupt_base_book3e)
>   LOAD_REG_IMMEDIATE(r15,interrupt_end_book3e)
>   cmpld   cr0,r10,r14
> @@ -523,7 +550,7 @@ kernel_dbg_exc:
>   blt+cr0,1f
>   bge+cr1,1f
> 
> - /* here it looks like we got an inappropriate debug exception. */
> +3:   /* here it looks like we got an inappropriate debug exception. */
>   lis r14,DBSR_IC@h   /* clear the IC event */
>   rlwinm  r11,r11,0,~MSR_DE   /* clear DE in the DSRR1 value */
>   mtspr   SPRN_DBSR,r14
> diff --git a/arch/powerpc/kernel/head_booke.h 
> b/arch/powerpc/kernel/head_booke.h
> index 5f051ee..edc6a3b 100644
> --- a/arch/powerpc/kernel/head_booke.h
> +++ b/arch/powerpc/kernel/head_booke.h
> @@ -285,7 +285,33 @@ label:
>   mfspr   r10,SPRN_DBSR;  /* check single-step/branch taken */  \
>   andis.  r10,r10,(DBSR_IC|DBSR_BT)@h;  \
>   beq+2f;   \
> +#ifdef CONFIG_KVM_BOOKE_HV \
> + /*\
> +  * EPCR.DUVD controls whether the debug events can come in\
> +  * hypervisor mode or not. When KVM guest is using the debug  \
> +  * resource then we do not want debug events to be captured   \
> +  * in guest entry/exit path. So we set EPCR.DUVD when entering\
> +  * and clears EPCR.DUVD when exiting from guest.  \
> +  * Debug instruction complete is a post-completion debug  \
> +  * exception but debug event gets posted on the basis of MSR  \
> +  * before the instruction is executed. Now if the instruction \
> +  * switches the context from guest mode (MSR.GS = 1) to hypervisor

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 03:06:42PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:56, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:45, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
>  
>  On 04.04.2013, at 14:38, Gleb Natapov wrote:
>  
> > On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>  
>  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>  
> > With KVM, MMIO is much slower than PIO, due to the need to
> > do page walk and emulation. But with EPT, it does not have to be: we
> > know the address from the VMCS so if the address is unique, we can 
> > look
> > up the eventfd directly, bypassing emulation.
> > 
> > Add an interface for userspace to specify this per-address, we can
> > use this e.g. for virtio.
> > 
> > The implementation adds a separate bus internally. This serves two
> > purposes:
> > - minimize overhead for old userspace that does not use PV MMIO
> > - minimize disruption in other code (since we don't know the length,
> > devices on the MMIO bus only get a valid address in write, this
> > way we don't need to touch all devices to teach them handle
> > an dinvalid length)
> > 
> > At the moment, this optimization is only supported for EPT on x86 
> > and
> > silently ignored for NPT and MMU, so everything works correctly but
> > slowly.
> > 
> > TODO: NPT, MMU and non x86 architectures.
> > 
> > The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> > pre-review and suggestions.
> > 
> > Signed-off-by: Michael S. Tsirkin 
>  
>  This still uses page fault intercepts which are orders of magnitudes 
>  slower than hypercalls. Why don't you just create a PV MMIO 
>  hypercall that the guest can use to invoke MMIO accesses towards the 
>  host based on physical addresses with explicit length encodings?
>  
> >>> It is slower, but not an order of magnitude slower. It become faster
> >>> with newer HW.
> >>> 
>  That way you simplify and speed up all code paths, exceeding the 
>  speed of PIO exits even. It should also be quite easily portable, as 
>  all other platforms have hypercalls available as well.
>  
> >>> We are trying to avoid PV as much as possible (well this is also PV,
> >>> but not guest visible
> >> 
> >> Also, how is this not guest visible? Who sets 
> >> KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
> >> that the guest does so, so it is guest visible.
> >> 
> > QEMU sets it.
>  
>  How does QEMU know?
>  
> >>> Knows what? When to create such eventfd? virtio device knows.
> >> 
> >> Where does it know from?
> >> 
> > It does it always.
> > 
> >>> 
> > 
> >> +/*
> >> + * PV_MMIO - Guest can promise us that all accesses touching this 
> >> address
> >> + * are writes of specified length, starting at the specified address.
> >> + * If not - it's a Guest bug.
> >> + * Can not be used together with either PIO or DATAMATCH.
> >> + */
> >> 
> > Virtio spec will state that access to a kick register needs to be of
> > specific length. This is reasonable thing for HW to ask.
>  
>  This is a spec change. So the guest would have to indicate that it 
>  adheres to a newer spec. Thus it's a guest visible change.
>  
> >>> There is not virtio spec that has kick register in MMIO. The spec is in
> >>> the works AFAIK. Actually PIO will not be deprecated and my suggestion
> >> 
> >> So the guest would indicate that it supports a newer revision of the spec 
> >> (in your case, that it supports MMIO). How is that any different from 
> >> exposing that it supports a PV MMIO hcall?
> >> 
> > Guest will indicate nothing. New driver will use MMIO if PIO is bar is
> > not configured. All driver will not work for virtio devices with MMIO
> > bar, but not PIO bar.
> 
> I can't parse that, sorry :).
> 
I am sure MST can explain it better, but I'll try one more time.
Device will have two BARs with kick register one is PIO another is MMIO.
Old driver works only with PIO new one support both. MMIO is used only
when PIO space is exhausted. So old driver will not be able to drive new
virtio device that have no PIO bar configured.

> > 
> >>> is to move to MMIO only when PIO address space is exhausted. For PCI it
> >>> will be never, for PCI-e it will be after ~16 devices.
> >> 
> >> Ok, let's go back a step here. Are yo

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:56, Gleb Natapov wrote:

> On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
>> 
>> On 04.04.2013, at 14:45, Gleb Natapov wrote:
>> 
>>> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:38, Gleb Natapov wrote:
 
> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
>> 
>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
>> 
>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
> With KVM, MMIO is much slower than PIO, due to the need to
> do page walk and emulation. But with EPT, it does not have to be: we
> know the address from the VMCS so if the address is unique, we can 
> look
> up the eventfd directly, bypassing emulation.
> 
> Add an interface for userspace to specify this per-address, we can
> use this e.g. for virtio.
> 
> The implementation adds a separate bus internally. This serves two
> purposes:
> - minimize overhead for old userspace that does not use PV MMIO
> - minimize disruption in other code (since we don't know the length,
> devices on the MMIO bus only get a valid address in write, this
> way we don't need to touch all devices to teach them handle
> an dinvalid length)
> 
> At the moment, this optimization is only supported for EPT on x86 and
> silently ignored for NPT and MMU, so everything works correctly but
> slowly.
> 
> TODO: NPT, MMU and non x86 architectures.
> 
> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> pre-review and suggestions.
> 
> Signed-off-by: Michael S. Tsirkin 
 
 This still uses page fault intercepts which are orders of magnitudes 
 slower than hypercalls. Why don't you just create a PV MMIO hypercall 
 that the guest can use to invoke MMIO accesses towards the host based 
 on physical addresses with explicit length encodings?
 
>>> It is slower, but not an order of magnitude slower. It become faster
>>> with newer HW.
>>> 
 That way you simplify and speed up all code paths, exceeding the speed 
 of PIO exits even. It should also be quite easily portable, as all 
 other platforms have hypercalls available as well.
 
>>> We are trying to avoid PV as much as possible (well this is also PV,
>>> but not guest visible
>> 
>> Also, how is this not guest visible? Who sets 
>> KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
>> that the guest does so, so it is guest visible.
>> 
> QEMU sets it.
 
 How does QEMU know?
 
>>> Knows what? When to create such eventfd? virtio device knows.
>> 
>> Where does it know from?
>> 
> It does it always.
> 
>>> 
> 
>> +/*
>> + * PV_MMIO - Guest can promise us that all accesses touching this 
>> address
>> + * are writes of specified length, starting at the specified address.
>> + * If not - it's a Guest bug.
>> + * Can not be used together with either PIO or DATAMATCH.
>> + */
>> 
> Virtio spec will state that access to a kick register needs to be of
> specific length. This is reasonable thing for HW to ask.
 
 This is a spec change. So the guest would have to indicate that it adheres 
 to a newer spec. Thus it's a guest visible change.
 
>>> There is not virtio spec that has kick register in MMIO. The spec is in
>>> the works AFAIK. Actually PIO will not be deprecated and my suggestion
>> 
>> So the guest would indicate that it supports a newer revision of the spec 
>> (in your case, that it supports MMIO). How is that any different from 
>> exposing that it supports a PV MMIO hcall?
>> 
> Guest will indicate nothing. New driver will use MMIO if PIO is bar is
> not configured. All driver will not work for virtio devices with MMIO
> bar, but not PIO bar.

I can't parse that, sorry :).

> 
>>> is to move to MMIO only when PIO address space is exhausted. For PCI it
>>> will be never, for PCI-e it will be after ~16 devices.
>> 
>> Ok, let's go back a step here. Are you actually able to measure any speed in 
>> performance with this patch applied and without when going through MMIO 
>> kicks?
>> 
>> 
> That's the question for MST. I think he did only micro benchmarks till
> now and he already posted his result here:
> 
> mmio-wildcard-eventfd:pci-mem 3529
> mmio-pv-eventfd:pci-mem 1878
> portio-wildcard-eventfd:pci-io 1846
> 
> So the patch speedup mmio by almost 100% and it is almost the same as PIO.

Those numbers don't align at all with what I measured.

MST, could you please do a real world latency benchmark with virtio-net and

  * normal ioeventfd
  * mmio-pv eventfd
  * hcall 

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 02:49:39PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:45, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
>  
>  On 04.04.2013, at 14:08, Gleb Natapov wrote:
>  
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >> 
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can 
> >>> look
> >>> up the eventfd directly, bypassing emulation.
> >>> 
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>> 
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>> 
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>> 
> >>> TODO: NPT, MMU and non x86 architectures.
> >>> 
> >>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>> 
> >>> Signed-off-by: Michael S. Tsirkin 
> >> 
> >> This still uses page fault intercepts which are orders of magnitudes 
> >> slower than hypercalls. Why don't you just create a PV MMIO hypercall 
> >> that the guest can use to invoke MMIO accesses towards the host based 
> >> on physical addresses with explicit length encodings?
> >> 
> > It is slower, but not an order of magnitude slower. It become faster
> > with newer HW.
> > 
> >> That way you simplify and speed up all code paths, exceeding the speed 
> >> of PIO exits even. It should also be quite easily portable, as all 
> >> other platforms have hypercalls available as well.
> >> 
> > We are trying to avoid PV as much as possible (well this is also PV,
> > but not guest visible
>  
>  Also, how is this not guest visible? Who sets 
>  KVM_IOEVENTFD_FLAG_PV_MMIO? The comment above its definition indicates 
>  that the guest does so, so it is guest visible.
>  
> >>> QEMU sets it.
> >> 
> >> How does QEMU know?
> >> 
> > Knows what? When to create such eventfd? virtio device knows.
> 
> Where does it know from?
> 
It does it always.

> > 
> >>> 
>  +/*
>  + * PV_MMIO - Guest can promise us that all accesses touching this 
>  address
>  + * are writes of specified length, starting at the specified address.
>  + * If not - it's a Guest bug.
>  + * Can not be used together with either PIO or DATAMATCH.
>  + */
>  
> >>> Virtio spec will state that access to a kick register needs to be of
> >>> specific length. This is reasonable thing for HW to ask.
> >> 
> >> This is a spec change. So the guest would have to indicate that it adheres 
> >> to a newer spec. Thus it's a guest visible change.
> >> 
> > There is not virtio spec that has kick register in MMIO. The spec is in
> > the works AFAIK. Actually PIO will not be deprecated and my suggestion
> 
> So the guest would indicate that it supports a newer revision of the spec (in 
> your case, that it supports MMIO). How is that any different from exposing 
> that it supports a PV MMIO hcall?
> 
Guest will indicate nothing. New driver will use MMIO if PIO is bar is
not configured. All driver will not work for virtio devices with MMIO
bar, but not PIO bar.

> > is to move to MMIO only when PIO address space is exhausted. For PCI it
> > will be never, for PCI-e it will be after ~16 devices.
> 
> Ok, let's go back a step here. Are you actually able to measure any speed in 
> performance with this patch applied and without when going through MMIO kicks?
> 
> 
That's the question for MST. I think he did only micro benchmarks till
now and he already posted his result here:

mmio-wildcard-eventfd:pci-mem 3529
mmio-pv-eventfd:pci-mem 1878
portio-wildcard-eventfd:pci-io 1846

So the patch speedup mmio by almost 100% and it is almost the same as PIO.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 6/6] kvm/ppc/mpic: add KVM_CAP_IRQ_MPIC

2013-04-04 Thread Alexander Graf

On 03.04.2013, at 03:57, Scott Wood wrote:

> Enabling this capability connects the vcpu to the designated in-kernel
> MPIC.  Using explicit connections between vcpus and irqchips allows
> for flexibility, but the main benefit at the moment is that it
> simplifies the code -- KVM doesn't need vm-global state to remember
> which MPIC object is associated with this vm, and it doesn't need to
> care about ordering between irqchip creation and vcpu creation.
> 
> Signed-off-by: Scott Wood 
> ---
> Documentation/virtual/kvm/api.txt   |8 ++
> arch/powerpc/include/asm/kvm_host.h |8 ++
> arch/powerpc/include/asm/kvm_ppc.h  |2 ++
> arch/powerpc/kvm/booke.c|4 ++-
> arch/powerpc/kvm/mpic.c |   49 +++
> arch/powerpc/kvm/powerpc.c  |   26 +++
> include/uapi/linux/kvm.h|1 +
> 7 files changed, 92 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index d52f3f9..4c326ae 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2728,3 +2728,11 @@ to receive the topmost interrupt vector.
> When disabled (args[0] == 0), behavior is as if this facility is unsupported.
> 
> When this capability is enabled, KVM_EXIT_EPR can occur.
> +
> +6.6 KVM_CAP_IRQ_MPIC
> +
> +Architectures: ppc
> +Parameters: args[0] is the MPIC device fd
> +args[1] is the MPIC CPU number for this vcpu
> +
> +This capability connects the vcpu to an in-kernel MPIC device.
> diff --git a/arch/powerpc/include/asm/kvm_host.h 
> b/arch/powerpc/include/asm/kvm_host.h
> index 7e7aef9..2a2e235 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -375,6 +375,11 @@ struct kvmppc_booke_debug_reg {
>   u64 dac[KVMPPC_BOOKE_MAX_DAC];
> };
> 
> +#define KVMPPC_IRQ_DEFAULT   0
> +#define KVMPPC_IRQ_MPIC  1
> +
> +struct openpic;
> +
> struct kvm_vcpu_arch {
>   ulong host_stack;
>   u32 host_pid;
> @@ -554,6 +559,9 @@ struct kvm_vcpu_arch {
>   unsigned long magic_page_pa; /* phys addr to map the magic page to */
>   unsigned long magic_page_ea; /* effect. addr to map the magic page to */
> 
> + int irq_type;   /* one of KVM_IRQ_* */
> + struct openpic *mpic;   /* KVM_IRQ_MPIC */
> +
> #ifdef CONFIG_KVM_BOOK3S_64_HV
>   struct kvm_vcpu_arch_shared shregs;
> 
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
> b/arch/powerpc/include/asm/kvm_ppc.h
> index 3b63b97..f54707f 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -276,6 +276,8 @@ static inline void kvmppc_set_epr(struct kvm_vcpu *vcpu, 
> u32 epr)
> }
> 
> void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu);
> +int kvmppc_mpic_connect_vcpu(struct file *mpic_filp, struct kvm_vcpu *vcpu,
> +  u32 cpu);
> 
> int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
> struct kvm_config_tlb *cfg);
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index cddc6b3..7d00222 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -430,8 +430,10 @@ static int kvmppc_booke_irqprio_deliver(struct kvm_vcpu 
> *vcpu,
>   if (update_epr == true) {
>   if (vcpu->arch.epr_flags & KVMPPC_EPR_USER)
>   kvm_make_request(KVM_REQ_EPR_EXIT, vcpu);
> - else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL)
> + else if (vcpu->arch.epr_flags & KVMPPC_EPR_KERNEL) {
> + BUG_ON(vcpu->arch.irq_type != KVMPPC_IRQ_MPIC);
>   kvmppc_mpic_set_epr(vcpu);
> + }
>   }
> 
>   new_msr &= msr_mask;
> diff --git a/arch/powerpc/kvm/mpic.c b/arch/powerpc/kvm/mpic.c
> index 8cda2fa..caffe3b 100644
> --- a/arch/powerpc/kvm/mpic.c
> +++ b/arch/powerpc/kvm/mpic.c
> @@ -1159,7 +1159,7 @@ static uint32_t openpic_iack(struct openpic *opp, 
> struct irq_dest *dst,
> 
> void kvmppc_mpic_set_epr(struct kvm_vcpu *vcpu)
> {
> - struct openpic *opp = vcpu->arch.irqchip_priv;
> + struct openpic *opp = vcpu->arch.mpic;
>   int cpu = vcpu->vcpu_id;
>   unsigned long flags;
> 
> @@ -1442,10 +1442,10 @@ static void map_mmio(struct openpic *opp)
> 
> static void unmap_mmio(struct openpic *opp)
> {
> - BUG_ON(opp->mmio_mapped);
> - opp->mmio_mapped = false;
> -
> - kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
> + if (opp->mmio_mapped) {
> + opp->mmio_mapped = false;
> + kvm_io_bus_unregister_dev(opp->kvm, KVM_MMIO_BUS, &opp->mmio);
> + }
> }
> 
> static int set_base_addr(struct openpic *opp, struct kvm_device_attr *attr)
> @@ -1681,6 +1681,45 @@ static const struct file_operations kvm_mpic_fops = {
>   .release = kvm_mpic_release,
> };
> 

Re: [PATCH 6/7] ARM: KVM: switch to a dual-step HYP init code

2013-04-04 Thread Marc Zyngier
On 04/04/13 00:15, Christoffer Dall wrote:
> On Tue, Apr 02, 2013 at 02:25:14PM +0100, Marc Zyngier wrote:
>> Our HYP init code suffers from two major design issues:
>> - it cannot support CPU hotplug, as we tear down the idmap very early
>> - it cannot perform a TLB invalidation when switching from init to
>>   runtime mappings, as pages are manipulated from PL1 exclusively
>>
>> The hotplug problem mandates that we keep two sets of page tables
>> (boot and runtime). The TLB problem mandates that we're able to
>> transition from one PGD to another while in HYP, invalidating the TLBs
>> in the process.
>>
>> To be able to do this, we need to share a page between the two page
>> tables. A page that will have the same VA in both configurations. All we
>> need is a VA that has the following properties:
>> - This VA can't be used to represent a kernel mapping.
>> - This VA will not conflict with the physical address of the kernel text
>>
>> The vectors page seems to satisfy this requirement:
>> - The kernel never maps anything else there
>> - The kernel text being copied at the beginning of the physical memory,
>>   it is unlikely to use the last 64kB (I doubt we'll ever support KVM
>>   on a system with something like 4MB of RAM, but patches are very
>>   welcome).
>>
>> Let's call this VA the trampoline VA.
>>
>> Now, we map our init page at 3 locations:
>> - idmap in the boot pgd
>> - trampoline VA in the boot pgd
>> - trampoline VA in the runtime pgd
>>
>> The init scenario is now the following:
>> - We jump in HYP with four parameters: boot HYP pgd, runtime HYP pgd,
>>   runtime stack, runtime vectors
>> - Enable the MMU with the boot pgd
>> - Jump to a target into the trampoline page (remember, this is the same
>>   physical page!)
>> - Now switch to the runtime pgd (same VA, and still the same physical
>>   page!)
>> - Invalidate TLBs
>> - Set stack and vectors
>> - Profit! (or eret, if you only care about the code).
> 
> So I'm going to do my usual commenting routine. Was it an idea to insert
> this commit text (which I really liked by the way!) into init.S where
> the current comment is a little lacking giving the massive complexity
> this is turning into, madness?

Will do.

[...]

>> diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
>> index 35a463f..b2c6967 100644
>> --- a/arch/arm/kvm/init.S
>> +++ b/arch/arm/kvm/init.S
>> @@ -21,6 +21,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>
>>  /
>>   * Hypervisor initialization
>> @@ -47,6 +48,9 @@ __kvm_hyp_init:
>>   W(b).
>>
>>  __do_hyp_init:
>> + cmp r2, #0  @ We have a SP?
>> + bne phase2  @ Yes, second stage init
>> +
>>   @ Set the HTTBR to point to the hypervisor PGD pointer passed
>>   mcrrp15, 4, r0, r1, c2
>>
>> @@ -96,14 +100,35 @@ __do_hyp_init:
>>   orr r0, r0, r1
>>   isb
>>   mcr p15, 4, r0, c1, c0, 0   @ HSCR
>> - isb
>>
>> - @ Set stack pointer and return to the kernel
>> + eret
> 
> Could you add some comment here to indicate we're done with phase1, it
> seems like this eret should not go unnoticed by casual readers (ok, they
> shouldn't read this code casually, but anyway..., it will make me sleep
> better)

Sure, no problem.

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Peter Maydell
On 4 April 2013 11:40, Michael S. Tsirkin  wrote:
> Add an option for users to specify PV eventfd
> listeners, and utilize KVM's PV MMIO underneath.
> Upodate all callers.
>
> Signed-off-by: Michael S. Tsirkin 
> ---
>  hw/dataplane/hostmem.c|  1 +
>  hw/ivshmem.c  |  2 ++
>  hw/pci-testdev.c  |  2 ++
>  hw/vhost.c|  4 ++--
>  hw/virtio-pci.c   |  4 ++--
>  include/exec/memory.h | 10 ++
>  kvm-all.c | 15 ---
>  linux-headers/linux/kvm.h |  8 
>  memory.c  |  9 +++--

Can we keep linux-headers updates in separate patches, please
(with commit messages saying what kernel commit/version the
sync is with)?

thanks
-- PMM
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:45, Gleb Natapov wrote:

> On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
>> 
>> On 04.04.2013, at 14:38, Gleb Natapov wrote:
>> 
>>> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 14:08, Gleb Natapov wrote:
 
> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>> 
>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>> 
>>> With KVM, MMIO is much slower than PIO, due to the need to
>>> do page walk and emulation. But with EPT, it does not have to be: we
>>> know the address from the VMCS so if the address is unique, we can look
>>> up the eventfd directly, bypassing emulation.
>>> 
>>> Add an interface for userspace to specify this per-address, we can
>>> use this e.g. for virtio.
>>> 
>>> The implementation adds a separate bus internally. This serves two
>>> purposes:
>>> - minimize overhead for old userspace that does not use PV MMIO
>>> - minimize disruption in other code (since we don't know the length,
>>> devices on the MMIO bus only get a valid address in write, this
>>> way we don't need to touch all devices to teach them handle
>>> an dinvalid length)
>>> 
>>> At the moment, this optimization is only supported for EPT on x86 and
>>> silently ignored for NPT and MMU, so everything works correctly but
>>> slowly.
>>> 
>>> TODO: NPT, MMU and non x86 architectures.
>>> 
>>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
>>> pre-review and suggestions.
>>> 
>>> Signed-off-by: Michael S. Tsirkin 
>> 
>> This still uses page fault intercepts which are orders of magnitudes 
>> slower than hypercalls. Why don't you just create a PV MMIO hypercall 
>> that the guest can use to invoke MMIO accesses towards the host based on 
>> physical addresses with explicit length encodings?
>> 
> It is slower, but not an order of magnitude slower. It become faster
> with newer HW.
> 
>> That way you simplify and speed up all code paths, exceeding the speed 
>> of PIO exits even. It should also be quite easily portable, as all other 
>> platforms have hypercalls available as well.
>> 
> We are trying to avoid PV as much as possible (well this is also PV,
> but not guest visible
 
 Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? 
 The comment above its definition indicates that the guest does so, so it 
 is guest visible.
 
>>> QEMU sets it.
>> 
>> How does QEMU know?
>> 
> Knows what? When to create such eventfd? virtio device knows.

Where does it know from?

> 
>>> 
 +/*
 + * PV_MMIO - Guest can promise us that all accesses touching this address
 + * are writes of specified length, starting at the specified address.
 + * If not - it's a Guest bug.
 + * Can not be used together with either PIO or DATAMATCH.
 + */
 
>>> Virtio spec will state that access to a kick register needs to be of
>>> specific length. This is reasonable thing for HW to ask.
>> 
>> This is a spec change. So the guest would have to indicate that it adheres 
>> to a newer spec. Thus it's a guest visible change.
>> 
> There is not virtio spec that has kick register in MMIO. The spec is in
> the works AFAIK. Actually PIO will not be deprecated and my suggestion

So the guest would indicate that it supports a newer revision of the spec (in 
your case, that it supports MMIO). How is that any different from exposing that 
it supports a PV MMIO hcall?

> is to move to MMIO only when PIO address space is exhausted. For PCI it
> will be never, for PCI-e it will be after ~16 devices.

Ok, let's go back a step here. Are you actually able to measure any speed in 
performance with this patch applied and without when going through MMIO kicks?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 02:39:51PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:38, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> >> 
> >>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>  
>  On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>  
> > With KVM, MMIO is much slower than PIO, due to the need to
> > do page walk and emulation. But with EPT, it does not have to be: we
> > know the address from the VMCS so if the address is unique, we can look
> > up the eventfd directly, bypassing emulation.
> > 
> > Add an interface for userspace to specify this per-address, we can
> > use this e.g. for virtio.
> > 
> > The implementation adds a separate bus internally. This serves two
> > purposes:
> > - minimize overhead for old userspace that does not use PV MMIO
> > - minimize disruption in other code (since we don't know the length,
> > devices on the MMIO bus only get a valid address in write, this
> > way we don't need to touch all devices to teach them handle
> > an dinvalid length)
> > 
> > At the moment, this optimization is only supported for EPT on x86 and
> > silently ignored for NPT and MMU, so everything works correctly but
> > slowly.
> > 
> > TODO: NPT, MMU and non x86 architectures.
> > 
> > The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> > pre-review and suggestions.
> > 
> > Signed-off-by: Michael S. Tsirkin 
>  
>  This still uses page fault intercepts which are orders of magnitudes 
>  slower than hypercalls. Why don't you just create a PV MMIO hypercall 
>  that the guest can use to invoke MMIO accesses towards the host based on 
>  physical addresses with explicit length encodings?
>  
> >>> It is slower, but not an order of magnitude slower. It become faster
> >>> with newer HW.
> >>> 
>  That way you simplify and speed up all code paths, exceeding the speed 
>  of PIO exits even. It should also be quite easily portable, as all other 
>  platforms have hypercalls available as well.
>  
> >>> We are trying to avoid PV as much as possible (well this is also PV,
> >>> but not guest visible
> >> 
> >> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? 
> >> The comment above its definition indicates that the guest does so, so it 
> >> is guest visible.
> >> 
> > QEMU sets it.
> 
> How does QEMU know?
> 
Knows what? When to create such eventfd? virtio device knows.

> > 
> >> +/*
> >> + * PV_MMIO - Guest can promise us that all accesses touching this address
> >> + * are writes of specified length, starting at the specified address.
> >> + * If not - it's a Guest bug.
> >> + * Can not be used together with either PIO or DATAMATCH.
> >> + */
> >> 
> > Virtio spec will state that access to a kick register needs to be of
> > specific length. This is reasonable thing for HW to ask.
> 
> This is a spec change. So the guest would have to indicate that it adheres to 
> a newer spec. Thus it's a guest visible change.
> 
There is not virtio spec that has kick register in MMIO. The spec is in
the works AFAIK. Actually PIO will not be deprecated and my suggestion
is to move to MMIO only when PIO address space is exhausted. For PCI it
will be never, for PCI-e it will be after ~16 devices.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Paolo Bonzini
Il 04/04/2013 12:40, Michael S. Tsirkin ha scritto:
> Add an option for users to specify PV eventfd
> listeners, and utilize KVM's PV MMIO underneath.
> Upodate all callers.

I would like Avi to comment on this, because I think this is not the
"memory-API approved" way of doing things.  You need KVM to define its
own AddressSpace, and make KVM's listener use
memory_region_to_address_space to figure out if it is for PV MMIO.

To handle accesses from TCG, the PV AddressSpace can simply have just an
alias to the actual MemoryRegion where the doorbell is.

Paolo

> Signed-off-by: Michael S. Tsirkin 
> ---
>  hw/dataplane/hostmem.c|  1 +
>  hw/ivshmem.c  |  2 ++
>  hw/pci-testdev.c  |  2 ++
>  hw/vhost.c|  4 ++--
>  hw/virtio-pci.c   |  4 ++--
>  include/exec/memory.h | 10 ++
>  kvm-all.c | 15 ---
>  linux-headers/linux/kvm.h |  8 
>  memory.c  |  9 +++--
>  9 files changed, 46 insertions(+), 9 deletions(-)
> 
> diff --git a/hw/dataplane/hostmem.c b/hw/dataplane/hostmem.c
> index 380537e..41d43e7 100644
> --- a/hw/dataplane/hostmem.c
> +++ b/hw/dataplane/hostmem.c
> @@ -126,6 +126,7 @@ static void hostmem_listener_section_dummy(MemoryListener 
> *listener,
>  
>  static void hostmem_listener_eventfd_dummy(MemoryListener *listener,
> MemoryRegionSection *section,
> +   bool pv,
> bool match_data, uint64_t data,
> EventNotifier *e)
>  {
> diff --git a/hw/ivshmem.c b/hw/ivshmem.c
> index 68a2cf2..28e5978 100644
> --- a/hw/ivshmem.c
> +++ b/hw/ivshmem.c
> @@ -352,6 +352,7 @@ static void ivshmem_add_eventfd(IVShmemState *s, int 
> posn, int i)
>  memory_region_add_eventfd(&s->ivshmem_mmio,
>DOORBELL,
>4,
> +  false,
>true,
>(posn << 16) | i,
>&s->peers[posn].eventfds[i]);
> @@ -362,6 +363,7 @@ static void ivshmem_del_eventfd(IVShmemState *s, int 
> posn, int i)
>  memory_region_del_eventfd(&s->ivshmem_mmio,
>DOORBELL,
>4,
> +  false,
>true,
>(posn << 16) | i,
>&s->peers[posn].eventfds[i]);
> diff --git a/hw/pci-testdev.c b/hw/pci-testdev.c
> index 9486624..f0ebf99 100644
> --- a/hw/pci-testdev.c
> +++ b/hw/pci-testdev.c
> @@ -80,6 +80,7 @@ static int pci_testdev_start(IOTest *test)
>  memory_region_add_eventfd(test->mr,
>le32_to_cpu(test->hdr->offset),
>test->size,
> +  false,
>test->match_data,
>test->hdr->data,
>&test->notifier);
> @@ -94,6 +95,7 @@ static void pci_testdev_stop(IOTest *test)
>  memory_region_del_eventfd(test->mr,
>le32_to_cpu(test->hdr->offset),
>test->size,
> +  false,
>test->match_data,
>test->hdr->data,
>&test->notifier);
> diff --git a/hw/vhost.c b/hw/vhost.c
> index 4d6aee3..9ad7101 100644
> --- a/hw/vhost.c
> +++ b/hw/vhost.c
> @@ -750,13 +750,13 @@ static void vhost_virtqueue_stop(struct vhost_dev *dev,
>  }
>  
>  static void vhost_eventfd_add(MemoryListener *listener,
> -  MemoryRegionSection *section,
> +  MemoryRegionSection *section, bool pv,
>bool match_data, uint64_t data, EventNotifier 
> *e)
>  {
>  }
>  
>  static void vhost_eventfd_del(MemoryListener *listener,
> -  MemoryRegionSection *section,
> +  MemoryRegionSection *section, bool pv,
>bool match_data, uint64_t data, EventNotifier 
> *e)
>  {
>  }
> diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
> index 19965e5..f9ff994 100644
> --- a/hw/virtio-pci.c
> +++ b/hw/virtio-pci.c
> @@ -189,10 +189,10 @@ static int 
> virtio_pci_set_host_notifier_internal(VirtIOPCIProxy *proxy,
>  }
>  virtio_queue_set_host_notifier_fd_handler(vq, true, set_handler);
>  memory_region_add_eventfd(&proxy->bar, VIRTIO_PCI_QUEUE_NOTIFY, 2,
> -  true, n, notifier);
> +  false, true, n, notifier);
>  } else {
>  memory_region_del_eventfd(&proxy->bar, VIRTIO_PCI_QUEUE_NOTIFY, 2,
> -  true, n, notifier);
> +

Re: [PATCH 2/7] ARM: KVM: fix HYP mapping limitations around zero

2013-04-04 Thread Marc Zyngier
On 04/04/13 00:14, Christoffer Dall wrote:
> On Tue, Apr 02, 2013 at 02:25:10PM +0100, Marc Zyngier wrote:
>> The current code for creating HYP mapping doesn't like to wrap
>> around zero, which prevents from mapping anything into the last
>> page of the virtual address space.
>>
>> It doesn't take much effort to remove this limitation, making
>> the code more consistent with the rest of the kernel in the process.
> 
> This is quite funny, the code used to behave like this, but
> reviewers said that this was old fashioned mm-style code that was
> unwanted and wrapping around zero should be avoided for all costs :)

My original attempt at this series was using the very last page of the
memory, and it took me a while to notice that the code couldn't map
anything in the last page.

Just for this reason, the code needed to be fixed.

> In any case, I welcome the familiarity.

Amen.

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:38, Gleb Natapov wrote:

> On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
>> 
>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
>> 
>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
> With KVM, MMIO is much slower than PIO, due to the need to
> do page walk and emulation. But with EPT, it does not have to be: we
> know the address from the VMCS so if the address is unique, we can look
> up the eventfd directly, bypassing emulation.
> 
> Add an interface for userspace to specify this per-address, we can
> use this e.g. for virtio.
> 
> The implementation adds a separate bus internally. This serves two
> purposes:
> - minimize overhead for old userspace that does not use PV MMIO
> - minimize disruption in other code (since we don't know the length,
> devices on the MMIO bus only get a valid address in write, this
> way we don't need to touch all devices to teach them handle
> an dinvalid length)
> 
> At the moment, this optimization is only supported for EPT on x86 and
> silently ignored for NPT and MMU, so everything works correctly but
> slowly.
> 
> TODO: NPT, MMU and non x86 architectures.
> 
> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> pre-review and suggestions.
> 
> Signed-off-by: Michael S. Tsirkin 
 
 This still uses page fault intercepts which are orders of magnitudes 
 slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
 the guest can use to invoke MMIO accesses towards the host based on 
 physical addresses with explicit length encodings?
 
>>> It is slower, but not an order of magnitude slower. It become faster
>>> with newer HW.
>>> 
 That way you simplify and speed up all code paths, exceeding the speed of 
 PIO exits even. It should also be quite easily portable, as all other 
 platforms have hypercalls available as well.
 
>>> We are trying to avoid PV as much as possible (well this is also PV,
>>> but not guest visible
>> 
>> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? 
>> The comment above its definition indicates that the guest does so, so it is 
>> guest visible.
>> 
> QEMU sets it.

How does QEMU know?

> 
>> +/*
>> + * PV_MMIO - Guest can promise us that all accesses touching this address
>> + * are writes of specified length, starting at the specified address.
>> + * If not - it's a Guest bug.
>> + * Can not be used together with either PIO or DATAMATCH.
>> + */
>> 
> Virtio spec will state that access to a kick register needs to be of
> specific length. This is reasonable thing for HW to ask.

This is a spec change. So the guest would have to indicate that it adheres to a 
newer spec. Thus it's a guest visible change.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 02:32:08PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >> 
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>> 
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>> 
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>> 
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>> 
> >>> TODO: NPT, MMU and non x86 architectures.
> >>> 
> >>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>> 
> >>> Signed-off-by: Michael S. Tsirkin 
> >> 
> >> This still uses page fault intercepts which are orders of magnitudes 
> >> slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
> >> the guest can use to invoke MMIO accesses towards the host based on 
> >> physical addresses with explicit length encodings?
> >> 
> > It is slower, but not an order of magnitude slower. It become faster
> > with newer HW.
> > 
> >> That way you simplify and speed up all code paths, exceeding the speed of 
> >> PIO exits even. It should also be quite easily portable, as all other 
> >> platforms have hypercalls available as well.
> >> 
> > We are trying to avoid PV as much as possible (well this is also PV,
> > but not guest visible
> 
> Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The 
> comment above its definition indicates that the guest does so, so it is guest 
> visible.
> 
QEMU sets it.

> +/*
> + * PV_MMIO - Guest can promise us that all accesses touching this address
> + * are writes of specified length, starting at the specified address.
> + * If not - it's a Guest bug.
> + * Can not be used together with either PIO or DATAMATCH.
> + */
> 
Virtio spec will state that access to a kick register needs to be of
specific length. This is reasonable thing for HW to ask.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:34, Gleb Natapov wrote:

> On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
>> 
>> On 04.04.2013, at 14:08, Gleb Natapov wrote:
>> 
>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
> With KVM, MMIO is much slower than PIO, due to the need to
> do page walk and emulation. But with EPT, it does not have to be: we
> know the address from the VMCS so if the address is unique, we can look
> up the eventfd directly, bypassing emulation.
> 
> Add an interface for userspace to specify this per-address, we can
> use this e.g. for virtio.
> 
> The implementation adds a separate bus internally. This serves two
> purposes:
> - minimize overhead for old userspace that does not use PV MMIO
> - minimize disruption in other code (since we don't know the length,
> devices on the MMIO bus only get a valid address in write, this
> way we don't need to touch all devices to teach them handle
> an dinvalid length)
> 
> At the moment, this optimization is only supported for EPT on x86 and
> silently ignored for NPT and MMU, so everything works correctly but
> slowly.
> 
> TODO: NPT, MMU and non x86 architectures.
> 
> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> pre-review and suggestions.
> 
> Signed-off-by: Michael S. Tsirkin 
 
 This still uses page fault intercepts which are orders of magnitudes 
 slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
 the guest can use to invoke MMIO accesses towards the host based on 
 physical addresses with explicit length encodings?
 
>>> It is slower, but not an order of magnitude slower. It become faster
>>> with newer HW.
>>> 
 That way you simplify and speed up all code paths, exceeding the speed of 
 PIO exits even. It should also be quite easily portable, as all other 
 platforms have hypercalls available as well.
 
>>> We are trying to avoid PV as much as possible (well this is also PV,
>>> but not guest visible). We haven't replaced PIO with hypercall for the
>>> same reason. My hope is that future HW will provide us with instruction
>>> decode for basic mov instruction at which point this optimisation can be
>>> dropped.
>> 
>> The same applies to an MMIO hypercall. Once the PV interface becomes 
>> obsolete, we can drop the capability we expose to the guest.
>> 
> Disable it on newer HW is easy, but it is not that simple to get rid of the 
> guest code.

You need to have guest code that runs with non-PV hcalls anyways, since you 
want to be able to run on older KVM versions. So what's the problem?

> 
>>> And hypercall has its own set of problems with Windows guests.
>>> When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
>>> hypercalls.  Mixing KVM hypercalls and Hyper-V requires some tricks. It
>> 
>> Can't we simply register a hypercall ID range with Microsoft?
> Doubt it. This is not only about the rang though. The calling convention
> is completely different.

There is no calling convention for KVM hypercalls anymore, or is there?
Either way, we would of course just adhere to the MS calling convention. We're 
only talking about register values here...

> 
>> 
>>> may also affect WHQLing Windows drivers since driver will talk to HW
>>> bypassing Windows interfaces.
>> 
>> Then the WHQL'ed driver doesn't support the PV MMIO hcall?
>> 
> All Windows drivers have to be WHQL'ed, saying that is like saying that
> we do not care about Windows guest virtio speed.

I don't mind if a Linux guest is faster than a Windows guest, if that's what 
you're asking.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 02:22:09PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 14:08, Gleb Natapov wrote:
> 
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >> 
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>> 
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>> 
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>> 
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>> 
> >>> TODO: NPT, MMU and non x86 architectures.
> >>> 
> >>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>> 
> >>> Signed-off-by: Michael S. Tsirkin 
> >> 
> >> This still uses page fault intercepts which are orders of magnitudes 
> >> slower than hypercalls. Why don't you just create a PV MMIO hypercall that 
> >> the guest can use to invoke MMIO accesses towards the host based on 
> >> physical addresses with explicit length encodings?
> >> 
> > It is slower, but not an order of magnitude slower. It become faster
> > with newer HW.
> > 
> >> That way you simplify and speed up all code paths, exceeding the speed of 
> >> PIO exits even. It should also be quite easily portable, as all other 
> >> platforms have hypercalls available as well.
> >> 
> > We are trying to avoid PV as much as possible (well this is also PV,
> > but not guest visible). We haven't replaced PIO with hypercall for the
> > same reason. My hope is that future HW will provide us with instruction
> > decode for basic mov instruction at which point this optimisation can be
> > dropped.
> 
> The same applies to an MMIO hypercall. Once the PV interface becomes 
> obsolete, we can drop the capability we expose to the guest.
> 
Disable it on newer HW is easy, but it is not that simple to get rid of the 
guest code.

> > And hypercall has its own set of problems with Windows guests.
> > When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
> > hypercalls.  Mixing KVM hypercalls and Hyper-V requires some tricks. It
> 
> Can't we simply register a hypercall ID range with Microsoft?
Doubt it. This is not only about the rang though. The calling convention
is completely different.

> 
> > may also affect WHQLing Windows drivers since driver will talk to HW
> > bypassing Windows interfaces.
> 
> Then the WHQL'ed driver doesn't support the PV MMIO hcall?
> 
All Windows drivers have to be WHQL'ed, saying that is like saying that
we do not care about Windows guest virtio speed.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] ARM: KVM: simplify HYP mapping population

2013-04-04 Thread Marc Zyngier
On 04/04/13 00:13, Christoffer Dall wrote:
> On Tue, Apr 02, 2013 at 02:25:09PM +0100, Marc Zyngier wrote:
>> The way we populate HYP mappings is a bit convoluted, to say the least.
>> Passing a pointer around to keep track of the current PFN is quite
>> odd, and we end-up having two different PTE accessors for no good
>> reason.
>>
>> Simplify the whole thing by unifying the two PTE accessors, passing
>> a pgprot_t around, and moving the various validity checks to the
>> upper layers.
>>
>> Signed-off-by: Marc Zyngier 
>> ---
>>  arch/arm/kvm/mmu.c | 100 
>> ++---
>>  1 file changed, 41 insertions(+), 59 deletions(-)
>>
>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> index 2f12e40..24811d1 100644
>> --- a/arch/arm/kvm/mmu.c
>> +++ b/arch/arm/kvm/mmu.c
>> @@ -125,54 +125,34 @@ void free_hyp_pmds(void)
>>  }
>>  
>>  static void create_hyp_pte_mappings(pmd_t *pmd, unsigned long start,
>> -unsigned long end)
>> +unsigned long end, unsigned long pfn,
>> +pgprot_t prot)
>>  {
>>  pte_t *pte;
>>  unsigned long addr;
>> -struct page *page;
>>  
>> -for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
>> -unsigned long hyp_addr = KERN_TO_HYP(addr);
>> -
>> -pte = pte_offset_kernel(pmd, hyp_addr);
>> -BUG_ON(!virt_addr_valid(addr));
>> -page = virt_to_page(addr);
>> -kvm_set_pte(pte, mk_pte(page, PAGE_HYP));
>> -}
>> -}
>> -
>> -static void create_hyp_io_pte_mappings(pmd_t *pmd, unsigned long start,
>> -   unsigned long end,
>> -   unsigned long *pfn_base)
>> -{
>> -pte_t *pte;
>> -unsigned long addr;
>> -
>> -for (addr = start & PAGE_MASK; addr < end; addr += PAGE_SIZE) {
>> -unsigned long hyp_addr = KERN_TO_HYP(addr);
>> -
>> -pte = pte_offset_kernel(pmd, hyp_addr);
>> -BUG_ON(pfn_valid(*pfn_base));
>> -kvm_set_pte(pte, pfn_pte(*pfn_base, PAGE_HYP_DEVICE));
>> -(*pfn_base)++;
>> +for (addr = start; addr < end; addr += PAGE_SIZE) {
>> +pte = pte_offset_kernel(pmd, addr);
>> +kvm_set_pte(pte, pfn_pte(pfn, prot));
>> +pfn++;
>>  }
>>  }
>>  
>>  static int create_hyp_pmd_mappings(pud_t *pud, unsigned long start,
>> -   unsigned long end, unsigned long *pfn_base)
>> +   unsigned long end, unsigned long pfn,
>> +   pgprot_t prot)
>>  {
>>  pmd_t *pmd;
>>  pte_t *pte;
>>  unsigned long addr, next;
>>  
>>  for (addr = start; addr < end; addr = next) {
>> -unsigned long hyp_addr = KERN_TO_HYP(addr);
>> -pmd = pmd_offset(pud, hyp_addr);
>> +pmd = pmd_offset(pud, addr);
>>  
>>  BUG_ON(pmd_sect(*pmd));
>>  
>>  if (pmd_none(*pmd)) {
>> -pte = pte_alloc_one_kernel(NULL, hyp_addr);
>> +pte = pte_alloc_one_kernel(NULL, addr);
>>  if (!pte) {
>>  kvm_err("Cannot allocate Hyp pte\n");
>>  return -ENOMEM;
>> @@ -182,25 +162,17 @@ static int create_hyp_pmd_mappings(pud_t *pud, 
>> unsigned long start,
>>  
>>  next = pmd_addr_end(addr, end);
>>  
>> -/*
>> - * If pfn_base is NULL, we map kernel pages into HYP with the
>> - * virtual address. Otherwise, this is considered an I/O
>> - * mapping and we map the physical region starting at
>> - * *pfn_base to [start, end[.
>> - */
>> -if (!pfn_base)
>> -create_hyp_pte_mappings(pmd, addr, next);
>> -else
>> -create_hyp_io_pte_mappings(pmd, addr, next, pfn_base);
>> +create_hyp_pte_mappings(pmd, addr, next, pfn, prot);
>> +pfn += (next - addr) >> PAGE_SHIFT;
> 
> so this scheme always assumes a physically contiguous memory reason,
> and I wasn't sure if we ever wanted to support mapping vmalloc'ed
> regions into Hyp mode, which is why I wrote the code the way it was
> (which goes to your "for no good reason" comment above).

So let's look at what we're actually mapping:
- kernel code/kmalloc-ed data: this is always physically contiguous
- MMIO: While this is mapped in the vmalloc region, this is actually
physically contiguous.

> I'm fine with assuming that this only works for contiguous regions, but
> I think it deserves a line in the comments on the exported functions
> (the non-IO one anyway).

Sure, I'll add that.

>>  }
>>  
>>  return 0;
>>  }
>>  
>> -static int __create_hyp_mappings(void *from, void *to, unsigned long 
>> *pfn_base)
>> +static int __create_hyp_mappings(pgd_t *pgdp,
>> +  

Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:08, Gleb Natapov wrote:

> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>> 
>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>> 
>>> With KVM, MMIO is much slower than PIO, due to the need to
>>> do page walk and emulation. But with EPT, it does not have to be: we
>>> know the address from the VMCS so if the address is unique, we can look
>>> up the eventfd directly, bypassing emulation.
>>> 
>>> Add an interface for userspace to specify this per-address, we can
>>> use this e.g. for virtio.
>>> 
>>> The implementation adds a separate bus internally. This serves two
>>> purposes:
>>> - minimize overhead for old userspace that does not use PV MMIO
>>> - minimize disruption in other code (since we don't know the length,
>>> devices on the MMIO bus only get a valid address in write, this
>>> way we don't need to touch all devices to teach them handle
>>> an dinvalid length)
>>> 
>>> At the moment, this optimization is only supported for EPT on x86 and
>>> silently ignored for NPT and MMU, so everything works correctly but
>>> slowly.
>>> 
>>> TODO: NPT, MMU and non x86 architectures.
>>> 
>>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
>>> pre-review and suggestions.
>>> 
>>> Signed-off-by: Michael S. Tsirkin 
>> 
>> This still uses page fault intercepts which are orders of magnitudes slower 
>> than hypercalls. Why don't you just create a PV MMIO hypercall that the 
>> guest can use to invoke MMIO accesses towards the host based on physical 
>> addresses with explicit length encodings?
>> 
> It is slower, but not an order of magnitude slower. It become faster
> with newer HW.
> 
>> That way you simplify and speed up all code paths, exceeding the speed of 
>> PIO exits even. It should also be quite easily portable, as all other 
>> platforms have hypercalls available as well.
>> 
> We are trying to avoid PV as much as possible (well this is also PV,
> but not guest visible

Also, how is this not guest visible? Who sets KVM_IOEVENTFD_FLAG_PV_MMIO? The 
comment above its definition indicates that the guest does so, so it is guest 
visible.

+/*
+ * PV_MMIO - Guest can promise us that all accesses touching this address
+ * are writes of specified length, starting at the specified address.
+ * If not - it's a Guest bug.
+ * Can not be used together with either PIO or DATAMATCH.
+ */


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 02:09:53PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 13:04, Michael S. Tsirkin wrote:
> 
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >> 
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>> 
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>> 
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>> 
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>> 
> >>> TODO: NPT, MMU and non x86 architectures.
> >>> 
> >>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>> 
> >>> Signed-off-by: Michael S. Tsirkin 
> >> 
> >> This still uses page fault intercepts which are orders of magnitudes
> >> slower than hypercalls.
> > 
> > Not really. Here's a test:
> > compare vmcall to portio:
> > 
> > vmcall 1519
> > ...
> > outl_to_kernel 1745
> > 
> > compare portio to mmio:
> > 
> > mmio-wildcard-eventfd:pci-mem 3529
> > mmio-pv-eventfd:pci-mem 1878
> > portio-wildcard-eventfd:pci-io 1846
> > 
> > So not orders of magnitude.
> 
> https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf
> 
> Check out page 41. Higher is better (number is number of loop cycles in a 
> second). My test system was an AMD Istanbul based box.

Wow 2x difference.  The difference seems to be much smaller now. Newer
hardware? Newer software?

> > 
> >> Why don't you just create a PV MMIO hypercall
> >> that the guest can use to invoke MMIO accesses towards the host based
> >> on physical addresses with explicit length encodings?
> >> That way you simplify and speed up all code paths, exceeding the speed
> >> of PIO exits even. It should also be quite easily portable, as all
> >> other platforms have hypercalls available as well.
> >> 
> >> 
> >> Alex
> > 
> > I sent such a patch, but maintainers seem reluctant to add hypercalls.
> > Gleb, could you comment please?
> > 
> > A fast way to do MMIO is probably useful in any case ...
> 
> Yes, but at least according to my numbers optimizing anything that is not 
> hcalls is a waste of time.
> 
> 
> Alex

This was the implementation: 'kvm_para: add mmio word store hypercall'.
Again, I don't mind.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:19, Gleb Natapov wrote:

> On Thu, Apr 04, 2013 at 02:09:53PM +0200, Alexander Graf wrote:
>> 
>> On 04.04.2013, at 13:04, Michael S. Tsirkin wrote:
>> 
>>> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
 
 On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
 
> With KVM, MMIO is much slower than PIO, due to the need to
> do page walk and emulation. But with EPT, it does not have to be: we
> know the address from the VMCS so if the address is unique, we can look
> up the eventfd directly, bypassing emulation.
> 
> Add an interface for userspace to specify this per-address, we can
> use this e.g. for virtio.
> 
> The implementation adds a separate bus internally. This serves two
> purposes:
> - minimize overhead for old userspace that does not use PV MMIO
> - minimize disruption in other code (since we don't know the length,
> devices on the MMIO bus only get a valid address in write, this
> way we don't need to touch all devices to teach them handle
> an dinvalid length)
> 
> At the moment, this optimization is only supported for EPT on x86 and
> silently ignored for NPT and MMU, so everything works correctly but
> slowly.
> 
> TODO: NPT, MMU and non x86 architectures.
> 
> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> pre-review and suggestions.
> 
> Signed-off-by: Michael S. Tsirkin 
 
 This still uses page fault intercepts which are orders of magnitudes
 slower than hypercalls.
>>> 
>>> Not really. Here's a test:
>>> compare vmcall to portio:
>>> 
>>> vmcall 1519
>>> ...
>>> outl_to_kernel 1745
>>> 
>>> compare portio to mmio:
>>> 
>>> mmio-wildcard-eventfd:pci-mem 3529
>>> mmio-pv-eventfd:pci-mem 1878
>>> portio-wildcard-eventfd:pci-io 1846
>>> 
>>> So not orders of magnitude.
>> 
>> https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf
>> 
>> Check out page 41. Higher is better (number is number of loop cycles in a 
>> second). My test system was an AMD Istanbul based box.
>> 
> Have you bypassed instruction emulation in your testing?

PIO doesn't do instruction emulation.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 14:08, Gleb Natapov wrote:

> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>> 
>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>> 
>>> With KVM, MMIO is much slower than PIO, due to the need to
>>> do page walk and emulation. But with EPT, it does not have to be: we
>>> know the address from the VMCS so if the address is unique, we can look
>>> up the eventfd directly, bypassing emulation.
>>> 
>>> Add an interface for userspace to specify this per-address, we can
>>> use this e.g. for virtio.
>>> 
>>> The implementation adds a separate bus internally. This serves two
>>> purposes:
>>> - minimize overhead for old userspace that does not use PV MMIO
>>> - minimize disruption in other code (since we don't know the length,
>>> devices on the MMIO bus only get a valid address in write, this
>>> way we don't need to touch all devices to teach them handle
>>> an dinvalid length)
>>> 
>>> At the moment, this optimization is only supported for EPT on x86 and
>>> silently ignored for NPT and MMU, so everything works correctly but
>>> slowly.
>>> 
>>> TODO: NPT, MMU and non x86 architectures.
>>> 
>>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
>>> pre-review and suggestions.
>>> 
>>> Signed-off-by: Michael S. Tsirkin 
>> 
>> This still uses page fault intercepts which are orders of magnitudes slower 
>> than hypercalls. Why don't you just create a PV MMIO hypercall that the 
>> guest can use to invoke MMIO accesses towards the host based on physical 
>> addresses with explicit length encodings?
>> 
> It is slower, but not an order of magnitude slower. It become faster
> with newer HW.
> 
>> That way you simplify and speed up all code paths, exceeding the speed of 
>> PIO exits even. It should also be quite easily portable, as all other 
>> platforms have hypercalls available as well.
>> 
> We are trying to avoid PV as much as possible (well this is also PV,
> but not guest visible). We haven't replaced PIO with hypercall for the
> same reason. My hope is that future HW will provide us with instruction
> decode for basic mov instruction at which point this optimisation can be
> dropped.

The same applies to an MMIO hypercall. Once the PV interface becomes obsolete, 
we can drop the capability we expose to the guest.

> And hypercall has its own set of problems with Windows guests.
> When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
> hypercalls.  Mixing KVM hypercalls and Hyper-V requires some tricks. It

Can't we simply register a hypercall ID range with Microsoft?

> may also affect WHQLing Windows drivers since driver will talk to HW
> bypassing Windows interfaces.

Then the WHQL'ed driver doesn't support the PV MMIO hcall?


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 02:09:53PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 13:04, Michael S. Tsirkin wrote:
> 
> > On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> >> 
> >> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> >> 
> >>> With KVM, MMIO is much slower than PIO, due to the need to
> >>> do page walk and emulation. But with EPT, it does not have to be: we
> >>> know the address from the VMCS so if the address is unique, we can look
> >>> up the eventfd directly, bypassing emulation.
> >>> 
> >>> Add an interface for userspace to specify this per-address, we can
> >>> use this e.g. for virtio.
> >>> 
> >>> The implementation adds a separate bus internally. This serves two
> >>> purposes:
> >>> - minimize overhead for old userspace that does not use PV MMIO
> >>> - minimize disruption in other code (since we don't know the length,
> >>> devices on the MMIO bus only get a valid address in write, this
> >>> way we don't need to touch all devices to teach them handle
> >>> an dinvalid length)
> >>> 
> >>> At the moment, this optimization is only supported for EPT on x86 and
> >>> silently ignored for NPT and MMU, so everything works correctly but
> >>> slowly.
> >>> 
> >>> TODO: NPT, MMU and non x86 architectures.
> >>> 
> >>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> >>> pre-review and suggestions.
> >>> 
> >>> Signed-off-by: Michael S. Tsirkin 
> >> 
> >> This still uses page fault intercepts which are orders of magnitudes
> >> slower than hypercalls.
> > 
> > Not really. Here's a test:
> > compare vmcall to portio:
> > 
> > vmcall 1519
> > ...
> > outl_to_kernel 1745
> > 
> > compare portio to mmio:
> > 
> > mmio-wildcard-eventfd:pci-mem 3529
> > mmio-pv-eventfd:pci-mem 1878
> > portio-wildcard-eventfd:pci-io 1846
> > 
> > So not orders of magnitude.
> 
> https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf
> 
> Check out page 41. Higher is better (number is number of loop cycles in a 
> second). My test system was an AMD Istanbul based box.
> 
Have you bypassed instruction emulation in your testing?

> > 
> >> Why don't you just create a PV MMIO hypercall
> >> that the guest can use to invoke MMIO accesses towards the host based
> >> on physical addresses with explicit length encodings?
> >> That way you simplify and speed up all code paths, exceeding the speed
> >> of PIO exits even. It should also be quite easily portable, as all
> >> other platforms have hypercalls available as well.
> >> 
> >> 
> >> Alex
> > 
> > I sent such a patch, but maintainers seem reluctant to add hypercalls.
> > Gleb, could you comment please?
> > 
> > A fast way to do MMIO is probably useful in any case ...
> 
> Yes, but at least according to my numbers optimizing anything that is not 
> hcalls is a waste of time.
> 
> 
> Alex

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 13:04, Michael S. Tsirkin wrote:

> On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
>> 
>> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
>> 
>>> With KVM, MMIO is much slower than PIO, due to the need to
>>> do page walk and emulation. But with EPT, it does not have to be: we
>>> know the address from the VMCS so if the address is unique, we can look
>>> up the eventfd directly, bypassing emulation.
>>> 
>>> Add an interface for userspace to specify this per-address, we can
>>> use this e.g. for virtio.
>>> 
>>> The implementation adds a separate bus internally. This serves two
>>> purposes:
>>> - minimize overhead for old userspace that does not use PV MMIO
>>> - minimize disruption in other code (since we don't know the length,
>>> devices on the MMIO bus only get a valid address in write, this
>>> way we don't need to touch all devices to teach them handle
>>> an dinvalid length)
>>> 
>>> At the moment, this optimization is only supported for EPT on x86 and
>>> silently ignored for NPT and MMU, so everything works correctly but
>>> slowly.
>>> 
>>> TODO: NPT, MMU and non x86 architectures.
>>> 
>>> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
>>> pre-review and suggestions.
>>> 
>>> Signed-off-by: Michael S. Tsirkin 
>> 
>> This still uses page fault intercepts which are orders of magnitudes
>> slower than hypercalls.
> 
> Not really. Here's a test:
> compare vmcall to portio:
> 
> vmcall 1519
> ...
> outl_to_kernel 1745
> 
> compare portio to mmio:
> 
> mmio-wildcard-eventfd:pci-mem 3529
> mmio-pv-eventfd:pci-mem 1878
> portio-wildcard-eventfd:pci-io 1846
> 
> So not orders of magnitude.

https://dl.dropbox.com/u/8976842/KVM%20Forum%202012/MMIO%20Tuning.pdf

Check out page 41. Higher is better (number is number of loop cycles in a 
second). My test system was an AMD Istanbul based box.

> 
>> Why don't you just create a PV MMIO hypercall
>> that the guest can use to invoke MMIO accesses towards the host based
>> on physical addresses with explicit length encodings?
>> That way you simplify and speed up all code paths, exceeding the speed
>> of PIO exits even. It should also be quite easily portable, as all
>> other platforms have hypercalls available as well.
>> 
>> 
>> Alex
> 
> I sent such a patch, but maintainers seem reluctant to add hypercalls.
> Gleb, could you comment please?
> 
> A fast way to do MMIO is probably useful in any case ...

Yes, but at least according to my numbers optimizing anything that is not 
hcalls is a waste of time.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Gleb Natapov
On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> 
> > With KVM, MMIO is much slower than PIO, due to the need to
> > do page walk and emulation. But with EPT, it does not have to be: we
> > know the address from the VMCS so if the address is unique, we can look
> > up the eventfd directly, bypassing emulation.
> > 
> > Add an interface for userspace to specify this per-address, we can
> > use this e.g. for virtio.
> > 
> > The implementation adds a separate bus internally. This serves two
> > purposes:
> > - minimize overhead for old userspace that does not use PV MMIO
> > - minimize disruption in other code (since we don't know the length,
> >  devices on the MMIO bus only get a valid address in write, this
> >  way we don't need to touch all devices to teach them handle
> >  an dinvalid length)
> > 
> > At the moment, this optimization is only supported for EPT on x86 and
> > silently ignored for NPT and MMU, so everything works correctly but
> > slowly.
> > 
> > TODO: NPT, MMU and non x86 architectures.
> > 
> > The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> > pre-review and suggestions.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> 
> This still uses page fault intercepts which are orders of magnitudes slower 
> than hypercalls. Why don't you just create a PV MMIO hypercall that the guest 
> can use to invoke MMIO accesses towards the host based on physical addresses 
> with explicit length encodings?
> 
It is slower, but not an order of magnitude slower. It become faster
with newer HW.

> That way you simplify and speed up all code paths, exceeding the speed of PIO 
> exits even. It should also be quite easily portable, as all other platforms 
> have hypercalls available as well.
> 
We are trying to avoid PV as much as possible (well this is also PV,
but not guest visible). We haven't replaced PIO with hypercall for the
same reason. My hope is that future HW will provide us with instruction
decode for basic mov instruction at which point this optimisation can be
dropped. And hypercall has its own set of problems with Windows guests.
When KVM runs in Hyper-V emulation mode it expects to get Hyper-V
hypercalls.  Mixing KVM hypercalls and Hyper-V requires some tricks. It
may also affect WHQLing Windows drivers since driver will talk to HW
bypassing Windows interfaces.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
On Thu, Apr 04, 2013 at 01:57:34PM +0200, Alexander Graf wrote:
> 
> On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:
> 
> > With KVM, MMIO is much slower than PIO, due to the need to
> > do page walk and emulation. But with EPT, it does not have to be: we
> > know the address from the VMCS so if the address is unique, we can look
> > up the eventfd directly, bypassing emulation.
> > 
> > Add an interface for userspace to specify this per-address, we can
> > use this e.g. for virtio.
> > 
> > The implementation adds a separate bus internally. This serves two
> > purposes:
> > - minimize overhead for old userspace that does not use PV MMIO
> > - minimize disruption in other code (since we don't know the length,
> >  devices on the MMIO bus only get a valid address in write, this
> >  way we don't need to touch all devices to teach them handle
> >  an dinvalid length)
> > 
> > At the moment, this optimization is only supported for EPT on x86 and
> > silently ignored for NPT and MMU, so everything works correctly but
> > slowly.
> > 
> > TODO: NPT, MMU and non x86 architectures.
> > 
> > The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> > pre-review and suggestions.
> > 
> > Signed-off-by: Michael S. Tsirkin 
> 
> This still uses page fault intercepts which are orders of magnitudes
> slower than hypercalls.

Not really. Here's a test:
compare vmcall to portio:

vmcall 1519
...
outl_to_kernel 1745

compare portio to mmio:

mmio-wildcard-eventfd:pci-mem 3529
mmio-pv-eventfd:pci-mem 1878
portio-wildcard-eventfd:pci-io 1846

So not orders of magnitude.

> Why don't you just create a PV MMIO hypercall
> that the guest can use to invoke MMIO accesses towards the host based
> on physical addresses with explicit length encodings?
> That way you simplify and speed up all code paths, exceeding the speed
> of PIO exits even. It should also be quite easily portable, as all
> other platforms have hypercalls available as well.
> 
> 
> Alex

I sent such a patch, but maintainers seem reluctant to add hypercalls.
Gleb, could you comment please?

A fast way to do MMIO is probably useful in any case ...

>  Alex
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v7 4/7] KVM: Add reset/restore rtc_status support

2013-04-04 Thread Gleb Natapov
On Mon, Apr 01, 2013 at 08:40:13AM +0800, Yang Zhang wrote:
> From: Yang Zhang 
> 
> Signed-off-by: Yang Zhang 
> ---
>  arch/x86/kvm/lapic.c |9 +
>  arch/x86/kvm/lapic.h |2 ++
>  virt/kvm/ioapic.c|   43 +++
>  virt/kvm/ioapic.h|1 +
>  4 files changed, 55 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 96ab160..9c041fa 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -94,6 +94,14 @@ static inline int apic_test_vector(int vec, void *bitmap)
>   return test_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
>  }
>  
> +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector)
> +{
> + struct kvm_lapic *apic = vcpu->arch.apic;
> +
> + return apic_test_vector(vector, apic->regs + APIC_ISR) ||
> + apic_test_vector(vector, apic->regs + APIC_IRR);
> +}
> +
>  static inline void apic_set_vector(int vec, void *bitmap)
>  {
>   set_bit(VEC_POS(vec), (bitmap) + REG_POS(vec));
> @@ -1665,6 +1673,7 @@ void kvm_apic_post_state_restore(struct kvm_vcpu *vcpu,
>   apic->highest_isr_cache = -1;
>   kvm_x86_ops->hwapic_isr_update(vcpu->kvm, apic_find_highest_isr(apic));
>   kvm_make_request(KVM_REQ_EVENT, vcpu);
> + kvm_rtc_irq_restore(vcpu);
>  }
>  
>  void __kvm_migrate_apic_timer(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 967519c..004d2ad 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -170,4 +170,6 @@ static inline bool kvm_apic_has_events(struct kvm_vcpu 
> *vcpu)
>   return vcpu->arch.apic->pending_events;
>  }
>  
> +bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
> +
>  #endif
> diff --git a/virt/kvm/ioapic.c b/virt/kvm/ioapic.c
> index 8664812..0b12b17 100644
> --- a/virt/kvm/ioapic.c
> +++ b/virt/kvm/ioapic.c
> @@ -90,6 +90,47 @@ static unsigned long ioapic_read_indirect(struct 
> kvm_ioapic *ioapic,
>   return result;
>  }
>  
> +static void rtc_irq_reset(struct kvm_ioapic *ioapic)
> +{
> + ioapic->rtc_status.pending_eoi = 0;
> + bitmap_zero(ioapic->rtc_status.dest_map, KVM_MAX_VCPUS);
> +}
> +
> +static void rtc_irq_restore(struct kvm_ioapic *ioapic)
> +{
> + struct kvm_vcpu *vcpu;
> + int vector, i, pending_eoi = 0;
> +
> + if (RTC_GSI >= IOAPIC_NUM_PINS)
> + return;
> +
> + vector = ioapic->redirtbl[RTC_GSI].fields.vector;
> + kvm_for_each_vcpu(i, vcpu, ioapic->kvm) {
> + if (kvm_apic_pending_eoi(vcpu, vector)) {
> + pending_eoi++;
> + __set_bit(vcpu->vcpu_id, ioapic->rtc_status.dest_map);
You should cleat dest_map at the beginning to get rid of stale bits.

> + }
> + }
> + ioapic->rtc_status.pending_eoi = pending_eoi;
> +}
> +
> +void kvm_rtc_irq_restore(struct kvm_vcpu *vcpu)
> +{
> + struct kvm_ioapic *ioapic = vcpu->kvm->arch.vioapic;
> + int vector;
> +
> + if (!ioapic)
> + return;
> +
Can this be called if ioapic == NULL?
Should check for if (RTC_GSI >= IOAPIC_NUM_PINS) here too.

> + spin_lock(&ioapic->lock);
> + vector = ioapic->redirtbl[RTC_GSI].fields.vector;
> + if (kvm_apic_pending_eoi(vcpu, vector)) {
> + __set_bit(vcpu->vcpu_id, ioapic->rtc_status.dest_map);
> + ioapic->rtc_status.pending_eoi++;
The bit may have been set already. The logic should be:

new_val = kvm_apic_pending_eoi(vcpu, vector)
old_val = set_bit(vcpu_id, dest_map)

if (new_val == old_val)
return;

if (new_val) {
__set_bit(vcpu_id, dest_map);
pending_eoi++;
} else {
__clear_bit(vcpu_id, dest_map);
pending_eoi--;
}

The naming of above two functions are not good either. Call
them something like kvm_rtc_eoi_tracking_restore_all() and
kvm_rtc_eoi_tracking_restore_one().  And _all should call _one() for
each vcpu. Make __rtc_irq_eoi_tracking_restore_one() that does not
take ioapic lock and call it from kvm_rtc_eoi_tracking_restore_one()
surrounded by locks.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 12:50, Michael S. Tsirkin wrote:

> With KVM, MMIO is much slower than PIO, due to the need to
> do page walk and emulation. But with EPT, it does not have to be: we
> know the address from the VMCS so if the address is unique, we can look
> up the eventfd directly, bypassing emulation.
> 
> Add an interface for userspace to specify this per-address, we can
> use this e.g. for virtio.
> 
> The implementation adds a separate bus internally. This serves two
> purposes:
> - minimize overhead for old userspace that does not use PV MMIO
> - minimize disruption in other code (since we don't know the length,
>  devices on the MMIO bus only get a valid address in write, this
>  way we don't need to touch all devices to teach them handle
>  an dinvalid length)
> 
> At the moment, this optimization is only supported for EPT on x86 and
> silently ignored for NPT and MMU, so everything works correctly but
> slowly.
> 
> TODO: NPT, MMU and non x86 architectures.
> 
> The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
> pre-review and suggestions.
> 
> Signed-off-by: Michael S. Tsirkin 

This still uses page fault intercepts which are orders of magnitudes slower 
than hypercalls. Why don't you just create a PV MMIO hypercall that the guest 
can use to invoke MMIO accesses towards the host based on physical addresses 
with explicit length encodings?

That way you simplify and speed up all code paths, exceeding the speed of PIO 
exits even. It should also be quite easily portable, as all other platforms 
have hypercalls available as well.


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC] kvm: add PV MMIO EVENTFD

2013-04-04 Thread Michael S. Tsirkin
With KVM, MMIO is much slower than PIO, due to the need to
do page walk and emulation. But with EPT, it does not have to be: we
know the address from the VMCS so if the address is unique, we can look
up the eventfd directly, bypassing emulation.

Add an interface for userspace to specify this per-address, we can
use this e.g. for virtio.

The implementation adds a separate bus internally. This serves two
purposes:
- minimize overhead for old userspace that does not use PV MMIO
- minimize disruption in other code (since we don't know the length,
  devices on the MMIO bus only get a valid address in write, this
  way we don't need to touch all devices to teach them handle
  an dinvalid length)

At the moment, this optimization is only supported for EPT on x86 and
silently ignored for NPT and MMU, so everything works correctly but
slowly.

TODO: NPT, MMU and non x86 architectures.

The idea was suggested by Peter Anvin.  Lots of thanks to Gleb for
pre-review and suggestions.

Signed-off-by: Michael S. Tsirkin 
---
 arch/x86/kvm/vmx.c   |  4 
 arch/x86/kvm/x86.c   |  1 +
 include/linux/kvm_host.h |  1 +
 include/uapi/linux/kvm.h |  9 +
 virt/kvm/eventfd.c   | 47 ++-
 virt/kvm/kvm_main.c  |  1 +
 6 files changed, 58 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6667042..cdaac9b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5127,6 +5127,10 @@ static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
gpa_t gpa;
 
gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
+   if (!kvm_io_bus_write(vcpu->kvm, KVM_PV_MMIO_BUS, gpa, 0, NULL)) {
+   skip_emulated_instruction(vcpu);
+   return 1;
+   }
 
ret = handle_mmio_page_fault_common(vcpu, gpa, true);
if (likely(ret == 1))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f19ac0a..b9223d9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2483,6 +2483,7 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_ASSIGN_DEV_IRQ:
case KVM_CAP_IRQFD:
case KVM_CAP_IOEVENTFD:
+   case KVM_CAP_IOEVENTFD_PV_MMIO:
case KVM_CAP_PIT2:
case KVM_CAP_PIT_STATE2:
case KVM_CAP_SET_IDENTITY_MAP_ADDR:
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cad77fe..35b74cd 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -149,6 +149,7 @@ struct kvm_io_bus {
 enum kvm_bus {
KVM_MMIO_BUS,
KVM_PIO_BUS,
+   KVM_PV_MMIO_BUS,
KVM_NR_BUSES
 };
 
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 3c56ba3..61783ee 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -449,11 +449,19 @@ enum {
kvm_ioeventfd_flag_nr_datamatch,
kvm_ioeventfd_flag_nr_pio,
kvm_ioeventfd_flag_nr_deassign,
+   kvm_ioeventfd_flag_nr_pv_mmio,
kvm_ioeventfd_flag_nr_max,
 };
 
 #define KVM_IOEVENTFD_FLAG_DATAMATCH (1 << kvm_ioeventfd_flag_nr_datamatch)
 #define KVM_IOEVENTFD_FLAG_PIO   (1 << kvm_ioeventfd_flag_nr_pio)
+/*
+ * PV_MMIO - Guest can promise us that all accesses touching this address
+ * are writes of specified length, starting at the specified address.
+ * If not - it's a Guest bug.
+ * Can not be used together with either PIO or DATAMATCH.
+ */
+#define KVM_IOEVENTFD_FLAG_PV_MMIO   (1 << kvm_ioeventfd_flag_nr_pv_mmio)
 #define KVM_IOEVENTFD_FLAG_DEASSIGN  (1 << kvm_ioeventfd_flag_nr_deassign)
 
 #define KVM_IOEVENTFD_VALID_FLAG_MASK  ((1 << kvm_ioeventfd_flag_nr_max) - 1)
@@ -665,6 +673,7 @@ struct kvm_ppc_smmu_info {
 #define KVM_CAP_PPC_EPR 86
 #define KVM_CAP_ARM_PSCI 87
 #define KVM_CAP_ARM_SET_DEVICE_ADDR 88
+#define KVM_CAP_IOEVENTFD_PV_MMIO 89
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 93e5b05..1b7619e 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -579,6 +579,7 @@ struct _ioeventfd {
struct kvm_io_device dev;
u8   bus_idx;
bool wildcard;
+   bool pvmmio;
 };
 
 static inline struct _ioeventfd *
@@ -600,7 +601,15 @@ ioeventfd_in_range(struct _ioeventfd *p, gpa_t addr, int 
len, const void *val)
 {
u64 _val;
 
-   if (!(addr == p->addr && len == p->length))
+   if (addr != p->addr)
+   /* address must be precise for a hit */
+   return false;
+
+   if (p->pvmmio)
+   /* pvmmio only looks at the address, so always a hit */
+   return true;
+
+   if (len != p->length)
/* address-range must be precise for a hit */
return false;
 
@@ -671,9 +680,11 @@ ioeventfd_check_collision(struct kvm *kvm, struct 
_ioeventfd *p)
 
list_for_each_entry(_p, &kvm->ioeventfds, list)
if (_p->bus_idx == p->bus_idx &&
-   _p->addr == p->addr && _p->length == p->

[PATCH v2 6/6] pci-testdev: add pv mmio test

2013-04-04 Thread Michael S. Tsirkin
Add ability to test the speed of PV MMIO.

Signed-off-by: Michael S. Tsirkin 
---
 hw/pci-testdev.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/hw/pci-testdev.c b/hw/pci-testdev.c
index f0ebf99..9a50631 100644
--- a/hw/pci-testdev.c
+++ b/hw/pci-testdev.c
@@ -19,6 +19,7 @@ typedef struct IOTest {
 EventNotifier notifier;
 bool hasnotifier;
 unsigned size;
+bool pv;
 bool match_data;
 PCITestDevHdr *hdr;
 unsigned bufsize;
@@ -33,7 +34,8 @@ typedef struct IOTest {
 static const char *iotest_test[] = {
 "no-eventfd",
 "wildcard-eventfd",
-"datamatch-eventfd"
+"datamatch-eventfd",
+"pv-eventfd"
 };
 
 static const char *iotest_type[] = {
@@ -80,7 +82,7 @@ static int pci_testdev_start(IOTest *test)
 memory_region_add_eventfd(test->mr,
   le32_to_cpu(test->hdr->offset),
   test->size,
-  false,
+  test->pv,
   test->match_data,
   test->hdr->data,
   &test->notifier);
@@ -95,7 +97,7 @@ static void pci_testdev_stop(IOTest *test)
 memory_region_del_eventfd(test->mr,
   le32_to_cpu(test->hdr->offset),
   test->size,
-  false,
+  test->pv,
   test->match_data,
   test->hdr->data,
   &test->notifier);
@@ -237,7 +239,8 @@ static int pci_testdev_init(PCIDevice *pci_dev)
 g_free(name);
 test->hdr->offset = cpu_to_le32(IOTEST_SIZE(i) + i * 
IOTEST_ACCESS_WIDTH);
 test->size = IOTEST_ACCESS_WIDTH;
-test->match_data = strcmp(IOTEST_TEST(i), "wildcard-eventfd");
+test->pv = !strcmp(IOTEST_TEST(i), "pv-eventfd");
+test->match_data = !strcmp(IOTEST_TEST(i), "datamatch-eventfd");
 test->hdr->test = i;
 test->hdr->data = test->match_data ? IOTEST_DATAMATCH : IOTEST_NOMATCH;
 test->hdr->width = IOTEST_ACCESS_WIDTH;
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 5/6] kvm: add PV MMIO

2013-04-04 Thread Michael S. Tsirkin
Add an option for users to specify PV eventfd
listeners, and utilize KVM's PV MMIO underneath.
Upodate all callers.

Signed-off-by: Michael S. Tsirkin 
---
 hw/dataplane/hostmem.c|  1 +
 hw/ivshmem.c  |  2 ++
 hw/pci-testdev.c  |  2 ++
 hw/vhost.c|  4 ++--
 hw/virtio-pci.c   |  4 ++--
 include/exec/memory.h | 10 ++
 kvm-all.c | 15 ---
 linux-headers/linux/kvm.h |  8 
 memory.c  |  9 +++--
 9 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/hw/dataplane/hostmem.c b/hw/dataplane/hostmem.c
index 380537e..41d43e7 100644
--- a/hw/dataplane/hostmem.c
+++ b/hw/dataplane/hostmem.c
@@ -126,6 +126,7 @@ static void hostmem_listener_section_dummy(MemoryListener 
*listener,
 
 static void hostmem_listener_eventfd_dummy(MemoryListener *listener,
MemoryRegionSection *section,
+   bool pv,
bool match_data, uint64_t data,
EventNotifier *e)
 {
diff --git a/hw/ivshmem.c b/hw/ivshmem.c
index 68a2cf2..28e5978 100644
--- a/hw/ivshmem.c
+++ b/hw/ivshmem.c
@@ -352,6 +352,7 @@ static void ivshmem_add_eventfd(IVShmemState *s, int posn, 
int i)
 memory_region_add_eventfd(&s->ivshmem_mmio,
   DOORBELL,
   4,
+  false,
   true,
   (posn << 16) | i,
   &s->peers[posn].eventfds[i]);
@@ -362,6 +363,7 @@ static void ivshmem_del_eventfd(IVShmemState *s, int posn, 
int i)
 memory_region_del_eventfd(&s->ivshmem_mmio,
   DOORBELL,
   4,
+  false,
   true,
   (posn << 16) | i,
   &s->peers[posn].eventfds[i]);
diff --git a/hw/pci-testdev.c b/hw/pci-testdev.c
index 9486624..f0ebf99 100644
--- a/hw/pci-testdev.c
+++ b/hw/pci-testdev.c
@@ -80,6 +80,7 @@ static int pci_testdev_start(IOTest *test)
 memory_region_add_eventfd(test->mr,
   le32_to_cpu(test->hdr->offset),
   test->size,
+  false,
   test->match_data,
   test->hdr->data,
   &test->notifier);
@@ -94,6 +95,7 @@ static void pci_testdev_stop(IOTest *test)
 memory_region_del_eventfd(test->mr,
   le32_to_cpu(test->hdr->offset),
   test->size,
+  false,
   test->match_data,
   test->hdr->data,
   &test->notifier);
diff --git a/hw/vhost.c b/hw/vhost.c
index 4d6aee3..9ad7101 100644
--- a/hw/vhost.c
+++ b/hw/vhost.c
@@ -750,13 +750,13 @@ static void vhost_virtqueue_stop(struct vhost_dev *dev,
 }
 
 static void vhost_eventfd_add(MemoryListener *listener,
-  MemoryRegionSection *section,
+  MemoryRegionSection *section, bool pv,
   bool match_data, uint64_t data, EventNotifier *e)
 {
 }
 
 static void vhost_eventfd_del(MemoryListener *listener,
-  MemoryRegionSection *section,
+  MemoryRegionSection *section, bool pv,
   bool match_data, uint64_t data, EventNotifier *e)
 {
 }
diff --git a/hw/virtio-pci.c b/hw/virtio-pci.c
index 19965e5..f9ff994 100644
--- a/hw/virtio-pci.c
+++ b/hw/virtio-pci.c
@@ -189,10 +189,10 @@ static int 
virtio_pci_set_host_notifier_internal(VirtIOPCIProxy *proxy,
 }
 virtio_queue_set_host_notifier_fd_handler(vq, true, set_handler);
 memory_region_add_eventfd(&proxy->bar, VIRTIO_PCI_QUEUE_NOTIFY, 2,
-  true, n, notifier);
+  false, true, n, notifier);
 } else {
 memory_region_del_eventfd(&proxy->bar, VIRTIO_PCI_QUEUE_NOTIFY, 2,
-  true, n, notifier);
+  false, true, n, notifier);
 virtio_queue_set_host_notifier_fd_handler(vq, false, false);
 event_notifier_cleanup(notifier);
 }
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 2322732..3612004 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -215,8 +215,10 @@ struct MemoryListener {
 void (*log_global_start)(MemoryListener *listener);
 void (*log_global_stop)(MemoryListener *listener);
 void (*eventfd_add)(MemoryListener *listener, MemoryRegionSection *section,
+bool pv,
 bool match_data, uint64_t data, EventNotifier *e)

[PATCH v2 3/6] kvm: support non datamatch ioeventfd

2013-04-04 Thread Michael S. Tsirkin
Adding restrictions just adds code.

Signed-off-by: Michael S. Tsirkin 
---
 kvm-all.c | 34 ++
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index 589e37c..ce823f9 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -501,21 +501,24 @@ int kvm_check_extension(KVMState *s, unsigned int 
extension)
 }
 
 static int kvm_set_ioeventfd_mmio(int fd, uint32_t addr, uint32_t val,
-  bool assign, uint32_t size)
+  bool assign, uint32_t size, bool datamatch)
 {
 int ret;
 struct kvm_ioeventfd iofd;
 
-iofd.datamatch = val;
+iofd.datamatch = datamatch ? val : 0;
 iofd.addr = addr;
 iofd.len = size;
-iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
+iofd.flags = 0;
 iofd.fd = fd;
 
 if (!kvm_enabled()) {
 return -ENOSYS;
 }
 
+if (datamatch) {
+iofd.flags |= KVM_IOEVENTFD_FLAG_DATAMATCH;
+}
 if (!assign) {
 iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
 }
@@ -530,19 +533,22 @@ static int kvm_set_ioeventfd_mmio(int fd, uint32_t addr, 
uint32_t val,
 }
 
 static int kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint16_t val,
- bool assign, uint32_t size)
+ bool assign, uint32_t size, bool datamatch)
 {
 struct kvm_ioeventfd kick = {
-.datamatch = val,
+.datamatch = datamatch ? val : 0,
 .addr = addr,
+.flags = KVM_IOEVENTFD_FLAG_PIO,
 .len = size,
-.flags = KVM_IOEVENTFD_FLAG_DATAMATCH | KVM_IOEVENTFD_FLAG_PIO,
 .fd = fd,
 };
 int r;
 if (!kvm_enabled()) {
 return -ENOSYS;
 }
+if (datamatch) {
+kick.flags |= KVM_IOEVENTFD_FLAG_DATAMATCH;
+}
 if (!assign) {
 kick.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
 }
@@ -571,7 +577,7 @@ static int kvm_check_many_ioeventfds(void)
 if (ioeventfds[i] < 0) {
 break;
 }
-ret = kvm_set_ioeventfd_pio(ioeventfds[i], 0, i, true, 2);
+ret = kvm_set_ioeventfd_pio(ioeventfds[i], 0, i, true, 2, true);
 if (ret < 0) {
 close(ioeventfds[i]);
 break;
@@ -582,7 +588,7 @@ static int kvm_check_many_ioeventfds(void)
 ret = i == ARRAY_SIZE(ioeventfds);
 
 while (i-- > 0) {
-kvm_set_ioeventfd_pio(ioeventfds[i], 0, i, false, 2);
+kvm_set_ioeventfd_pio(ioeventfds[i], 0, i, false, 2, true);
 close(ioeventfds[i]);
 }
 return ret;
@@ -802,10 +808,8 @@ static void kvm_mem_ioeventfd_add(MemoryListener *listener,
 int fd = event_notifier_get_fd(e);
 int r;
 
-assert(match_data && section->size <= 8);
-
 r = kvm_set_ioeventfd_mmio(fd, section->offset_within_address_space,
-   data, true, section->size);
+   data, true, section->size, match_data);
 if (r < 0) {
 abort();
 }
@@ -820,7 +824,7 @@ static void kvm_mem_ioeventfd_del(MemoryListener *listener,
 int r;
 
 r = kvm_set_ioeventfd_mmio(fd, section->offset_within_address_space,
-   data, false, section->size);
+   data, false, section->size, match_data);
 if (r < 0) {
 abort();
 }
@@ -834,10 +838,8 @@ static void kvm_io_ioeventfd_add(MemoryListener *listener,
 int fd = event_notifier_get_fd(e);
 int r;
 
-assert(match_data && section->size <= 8);
-
 r = kvm_set_ioeventfd_pio(fd, section->offset_within_address_space,
-  data, true, section->size);
+  data, true, section->size, match_data);
 if (r < 0) {
 abort();
 }
@@ -853,7 +855,7 @@ static void kvm_io_ioeventfd_del(MemoryListener *listener,
 int r;
 
 r = kvm_set_ioeventfd_pio(fd, section->offset_within_address_space,
-  data, false, section->size);
+  data, false, section->size, match_data);
 if (r < 0) {
 abort();
 }
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 2/6] kvm: support any size for pio eventfd

2013-04-04 Thread Michael S. Tsirkin
Signed-off-by: Michael S. Tsirkin 
---
 kvm-all.c | 24 
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/kvm-all.c b/kvm-all.c
index ca9775d..589e37c 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -500,8 +500,8 @@ int kvm_check_extension(KVMState *s, unsigned int extension)
 return ret;
 }
 
-static int kvm_set_ioeventfd_mmio(int fd, uint32_t addr, uint32_t val, bool 
assign,
-  uint32_t size)
+static int kvm_set_ioeventfd_mmio(int fd, uint32_t addr, uint32_t val,
+  bool assign, uint32_t size)
 {
 int ret;
 struct kvm_ioeventfd iofd;
@@ -529,13 +529,13 @@ static int kvm_set_ioeventfd_mmio(int fd, uint32_t addr, 
uint32_t val, bool assi
 return 0;
 }
 
-static int kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t val,
-  bool assign)
+static int kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint16_t val,
+ bool assign, uint32_t size)
 {
 struct kvm_ioeventfd kick = {
 .datamatch = val,
 .addr = addr,
-.len = 2,
+.len = size,
 .flags = KVM_IOEVENTFD_FLAG_DATAMATCH | KVM_IOEVENTFD_FLAG_PIO,
 .fd = fd,
 };
@@ -571,7 +571,7 @@ static int kvm_check_many_ioeventfds(void)
 if (ioeventfds[i] < 0) {
 break;
 }
-ret = kvm_set_ioeventfd_pio_word(ioeventfds[i], 0, i, true);
+ret = kvm_set_ioeventfd_pio(ioeventfds[i], 0, i, true, 2);
 if (ret < 0) {
 close(ioeventfds[i]);
 break;
@@ -582,7 +582,7 @@ static int kvm_check_many_ioeventfds(void)
 ret = i == ARRAY_SIZE(ioeventfds);
 
 while (i-- > 0) {
-kvm_set_ioeventfd_pio_word(ioeventfds[i], 0, i, false);
+kvm_set_ioeventfd_pio(ioeventfds[i], 0, i, false, 2);
 close(ioeventfds[i]);
 }
 return ret;
@@ -834,10 +834,10 @@ static void kvm_io_ioeventfd_add(MemoryListener *listener,
 int fd = event_notifier_get_fd(e);
 int r;
 
-assert(match_data && section->size == 2);
+assert(match_data && section->size <= 8);
 
-r = kvm_set_ioeventfd_pio_word(fd, section->offset_within_address_space,
-   data, true);
+r = kvm_set_ioeventfd_pio(fd, section->offset_within_address_space,
+  data, true, section->size);
 if (r < 0) {
 abort();
 }
@@ -852,8 +852,8 @@ static void kvm_io_ioeventfd_del(MemoryListener *listener,
 int fd = event_notifier_get_fd(e);
 int r;
 
-r = kvm_set_ioeventfd_pio_word(fd, section->offset_within_address_space,
-   data, false);
+r = kvm_set_ioeventfd_pio(fd, section->offset_within_address_space,
+  data, false, section->size);
 if (r < 0) {
 abort();
 }
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 4/6] pci: add pci test device

2013-04-04 Thread Michael S. Tsirkin
This device is used for kvm unit tests,
currently it supports testing performance of ioeventfd.
Using updated kvm unittest, here's an example output:
mmio-no-eventfd:pci-mem 8796
mmio-wildcard-eventfd:pci-mem 3609
mmio-datamatch-eventfd:pci-mem 3685
portio-no-eventfd:pci-io 5287
portio-wildcard-eventfd:pci-io 1762
portio-datamatch-eventfd:pci-io 1777

Signed-off-by: Michael S. Tsirkin 
---
 docs/specs/pci-testdev.txt |  26 
 hw/Makefile.objs   |   1 +
 hw/pci-testdev.c   | 306 +
 hw/pci/pci.h   |   1 +
 4 files changed, 334 insertions(+)
 create mode 100644 docs/specs/pci-testdev.txt
 create mode 100644 hw/pci-testdev.c

diff --git a/docs/specs/pci-testdev.txt b/docs/specs/pci-testdev.txt
new file mode 100644
index 000..128ae22
--- /dev/null
+++ b/docs/specs/pci-testdev.txt
@@ -0,0 +1,26 @@
+pci-test is a device used for testing low level IO
+
+device implements up to two BARs: BAR0 and BAR1.
+Each BAR can be memory or IO. Guests must detect
+BAR type and act accordingly.
+
+Each BAR size is up to 4K bytes.
+Each BAR starts with the following header:
+
+typedef struct PCITestDevHdr {
+uint8_t test;  <- write-only, starts a given test number
+uint8_t width_type; <- read-only, type and width of access for a given 
test.
+   1,2,4 for byte,word or long write.
+   any other value if test not supported on this BAR
+uint8_t pad0[2];
+uint32_t offset; <- read-only, offset in this BAR for a given test
+uint32_t data;<- read-only, data to use for a given test
+uint32_t count;  <- for debugging. number of writes detected.
+uint8_t name[]; <- for debugging. 0-terminated ASCII string.
+} PCITestDevHdr;
+
+All registers are little endian.
+
+device is expected to always implement tests 0 to N on each BAR, and to add new
+tests with higher numbers.  In this way a guest can scan test numbers until it
+detects an access type that it does not support on this BAR, then stop.
diff --git a/hw/Makefile.objs b/hw/Makefile.objs
index eb7eb31..b4e3b1d 100644
--- a/hw/Makefile.objs
+++ b/hw/Makefile.objs
@@ -12,6 +12,7 @@ common-obj-$(CONFIG_VIRTIO_PCI) += virtio-pci.o
 common-obj-$(CONFIG_VIRTIO) += virtio-bus.o
 common-obj-y += fw_cfg.o
 common-obj-$(CONFIG_PCI) += pci_bridge_dev.o
+common-obj-$(CONFIG_PCI) += pci-testdev.o
 common-obj-$(CONFIG_PCI) += ioh3420.o xio3130_upstream.o xio3130_downstream.o
 common-obj-$(CONFIG_PCI) += i82801b11.o
 common-obj-y += watchdog.o
diff --git a/hw/pci-testdev.c b/hw/pci-testdev.c
new file mode 100644
index 000..9486624
--- /dev/null
+++ b/hw/pci-testdev.c
@@ -0,0 +1,306 @@
+#include "hw/hw.h"
+#include "hw/pci/pci.h"
+#include "qemu/event_notifier.h"
+#include "qemu/osdep.h"
+
+typedef struct PCITestDevHdr {
+uint8_t test;
+uint8_t width;
+uint8_t pad0[2];
+uint32_t offset;
+uint8_t data;
+uint8_t pad1[3];
+uint32_t count;
+uint8_t name[];
+} PCITestDevHdr;
+
+typedef struct IOTest {
+MemoryRegion *mr;
+EventNotifier notifier;
+bool hasnotifier;
+unsigned size;
+bool match_data;
+PCITestDevHdr *hdr;
+unsigned bufsize;
+} IOTest;
+
+#define IOTEST_DATAMATCH 0xFA
+#define IOTEST_NOMATCH   0xCE
+
+#define IOTEST_IOSIZE 128
+#define IOTEST_MEMSIZE 2048
+
+static const char *iotest_test[] = {
+"no-eventfd",
+"wildcard-eventfd",
+"datamatch-eventfd"
+};
+
+static const char *iotest_type[] = {
+"mmio",
+"portio"
+};
+
+#define IOTEST_TEST(i) (iotest_test[((i) % ARRAY_SIZE(iotest_test))])
+#define IOTEST_TYPE(i) (iotest_type[((i) / ARRAY_SIZE(iotest_test))])
+#define IOTEST_MAX_TEST (ARRAY_SIZE(iotest_test))
+#define IOTEST_MAX_TYPE (ARRAY_SIZE(iotest_type))
+#define IOTEST_MAX (IOTEST_MAX_TEST * IOTEST_MAX_TYPE)
+
+enum {
+IOTEST_ACCESS_NAME,
+IOTEST_ACCESS_DATA,
+IOTEST_ACCESS_MAX,
+};
+
+#define IOTEST_ACCESS_TYPE uint8_t
+#define IOTEST_ACCESS_WIDTH (sizeof(uint8_t))
+
+typedef struct PCITestDevState {
+PCIDevice dev;
+MemoryRegion mmio;
+MemoryRegion portio;
+IOTest *tests;
+int current;
+} PCITestDevState;
+
+#define IOTEST_IS_MEM(i) (strcmp(IOTEST_TYPE(i), "portio"))
+#define IOTEST_REGION(d, i) (IOTEST_IS_MEM(i) ?  &(d)->mmio : &(d)->portio)
+#define IOTEST_SIZE(i) (IOTEST_IS_MEM(i) ? IOTEST_MEMSIZE : IOTEST_IOSIZE)
+#define IOTEST_PCI_BAR(i) (IOTEST_IS_MEM(i) ? PCI_BASE_ADDRESS_SPACE_MEMORY : \
+   PCI_BASE_ADDRESS_SPACE_IO)
+
+static int pci_testdev_start(IOTest *test)
+{
+test->hdr->count = 0;
+if (!test->hasnotifier) {
+return 0;
+}
+event_notifier_test_and_clear(&test->notifier);
+memory_region_add_eventfd(test->mr,
+  le32_to_cpu(test->hdr->offset),
+  test->size,
+  test->match_data,
+  test->hdr->data,
+ 

[PATCH v2 1/6] kvm: remove unused APIs

2013-04-04 Thread Michael S. Tsirkin
There are only used internally now, move them
out of header and out of stub.

Signed-off-by: Michael S. Tsirkin 
---
 include/sysemu/kvm.h |   4 --
 kvm-all.c| 107 ++-
 kvm-stub.c   |  10 -
 3 files changed, 54 insertions(+), 67 deletions(-)

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index f2d97b5..4a65d9f 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -283,10 +283,6 @@ int kvm_physical_memory_addr_from_host(KVMState *s, void 
*ram_addr,
 #endif
 
 #endif
-int kvm_set_ioeventfd_mmio(int fd, uint32_t adr, uint32_t val, bool assign,
-   uint32_t size);
-
-int kvm_set_ioeventfd_pio_word(int fd, uint16_t adr, uint16_t val, bool 
assign);
 
 int kvm_irqchip_add_msi_route(KVMState *s, MSIMessage msg);
 int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg);
diff --git a/kvm-all.c b/kvm-all.c
index 9b433d3..ca9775d 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -500,6 +500,60 @@ int kvm_check_extension(KVMState *s, unsigned int 
extension)
 return ret;
 }
 
+static int kvm_set_ioeventfd_mmio(int fd, uint32_t addr, uint32_t val, bool 
assign,
+  uint32_t size)
+{
+int ret;
+struct kvm_ioeventfd iofd;
+
+iofd.datamatch = val;
+iofd.addr = addr;
+iofd.len = size;
+iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
+iofd.fd = fd;
+
+if (!kvm_enabled()) {
+return -ENOSYS;
+}
+
+if (!assign) {
+iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
+}
+
+ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, &iofd);
+
+if (ret < 0) {
+return -errno;
+}
+
+return 0;
+}
+
+static int kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t val,
+  bool assign)
+{
+struct kvm_ioeventfd kick = {
+.datamatch = val,
+.addr = addr,
+.len = 2,
+.flags = KVM_IOEVENTFD_FLAG_DATAMATCH | KVM_IOEVENTFD_FLAG_PIO,
+.fd = fd,
+};
+int r;
+if (!kvm_enabled()) {
+return -ENOSYS;
+}
+if (!assign) {
+kick.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
+}
+r = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, &kick);
+if (r < 0) {
+return r;
+}
+return 0;
+}
+
+
 static int kvm_check_many_ioeventfds(void)
 {
 /* Userspace can use ioeventfd for io notification.  This requires a host
@@ -1971,59 +2025,6 @@ int kvm_set_signal_mask(CPUArchState *env, const 
sigset_t *sigset)
 
 return r;
 }
-
-int kvm_set_ioeventfd_mmio(int fd, uint32_t addr, uint32_t val, bool assign,
-   uint32_t size)
-{
-int ret;
-struct kvm_ioeventfd iofd;
-
-iofd.datamatch = val;
-iofd.addr = addr;
-iofd.len = size;
-iofd.flags = KVM_IOEVENTFD_FLAG_DATAMATCH;
-iofd.fd = fd;
-
-if (!kvm_enabled()) {
-return -ENOSYS;
-}
-
-if (!assign) {
-iofd.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
-}
-
-ret = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, &iofd);
-
-if (ret < 0) {
-return -errno;
-}
-
-return 0;
-}
-
-int kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t val, bool 
assign)
-{
-struct kvm_ioeventfd kick = {
-.datamatch = val,
-.addr = addr,
-.len = 2,
-.flags = KVM_IOEVENTFD_FLAG_DATAMATCH | KVM_IOEVENTFD_FLAG_PIO,
-.fd = fd,
-};
-int r;
-if (!kvm_enabled()) {
-return -ENOSYS;
-}
-if (!assign) {
-kick.flags |= KVM_IOEVENTFD_FLAG_DEASSIGN;
-}
-r = kvm_vm_ioctl(kvm_state, KVM_IOEVENTFD, &kick);
-if (r < 0) {
-return r;
-}
-return 0;
-}
-
 int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
 {
 return kvm_arch_on_sigbus_vcpu(cpu, code, addr);
diff --git a/kvm-stub.c b/kvm-stub.c
index 760aadc..ef1f201 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -102,16 +102,6 @@ int kvm_set_signal_mask(CPUArchState *env, const sigset_t 
*sigset)
 }
 #endif
 
-int kvm_set_ioeventfd_pio_word(int fd, uint16_t addr, uint16_t val, bool 
assign)
-{
-return -ENOSYS;
-}
-
-int kvm_set_ioeventfd_mmio(int fd, uint32_t adr, uint32_t val, bool assign, 
uint32_t len)
-{
-return -ENOSYS;
-}
-
 int kvm_on_sigbus_vcpu(CPUState *cpu, int code, void *addr)
 {
 return 1;
-- 
MST

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 0/6] kvm: pci PORT IO MMIO and PV MMIO speed tests

2013-04-04 Thread Michael S. Tsirkin
These patches add a test device, useful to measure speed of MMIO versus PIO, in
different configurations. As I didn't want to reserve a hardcoded range
of memory, I added pci device for this instead.  Used together with the
kvm unittest patches I posted on kvm mailing list.

To use, simply add the device on the pci bus.
Example test output:
vmcall 1519

outl_to_kernel 1745

mmio-no-eventfd:pci-mem 9075
mmio-wildcard-eventfd:pci-mem 3529
mmio-datamatch-eventfd:pci-mem 3509
mmio-pv-eventfd:pci-mem 1878
portio-no-eventfd:pci-io 5535
portio-wildcard-eventfd:pci-io 1846
portio-datamatch-eventfd:pci-io 1848
portio-pv-eventfd:pci-io 1842


First interesting conclusion is that the overhead of MMIO
exit to QEMU as compared to PIO is double that of
MMIO ioeventfd as compared to PIO eventfd. Is this a known fact?

Second is that PV MMIO is almost as fast as portio, so it
looks like a viable replacement when we have a lot of
devices such that we run out of IO space.

Third is that eventfd has small but measureable overhead,
and that hypercalls are measureably faster than port io.
The difference is probably in hardware. Whether this justifies
considering hypercalls for e.g. virtio, is open for discussion.

I also had to extend kvm in a minor way, making all ioeventfd
options accessible through the API. This actually needs less code
than checking that users DTRT.

Patches 1-5 are useful in themselves.
Patches 5 and 6 add support for PV MMIO eventfds and
add the test for it, tested against a kernel patch
I'll send shortly. They can be applied separately when
we are sure the kernel interface is going to be there.

Michael S. Tsirkin (6):
  kvm: remove unused APIs
  kvm: support any size for pio eventfd
  kvm: support non datamatch ioeventfd
  pci: add pci test device
  kvm: add PV MMIO
  pci-testdev: add pv mmio test

 docs/specs/pci-testdev.txt |  26 
 hw/Makefile.objs   |   1 +
 hw/dataplane/hostmem.c |   1 +
 hw/ivshmem.c   |   2 +
 hw/pci-testdev.c   | 311 +
 hw/pci/pci.h   |   1 +
 hw/vhost.c |   4 +-
 hw/virtio-pci.c|   4 +-
 include/exec/memory.h  |  10 ++
 include/sysemu/kvm.h   |   4 -
 kvm-all.c  | 142 +++--
 kvm-stub.c |  10 --
 linux-headers/linux/kvm.h  |   8 ++
 memory.c   |   9 +-
 14 files changed, 448 insertions(+), 85 deletions(-)
 create mode 100644 docs/specs/pci-testdev.txt
 create mode 100644 hw/pci-testdev.c

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: s390: virtio_ccw: reset errors for new I/O.

2013-04-04 Thread Alexander Graf

On 04.04.2013, at 12:20, Cornelia Huck wrote:

> On Thu, 4 Apr 2013 11:28:35 +0200
> Alexander Graf  wrote:
> 
>> 
>> On 04.04.2013, at 10:25, Cornelia Huck wrote:
>> 
>>> ccw_io_helper neglected to reset vcdev->err after a new channel
>>> program had been successfully started, resulting in stale errors
>>> delivered after one I/O failed. Reset the error after a new
>>> channel program has been successfully started with no old I/O
>>> pending.
>>> 
>>> Signed-off-by: Cornelia Huck 
>> 
>> Shouldn't this be CC: stable?
> 
> I don't think so, current code will not continue after errors anyway.

Yeah, I got confused by the cover letter for a single patch. Usually, when you 
only send a single patch, you wouldn't also send a cover letter :).


Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] kvm: fix MMIO/PIO collision misdetection

2013-04-04 Thread Michael S. Tsirkin
PIO and MMIO are separate address spaces, but
ioeventfd registration code mistakenly detected
two eventfds as duplicate if they use the same address,
even if one is PIO and another one MMIO.

Signed-off-by: Michael S. Tsirkin 
---
 virt/kvm/eventfd.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index adb17f2..93e5b05 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -577,6 +577,7 @@ struct _ioeventfd {
struct eventfd_ctx  *eventfd;
u64  datamatch;
struct kvm_io_device dev;
+   u8   bus_idx;
bool wildcard;
 };
 
@@ -669,7 +670,8 @@ ioeventfd_check_collision(struct kvm *kvm, struct 
_ioeventfd *p)
struct _ioeventfd *_p;
 
list_for_each_entry(_p, &kvm->ioeventfds, list)
-   if (_p->addr == p->addr && _p->length == p->length &&
+   if (_p->bus_idx == p->bus_idx &&
+   _p->addr == p->addr && _p->length == p->length &&
(_p->wildcard || p->wildcard ||
 _p->datamatch == p->datamatch))
return true;
@@ -717,6 +719,7 @@ kvm_assign_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd 
*args)
 
INIT_LIST_HEAD(&p->list);
p->addr= args->addr;
+   p->bus_idx = bus_idx;
p->length  = args->len;
p->eventfd = eventfd;
 
@@ -775,7 +778,8 @@ kvm_deassign_ioeventfd(struct kvm *kvm, struct 
kvm_ioeventfd *args)
list_for_each_entry_safe(p, tmp, &kvm->ioeventfds, list) {
bool wildcard = !(args->flags & KVM_IOEVENTFD_FLAG_DATAMATCH);
 
-   if (p->eventfd != eventfd  ||
+   if (p->bus_idx != bus_idx ||
+   p->eventfd != eventfd  ||
p->addr != args->addr  ||
p->length != args->len ||
p->wildcard != wildcard)
-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] ARM: KVM: switch to a dual-step HYP init code

2013-04-04 Thread Marc Zyngier
On 04/04/13 00:15, Christoffer Dall wrote:
> On Wed, Apr 03, 2013 at 11:38:30AM +0100, Marc Zyngier wrote:
>> On 03/04/13 11:07, Will Deacon wrote:
>>> On Tue, Apr 02, 2013 at 02:25:14PM +0100, Marc Zyngier wrote:
 Our HYP init code suffers from two major design issues:
 - it cannot support CPU hotplug, as we tear down the idmap very early
 - it cannot perform a TLB invalidation when switching from init to
   runtime mappings, as pages are manipulated from PL1 exclusively
>>>
>>> [...]
>>>
 diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
 index 35a463f..b2c6967 100644
 --- a/arch/arm/kvm/init.S
 +++ b/arch/arm/kvm/init.S
 @@ -21,6 +21,7 @@
  #include 
  #include 
  #include 
 +#include 
  
  /
   * Hypervisor initialization
 @@ -47,6 +48,9 @@ __kvm_hyp_init:
W(b).
  
  __do_hyp_init:
 +  cmp r2, #0  @ We have a SP?
 +  bne phase2  @ Yes, second stage init
 +
@ Set the HTTBR to point to the hypervisor PGD pointer passed
mcrrp15, 4, r0, r1, c2
  
 @@ -96,14 +100,35 @@ __do_hyp_init:
orr r0, r0, r1
isb
mcr p15, 4, r0, c1, c0, 0   @ HSCR
 -  isb
  
 -  @ Set stack pointer and return to the kernel
 +  eret
 +
 +phase2:
 +  @ Set stack pointer
mov sp, r2
  
@ Set HVBAR to point to the HYP vectors
mcr p15, 4, r3, c12, c0, 0  @ HVBAR
  
 +  @ Jump to the trampoline page
 +  ldr r2, =#PAGE_MASK
>>>
>>> Shifting right by PAGE_SHIFT can avoid the load.
>>
>> Not really. We're masking out the top bits of "target" and adding them
>> to the trampoline base address, so shifting doesn't help.
>>
>> But, as you suggested offline, BFI can come to the rescue and make that
>> code totally fun and unreadable. How about (untested):
>>
>>  ldr r2, =#TRAMPOLINE_VA
>>  adr r3, target
>>  bfi r2, r3, #0, #PAGE_SHIFT
>>  mov pc, r2
>>
>> I really like it! :)
>>
> 
> What kind of drugs are you on?
> 
> Ok, I actually like it too.

Implemented, tested, works.

>>>
 +  adr r3, target
 +  bic r3, r3, r2
 +  ldr r2, =#TRAMPOLINE_VA
 +  add r3, r3, r2
 +  mov pc, r3
 +
 +  nop
>>>
>>> 
>>
>> ... "You know I'll never sleep no more" ...
>>
> 
> Seriously, what kind of drugs are you guys on?

Someone did comment last year about the quality of the water in
Cambridge. He may have been right. But in this occurrence, it's only a
mild case of Frank Zappatis (Zomby Woof variety).

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/7] ARM: KVM: enforce page alignment for identity mapped code

2013-04-04 Thread Marc Zyngier
On 04/04/13 00:15, Christoffer Dall wrote:
> On Wed, Apr 03, 2013 at 11:00:39AM +0100, Marc Zyngier wrote:
>> On 03/04/13 10:50, Will Deacon wrote:
>>> On Tue, Apr 02, 2013 at 02:25:12PM +0100, Marc Zyngier wrote:
 We're about to move to a init procedure where we rely on the
 fact that the init code fits in a single page. Make sure we
 align the idmap text on a page boundary, and that the code is
 not bigger than a single page.

 Signed-off-by: Marc Zyngier 
 ---
  arch/arm/kernel/vmlinux.lds.S | 2 +-
  arch/arm/kvm/init.S   | 7 +++
  2 files changed, 8 insertions(+), 1 deletion(-)

 diff --git a/arch/arm/kernel/vmlinux.lds.S b/arch/arm/kernel/vmlinux.lds.S
 index b571484..d9dd265 100644
 --- a/arch/arm/kernel/vmlinux.lds.S
 +++ b/arch/arm/kernel/vmlinux.lds.S
 @@ -20,7 +20,7 @@
VMLINUX_SYMBOL(__idmap_text_start) = .; \
*(.idmap.text)  \
VMLINUX_SYMBOL(__idmap_text_end) = .;   \
 -  ALIGN_FUNCTION();   \
 +  . = ALIGN(PAGE_SIZE);   \
VMLINUX_SYMBOL(__hyp_idmap_text_start) = .; \
*(.hyp.idmap.text)  \
VMLINUX_SYMBOL(__hyp_idmap_text_end) = .;
 diff --git a/arch/arm/kvm/init.S b/arch/arm/kvm/init.S
 index 9f37a79..35a463f 100644
 --- a/arch/arm/kvm/init.S
 +++ b/arch/arm/kvm/init.S
 @@ -111,4 +111,11 @@ __do_hyp_init:
.globl __kvm_hyp_init_end
  __kvm_hyp_init_end:
  
 +  /*
 +   * The above code *must* fit in a single page for the trampoline
 +   * madness to work. Whoever decides to change it must make sure
 +   * we map the right amount of memory for the trampoline to work.
 +   * The line below ensures any breakage will get noticed.
> 
> This comment is a little funny. What's this about changing it? And why
> madness? OK, it's a little mad.
> 
> I think the nice thing to explain here is why .org would complain - I
> had to look in the AS reference to figure out that it will complain if
> the location pointer is forwarded outside the section - if that is the
> right reason?
> 
 +   */
 +  .org__kvm_hyp_init + PAGE_SIZE
.popsection
>>>
>>> What effect does this have on the size of the kernel image? I'd expect the
>>> idmap code to be pretty small, so aligning to a page might be overkill a lot
>>> of the time.
>>
>> So we're indeed wasting memory between the kernel and the HYP idmaps.
>> The reason for this is to ensure that the init code is not spread across
>> two pages, which would make the trampoline harder to allocate.
>>
>> If there's a way to ensure that the init.S code will fit a single page,
>> then we can remove this ALIGN(PAGE_SIZE).
> 
> Just allocate a temporary page and copy the bit of code into there?  It
> should be easy enough to make sure it's location independent.  We can
> even give that page back later if we want and re-do the trick when a CPU
> gets hot-plugged if we want. No?

I'd really like to avoid freeing stuff, because it gives us even more
chances to screw our TLBs (we'd have to invalidate on unmap, and deal
with potential races between CPUs coming up).

We could allocate this page *if* the HYP init code happens to cross a
page boundary, and only then. I'll have a look at implementing this as
it should solve Will's issue (which is even worse on arm64 with 64k pages).

M.
-- 
Jazz is not dead. It just smells funny...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v3 1/6] kvm: add device control API

2013-04-04 Thread Gleb Natapov
On Tue, Apr 02, 2013 at 08:57:48PM -0500, Scott Wood wrote:
> Currently, devices that are emulated inside KVM are configured in a
> hardcoded manner based on an assumption that any given architecture
> only has one way to do it.  If there's any need to access device state,
> it is done through inflexible one-purpose-only IOCTLs (e.g.
> KVM_GET/SET_LAPIC).  Defining new IOCTLs for every little thing is
> cumbersome and depletes a limited numberspace.
> 
> This API provides a mechanism to instantiate a device of a certain
> type, returning an ID that can be used to set/get attributes of the
> device.  Attributes may include configuration parameters (e.g.
> register base address), device state, operational commands, etc.  It
> is similar to the ONE_REG API, except that it acts on devices rather
> than vcpus.
> 
> Both device types and individual attributes can be tested without having
> to create the device or get/set the attribute, without the need for
> separately managing enumerated capabilities.
> 
> Signed-off-by: Scott Wood 
> ---
> v3: remove some changes that were merged into this patch by accident,
> and fix the error documentation for KVM_CREATE_DEVICE.
> 
> NOTE: I had some difficulty figuring out what ioctl numbers I should
> assign...  it seems that at one point care was taken to keep vcpu and
> vm ioctls separate, but some overlap exists now (despite not exhausing
> the ioctl space).  Some of that was my fault, but not all of it. :-)
> I moved to a new ioctl range for device control -- please let me know
> if there's something else you'd prefer I do.
> ---
>  Documentation/virtual/kvm/api.txt|   70 
> ++
>  Documentation/virtual/kvm/devices/README |1 +
>  include/uapi/linux/kvm.h |   27 
>  virt/kvm/kvm_main.c  |   31 +
>  4 files changed, 129 insertions(+)
>  create mode 100644 Documentation/virtual/kvm/devices/README
> 
> diff --git a/Documentation/virtual/kvm/api.txt 
> b/Documentation/virtual/kvm/api.txt
> index 976eb65..d52f3f9 100644
> --- a/Documentation/virtual/kvm/api.txt
> +++ b/Documentation/virtual/kvm/api.txt
> @@ -2173,6 +2173,76 @@ header; first `n_valid' valid entries with contents 
> from the data
>  written, then `n_invalid' invalid entries, invalidating any previously
>  valid entries found.
>  
> +4.79 KVM_CREATE_DEVICE
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: vm ioctl
> +Parameters: struct kvm_create_device (in/out)
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENODEV: The device type is unknown or unsupported
> +  EEXIST: Device already created, and this type of device may not
> +  be instantiated multiple times
> +
> +  Other error conditions may be defined by individual device types or
> +  have their standard meanings.
> +
> +Creates an emulated device in the kernel.  The file descriptor returned
> +in fd can be used with KVM_SET/GET/HAS_DEVICE_ATTR.
> +
> +If the KVM_CREATE_DEVICE_TEST flag is set, only test whether the
> +device type is supported (not necessarily whether it can be created
> +in the current vm).
> +
> +Individual devices should not define flags.  Attributes should be used
> +for specifying any behavior that is not implied by the device type
> +number.
> +
> +struct kvm_create_device {
> + __u32   type;   /* in: KVM_DEV_TYPE_xxx */
> + __u32   fd; /* out: device handle */
> + __u32   flags;  /* in: KVM_CREATE_DEVICE_xxx */
> +};
> +
> +4.80 KVM_SET_DEVICE_ATTR/KVM_GET_DEVICE_ATTR
> +
> +Capability: KVM_CAP_DEVICE_CTRL
> +Type: device ioctl
> +Parameters: struct kvm_device_attr
> +Returns: 0 on success, -1 on error
> +Errors:
> +  ENXIO:  The group or attribute is unknown/unsupported for this device
> +  EPERM:  The attribute cannot (currently) be accessed this way
> +  (e.g. read-only attribute, or attribute that only makes
> +  sense when the device is in a different state)
> +
> +  Other error conditions may be defined by individual device types.
> +
> +Gets/sets a specified piece of device configuration and/or state.  The
> +semantics are device-specific.  See individual device documentation in
> +the "devices" directory.  As with ONE_REG, the size of the data
> +transferred is defined by the particular attribute.
> +
> +struct kvm_device_attr {
> + __u32   flags;  /* no flags currently defined */
> + __u32   group;  /* device-defined */
> + __u64   attr;   /* group-defined */
> + __u64   addr;   /* userspace address of attr data */
> +};
> +
Since now each device has its own fd is it an advantage to enforce
common interface between different devices? If we do so though why
not handle file creation, ioctl and file descriptor lifetime in the
common code. Common code will have "struct kvm_device" with "struct
kvm_device_arch" and "struct kvm_device_ops" members. Instead of
kvm_mpic_ioctl there will be kvm_device_ioctl which will despatch ioctls
to a devi

Re: [PATCH] KVM: s390: virtio_ccw: reset errors for new I/O.

2013-04-04 Thread Cornelia Huck
On Thu, 4 Apr 2013 11:28:35 +0200
Alexander Graf  wrote:

> 
> On 04.04.2013, at 10:25, Cornelia Huck wrote:
> 
> > ccw_io_helper neglected to reset vcdev->err after a new channel
> > program had been successfully started, resulting in stale errors
> > delivered after one I/O failed. Reset the error after a new
> > channel program has been successfully started with no old I/O
> > pending.
> > 
> > Signed-off-by: Cornelia Huck 
> 
> Shouldn't this be CC: stable?

I don't think so, current code will not continue after errors anyway.

> 
> 
> Alex
> 
> > ---
> > drivers/s390/kvm/virtio_ccw.c | 5 -
> > 1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/s390/kvm/virtio_ccw.c b/drivers/s390/kvm/virtio_ccw.c
> > index 42d507c..5948f2a 100644
> > --- a/drivers/s390/kvm/virtio_ccw.c
> > +++ b/drivers/s390/kvm/virtio_ccw.c
> > @@ -133,8 +133,11 @@ static int ccw_io_helper(struct virtio_ccw_device 
> > *vcdev,
> > do {
> > spin_lock_irqsave(get_ccwdev_lock(vcdev->cdev), flags);
> > ret = ccw_device_start(vcdev->cdev, ccw, intparm, 0, 0);
> > -   if (!ret)
> > +   if (!ret) {
> > +   if (!vcdev->curr_io)
> > +   vcdev->err = 0;
> > vcdev->curr_io |= flag;
> > +   }
> > spin_unlock_irqrestore(get_ccwdev_lock(vcdev->cdev), flags);
> > cpu_relax();
> > } while (ret == -EBUSY);
> > -- 
> > 1.7.12.4
> > 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 1/6] kvm: add device control API

2013-04-04 Thread Gleb Natapov
On Wed, Apr 03, 2013 at 07:39:52PM +0200, Alexander Graf wrote:
> 
> On 03.04.2013, at 19:37, Scott Wood wrote:
> 
> > On 04/03/2013 08:22:37 AM, Alexander Graf wrote:
> >> On 03.04.2013, at 04:17, Paul Mackerras wrote:
> >> > On Tue, Apr 02, 2013 at 08:19:56PM -0500, Scott Wood wrote:
> >> >> On 04/02/2013 08:02:39 PM, Paul Mackerras wrote:
> >> >>> On Mon, Apr 01, 2013 at 05:47:48PM -0500, Scott Wood wrote:
> >>  +4.79 KVM_CREATE_DEVICE
> >>  +
> >>  +Capability: KVM_CAP_DEVICE_CTRL
> >> >>>
> >> >>> I notice this patch doesn't add this capability;
> >> >>
> >> >> Yes, it does (see below).
> >> >>
> >> >>> you add it in a later patch.
> >> >>
> >> >> Maybe you're thinking of KVM_CAP_IRQ_MPIC?
> >> >
> >> > No, I was referring to the addition to kvm_dev_ioctl_check_extension()
> >> > of a KVM_CAP_DEVICE_CTRL case.  Since this patch adds the code to handle
> >> > KVM_CREATE_DEVICE ioctl it should also add the code to return 1 if
> >> > userspace queries the KVM_CAP_DEVICE_CTRL capability.
> >> >
> >>  +/* ioctl for vm fd */
> >>  +#define KVM_CREATE_DEVICE  _IOWR(KVMIO,  0xe0, struct
> >> >>> kvm_create_device)
> >> >>>
> >> >>> This define should go with the other VM ioctls, otherwise the next
> >> >>> person to add a VM ioctl will probably miss it and reuse the 0xe0
> >> >>> code.
> >> >>
> >> >> That's actually why I moved it to a new section, with device control
> >> >> ioctls getting their own range, as the legacy "device model" and
> >> >> some other things did.  0xe0 is not the next ioctl that would be
> >> >> used for either vm or vcpu.  The ioctl numbering is actually already
> >> >> a mess, with sometimes care being taken to keep vcpu and vm ioctls
> >> >> from overlapping, but on other places overlapping does happen.  I'm
> >> >> not sure what exactly I should do here.
> >> >
> >> > Well, even if you are using a new range, I still think that
> >> > KVM_CREATE_DEVICE, being a VM ioctl, should go next to the other VM
> >> > ioctls.  I guess it's ultimately up to the maintainers.
> >> I agree. Things get confusing for VM ioctls otherwise.
> > 
> > Things are already confusing. :-)
> > 
> > I can move KVM_CREATE_DEVICE back with the other VM ioctls, but what number 
> > should it get?  The last VM ioctl is 0xab (which is also a VCPU ioctl).  
> > Should I use 0xac (which is also a VCPU ioctl)?  Or should I try to avoid a 
> > conflict, as was sometimes done in the past -- in which case, which number 
> > should I use?
> 
> Gleb, Marcelo?
> 
> 
Yes, ioctls number assignments are a little bit of a mess :( There are 14
numbers that are used twice and some of them are used twice for the same
type of fd, but with different direction bits and there is one, 0x9b,
that is used twice and have the same IO direction and even, potentially,
same parameter size (just checked it has the same parameter size)! Good that
the second use is only on IA64 which is almost dead.

Although ioctls numbers between different types of fds should not conflict
lets try and make them unique. Scott, please put KVM_CREATE_DEVICE near
device model ioctls and give it a unique number.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >