date:20120910

Re: [PATCH v2] kvm/fpu: Enable fully eager restore kvm FPU

2012-09-10 Thread Avi Kivity

On 09/10/2012 06:29 AM, Hao, Xudong wrote:
>> 
>> Doesn't help.  We can have:
>> 
>> host: deactivate fpu for some reason
>> guest: set cr4.osxsave, xcr0.bit3
>> host: enter guest with deactivated fpu
>> guest: touch fpu
>> 
>> result: host fpu corrupted.
> 
> Avi, I'm not sure if I fully understand of you. Do you mean enter guest with 
> a fpu_active=0 and then fpu does not restore? 

Yes.

> If so, I will add fpu_active=1 in the no-lazy case.
> 
> -   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
> +   if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) &&
> +   (vcpu->arch.xcr0 & ~((u64)KVM_XSTATE_LAZY))) {
> +   kvm_x86_ops->fpu_activate(vcpu);
> +   vcpu->fpu_active=1;
> +   }
> +   else
> +   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
> 

It doesn't help here.

  1 guest boot
  2 kvm_userspace_exit (deactivates fpu)
  3 XSETBV exit that sets xcr0.new_bit
  4 kvm_enter

There is no call to kvm_put_guest_fpu() between 3 and 4, you need
something in __kvm_set_xcr() to activate the fpu.

kvm_put_guest_fpu() doesn't need to activate the fpu then, just to avoid
deactivating it.

Note you also need to consider writes to xcr0 and cr4 that happen in the
reverse order due to live migration.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] KVM: tsc deadline timer works only when hrtimer high resolution configured

2012-09-10 Thread Avi Kivity

On 09/09/2012 06:10 PM, Liu, Jinsong wrote:
> Avi Kivity wrote:
>> On 09/09/2012 05:54 PM, Liu, Jinsong wrote:
 
 hrtimers is an intrusive feature, I don't think we should
 force-enable it.  Please change it to a depends on.
>>> 
>>> Hmm, if it changed as
>>> config KVM
>>> depends on HIGH_RES_TIMERS
>>> The item 'Kernel-based Virtual Machine (KVM) support (NEW)' even
>>> didn't appear to user when make menuconfig (when HIGH_RES_TIMERS
>>> disable)  
>>> 
>>> Is it good? I just have a little concern here:)
>> 
>> It's not good, but that's what we have.
>> 
>> It's okay to force-enable low-impact features (like preempt notifies).
>> 
>> hrimers, on the other hand, change kernel behaviour quite deeply.
>> 
>> Maybe over time someone will fix the config tools to unhide features
>> that can be enabled by turning on a dependency.
> 
> OK, updated as attached.

Thanks, applied.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] kvm tools: Export DISPLAY ENV as our default host ip address

2012-09-10 Thread Avi Kivity

On 09/10/2012 04:26 AM, Asias He wrote:
>>
>>>
 Or you can
 make the guest talk to an internal unix-domain socket, tunnel that
 through virtio-serial, terminate virtio-serial in lkvm, and direct it
 towards the local X socket.
>>>
>>> Doesn't this require some user agent or config modification to the guest?
>>
>> It does, a daemon that listens locally and forwards data over
>> virtio-serial.  But you build your own initrd anyway, don't you?
> 
> Using our custom init file is one use case. User may use distro disk
> image as guest also.

We could push it into the distros, it should be useful for qemu as well.
 This could be done even more properly by adding support for
virtio-serial in X itself (is anything really needed on the client side?)

The only problem is that it is insecure, but maybe
http://www.nsa.gov/research/_files/selinux/papers/x11/t1.shtml can help.
 I don't know if it's integrated yet.

> 
>> Another option is ppp-over-virtio-serial.
> 
> Seems this still uses tcp where the link layer changes from ethernet to 
> serial.

Right, you have to terminate it on the host side without routing.

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] KVM: MMU: release noslot pfn on the fail path properly

2012-09-10 Thread Avi Kivity

On 09/07/2012 09:13 AM, Xiao Guangrong wrote:
> We can not directly call kvm_release_pfn_clean to release the pfn
> since we can meet noslot pfn which is used to cache mmio info into
> spte
> 
> Introduce mmu_release_pfn_clean to do this kind of thing
> 
> Signed-off-by: Xiao Guangrong 
> ---
>  arch/x86/kvm/mmu.c |   19 ++-
>  arch/x86/kvm/paging_tmpl.h |4 ++--
>  2 files changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 399c177..3c10bca 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -2432,6 +2432,16 @@ done:
>   return ret;
>  }
> 
> +/*
> + * The primary user is page fault path which call it to properly
> + * release noslot_pfn.
> + */
> +static void mmu_release_pfn_clean(pfn_t pfn)
> +{
> + if (!is_error_pfn(pfn))
> + kvm_release_pfn_clean(pfn);
> +}
> +

Too many APIs, each slightly different.  How do I know which one to call?

Please change kvm_release_pfn_*() instead, calling some arch hook (or
even #ifdef CONFIG_KVM_HAS_FAST_MMIO) to check for the special case.




-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] KVM: MMU: remove unnecessary check

2012-09-10 Thread Avi Kivity

On 09/07/2012 09:15 AM, Xiao Guangrong wrote:
> Checking the return of kvm_mmu_get_page is unnecessary since it is
> guaranteed by memory cache
> 


Thanks, applied.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/3] KVM: fix release error page

2012-09-10 Thread Avi Kivity

On 09/07/2012 09:14 AM, Xiao Guangrong wrote:
> This bug was triggered:
> [ 4220.198458] BUG: unable to handle kernel paging request at fffe
> [ 4220.203907] IP: [] put_page+0xf/0x34
> ..
> [ 4220.237326] Call Trace:
> [ 4220.237361]  [] kvm_arch_destroy_vm+0xf9/0x101 [kvm]
> [ 4220.237382]  [] kvm_put_kvm+0xcc/0x127 [kvm]
> [ 4220.237401]  [] kvm_vcpu_release+0x18/0x1c [kvm]
> [ 4220.237407]  [] __fput+0x111/0x1ed
> [ 4220.237411]  [] fput+0xe/0x10
> [ 4220.237418]  [] task_work_run+0x5d/0x88
> [ 4220.237424]  [] do_exit+0x2bf/0x7ca
> 
> The test case:
> 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #include 
> #include 
> #include 
> #include 
> 
> #include 
> 
> #define die(fmt, args...) do {\
>   printf(fmt, ##args);\
>   exit(-1);} while (0)
> 
> static int create_vm(void)
> {
>   int sys_fd, vm_fd;
> 
>   sys_fd = open("/dev/kvm", O_RDWR);
>   if (sys_fd < 0)
>   die("open /dev/kvm fail.\n");
> 
>   vm_fd = ioctl(sys_fd, KVM_CREATE_VM, 0);
>   if (vm_fd < 0)
>   die("KVM_CREATE_VM fail.\n");
> 
>   return vm_fd;
> }
> 
> static int create_vcpu(int vm_fd)
> {
>   int vcpu_fd;
> 
>   vcpu_fd = ioctl(vm_fd, KVM_CREATE_VCPU, 0);
>   if (vcpu_fd < 0)
>   die("KVM_CREATE_VCPU ioctl.\n");
>   printf("Create vcpu.\n");
>   return vcpu_fd;
> }
> 
> static void *vcpu_thread(void *arg)
> {
>   int vm_fd = (int)(long)arg;
> 
>   create_vcpu(vm_fd);
>   return NULL;
> }
> 
> int main(int argc, char *argv[])
> {
>   pthread_t thread;
>   int vm_fd;
> 
>   (void)argc;
>   (void)argv;
> 
>   vm_fd = create_vm();
>   pthread_create(&thread, NULL, vcpu_thread, (void *)(long)vm_fd);
>   printf("Exit.\n");
>   return 0;
> }
> 
> It caused by release kvm->arch.ept_identity_map_addr which is the
> error page.
> 
> The parent thread can send KILL signal to the vcpu thread when it was
> exiting which stops faulting pages and potentially allocating memory.
> So gfn_to_pfn/gfn_to_page may fail at this time
> 
> Fixed by checking the page before it is used
> 

Thanks, applied to master for 3.6.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] KVM: MMU: release noslot pfn on the fail path properly

2012-09-10 Thread Xiao Guangrong

On 09/10/2012 04:22 PM, Avi Kivity wrote:
> On 09/07/2012 09:13 AM, Xiao Guangrong wrote:
>> We can not directly call kvm_release_pfn_clean to release the pfn
>> since we can meet noslot pfn which is used to cache mmio info into
>> spte
>>
>> Introduce mmu_release_pfn_clean to do this kind of thing
>>
>> Signed-off-by: Xiao Guangrong 
>> ---
>>  arch/x86/kvm/mmu.c |   19 ++-
>>  arch/x86/kvm/paging_tmpl.h |4 ++--
>>  2 files changed, 16 insertions(+), 7 deletions(-)
>>
>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>> index 399c177..3c10bca 100644
>> --- a/arch/x86/kvm/mmu.c
>> +++ b/arch/x86/kvm/mmu.c
>> @@ -2432,6 +2432,16 @@ done:
>>  return ret;
>>  }
>>
>> +/*
>> + * The primary user is page fault path which call it to properly
>> + * release noslot_pfn.
>> + */
>> +static void mmu_release_pfn_clean(pfn_t pfn)
>> +{
>> +if (!is_error_pfn(pfn))
>> +kvm_release_pfn_clean(pfn);
>> +}
>> +
> 
> Too many APIs, each slightly different.  How do I know which one to call?

It is only used in mmu and it is a static function.

> 
> Please change kvm_release_pfn_*() instead, calling some arch hook (or
> even #ifdef CONFIG_KVM_HAS_FAST_MMIO) to check for the special case.

We only need to call it on page fault path. If we change the common API
other x86 components have to suffer from it.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] KVM: MMU: release noslot pfn on the fail path properly

2012-09-10 Thread Avi Kivity

On 09/10/2012 11:37 AM, Xiao Guangrong wrote:
> On 09/10/2012 04:22 PM, Avi Kivity wrote:
>> On 09/07/2012 09:13 AM, Xiao Guangrong wrote:
>>> We can not directly call kvm_release_pfn_clean to release the pfn
>>> since we can meet noslot pfn which is used to cache mmio info into
>>> spte
>>>
>>> Introduce mmu_release_pfn_clean to do this kind of thing
>>>
>>> Signed-off-by: Xiao Guangrong 
>>> ---
>>>  arch/x86/kvm/mmu.c |   19 ++-
>>>  arch/x86/kvm/paging_tmpl.h |4 ++--
>>>  2 files changed, 16 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
>>> index 399c177..3c10bca 100644
>>> --- a/arch/x86/kvm/mmu.c
>>> +++ b/arch/x86/kvm/mmu.c
>>> @@ -2432,6 +2432,16 @@ done:
>>> return ret;
>>>  }
>>>
>>> +/*
>>> + * The primary user is page fault path which call it to properly
>>> + * release noslot_pfn.
>>> + */
>>> +static void mmu_release_pfn_clean(pfn_t pfn)
>>> +{
>>> +   if (!is_error_pfn(pfn))
>>> +   kvm_release_pfn_clean(pfn);
>>> +}
>>> +
>> 
>> Too many APIs, each slightly different.  How do I know which one to call?
> 
> It is only used in mmu and it is a static function.

Still, how do I know which one to call?  The name tells me nothing.
When I read the code, how do I know if a call is correct or not?

> 
>> 
>> Please change kvm_release_pfn_*() instead, calling some arch hook (or
>> even #ifdef CONFIG_KVM_HAS_FAST_MMIO) to check for the special case.
> 
> We only need to call it on page fault path. If we change the common API
> other x86 components have to suffer from it.

This way, I have to suffer from it.

btw, what about another approach, to avoid those paths completely?
Avoid calling __direct_map() with error_pfn, and jump to a label after
kvm_release_pfn_clean() in page_fault(), etc?

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Michael S. Tsirkin

On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
> This implementation of a virtio balloon driver uses the page cache to
> "store" pages that have been released to the host.  The communication
> (outside of target counts) is one way--the guest notifies the host when
> it adds a page to the page cache, allowing the host to madvise(2) with
> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> (via the regular page reclaim).  This means that inflating the balloon
> is similar to the existing balloon mechanism, but the deflate is
> different--it re-uses existing Linux kernel functionality to
> automatically reclaim.
> 
> Signed-off-by: Frank Swiderski 

I've been trying to understand this, and I have
a question: what exactly is the benefit
of this new device?

Note that users could not care less about how a driver
is implemented internally.

Is there some workload where you see VM working better with
this than regular balloon? Any numbers?

Also, can't we just replace existing balloon implementation
with this one?  Why it is so important to deflate silently?
I guess filesystem does not currently get a callback
before page is reclaimed but this isan implementation detail -
maybe this can be fixed?

Also can you pls answer Avi's question?
How is overcommit managed?


> ---
>  drivers/virtio/Kconfig  |   13 +
>  drivers/virtio/Makefile |1 +
>  drivers/virtio/virtio_fileballoon.c |  636 
> +++
>  include/linux/virtio_balloon.h  |9 +
>  include/linux/virtio_ids.h  |1 +
>  5 files changed, 660 insertions(+), 0 deletions(-)
>  create mode 100644 drivers/virtio/virtio_fileballoon.c
> 
> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
> index f38b17a..cffa2a7 100644
> --- a/drivers/virtio/Kconfig
> +++ b/drivers/virtio/Kconfig
> @@ -35,6 +35,19 @@ config VIRTIO_BALLOON
>  
>If unsure, say M.
>  
> +config VIRTIO_FILEBALLOON
> + tristate "Virtio page cache-backed balloon driver"
> + select VIRTIO
> + select VIRTIO_RING
> + ---help---
> +  This driver supports decreasing and automatically reclaiming the
> +  memory within a guest VM.  Unlike VIRTIO_BALLOON, this driver instead
> +  tries to maintain a specific target balloon size using the page cache.
> +  This allows the guest to implicitly deflate the balloon by flushing
> +  pages from the cache and touching the page.
> +
> +  If unsure, say N.
> +
>   config VIRTIO_MMIO
>   tristate "Platform bus driver for memory mapped virtio devices 
> (EXPERIMENTAL)"
>   depends on HAS_IOMEM && EXPERIMENTAL
> diff --git a/drivers/virtio/Makefile b/drivers/virtio/Makefile
> index 5a4c63c..7ca0a3f 100644
> --- a/drivers/virtio/Makefile
> +++ b/drivers/virtio/Makefile
> @@ -3,3 +3,4 @@ obj-$(CONFIG_VIRTIO_RING) += virtio_ring.o
>  obj-$(CONFIG_VIRTIO_MMIO) += virtio_mmio.o
>  obj-$(CONFIG_VIRTIO_PCI) += virtio_pci.o
>  obj-$(CONFIG_VIRTIO_BALLOON) += virtio_balloon.o
> +obj-$(CONFIG_VIRTIO_FILEBALLOON) += virtio_fileballoon.o
> diff --git a/drivers/virtio/virtio_fileballoon.c 
> b/drivers/virtio/virtio_fileballoon.c
> new file mode 100644
> index 000..ff252ec
> --- /dev/null
> +++ b/drivers/virtio/virtio_fileballoon.c
> @@ -0,0 +1,636 @@
> +/* Virtio file (page cache-backed) balloon implementation, inspired by
> + * Dor Loar and Marcelo Tosatti's implementations, and based on Rusty 
> Russel's
> + * implementation.
> + *
> + * This implementation of the virtio balloon driver re-uses the page cache to
> + * allow memory consumed by inflating the balloon to be reclaimed by linux.  
> It
> + * creates and mounts a bare-bones filesystem containing a single inode.  
> When
> + * the host requests the balloon to inflate, it does so by "reading" pages at
> + * offsets into the inode mapping's page_tree.  The host is notified when the
> + * pages are added to the page_tree, allowing it (the host) to madvise(2) the
> + * corresponding host memory, reducing the RSS of the virtual machine.  In 
> this
> + * implementation, the host is only notified when a page is added to the
> + * balloon.  Reclaim happens under the existing TTFP logic, which flushes 
> unused
> + * pages in the page cache.  If the host used MADV_DONTNEED, then when the 
> guest
> + * uses the page, the zero page will be mapped in, allowing automatic (and 
> fast,
> + * compared to requiring a host notification via a virtio queue to get memory
> + * back) reclaim.
> + *
> + *  Copyright 2008 Rusty Russell IBM Corporation
> + *  Copyright 2011 Frank Swiderski Google Inc
> + *
> + *  This program is free software; you can redistribute it and/or modify
> + *  it under the terms of the GNU General Public License as published by
> + *  the Free Software Foundation; either version 2 of the License, or
> + *  (at your option) any later version.
> + *
> + *  This program is distributed in the hope that it will be useful,
> + *  but WITHOUT ANY W

Re: [PATCH v2] KVM: trace the events of mmu_notifier

2012-09-10 Thread Avi Kivity

On 09/07/2012 09:16 AM, Xiao Guangrong wrote:
> mmu_notifier is the interface to broadcast the mm events to KVM, the
> tracepoints introduced in this patch can trace all these events, it is
> very helpful for us to notice and fix the bug caused by mm

There is nothing kvm specific here.  Perhaps this can be made generic
(with a mm parameter so we can filter by process).


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/3] KVM: MMU: release noslot pfn on the fail path properly

2012-09-10 Thread Xiao Guangrong

On 09/10/2012 05:02 PM, Avi Kivity wrote:
> On 09/10/2012 11:37 AM, Xiao Guangrong wrote:
>> On 09/10/2012 04:22 PM, Avi Kivity wrote:
>>> On 09/07/2012 09:13 AM, Xiao Guangrong wrote:
 We can not directly call kvm_release_pfn_clean to release the pfn
 since we can meet noslot pfn which is used to cache mmio info into
 spte

 Introduce mmu_release_pfn_clean to do this kind of thing

 Signed-off-by: Xiao Guangrong 
 ---
  arch/x86/kvm/mmu.c |   19 ++-
  arch/x86/kvm/paging_tmpl.h |4 ++--
  2 files changed, 16 insertions(+), 7 deletions(-)

 diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
 index 399c177..3c10bca 100644
 --- a/arch/x86/kvm/mmu.c
 +++ b/arch/x86/kvm/mmu.c
 @@ -2432,6 +2432,16 @@ done:
return ret;
  }

 +/*
 + * The primary user is page fault path which call it to properly
 + * release noslot_pfn.
 + */
 +static void mmu_release_pfn_clean(pfn_t pfn)
 +{
 +  if (!is_error_pfn(pfn))
 +  kvm_release_pfn_clean(pfn);
 +}
 +
>>>
>>> Too many APIs, each slightly different.  How do I know which one to call?
>>
>> It is only used in mmu and it is a static function.
> 
> Still, how do I know which one to call?  The name tells me nothing.
> When I read the code, how do I know if a call is correct or not?
> 
>>
>>>
>>> Please change kvm_release_pfn_*() instead, calling some arch hook (or
>>> even #ifdef CONFIG_KVM_HAS_FAST_MMIO) to check for the special case.
>>
>> We only need to call it on page fault path. If we change the common API
>> other x86 components have to suffer from it.
> 
> This way, I have to suffer from it.

Sorry. :(

> 
> btw, what about another approach, to avoid those paths completely?
> Avoid calling __direct_map() with error_pfn, and jump to a label after
> kvm_release_pfn_clean() in page_fault(), etc?

I will try it.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [User question] Huge buffer size on KVM host

2012-09-10 Thread Martin Wawro

On 08/16/2012 04:57 PM, Avi Kivity wrote:

Hi Avi,

No, there was no reason and we disabled it there too. Interestingly, the
buffer
size did not go down significantly, even when manually flushing the pages
using /proc/sys/vm/drop_caches (3), the buffer size did not go down.
Finally,
after rebooting the host, buffer size was back to normal again and when all
caches were disabled, we were also able to manually drop the buffered pages.

However, after 2 weeks or so, we encountered significant instabilities
in the system.
Inside the guest OS, the load went up like crazy (way past 30) and the 
guest
was virtually unusable. Checking the host: buffer size, load, memory,
I/O was all in
a well defined range.

When rebooting the guest OS, the problem manifested again after
about 20-30mins (reproducible a couple of times). The only thing that
worked to
stop this (I am afraid only for a couple of weeks or so), was rebooting
the host system.
On the guest and the host there was nothing suspicious in the logs or
dmesg output.

Best regards,

Martin

> Hi Avi,
>
> It seems that the kernel on that particular machine is too old, those entries 
> are
> not featured. We checked on a comparable setup with a newer kernel and both 
> entries
> were set to 512.
>
> We also did have a third more thorough look on the caching. It turns out that 
> the
> virt-manager does not seem to honor the caching adjusted in the GUI correctly.
> We disabled caching on all virtual devices for this particular VM and checking
> with "ps -fxal" revealed, that only one of those devices (and a rather small 
> one too)
> had this set. We corrected this in the XML file directly and the buffer size
> currently resides at around 1.8 GB after rebooting the VM (the only virtio 
> device
> not having the cache=none option set is now the (non-mounted) cdrom).
>
> cc += libvirt-list
>
> Is there a reason that cdroms don't get cache=none?
>
>

-- 

Martin Wawro   | Digital Medics GmbH 
Managing Director  |  Otto-Hahn-Str. 15, 44227 Dortmund, Germany
Tel. +49-231-9742-6622 |  Fax: +49-231-9742-6623
Key: 0xB0A225BD|Registered at AG Dortmund, HRB 19360

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] support readonly memory feature in qemu

2012-09-10 Thread Jan Kiszka

On 2012-09-09 17:45, Avi Kivity wrote:
> On 09/07/2012 11:50 AM, Jan Kiszka wrote:
>>
>>> +} else {
>>> +cpu_physical_memory_rw(run->mmio.phys_addr,
>>> +   run->mmio.data,
>>> +   run->mmio.len,
>>> +   run->mmio.is_write);
>>> +}
>>> +
>>>  ret = 0;
>>>  break;
>>>  case KVM_EXIT_IRQ_WINDOW_OPEN:
>>>
>>
>> Great to see this feature for KVM finally! I'm just afraid that this
>> will finally break good old isapc - due to broken Seabios. KVM used to
>> "unbreak" it as it didn't respect write protections. ;)
> 
> Can you describe the breakage?

Try "qemu -machine isapc [-enable-kvm]". Seabios is writing to some
read-only marked area. Don't recall where precisely.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] KVM: trace the events of mmu_notifier

2012-09-10 Thread Xiao Guangrong

On 09/10/2012 05:09 PM, Avi Kivity wrote:
> On 09/07/2012 09:16 AM, Xiao Guangrong wrote:
>> mmu_notifier is the interface to broadcast the mm events to KVM, the
>> tracepoints introduced in this patch can trace all these events, it is
>> very helpful for us to notice and fix the bug caused by mm
> 
> There is nothing kvm specific here.  Perhaps this can be made generic
> (with a mm parameter so we can filter by process).

Hmm, i would like to put these tracepoints in the mmu-lock then we can clearly
know the sequence between mm and kvm mmu. It is useful for us to detect the
issue/race between them.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 4/4] kvm: i386: Add classic PCI device assignment

2012-09-10 Thread Jan Kiszka

On 2012-09-09 16:13, Avi Kivity wrote:
> On 09/06/2012 11:44 AM, Jan Kiszka wrote:
>> On 2012-08-30 20:30, Jan Kiszka wrote:
>>> This adds PCI device assignment for i386 targets using the classic KVM
>>> interfaces. This version is 100% identical to what is being maintained
>>> in qemu-kvm for several years and is supported by libvirt as well. It is
>>> expected to remain relevant for another couple of years until kernels
>>> without full-features and performance-wise equivalent VFIO support are
>>> obsolete.
>>>
>>> A refactoring to-do that should be done in-tree is to model MSI and
>>> MSI-X support via the generic PCI layer, similar to what VFIO is already
>>> doing for MSI-X. This should improve the correctness and clean up the
>>> code from duplicate logic.
>>>
>>> Signed-off-by: Jan Kiszka 
>>> ---
>>>
>>> Changes in v3:
>>>  - addressed comment by Peter (changed device name to kvm-pci-assign +
>>>alias)
>>>  - addressed (most) comments by Michael
>>>  - fixed INT pin regression
>>
>> Does someone _disagree_ that there are no open (and reasonably solvable)
>> issues and that this can now be merged through uq/master?
> 
> Is patch 4 the only one that is at v3, and the rest are to be taken from
> the original posting?

That is correct.

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH uq/master] kvm: Clean up irqfd API

2012-09-10 Thread Jan Kiszka

On 2012-09-09 16:01, Avi Kivity wrote:
> On 08/20/2012 11:55 AM, Jan Kiszka wrote:
>> No need to expose the fd-based interface, everyone will already be fine
>> with the more handy EventNotifier variant. Rename the latter to clarify
>> that we are still talking about irqfds here.
> 
> Thanks, applied.
> 
>>  
>> -int kvm_irqchip_add_irqfd(KVMState *s, int fd, int virq);
>> -int kvm_irqchip_remove_irqfd(KVMState *s, int fd, int virq);
>> -int kvm_irqchip_add_irq_notifier(KVMState *s, EventNotifier *n, int virq);
>> -int kvm_irqchip_remove_irq_notifier(KVMState *s, EventNotifier *n, int 
>> virq);
>> +int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n, int virq);
>> +int kvm_irqchip_remove_irqfd_notifier(KVMState *s, EventNotifier *n, int 
>> virq);
>>  #endif
> 
> Those names aren't particularly satisfying.  add_irqfd_notifier implies
> you want to be notified about irqfd events, but that's not what the
> function does.  Not sure what a good name would be.

Now that there are no more variants, we could also drop the "notifier"
from the name again. Better?

Jan

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ANNOUNCE] qemu-kvm-1.1.2

2012-09-10 Thread Avi Kivity

qemu-kvm-1.1.2 is now available. This release is based on the upstream
qemu 1.1.2, plus kvm-specific enhancements. Please see the
original QEMU 1.1.2 release announcement [1] for details.

This release can be used with the kvm kernel modules provided by your
distribution kernel, or by the modules in the kvm-kmod package, such
as kvm-kmod-3.5.

http://www.linux-kvm.org

[1] http://lists.nongnu.org/archive/html/qemu-devel/2012-09/msg00587.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[ANNOUNCE] qemu-kvm-1.2.0

2012-09-10 Thread Avi Kivity

qemu-kvm-1.2.0 is now available. This release is based on the upstream
qemu 1.2.0, plus kvm-specific enhancements. Please see the
original QEMU 1.2.0 release announcement [1] for details.

This release can be used with the kvm kernel modules provided by your
distribution kernel, or by the modules in the kvm-kmod package, such
as kvm-kmod-3.5.

http://www.linux-kvm.org

[1] http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg00506.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ANNOUNCE] qemu-kvm-1.2.0

2012-09-10 Thread Jan Kiszka

On 2012-09-10 11:53, Avi Kivity wrote:
> qemu-kvm-1.2.0 is now available. This release is based on the upstream
> qemu 1.2.0, plus kvm-specific enhancements. Please see the
> original QEMU 1.2.0 release announcement [1] for details.

To be more precise about the kvm-specific enhancements: The only major
difference is now PCI passthrough. Beyond that, there are smaller
differences like old special command-line switches (for which upstream
equivalents exist), some adjustments that allow migration from older
qemu-kvm versions (cannot be merged upstream), and the test device for
kvm unit tests. That's it.

IOW: Anyone who can reboot his/her guests for an upgrade and doesn't
depend on PCI passthrough can safely use upstream QEMU 1.2.

Jan

> 
> This release can be used with the kvm kernel modules provided by your
> distribution kernel, or by the modules in the kvm-kmod package, such
> as kvm-kmod-3.5.
> 
> http://www.linux-kvm.org
> 
> [1] http://lists.gnu.org/archive/html/qemu-devel/2012-09/msg00506.html

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

qemu-kvm log

2012-09-10 Thread Liu, Jinsong

Hi,

I'm recently debugging a qemu-kvm issue. I add some print code like 
'fprintf(stderr, ...)', however I fail to see any info at stdio. Anyone can 
tell me where is qemu-kvm logfile, or, what I need do to record my fprintf info?

Thanks,
Jinsong--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv4] virtio-spec: virtio network device multiqueue support

2012-09-10 Thread Jason Wang


On 09/10/2012 02:33 PM, Michael S. Tsirkin wrote:

On Mon, Sep 10, 2012 at 09:27:38AM +0300, Michael S. Tsirkin wrote:

On Mon, Sep 10, 2012 at 09:16:29AM +0300, Michael S. Tsirkin wrote:

On Mon, Sep 10, 2012 at 11:42:25AM +0930, Rusty Russell wrote:

OK, I read the spec (pasted below for easy of reading), but I'm still
confused over how this will work.

I thought normal net drivers have the hardware provide an rxhash for
each packet, and we map that to CPU to queue the packet on[1].  We hope
that the receiving process migrates to that CPU, so xmit queue
matches.

This ony works sometimes.  For example it's common to pin netperf to a
cpu to get consistent performance.  Proper hardware must obey what
applications want it to do, not the other way around.


For virtio this would mean a new per-packet rxhash value, right?

Why are we doing something different?  What am I missing?

Thanks,
Rusty.
[1] Everything I Know About Networking I Learned From LWN:
 https://lwn.net/Articles/362339/

I think you missed this:

Some network interfaces can help with the distribution of incoming
packets; they have multiple receive queues and multiple interrupt lines.
Others, though, are equipped with a single queue, meaning that the
driver for that hardware must deal with all incoming packets in a
single, serialized stream. Parallelizing such a stream requires some
intelligence on the part of the host operating system.

In other words RPS is a hack to speed up networking on cheapo
hardware, this is one of the reasons it is off by default.
Good hardware has multiple receive queues.
We can implement a good one so we do not need RPS.

Also not all guest OS-es support RPS.

Does this clarify?

I would like to add that on many processors, sending
IPCs between guest CPUs requires exits on sending *and*
receiving path, making it very expensive.

A final addition: what you suggest above would be
"TX follows RX", right?
It is in anticipation of something like that, that I made
steering programming so generic.
I think TX follows RX is more immediately useful for reasons above
but we can add both to spec and let drivers and devices
decide what they want to support.
Pls let me know.


AFAIK, ixgbe does "rx follows tx". The only differences between ixgbe 
and virtio-net is that ixgbe driver programs the flow director during 
packet transmission but we suggest to do it silently in the device for 
simplicity. Even with this, more co-operation is still needed for the 
driver ( e.g ixgbe try to use per-cpu queue by setting affinity hint and 
using cpuid to choose the txq which could be reused in virtio-net driver).



---
Transmit Packet Steering

When VIRTIO_NET_F_MULTIQUEUE feature bit is negotiated, guest can use any of 
multiple configured transmit queues to transmit a given packet. To avoid packet 
reordering by device (which generally leads to performance degradation) driver 
should attempt to utilize the same transmit virtqueue for all packets of a 
given transmit flow. For bi-directional protocols (in practice, TCP), a given 
network connection can utilize both transmit and receive queues. For best 
performance, packets from a single connection should utilize the paired 
transmit and receive queues from the same virtqueue pair; for example both 
transmitqN and receiveqN. This rule makes it possible to optimize processing on 
the device side, but this is not a hard requirement: devices should function 
correctly even when this rule is not followed.

Driver selects an active steering rule using VIRTIO_NET_CTRL_STEERING command 
(this controls both which virtqueue is selected for a given packet for receive 
and notifies the device which virtqueues are about to be used for transmit).

This command accepts a single out argument in the following format:

#define VIRTIO_NET_CTRL_STEERING   4

The field rule specifies the function used to select transmit virtqueue for a 
given packet; the field param makes it possible to pass an extra parameter if 
appropriate. When rule is set to VIRTIO_NET_CTRL_STEERING_SINGLE (this is the 
default) all packets are steered to the default virtqueue transmitq (1); param 
is unused; this is the default. With any other rule, When rule is set to 
VIRTIO_NET_CTRL_STEERING_RX_FOLLOWS_TX packets are steered by driver to the 
first N=(param+1) multiqueue virtqueues transmitq1...transmitqN; the default 
transmitq is unused. Driver must have configured all these (param+1) virtqueues 
beforehand.

Supported steering rules can be added and removed in the future. Driver should 
check that the request to change the steering rule was successful by checking 
ack values of the command. As selecting a specific steering is an optimization 
feature, drivers should avoid hard failure and fall back on using a supported 
steering rule if this command fails. The default steering rule is 
VIRTIO_NET_CTRL_STEERING_SINGLE. It will not be removed.

When the steering rule

Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Peter Lieven


On 09/06/12 16:58, Avi Kivity wrote:

On 08/22/2012 06:06 PM, Peter Lieven wrote:

Hi,

has anyone ever tested to run memtest with -cpu host flag passed to
qemu-kvm?
For me it resets when probing the chipset. With -cpu qemu64 it works
just fine.

Maybe this is specific to memtest, but it might be sth that can happen
in other
applications to.

Any thoughts?

Try to identify the cpu flag that causes this by removing them
successively (-cpu host,-flag...).  Alternatively capture a trace
(http://www.linux-kvm.org/page/Tracing) look for TRIPLE_FAULT (Intel),
and post the few hundred lines preceding it.


Here we go:

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason EXCEPTION_NMI 
rip 0xd185 info 0 8307

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_fpu: load
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc60 info cf80003 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_write at 0xcf8 
size 4 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_fpu: unload
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc29 info cfc0009 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_emulate_insn: [FAILED TO 
PARSE] rip=52265 csbase=0 len=2 insn=fí%ÿÿ flags=5 failed=0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_read at 0xcfc size 
2 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc60 info cf80003 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_write at 0xcf8 
size 4 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc29 info cfe0009 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_emulate_insn: [FAILED TO 
PARSE] rip=52265 csbase=0 len=2 insn=fí%ÿÿ flags=5 failed=0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_read at 0xcfe size 
2 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason EXCEPTION_NMI 
rip 0xd185 info 0 8307

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_fpu: load
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc60 info cf80003 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_write at 0xcf8 
size 4 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_fpu: unload
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc29 info cfc0009 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_emulate_insn: [FAILED TO 
PARSE] rip=52265 csbase=0 len=2 insn=fí%ÿÿ flags=5 failed=0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_read at 0xcfc size 
2 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc60 info cf80003 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_write at 0xcf8 
size 4 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason IO_INSTRUCTION 
rip 0xcc29 info cfc0009 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_emulate_insn: [FAILED TO 
PARSE] rip=52265 csbase=0 len=2 insn=fí%ÿÿ flags=5 failed=0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_pio: pio_read at 0xcfc size 
2 count 1
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_userspace_exit: reason 
KVM_EXIT_IO (2)

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason EPT_MISCONFIG 
rip 0x86e0 info 0 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_emulate_insn: [FAILED TO 
PARSE] rip=34528 csbase=0 len=3 insn=F@ÒuõL$ flags=5 failed=0
qemu-kvm-1.0.1-5107 [007] 410771.148000: vcpu_match_mmio: gva 0xb873c 
gpa 0xb873c Write GPA
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_mmio: mmio write len 1 gpa 
0xb873c val 0x6f

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit:

Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Paolo Bonzini

Il 10/09/2012 13:06, Peter Lieven ha scritto:
> 
> qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
> qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason MSR_READ rip
> 0x11478 info 0 0
> qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_msr: msr_read 194 = 0x0 (#GP)
> qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_inj_exception: #GP (0x0)
> qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
> qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason TRIPLE_FAULT
> rip 0x11478 info 0 0

Memory controller MSR:

static float getNHMmultiplier(void)
{
unsigned int msr_lo, msr_hi;
float coef;

/* Find multiplier (by MSR) */
/* First, check if Flexible Ratio is Enabled */
rdmsr(0x194, msr_lo, msr_hi);
if((msr_lo >> 16) & 1){
coef = (msr_lo >> 8) & 0xFF;
 } else {
rdmsr(0xCE, msr_lo, msr_hi);
coef = (msr_lo >> 8) & 0xFF;
 }

return coef;
}

Looks like we need to emulate it since memtest only looks at the cpuid
to detect an integrated memory controller.  What does this return for you?

dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC v2 PATCH 00/21] KVM: x86: CPU isolation and direct interrupts delivery to guests

2012-09-10 Thread Tomoki Sekiyama

Hi Jan,

On 2012/09/07 17:26, Jan Kiszka wrote:

> On 2012-09-06 13:27, Tomoki Sekiyama wrote:
>> This RFC patch series provides facility to dedicate CPUs to KVM guests
>> and enable the guests to handle interrupts from passed-through PCI devices
>> directly (without VM exit and relay by the host).
>>
>> With this feature, we can improve throughput and response time of the device
>> and the host's CPU usage by reducing the overhead of interrupt handling.
>> This is good for the application using very high throughput/frequent
>> interrupt device (e.g. 10GbE NIC).
>> Real-time applicatoins also gets benefit from CPU isolation feature, which
>> reduces interfare from host kernel tasks and scheduling delay.
>>
>> The overview of this patch series is presented in CloudOpen 2012.
>> The slides are available at:
>> http://events.linuxfoundation.org/images/stories/pdf/lcna_co2012_sekiyama.pdf
> 
> One question regarding your benchmarks: If you measured against standard
> KVM, were the vCPU thread running on an isolcpus core of its own as
> well? If not, your numbers about the impact of these patches on maximum,
> maybe also average latencies are likely too good. Also, using a non-RT
> host and possibly a non-prioritized vCPU thread for the standard
> scenario looks like an unfair comparison.


In the standard KVM benchmark, the vCPU thread is pinned down to its own
CPU core. In addition, the vCPU thread and irq/*-kvm threads are both set
to the max priority with SCHED_RR policy.

As you said, RT-host may result in better max latencies as below.
(But min/average latencies became worse, however, this might be our setup
 issue.)
 Min / Avg / Max latencies
Normal KVM
  RT-host (3.4.4-rt14)  15us / 21us / 117us
  non RT-host (3.5.0-rc6)6us / 11us / 152us
KVM + Direct IRQ
  non RT-host (3.5.0-rc6 +patch) 1us /  2us /  14us

Thanks,
-- 
Tomoki Sekiyama 
Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Peter Lieven


On 09/10/12 13:29, Paolo Bonzini wrote:

Il 10/09/2012 13:06, Peter Lieven ha scritto:

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason MSR_READ rip
0x11478 info 0 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_msr: msr_read 194 = 0x0 (#GP)
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_inj_exception: #GP (0x0)
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason TRIPLE_FAULT
rip 0x11478 info 0 0

Memory controller MSR:

static float getNHMmultiplier(void)
{
 unsigned int msr_lo, msr_hi;
 float coef;

 /* Find multiplier (by MSR) */
 /* First, check if Flexible Ratio is Enabled */
 rdmsr(0x194, msr_lo, msr_hi);
 if((msr_lo>>  16)&  1){
 coef = (msr_lo>>  8)&  0xFF;
  } else {
 rdmsr(0xCE, msr_lo, msr_hi);
 coef = (msr_lo>>  8)&  0xFF;
  }

 return coef;
}

Looks like we need to emulate it since memtest only looks at the cpuid
to detect an integrated memory controller.  What does this return for you?

dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd

I/O error.

Peter

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Peter Lieven


On 09/10/12 13:29, Paolo Bonzini wrote:

Il 10/09/2012 13:06, Peter Lieven ha scritto:

qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason MSR_READ rip
0x11478 info 0 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_msr: msr_read 194 = 0x0 (#GP)
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_inj_exception: #GP (0x0)
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason TRIPLE_FAULT
rip 0x11478 info 0 0

Memory controller MSR:

static float getNHMmultiplier(void)
{
 unsigned int msr_lo, msr_hi;
 float coef;

 /* Find multiplier (by MSR) */
 /* First, check if Flexible Ratio is Enabled */
 rdmsr(0x194, msr_lo, msr_hi);
 if((msr_lo>>  16)&  1){
 coef = (msr_lo>>  8)&  0xFF;
  } else {
 rdmsr(0xCE, msr_lo, msr_hi);
 coef = (msr_lo>>  8)&  0xFF;
  }

 return coef;
}

Looks like we need to emulate it since memtest only looks at the cpuid
to detect an integrated memory controller.  What does this return for you?

dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd

it only works without the skip. but the msr device returns all zeroes.

peter

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Avi Kivity

On 09/10/2012 02:29 PM, Paolo Bonzini wrote:
> Il 10/09/2012 13:06, Peter Lieven ha scritto:
>> 
>> qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
>> qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason MSR_READ rip
>> 0x11478 info 0 0
>> qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_msr: msr_read 194 = 0x0 (#GP)
>> qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_inj_exception: #GP (0x0)
>> qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_entry: vcpu 0
>> qemu-kvm-1.0.1-5107 [007] 410771.148000: kvm_exit: reason TRIPLE_FAULT
>> rip 0x11478 info 0 0
> 
> Memory controller MSR:
> 
> static float getNHMmultiplier(void)
> {
> unsigned int msr_lo, msr_hi;
> float coef;
> 
> /* Find multiplier (by MSR) */
> /* First, check if Flexible Ratio is Enabled */
> rdmsr(0x194, msr_lo, msr_hi);
> if((msr_lo >> 16) & 1){
> coef = (msr_lo >> 8) & 0xFF;
>  } else {
> rdmsr(0xCE, msr_lo, msr_hi);
> coef = (msr_lo >> 8) & 0xFF;
>  }
> 
> return coef;
> }
> 

The SDM only mentions 0x194 as a machine check exception register
(recorded R12).

In the Architectural MSRs (how I love that name) section 18AH-197H are
listed as reserved.

In the Nehalem section they're not there at all.



-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Paolo Bonzini

Il 10/09/2012 13:52, Peter Lieven ha scritto:
>> dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
>> dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd
> it only works without the skip. but the msr device returns all zeroes.

Hmm, the strange API of the MSR device doesn't work well with dd (dd
skips to 0x194 * 8 because bs is 8.  You can try this program:

#include 
#include 
#include 

int rdmsr(int fd, long reg)
{
char msg[40];
long long val;
sprintf(msg, "rdmsr(%#x)", reg);
if (pread(fd, &val, 8, reg) < 0) {
perror(msg);
} else {
printf("%s: %#016llx\n", msg, val);
fflush(stdout);
}
}


int main()
{
int fd = open("/dev/cpu/0/msr", O_RDONLY);
if (fd < 0) { perror("open"); exit(1); }
rdmsr(fd, 0x194);
rdmsr(fd, 0xCE);
}

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Gleb Natapov

On Mon, Sep 10, 2012 at 02:15:49PM +0200, Paolo Bonzini wrote:
> Il 10/09/2012 13:52, Peter Lieven ha scritto:
> >> dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
> >> dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd
> > it only works without the skip. but the msr device returns all zeroes.
> 
> Hmm, the strange API of the MSR device doesn't work well with dd (dd
> skips to 0x194 * 8 because bs is 8.  You can try this program:
> 
There is rdmsr/wrmsr in msr-tools.

> #include 
> #include 
> #include 
> 
> int rdmsr(int fd, long reg)
> {
> char msg[40];
> long long val;
> sprintf(msg, "rdmsr(%#x)", reg);
> if (pread(fd, &val, 8, reg) < 0) {
> perror(msg);
> } else {
> printf("%s: %#016llx\n", msg, val);
> fflush(stdout);
> }
> }
> 
> 
> int main()
> {
> int fd = open("/dev/cpu/0/msr", O_RDONLY);
> if (fd < 0) { perror("open"); exit(1); }
> rdmsr(fd, 0x194);
> rdmsr(fd, 0xCE);
> }
> 
> Paolo
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Peter Lieven


On 09/10/12 14:21, Gleb Natapov wrote:

On Mon, Sep 10, 2012 at 02:15:49PM +0200, Paolo Bonzini wrote:

Il 10/09/2012 13:52, Peter Lieven ha scritto:

dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd

it only works without the skip. but the msr device returns all zeroes.

Hmm, the strange API of the MSR device doesn't work well with dd (dd
skips to 0x194 * 8 because bs is 8.  You can try this program:


There is rdmsr/wrmsr in msr-tools.
rdmsr returns it cannot read those MSRs. regardless if I use -cpu host 
or -cpu qemu64.


peter

#include
#include
#include

int rdmsr(int fd, long reg)
{
 char msg[40];
 long long val;
 sprintf(msg, "rdmsr(%#x)", reg);
 if (pread(fd,&val, 8, reg)<  0) {
 perror(msg);
 } else {
 printf("%s: %#016llx\n", msg, val);
 fflush(stdout);
 }
}


int main()
{
 int fd = open("/dev/cpu/0/msr", O_RDONLY);
 if (fd<  0) { perror("open"); exit(1); }
 rdmsr(fd, 0x194);
 rdmsr(fd, 0xCE);
}

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Gleb.


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v3 4/4] kvm: i386: Add classic PCI device assignment

2012-09-10 Thread Avi Kivity

On 09/10/2012 12:26 PM, Jan Kiszka wrote:
>> 
>> Is patch 4 the only one that is at v3, and the rest are to be taken from
>> the original posting?
> 
> That is correct.

Thanks, applied to uq/master, will push shortly.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

KVM call agenda for Tuesday, September 11th

2012-09-10 Thread Juan Quintela

Hi

Please send in any agenda items you are interested in covering.

Thanks, Juan.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Avi Kivity

On 09/10/2012 03:29 PM, Peter Lieven wrote:
> On 09/10/12 14:21, Gleb Natapov wrote:
>> On Mon, Sep 10, 2012 at 02:15:49PM +0200, Paolo Bonzini wrote:
>>> Il 10/09/2012 13:52, Peter Lieven ha scritto:
> dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
> dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd
 it only works without the skip. but the msr device returns all zeroes.
>>> Hmm, the strange API of the MSR device doesn't work well with dd (dd
>>> skips to 0x194 * 8 because bs is 8.  You can try this program:
>>>
>> There is rdmsr/wrmsr in msr-tools.
> rdmsr returns it cannot read those MSRs. regardless if I use -cpu host
> or -cpu qemu64.

On the host.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: memtest 4.20+ does not work with -cpu host

2012-09-10 Thread Peter Lieven


On 09/10/12 14:32, Avi Kivity wrote:

On 09/10/2012 03:29 PM, Peter Lieven wrote:

On 09/10/12 14:21, Gleb Natapov wrote:

On Mon, Sep 10, 2012 at 02:15:49PM +0200, Paolo Bonzini wrote:

Il 10/09/2012 13:52, Peter Lieven ha scritto:

dd if=/dev/cpu/0/msr skip=$((0x194)) bs=8 count=1 | xxd
dd if=/dev/cpu/0/msr skip=$((0xCE)) bs=8 count=1 | xxd

it only works without the skip. but the msr device returns all zeroes.

Hmm, the strange API of the MSR device doesn't work well with dd (dd
skips to 0x194 * 8 because bs is 8.  You can try this program:


There is rdmsr/wrmsr in msr-tools.

rdmsr returns it cannot read those MSRs. regardless if I use -cpu host
or -cpu qemu64.

On the host.

aaah ok:

#rdmsr -0 0x194
00011100
#rdmsr -0 0xce
0c0004011103

Peter





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH] KVM: optimize apic interrupt delivery

2012-09-10 Thread Gleb Natapov

Most interrupt are delivered to only one vcpu. Use pre-build tables to
find interrupt destination instead of looping through all vcpus.

Signed-off-by: Gleb Natapov 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 64adb61..121f308 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -511,6 +511,14 @@ struct kvm_arch_memory_slot {
struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
 };
 
+struct kvm_apic_map {
+   struct rcu_head rcu;
+   bool flat;
+   u8 ldr_bits;
+   struct kvm_lapic *phys_map[255];
+   struct kvm_lapic *logical_map[16][16];
+};
+
 struct kvm_arch {
unsigned int n_used_mmu_pages;
unsigned int n_requested_mmu_pages;
@@ -528,6 +536,8 @@ struct kvm_arch {
struct kvm_ioapic *vioapic;
struct kvm_pit *vpit;
int vapics_in_nmi_mode;
+   struct kvm_apic_map *apic_map;
+   struct mutex apic_map_lock;
 
unsigned int tss_addr;
struct page *apic_access_page;
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 07ad628..d18ddd2 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -139,11 +139,101 @@ static inline int apic_enabled(struct kvm_lapic *apic)
(LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
 APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
 
+static inline int apic_x2apic_mode(struct kvm_lapic *apic)
+{
+   return apic->vcpu->arch.apic_base & X2APIC_ENABLE;
+}
+
+
 static inline int kvm_apic_id(struct kvm_lapic *apic)
 {
return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
 }
 
+static void rcu_free_apic_map(struct rcu_head *head)
+{
+   struct kvm_apic_map *p = container_of(head,
+   struct kvm_apic_map, rcu);
+
+   kfree(p);
+}
+
+static void kvm_apic_get_logical_id(u32 ldr, bool flat, u8 ldr_bits,
+   u16 *cid, u16 *lid)
+{
+   if (ldr_bits == 32) {
+   *cid = ldr >> 16;
+   *lid = ldr & 0x;
+   } else {
+   ldr = GET_APIC_LOGICAL_ID(ldr);
+
+   if (flat) {
+   *cid = 0;
+   *lid = ldr;
+   } else {
+   *cid = ldr >> 4;
+   *lid = ldr & 0xf;
+   }
+   }
+}
+
+static inline int recalculate_apic_map(struct kvm *kvm)
+{
+   struct kvm_apic_map *new, *old = NULL;
+   struct kvm_vcpu *vcpu;
+   int i;
+
+   new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
+
+   if (!new)
+   return -ENOMEM;
+
+   new->ldr_bits = 8;
+   new->flat = true;
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   u16 cid, lid;
+   struct kvm_lapic *apic = vcpu->arch.apic;
+
+   if (!kvm_apic_present(vcpu))
+   continue;
+
+   if (apic_x2apic_mode(apic)) {
+   new->ldr_bits = 32;
+   new->flat = false;
+   } else if (kvm_apic_sw_enabled(apic) && new->flat &&
+   kvm_apic_get_reg(apic, APIC_DFR) == 
APIC_DFR_CLUSTER)
+   new->flat = false;
+
+   new->phys_map[kvm_apic_id(apic)] = apic;
+   kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
+   new->flat, new->ldr_bits, &cid, &lid);
+
+   if (lid)
+   new->logical_map[cid][ffs(lid) - 1] = apic;
+   }
+   mutex_lock(&kvm->arch.apic_map_lock);
+   old = kvm->arch.apic_map;
+   rcu_assign_pointer(kvm->arch.apic_map, new);
+   mutex_unlock(&kvm->arch.apic_map_lock);
+
+   if (old)
+   call_rcu(&old->rcu, rcu_free_apic_map);
+
+   return 0;
+}
+
+static inline int kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
+{
+   apic_set_reg(apic, APIC_ID, id << 24);
+   return recalculate_apic_map(apic->vcpu->kvm);
+}
+
+static inline int kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
+{
+   apic_set_reg(apic, APIC_LDR, id);
+   return recalculate_apic_map(apic->vcpu->kvm);
+}
+
 static inline int apic_lvt_enabled(struct kvm_lapic *apic, int lvt_type)
 {
return !(kvm_apic_get_reg(apic, lvt_type) & APIC_LVT_MASKED);
@@ -193,11 +283,6 @@ void kvm_apic_set_version(struct kvm_vcpu *vcpu)
apic_set_reg(apic, APIC_LVR, v);
 }
 
-static inline int apic_x2apic_mode(struct kvm_lapic *apic)
-{
-   return apic->vcpu->arch.apic_base & X2APIC_ENABLE;
-}
-
 static const unsigned int apic_lvt_mask[APIC_LVT_NUM] = {
LVT_MASK ,  /* part LVTT mask, timer mode mask added at runtime */
LVT_MASK | APIC_MODE_MASK,  /* LVTTHMR */
@@ -477,6 +562,67 @@ int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct 
kvm_lapic *source,
return result;
 }
 
+bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
+   struct kvm_lapic_irq *irq, int *r)
+{
+   unsigned l

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Andrew Theurer

On Sat, 2012-09-08 at 14:13 +0530, Srikar Dronamraju wrote:
> > 
> > signed-off-by: Andrew Theurer 
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index fbf1fd0..c767915 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
> > preempt)
> > 
> >  again:
> > p_rq = task_rq(p);
> > +   if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags &
> > PF_VCPU)) {
> > +   goto out_no_unlock;
> > +   }
> > double_rq_lock(rq, p_rq);
> > while (task_rq(p) != p_rq) {
> > double_rq_unlock(rq, p_rq);
> > @@ -4856,8 +4859,6 @@ again:
> > if (curr->sched_class != p->sched_class)
> > goto out;
> > 
> > -   if (task_running(p_rq, p) || p->state)
> > -   goto out;
> 
> Is it possible that by this time the current thread takes double rq
> lock, thread p could actually be running?  i.e is there merit to keep
> this check around even with your similar check above?

I think that's a good idea.  I'll add that back in.
> 
> > 
> > yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> > if (yielded) {
> > @@ -4879,6 +4880,7 @@ again:
> > 
> >  out:
> > double_rq_unlock(rq, p_rq);
> > +out_no_unlock:
> > local_irq_restore(flags);
> > 
> > if (yielded)
> > 
> > 
> 

-Andrew Theurer


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] qemu-kvm log

2012-09-10 Thread Avi Kivity

On 09/10/2012 01:44 PM, Liu, Jinsong wrote:
> Hi,
> 
> I'm recently debugging a qemu-kvm issue. I add some print code like 
> 'fprintf(stderr, ...)', however I fail to see any info at stdio. Anyone can 
> tell me where is qemu-kvm logfile, or, what I need do to record my fprintf 
> info?


If you're running via libvirt, the log is in /var/log/libvirt/qemu.  If
you're running from the command line and printf()s should end on your
terminal.


-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [Qemu-devel] qemu-kvm log

2012-09-10 Thread Liu, Jinsong

Avi Kivity wrote:
> On 09/10/2012 01:44 PM, Liu, Jinsong wrote:
>> Hi,
>> 
>> I'm recently debugging a qemu-kvm issue. I add some print code like
>> 'fprintf(stderr, ...)', however I fail to see any info at stdio.
>> Anyone can tell me where is qemu-kvm logfile, or, what I need do to
>> record my fprintf info?   
> 
> 
> If you're running via libvirt, the log is in /var/log/libvirt/qemu. 
> If you're running from the command line and printf()s should end on
> your terminal.

Thanks!

I run qemu-kvm from command line, but due to could not initialize SDL, I run 
qemu command via VNC, like
1). test kvm-unit-tests: qemu-system-x86_64 -device testdev,chardev=testlog 
-chardev file,id=testlog,path=apic.out -serial stdio -kernel ./x86/apic.flat 
-cpu host
2). test kvm guest: qemu-system-x86_64 -smp2 -m 512 -hda test.qcow.img -cpu host
for case 1, there is some printf info of kvm-unit-tests on terminal but no 
qemu-kvm printf info;
for case 2, no qemu-kvm printf info on terminal;

Best Regards,
Jinsong--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] KVM: optimize apic interrupt delivery

2012-09-10 Thread Michael S. Tsirkin

On Mon, Sep 10, 2012 at 04:09:15PM +0300, Gleb Natapov wrote:
> Most interrupt are delivered to only one vcpu. Use pre-build tables to
> find interrupt destination instead of looping through all vcpus.
> 
> Signed-off-by: Gleb Natapov 

Looks good overall. I think I see some bugs, with rcu use and others.
the most suspecious thing is that code seems to use rcu but
there are no calls to rcu_dereference anywhere.
Pls see below.

Thanks for looking into this, once ready I intend to rework
direct injection to work on top of this.


> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 64adb61..121f308 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot {
>   struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
>  };
>  
> +struct kvm_apic_map {
> + struct rcu_head rcu;
> + bool flat;
> + u8 ldr_bits;
> + struct kvm_lapic *phys_map[255];

Not 256? apic id is sometimes used in the lookup - is is guaranteed
that guest can not set it to 0xff? If yes this will overflow.

> + struct kvm_lapic *logical_map[16][16];


These are large arrays: 2Kbyte each one, which is bad
for cache.
Is it not better to have vcpu index stored there?
that would reduce cache footprint by x8.

> +};
> +
>  struct kvm_arch {
>   unsigned int n_used_mmu_pages;
>   unsigned int n_requested_mmu_pages;
> @@ -528,6 +536,8 @@ struct kvm_arch {
>   struct kvm_ioapic *vioapic;
>   struct kvm_pit *vpit;
>   int vapics_in_nmi_mode;
> + struct kvm_apic_map *apic_map;
> + struct mutex apic_map_lock;
>  
>   unsigned int tss_addr;
>   struct page *apic_access_page;
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 07ad628..d18ddd2 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -139,11 +139,101 @@ static inline int apic_enabled(struct kvm_lapic *apic)
>   (LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
>APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
>  
> +static inline int apic_x2apic_mode(struct kvm_lapic *apic)
> +{
> + return apic->vcpu->arch.apic_base & X2APIC_ENABLE;
> +}
> +
> +

two emoty lines not needed

>  static inline int kvm_apic_id(struct kvm_lapic *apic)
>  {
>   return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
>  }
>  
> +static void rcu_free_apic_map(struct rcu_head *head)
> +{
> + struct kvm_apic_map *p = container_of(head,
> + struct kvm_apic_map, rcu);

why break this line? it is not too long as is.

> +
> + kfree(p);
> +}
> +
> +static void kvm_apic_get_logical_id(u32 ldr, bool flat, u8 ldr_bits,
> + u16 *cid, u16 *lid)
> +{
> + if (ldr_bits == 32) {
> + *cid = ldr >> 16;
> + *lid = ldr & 0x;
> + } else {
> + ldr = GET_APIC_LOGICAL_ID(ldr);
> +
> + if (flat) {
> + *cid = 0;
> + *lid = ldr;
> + } else {
> + *cid = ldr >> 4;
> + *lid = ldr & 0xf;
> + }
> + }
> +}
> +
> +static inline int recalculate_apic_map(struct kvm *kvm)

__must_check?

> +{
> + struct kvm_apic_map *new, *old = NULL;
> + struct kvm_vcpu *vcpu;
> + int i;
> +
> + new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
> +

empty line not needed here :)

> + if (!new)
> + return -ENOMEM;
> +
> + new->ldr_bits = 8;
> + new->flat = true;
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + u16 cid, lid;
> + struct kvm_lapic *apic = vcpu->arch.apic;
> +
> + if (!kvm_apic_present(vcpu))
> + continue;
> +
> + if (apic_x2apic_mode(apic)) {
> + new->ldr_bits = 32;
> + new->flat = false;
> + } else if (kvm_apic_sw_enabled(apic) && new->flat &&
> + kvm_apic_get_reg(apic, APIC_DFR) == 
> APIC_DFR_CLUSTER)
> + new->flat = false;
> +
> + new->phys_map[kvm_apic_id(apic)] = apic;
> + kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
> + new->flat, new->ldr_bits, &cid, &lid);
> +
> + if (lid)
> + new->logical_map[cid][ffs(lid) - 1] = apic;
> + }
> + mutex_lock(&kvm->arch.apic_map_lock);
> + old = kvm->arch.apic_map;

rcu_dereference_protected?

> + rcu_assign_pointer(kvm->arch.apic_map, new);
> + mutex_unlock(&kvm->arch.apic_map_lock);
> +
> + if (old)
> + call_rcu(&old->rcu, rcu_free_apic_map);

What guarantees rcu_free_apic_map is called before module goes away?
I suspect we need at least rcu_barrier here:
https://lwn.net/Articles/217484/

Also, this is done upon guest request?
This allocates 2K and delays free until next rcu grace
period. Guest can buffer up *a lot of* host kernel
memory before this

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Raghavendra K T


On 09/08/2012 01:12 AM, Andrew Theurer wrote:

On Fri, 2012-09-07 at 23:36 +0530, Raghavendra K T wrote:

CCing PeterZ also.

On 09/07/2012 06:41 PM, Andrew Theurer wrote:

I have noticed recently that PLE/yield_to() is still not that scalable
for really large guests, sometimes even with no CPU over-commit.  I have
a small change that make a very big difference.

[...]

We are indeed avoiding CPUS in guest mode when we check
task->flags&  PF_VCPU in vcpu_on_spin path.  Doesn't that suffice?

My understanding is that it checks if the candidate vcpu task is in
guest mode (let's call this vcpu g1vcpuN), and that vcpu will not be a
target to yield to if it is already in guest mode.  I am concerned about
a different vcpu, possibly from a different VM (let's call it g2vcpuN),
but it also located on the same runqueue as g1vcpuN -and- running.  That
vcpu, g2vcpuN, may also be doing a directed yield, and it may already be
holding the rq lock.  Or it could be in guest mode.  If it is in guest
mode, then let's still target this rq, and try to yield to g1vcpuN.
However, if g2vcpuN is not in guest mode, then don't bother trying.


- If a non vcpu task was currently running, this change can ignore 
request to yield to a target vcpu. The target vcpu could be the most 
eligible vcpu causing other vcpus to do ple exits.

Is it possible to modify the check to deal with only vcpu tasks?

- Should we use p_rq->cfs_rq->skip instead to let us know that some 
yield was active at this time?


-

Cpu 1  cpu2 cpu3
a1  a2a3
b1  b2b3
c2(yield target of a1)c3(yield target of a2)

If vcpu a1 is doing directed yield to vcpu c2; current vcpu a2 on target 
cpu is also doing a directed yield(to some vcpu c3). Then this change 
will only allow vcpu a2 will do a schedule() to b2 (if a2 -> c3 yield is 
successful). Do we miss yielding to a vcpu c2?
a1 might not find a suitable vcpu to yield and might go back to 
spinning. Is my understanding correct?



Patch include below.

Here's the new, v2 result with the previous two:

10 VMs, 16-way each, all running dbench (2x cpu over-commit)
 throughput +/- stddev
  - -
ple on:   2552 +/- .70%
ple on: w/fixv1:  4621 +/- 2.12%  (81% improvement)
ple on: w/fixv2:  6115*   (139% improvement)



The numbers look great.


[*] I do not have stdev yet because all 10 runs are not complete

for v1 to v2, host CPU dropped from 60% to 50%.  Time in spin_lock() is
also dropping:


[...]


So this seems to be working.  However I wonder just how far we can take
this.  Ideally we need to be in<3-4% in host for PLE work, like I
observe for the 8-way VMs.  We are still way off.

-Andrew


signed-off-by: Andrew Theurer

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..c767915 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
preempt)

  again:
p_rq = task_rq(p);
+   if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags&
PF_VCPU)) {


While we are checking the flags of p_rq->curr task, the task p can 
migrate to some other runqueue. In this case will we miss yielding to 
the most eligible vcpu?



+   goto out_no_unlock;
+   }


Nit:
We dont need parenthesis above.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] KVM: optimize apic interrupt delivery

2012-09-10 Thread Avi Kivity

On 09/10/2012 04:09 PM, Gleb Natapov wrote:
> Most interrupt are delivered to only one vcpu. Use pre-build tables to
> find interrupt destination instead of looping through all vcpus.
> 
> +
> +static inline int recalculate_apic_map(struct kvm *kvm)
> +{
> + struct kvm_apic_map *new, *old = NULL;
> + struct kvm_vcpu *vcpu;
> + int i;
> +
> + new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
> +
> + if (!new)
> + return -ENOMEM;
> +
> + new->ldr_bits = 8;
> + new->flat = true;
> + kvm_for_each_vcpu(i, vcpu, kvm) {
> + u16 cid, lid;
> + struct kvm_lapic *apic = vcpu->arch.apic;
> +
> + if (!kvm_apic_present(vcpu))
> + continue;
> +
> + if (apic_x2apic_mode(apic)) {
> + new->ldr_bits = 32;
> + new->flat = false;
> + } else if (kvm_apic_sw_enabled(apic) && new->flat &&
> + kvm_apic_get_reg(apic, APIC_DFR) == 
> APIC_DFR_CLUSTER)
> + new->flat = false;
> +
> + new->phys_map[kvm_apic_id(apic)] = apic;
> + kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
> + new->flat, new->ldr_bits, &cid, &lid);
> +
> + if (lid)
> + new->logical_map[cid][ffs(lid) - 1] = apic;
> + }
> + mutex_lock(&kvm->arch.apic_map_lock);
> + old = kvm->arch.apic_map;
> + rcu_assign_pointer(kvm->arch.apic_map, new);
> + mutex_unlock(&kvm->arch.apic_map_lock);

is this not racy?

cpu 0 cpu 1
  -
new apic id
recalculate

  new apic id
  recalculate

  take lock
  install map
  drop lock

take lock
install map
drop lock


> +
> + if (old)
> + call_rcu(&old->rcu, rcu_free_apic_map);
> +
> + return 0;
> +}
> +

> -
>  static const unsigned int apic_lvt_mask[APIC_LVT_NUM] = {
>   LVT_MASK ,  /* part LVTT mask, timer mode mask added at runtime */
>   LVT_MASK | APIC_MODE_MASK,  /* LVTTHMR */
> @@ -477,6 +562,67 @@ int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct 
> kvm_lapic *source,
>   return result;
>  }
>  
> +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> + struct kvm_lapic_irq *irq, int *r)
> +{
> + unsigned long bitmap = 1;
> + struct kvm_lapic **dst;
> + int i;
> +
> + if (irq->shorthand == APIC_DEST_SELF) {
> + *r = kvm_apic_set_irq(src->vcpu, irq);
> + return true;
> + }
> +
> + if (irq->shorthand)
> + return false;
> +
> + if (irq->dest_mode == 0) { /* physical mode */
> + if (irq->delivery_mode == APIC_DM_LOWEST ||
> + irq->dest_id == 0xff)
> + return false;
> + rcu_read_lock();

Unmatched rcu_read_lock() inside a block is ugly, please start it earlier.

> + dst = &kvm->arch.apic_map->phys_map[irq->dest_id & 0xff];

Like mst says, rcu_dereference().

> + } else {
> + u16 cid, lid;
> + u32 mda = irq->dest_id;
> +
> + rcu_read_lock();
> + if (kvm->arch.apic_map->ldr_bits == 8)
> + mda <<= 24;
> +
> + kvm_apic_get_logical_id(mda, kvm->arch.apic_map->flat,
> + kvm->arch.apic_map->ldr_bits, &cid, &lid);
> + dst = kvm->arch.apic_map->logical_map[cid];
> +
> + bitmap = lid;
> + if (irq->delivery_mode == APIC_DM_LOWEST) {
> + int l = -1;
> + for_each_set_bit(i, &bitmap, 16) {
> + if (!dst[i])
> + continue;
> + if (l < 0)
> + l = i;
> + else if (kvm_apic_compare_prio(dst[i]->vcpu, 
> dst[l]->vcpu) < 0)
> + l = i;
> + }
> +
> + bitmap = (l >= 0) ? 1 << l : 0;
> + }
> + }
> +
> + for_each_set_bit(i, &bitmap, 16) {
> + if (!dst[i])
> + continue;
> + if (*r < 0)
> + *r = 0;
> + *r += kvm_apic_set_irq(dst[i]->vcpu, irq);
> + }
> +
> + rcu_read_unlock();
> + return true;
> +}
> +
>  /*
>   * Add a pending IRQ into lapic.
>   * Return 1 if successfully added and 0 if discarded.
> @@ -880,7 +1026,7 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 
> reg, u32 val)
>   switch (reg) {
>   case APIC_ID:   /* Local APIC ID */
>   if (!apic_x2apic_mode(apic))
> - apic_set_reg(apic, APIC_ID, val);
> + kvm_apic_set_id(apic, val >> 24);
>   else
>   ret = 1;
>

[PULL] KVM updates

2012-09-10 Thread Avi Kivity

Please pull from:

  git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master

to merge some kvm updates, most notably a port of qemu-kvm's pre-vfio device
assignment.  With this there are no significant changes left between qemu and
qemu-kvm (though some work remains).


Christian Borntraeger (1):
  qemu: Use valgrind annotations to mark kvm guest memory as defined

Jan Kiszka (5):
  kvm: Clean up irqfd API
  kvm: Introduce kvm_irqchip_update_msi_route
  kvm: Introduce kvm_has_intx_set_mask
  kvm: i386: Add services required for PCI device assignment
  kvm: i386: Add classic PCI device assignment

 configure  |3 +-
 hw/kvm/Makefile.objs   |2 +-
 hw/kvm/pci-assign.c| 1915 
 hw/qdev-monitor.c  |1 +
 hw/virtio-pci.c|4 +-
 kvm-all.c  |   75 +-
 kvm-stub.c |   14 +-
 kvm.h  |8 +-
 target-i386/kvm.c  |  141 
 target-i386/kvm_i386.h |   22 +
 10 files changed, 2151 insertions(+), 34 deletions(-)
 create mode 100644 hw/kvm/pci-assign.c

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PULL] KVM updates

2012-09-10 Thread Jan Kiszka

On 2012-09-10 17:25, Avi Kivity wrote:
> Please pull from:
> 
>   git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master
> 
> to merge some kvm updates, most notably a port of qemu-kvm's pre-vfio device
> assignment.  With this there are no significant changes left between qemu and
> qemu-kvm (though some work remains).
> 
> 
> Christian Borntraeger (1):
>   qemu: Use valgrind annotations to mark kvm guest memory as defined
> 
> Jan Kiszka (5):
>   kvm: Clean up irqfd API
>   kvm: Introduce kvm_irqchip_update_msi_route
>   kvm: Introduce kvm_has_intx_set_mask
>   kvm: i386: Add services required for PCI device assignment
>   kvm: i386: Add classic PCI device assignment

uq/master used to contain two patches from Peter regarding the kernel
headers and my coalesced MMIO decoupling series - dropped intentionally?

Jan

> 
>  configure  |3 +-
>  hw/kvm/Makefile.objs   |2 +-
>  hw/kvm/pci-assign.c| 1915 
> 
>  hw/qdev-monitor.c  |1 +
>  hw/virtio-pci.c|4 +-
>  kvm-all.c  |   75 +-
>  kvm-stub.c |   14 +-
>  kvm.h  |8 +-
>  target-i386/kvm.c  |  141 
>  target-i386/kvm_i386.h |   22 +
>  10 files changed, 2151 insertions(+), 34 deletions(-)
>  create mode 100644 hw/kvm/pci-assign.c
> 

-- 
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Qemu-devel] [PATCH 4/4] kvm: i386: Add classic PCI device assignment

2012-09-10 Thread Andreas Färber

Am 05.09.2012 22:46, schrieb Anthony Liguori:
> What do we do when the FSF comes out with the GPLv4 and relicenses again
> in an incompatible fashion?  Do we do this exercise every couple of
> years?

That's exactly why I suggested GPLv2+ because it was supposed to be a
preparation for the future. I'm pretty sure I mentioned the potential
GPLv4 back then!

Andreas

-- 
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg, Germany
GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer; HRB 16746 AG Nürnberg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL] KVM updates for 3.6-rc5

2012-09-10 Thread Avi Kivity

Linus, please pull from the repo and branch at:

  git://git.kernel.org/pub/scm/virt/kvm/kvm.git tags/kvm-3.6-2

to receive a trio of KVM fixes: incorrect lookup of guest cpuid,
an uninitialized variable fix, and error path cleanup fix.

Shortlog/diffstat follow.


KVM updates for 3.6-rc5


Jamie Iles (1):
  KVM: PIC: fix use of uninitialised variable.

Ren, Yongjie (1):
  KVM: x86: Check INVPCID feature bit in EBX of leaf 7

Xiao Guangrong (1):
  KVM: fix error paths for failed gfn_to_page() calls

 arch/x86/kvm/i8259.c |  2 +-
 arch/x86/kvm/vmx.c   | 23 ++-
 arch/x86/kvm/x86.c   | 13 ++---
 3 files changed, 29 insertions(+), 9 deletions(-)

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] KVM: optimize apic interrupt delivery

2012-09-10 Thread Gleb Natapov

On Mon, Sep 10, 2012 at 06:09:01PM +0300, Avi Kivity wrote:
> On 09/10/2012 04:09 PM, Gleb Natapov wrote:
> > Most interrupt are delivered to only one vcpu. Use pre-build tables to
> > find interrupt destination instead of looping through all vcpus.
> > 
> > +
> > +static inline int recalculate_apic_map(struct kvm *kvm)
> > +{
> > +   struct kvm_apic_map *new, *old = NULL;
> > +   struct kvm_vcpu *vcpu;
> > +   int i;
> > +
> > +   new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
> > +
> > +   if (!new)
> > +   return -ENOMEM;
> > +
> > +   new->ldr_bits = 8;
> > +   new->flat = true;
> > +   kvm_for_each_vcpu(i, vcpu, kvm) {
> > +   u16 cid, lid;
> > +   struct kvm_lapic *apic = vcpu->arch.apic;
> > +
> > +   if (!kvm_apic_present(vcpu))
> > +   continue;
> > +
> > +   if (apic_x2apic_mode(apic)) {
> > +   new->ldr_bits = 32;
> > +   new->flat = false;
> > +   } else if (kvm_apic_sw_enabled(apic) && new->flat &&
> > +   kvm_apic_get_reg(apic, APIC_DFR) == 
> > APIC_DFR_CLUSTER)
> > +   new->flat = false;
> > +
> > +   new->phys_map[kvm_apic_id(apic)] = apic;
> > +   kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
> > +   new->flat, new->ldr_bits, &cid, &lid);
> > +
> > +   if (lid)
> > +   new->logical_map[cid][ffs(lid) - 1] = apic;
> > +   }
> > +   mutex_lock(&kvm->arch.apic_map_lock);
> > +   old = kvm->arch.apic_map;
> > +   rcu_assign_pointer(kvm->arch.apic_map, new);
> > +   mutex_unlock(&kvm->arch.apic_map_lock);
> 
> is this not racy?
> 
> cpu 0 cpu 1
>   -
> new apic id
> recalculate
> 
>   new apic id
>   recalculate
> 
>   take lock
>   install map
>   drop lock
> 
> take lock
> install map
> drop lock
> 
Hmm, yes. Will have to hold lock during recalculation. Not a big deal.

> 
> > +
> > +   if (old)
> > +   call_rcu(&old->rcu, rcu_free_apic_map);
> > +
> > +   return 0;
> > +}
> > +
> 
> > -
> >  static const unsigned int apic_lvt_mask[APIC_LVT_NUM] = {
> > LVT_MASK ,  /* part LVTT mask, timer mode mask added at runtime */
> > LVT_MASK | APIC_MODE_MASK,  /* LVTTHMR */
> > @@ -477,6 +562,67 @@ int kvm_apic_match_dest(struct kvm_vcpu *vcpu, struct 
> > kvm_lapic *source,
> > return result;
> >  }
> >  
> > +bool kvm_irq_delivery_to_apic_fast(struct kvm *kvm, struct kvm_lapic *src,
> > +   struct kvm_lapic_irq *irq, int *r)
> > +{
> > +   unsigned long bitmap = 1;
> > +   struct kvm_lapic **dst;
> > +   int i;
> > +
> > +   if (irq->shorthand == APIC_DEST_SELF) {
> > +   *r = kvm_apic_set_irq(src->vcpu, irq);
> > +   return true;
> > +   }
> > +
> > +   if (irq->shorthand)
> > +   return false;
> > +
> > +   if (irq->dest_mode == 0) { /* physical mode */
> > +   if (irq->delivery_mode == APIC_DM_LOWEST ||
> > +   irq->dest_id == 0xff)
> > +   return false;
> > +   rcu_read_lock();
> 
> Unmatched rcu_read_lock() inside a block is ugly, please start it earlier.
> 
Doing rcu_read_unlock() ugly too, but if you prefer it this way...

> > +   dst = &kvm->arch.apic_map->phys_map[irq->dest_id & 0xff];
> 
> Like mst says, rcu_dereference().
> 
Ugh.

> > +   } else {
> > +   u16 cid, lid;
> > +   u32 mda = irq->dest_id;
> > +
> > +   rcu_read_lock();
> > +   if (kvm->arch.apic_map->ldr_bits == 8)
> > +   mda <<= 24;
> > +
> > +   kvm_apic_get_logical_id(mda, kvm->arch.apic_map->flat,
> > +   kvm->arch.apic_map->ldr_bits, &cid, &lid);
> > +   dst = kvm->arch.apic_map->logical_map[cid];
> > +
> > +   bitmap = lid;
> > +   if (irq->delivery_mode == APIC_DM_LOWEST) {
> > +   int l = -1;
> > +   for_each_set_bit(i, &bitmap, 16) {
> > +   if (!dst[i])
> > +   continue;
> > +   if (l < 0)
> > +   l = i;
> > +   else if (kvm_apic_compare_prio(dst[i]->vcpu, 
> > dst[l]->vcpu) < 0)
> > +   l = i;
> > +   }
> > +
> > +   bitmap = (l >= 0) ? 1 << l : 0;
> > +   }
> > +   }
> > +
> > +   for_each_set_bit(i, &bitmap, 16) {
> > +   if (!dst[i])
> > +   continue;
> > +   if (*r < 0)
> > +   *r = 0;
> > +   *r += kvm_apic_set_irq(dst[i]->vcpu, irq);
> > +   }
> > +
> > +   rcu_read_unlock();
> > +   return true;
> > +}
> > +
> >  /*
> >   * Add a pending IRQ into lapic.
> >   * Return 1 if successfully added and 0 if discarded.
> > @@ -880,7 +1026,7 @@ static int apic_reg_w

Re: [PULL] KVM updates

2012-09-10 Thread Marcelo Tosatti

On Mon, Sep 10, 2012 at 05:32:39PM +0200, Jan Kiszka wrote:
> On 2012-09-10 17:25, Avi Kivity wrote:
> > Please pull from:
> > 
> >   git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master
> > 
> > to merge some kvm updates, most notably a port of qemu-kvm's pre-vfio device
> > assignment.  With this there are no significant changes left between qemu 
> > and
> > qemu-kvm (though some work remains).
> > 
> > 
> > Christian Borntraeger (1):
> >   qemu: Use valgrind annotations to mark kvm guest memory as defined
> > 
> > Jan Kiszka (5):
> >   kvm: Clean up irqfd API
> >   kvm: Introduce kvm_irqchip_update_msi_route
> >   kvm: Introduce kvm_has_intx_set_mask
> >   kvm: i386: Add services required for PCI device assignment
> >   kvm: i386: Add classic PCI device assignment
> 
> uq/master used to contain two patches from Peter regarding the kernel
> headers and my coalesced MMIO decoupling series - dropped intentionally?
> 
> Jan

Probably unintentionally, they havent been lost though (will be in the
next batch once this batch is pulled).

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/2] virtio-ring: Allocate indirect buffers from cache when possible

2012-09-10 Thread Paolo Bonzini

Il 06/09/2012 07:02, Michael S. Tsirkin ha scritto:
>> > It might be worth just unconditionally having a cache for the 2
>> > descriptor case.  This is what I get with qemu tap, though for some
>> > reason the device features don't have guest or host CSUM, so my setup is
>> > probably screwed:
> Yes without checksum net core always linearizes packets, so yes it is
> screwed.
> For -net, skb always allocates space for 17 frags + linear part so
> it seems sane to do same in virtio core, and allocate, for -net,
> up to max_frags + 1 from cache.
> We can adjust it: no _SG -> 2 otherwise 18.
> 
> Not sure about other drivers, maybe really use 2 there for now.

2 should also be good for virtio-blk and virtio-scsi 4KB random rw
workloads.

Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PULL] KVM updates

2012-09-10 Thread Avi Kivity

On 09/10/2012 06:49 PM, Marcelo Tosatti wrote:
> On Mon, Sep 10, 2012 at 05:32:39PM +0200, Jan Kiszka wrote:
>> On 2012-09-10 17:25, Avi Kivity wrote:
>> > Please pull from:
>> > 
>> >   git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master
>> > 
>> > to merge some kvm updates, most notably a port of qemu-kvm's pre-vfio 
>> > device
>> > assignment.  With this there are no significant changes left between qemu 
>> > and
>> > qemu-kvm (though some work remains).
>> > 
>> > 
>> > Christian Borntraeger (1):
>> >   qemu: Use valgrind annotations to mark kvm guest memory as defined
>> > 
>> > Jan Kiszka (5):
>> >   kvm: Clean up irqfd API
>> >   kvm: Introduce kvm_irqchip_update_msi_route
>> >   kvm: Introduce kvm_has_intx_set_mask
>> >   kvm: i386: Add services required for PCI device assignment
>> >   kvm: i386: Add classic PCI device assignment
>> 
>> uq/master used to contain two patches from Peter regarding the kernel
>> headers and my coalesced MMIO decoupling series - dropped intentionally?

Thanks for checking.

> Probably unintentionally, they havent been lost though (will be in the
> next batch once this batch is pulled).

Any idea how they were lost?

-- 
error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Peter Zijlstra

On Mon, 2012-09-10 at 08:16 -0500, Andrew Theurer wrote:
> > > @@ -4856,8 +4859,6 @@ again:
> > > if (curr->sched_class != p->sched_class)
> > > goto out;
> > > 
> > > -   if (task_running(p_rq, p) || p->state)
> > > -   goto out;
> > 
> > Is it possible that by this time the current thread takes double rq
> > lock, thread p could actually be running?  i.e is there merit to keep
> > this check around even with your similar check above?
> 
> I think that's a good idea.  I'll add that back in. 

Right, it needs to still be there, the test before acquiring p_rq is an
optimistic test to avoid work, but you have to still test it once you
acquire p_rq since the rest of the code relies on this not being so.

How about something like this instead.. ?

---
 kernel/sched/core.c | 35 ++-
 1 file changed, 26 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c46a011..c9ecab2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4300,6 +4300,23 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct 
*p)
+{
+   if (!curr->sched_class->yield_to_task)
+   return false;
+
+   if (curr->sched_class != p->sched_class)
+   return false;
+
+   if (task_running(p_rq, p) || p->state)
+   return false;
+
+   return true;
+}
+
 /**
  * yield_to - yield the current processor to another thread in
  * your thread group, or accelerate that thread toward the
@@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
rq = this_rq();
 
 again:
+   /* optimistic test to avoid taking locks */
+   if (!__yield_to_candidate(curr, p))
+   goto out_irq;
+
p_rq = task_rq(p);
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
@@ -4330,14 +4351,9 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
goto again;
}
 
-   if (!curr->sched_class->yield_to_task)
-   goto out;
-
-   if (curr->sched_class != p->sched_class)
-   goto out;
-
-   if (task_running(p_rq, p) || p->state)
-   goto out;
+   /* validate state, holding p_rq ensures p's state cannot change */
+   if (!__yield_to_candidate(curr, p))
+   goto out_unlock;
 
yielded = curr->sched_class->yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4350,8 +4366,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt)
resched_task(p_rq->curr);
}
 
-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);
 
if (yielded)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/2] virtio-ring: Allocate indirect buffers from cache when possible

2012-09-10 Thread Michael S. Tsirkin

On Mon, Sep 10, 2012 at 10:47:15AM -0500, Thomas Lendacky wrote:
> On Friday, September 07, 2012 09:19:04 AM Rusty Russell wrote:
> 
> > "Michael S. Tsirkin"  writes:
> 
> > > On Thu, Sep 06, 2012 at 05:27:23PM +0930, Rusty Russell wrote:
> 
> > >> "Michael S. Tsirkin"  writes:
> 
> > >> > Yes without checksum net core always linearizes packets, so yes it is
> 
> > >> > screwed.
> 
> > >> > For -net, skb always allocates space for 17 frags + linear part so
> 
> > >> > it seems sane to do same in virtio core, and allocate, for -net,
> 
> > >> > up to max_frags + 1 from cache.
> 
> > >> > We can adjust it: no _SG -> 2 otherwise 18.
> 
> > >>
> 
> > >> But I thought it used individual buffers these days?
> 
> > >
> 
> > > Yes for receive, no for transmit. That's probably why
> 
> > > we should have the threshold per vq, not per device, BTW.
> 
> >
> 
> > Can someone actually run with my histogram patch and see what the real
> 
> > numbers are?
> 
> >
> 
>  
> 
> I ran some TCP_RR and TCP_STREAM sessions, both host-to-guest and
> 
> guest-to-host, with a form of the histogram patch applied against a
> 
> RHEL6.3 kernel. The histogram values were reset after each test.
> 
>  
> 
> Here are the results:
> 
>  
> 
> 60 session TCP_RR from host-to-guest with 256 byte request and 256 byte
> 
> response for 60 seconds:
> 
>  
> 
> Queue histogram for virtio1:
> 
> Size distribution for input (max=7818456):
> 
> 1: 7818456 
> 
> Size distribution for output (max=7816698):
> 
> 2: 149
> 
> 3: 7816698 

Here, a threshold would help.

> 
> 4: 2
> 
> 5: 1
> 
> Size distribution for control (max=1):
> 
> 0: 0
> 
>  
> 
>  
> 
> 4 session TCP_STREAM from host-to-guest with 4K message size for 60 seconds:
> 
>  
> 
> Queue histogram for virtio1:
> 
> Size distribution for input (max=16050941):
> 
> 1: 16050941 
> 
> Size distribution for output (max=1877796):
> 
> 2: 1877796 
> 
> 3: 5
> 
> Size distribution for control (max=1):
> 
> 0: 0
> 
>  
> 
> 4 session TCP_STREAM from host-to-guest with 16K message size for 60 seconds:
> 
>  
> 
> Queue histogram for virtio1:
> 
> Size distribution for input (max=16831151):
> 
> 1: 16831151 
> 
> Size distribution for output (max=1923965):
> 
> 2: 1923965 
> 
> 3: 5
> 
> Size distribution for control (max=1):
> 
> 0: 0
> 

Hmm for virtio net output we do always use 2 s/g, this is because of a
qemu bug. Maybe it's time we fixed this, added a feature bit?
This would fix above without threshold hacks.


> 
> 4 session TCP_STREAM from guest-to-host with 4K message size for 60 seconds:
> 
>  
> 
> Queue histogram for virtio1:
> 
> Size distribution for input (max=1316069):
> 
> 1: 1316069 
> 
> Size distribution for output (max=879213):
> 
> 2: 24
> 
> 3: 24097 #
> 
> 4: 23176 #
> 
> 5: 3412
> 
> 6: 4446
> 
> 7: 4663
> 
> 8: 4195
> 
> 9: 3772
> 
> 10: 3388
> 
> 11: 3666
> 
> 12: 2885
> 
> 13: 2759
> 
> 14: 2997
> 
> 15: 3060
> 
> 16: 2651
> 
> 17: 2235
> 
> 18: 92721 ##
> 
> 19: 879213 
> 
> Size distribution for control (max=1):
> 
> 0: 0
> 
>  
> 
> 4 session TCP_STREAM from guest-to-host with 16K message size for 60 seconds:
> 
>  
> 
> Queue histogram for virtio1:
> 
> Size distribution for input (max=1428590):
> 
> 1: 1428590 
> 
> Size distribution for output (max=957774):
> 
> 2: 20
> 
> 3: 54955 ###
> 
> 4: 34281 ##
> 
> 5: 2967
> 
> 6: 3394
> 
> 7: 9400
> 
> 8: 3061
> 
> 9: 3397
> 
> 10: 3258
> 
> 11: 3275
> 
> 12: 3147
> 
> 13: 2876
> 
> 14: 2747
> 
> 15: 2832
> 
> 16: 2013
> 
> 17: 1670
> 
> 18: 100369 ##
> 
> 19: 957774 
> 
> Size distribution for control (max=1):
> 
> 0: 0
> 
>  
> 
> Thanks,
> 
> Tom

In these tests we would have to set threshold pretty high.
I wonder whether the following makes any difference: the idea is to
A. get less false cache sharing by allocating full cache lines
B. better locality by using same cache for multiple sizes

So we get some of the wins of the threshold without bothering
with a cache.

Will try to test but not until later this week.

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 5aa43c3..c184712 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -132,7 +132,8 @@ static int vring_add_indirect(struct vring_virtqueue *vq,
unsigned head;
int i;
 
-   desc = kmalloc((out + in) * sizeof(struct vring_desc), gfp);
+   desc = k

Re: [PATCH v2 2/2] virtio-ring: Allocate indirect buffers from cache when possible

2012-09-10 Thread Thomas Lendacky

On Friday, September 07, 2012 09:19:04 AM Rusty Russell wrote:
> "Michael S. Tsirkin"  writes:
> > On Thu, Sep 06, 2012 at 05:27:23PM +0930, Rusty Russell wrote:
> >> "Michael S. Tsirkin"  writes:
> >> > Yes without checksum net core always linearizes packets, so yes it is
> >> > screwed.
> >> > For -net, skb always allocates space for 17 frags + linear part so
> >> > it seems sane to do same in virtio core, and allocate, for -net,
> >> > up to max_frags + 1 from cache.
> >> > We can adjust it: no _SG -> 2 otherwise 18.
> >> 
> >> But I thought it used individual buffers these days?
> > 
> > Yes for receive, no for transmit. That's probably why
> > we should have the threshold per vq, not per device, BTW.
> 
> Can someone actually run with my histogram patch and see what the real
> numbers are?

Somehow some HTML got in my first reply, resending...

I ran some TCP_RR and TCP_STREAM sessions, both host-to-guest and
guest-to-host, with a form of the histogram patch applied against a
RHEL6.3 kernel. The histogram values were reset after each test.

Here are the results:

60 session TCP_RR from host-to-guest with 256 byte request and 256 byte
response for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=7818456):
 1: 7818456 
Size distribution for output (max=7816698):
 2: 149 
 3: 7816698 
 4: 2   
 5: 1   
Size distribution for control (max=1):
 0: 0


4 session TCP_STREAM from host-to-guest with 4K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=16050941):
 1: 16050941 
Size distribution for output (max=1877796):
 2: 1877796 
 3: 5   
Size distribution for control (max=1):
 0: 0

4 session TCP_STREAM from host-to-guest with 16K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=16831151):
 1: 16831151 
Size distribution for output (max=1923965):
 2: 1923965 
 3: 5   
Size distribution for control (max=1):
 0: 0

4 session TCP_STREAM from guest-to-host with 4K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=1316069):
 1: 1316069 
Size distribution for output (max=879213):
 2: 24  
 3: 24097   #
 4: 23176   #
 5: 3412
 6: 4446
 7: 4663
 8: 4195
 9: 3772
10: 3388
11: 3666
12: 2885
13: 2759
14: 2997
15: 3060
16: 2651
17: 2235
18: 92721   ##  
19: 879213  
Size distribution for control (max=1):
 0: 0

4 session TCP_STREAM from guest-to-host with 16K message size for 60 seconds:

Queue histogram for virtio1:
Size distribution for input (max=1428590):
 1: 1428590 
Size distribution for output (max=957774):
 2: 20
 3: 54955   ###
 4: 34281   ##
 5: 2967
 6: 3394
 7: 9400
 8: 3061
 9: 3397
10: 3258
11: 3275
12: 3147
13: 2876
14: 2747
15: 2832
16: 2013
17: 1670
18: 100369  ##
19: 957774  
Size distribution for control (max=1):
 0: 0

Thanks,
Tom

> 
> I'm not convinced that the ideal 17-buffer case actually happens as much
> as we think.  And if it's not happening with this netperf test, we're
> testing the wrong thing.
> 
> Thanks,
> Rusty.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] KVM: optimize apic interrupt delivery

2012-09-10 Thread Gleb Natapov

On Mon, Sep 10, 2012 at 05:44:38PM +0300, Michael S. Tsirkin wrote:
> On Mon, Sep 10, 2012 at 04:09:15PM +0300, Gleb Natapov wrote:
> > Most interrupt are delivered to only one vcpu. Use pre-build tables to
> > find interrupt destination instead of looping through all vcpus.
> > 
> > Signed-off-by: Gleb Natapov 
> 
> Looks good overall. I think I see some bugs, with rcu use and others.
> the most suspecious thing is that code seems to use rcu but
> there are no calls to rcu_dereference anywhere.
Yep. Forgot about it.

> Pls see below.
> 
> Thanks for looking into this, once ready I intend to rework
> direct injection to work on top of this.
> 
> 
> > diff --git a/arch/x86/include/asm/kvm_host.h 
> > b/arch/x86/include/asm/kvm_host.h
> > index 64adb61..121f308 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -511,6 +511,14 @@ struct kvm_arch_memory_slot {
> > struct kvm_lpage_info *lpage_info[KVM_NR_PAGE_SIZES - 1];
> >  };
> >  
> > +struct kvm_apic_map {
> > +   struct rcu_head rcu;
> > +   bool flat;
> > +   u8 ldr_bits;
> > +   struct kvm_lapic *phys_map[255];
> 
> Not 256? apic id is sometimes used in the lookup - is is guaranteed
> that guest can not set it to 0xff? If yes this will overflow.
> 
Yes. Need to be 256.

> > +   struct kvm_lapic *logical_map[16][16];
> 
> 
> These are large arrays: 2Kbyte each one, which is bad
> for cache.
> Is it not better to have vcpu index stored there?
> that would reduce cache footprint by x8.
> 
We do not have vcpu indexes, only vcpu ids which are not correspond to
vcpu place in vcpu array. vcpu array likely be replaced by a list when
vcpu destruction functionality will be implemented, so I do not want to
introduce any dependencies on current implementation detail.

> > +};
> > +
> >  struct kvm_arch {
> > unsigned int n_used_mmu_pages;
> > unsigned int n_requested_mmu_pages;
> > @@ -528,6 +536,8 @@ struct kvm_arch {
> > struct kvm_ioapic *vioapic;
> > struct kvm_pit *vpit;
> > int vapics_in_nmi_mode;
> > +   struct kvm_apic_map *apic_map;
> > +   struct mutex apic_map_lock;
> >  
> > unsigned int tss_addr;
> > struct page *apic_access_page;
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 07ad628..d18ddd2 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -139,11 +139,101 @@ static inline int apic_enabled(struct kvm_lapic 
> > *apic)
> > (LVT_MASK | APIC_MODE_MASK | APIC_INPUT_POLARITY | \
> >  APIC_LVT_REMOTE_IRR | APIC_LVT_LEVEL_TRIGGER)
> >  
> > +static inline int apic_x2apic_mode(struct kvm_lapic *apic)
> > +{
> > +   return apic->vcpu->arch.apic_base & X2APIC_ENABLE;
> > +}
> > +
> > +
> 
> two emoty lines not needed
> 
That's terrible, I agree.
 
> >  static inline int kvm_apic_id(struct kvm_lapic *apic)
> >  {
> > return (kvm_apic_get_reg(apic, APIC_ID) >> 24) & 0xff;
> >  }
> >  
> > +static void rcu_free_apic_map(struct rcu_head *head)
> > +{
> > +   struct kvm_apic_map *p = container_of(head,
> > +   struct kvm_apic_map, rcu);
> 
> why break this line? it is not too long as is.
> 
Because it was longer, but than structure was renamed to be shorter :)

> > +
> > +   kfree(p);
> > +}
> > +
> > +static void kvm_apic_get_logical_id(u32 ldr, bool flat, u8 ldr_bits,
> > +   u16 *cid, u16 *lid)
> > +{
> > +   if (ldr_bits == 32) {
> > +   *cid = ldr >> 16;
> > +   *lid = ldr & 0x;
> > +   } else {
> > +   ldr = GET_APIC_LOGICAL_ID(ldr);
> > +
> > +   if (flat) {
> > +   *cid = 0;
> > +   *lid = ldr;
> > +   } else {
> > +   *cid = ldr >> 4;
> > +   *lid = ldr & 0xf;
> > +   }
> > +   }
> > +}
> > +
> > +static inline int recalculate_apic_map(struct kvm *kvm)
> 
> __must_check?
What is it?
> 
> > +{
> > +   struct kvm_apic_map *new, *old = NULL;
> > +   struct kvm_vcpu *vcpu;
> > +   int i;
> > +
> > +   new = kzalloc(sizeof(struct kvm_apic_map), GFP_KERNEL);
> > +
> 
> empty line not needed here :)
I like it that way. You think it will break compilation or runtime?

> 
> > +   if (!new)
> > +   return -ENOMEM;
> > +
> > +   new->ldr_bits = 8;
> > +   new->flat = true;
> > +   kvm_for_each_vcpu(i, vcpu, kvm) {
> > +   u16 cid, lid;
> > +   struct kvm_lapic *apic = vcpu->arch.apic;
> > +
> > +   if (!kvm_apic_present(vcpu))
> > +   continue;
> > +
> > +   if (apic_x2apic_mode(apic)) {
> > +   new->ldr_bits = 32;
> > +   new->flat = false;
> > +   } else if (kvm_apic_sw_enabled(apic) && new->flat &&
> > +   kvm_apic_get_reg(apic, APIC_DFR) == 
> > APIC_DFR_CLUSTER)
> > +   new->flat = false;
> > +
> > +   new->phys_map[kvm_apic_id(apic)] = apic;
> > +   kvm_apic_get_logical_id(kvm_apic_get_reg(apic, APIC_LDR),
> > +

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Srikar Dronamraju

* Peter Zijlstra  [2012-09-10 18:03:55]:

> On Mon, 2012-09-10 at 08:16 -0500, Andrew Theurer wrote:
> > > > @@ -4856,8 +4859,6 @@ again:
> > > > if (curr->sched_class != p->sched_class)
> > > > goto out;
> > > > 
> > > > -   if (task_running(p_rq, p) || p->state)
> > > > -   goto out;
> > > 
> > > Is it possible that by this time the current thread takes double rq
> > > lock, thread p could actually be running?  i.e is there merit to keep
> > > this check around even with your similar check above?
> > 
> > I think that's a good idea.  I'll add that back in. 
> 
> Right, it needs to still be there, the test before acquiring p_rq is an
> optimistic test to avoid work, but you have to still test it once you
> acquire p_rq since the rest of the code relies on this not being so.
> 
> How about something like this instead.. ?
> 
> ---
>  kernel/sched/core.c | 35 ++-
>  1 file changed, 26 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c46a011..c9ecab2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4300,6 +4300,23 @@ void __sched yield(void)
>  }
>  EXPORT_SYMBOL(yield);
>  
> +/*
> + * Tests preconditions required for sched_class::yield_to().
> + */
> +static bool __yield_to_candidate(struct task_struct *curr, struct 
> task_struct *p)
> +{
> + if (!curr->sched_class->yield_to_task)
> + return false;
> +
> + if (curr->sched_class != p->sched_class)
> + return false;


Peter, 

Should we also add a check if the runq has a skip buddy (as pointed out
by Raghu) and return if the skip buddy is already set.  Something akin
to 

if (p_rq->cfs_rq->skip)
return false;

So if somebody has already acquired a double run queue lock and almost
set the next buddy, we dont need to take run queue lock and also avoid
overwriting the already set skip buddy.

> +
> + if (task_running(p_rq, p) || p->state)
> + return false;
> +
> + return true;
> +}
> +
>  /**
>   * yield_to - yield the current processor to another thread in
>   * your thread group, or accelerate that thread toward the
> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, bool 
> preempt)
>   rq = this_rq();
>  
>  again:
> + /* optimistic test to avoid taking locks */
> + if (!__yield_to_candidate(curr, p))
> + goto out_irq;
> +
>   p_rq = task_rq(p);
>   double_rq_lock(rq, p_rq);
>   while (task_rq(p) != p_rq) {
> @@ -4330,14 +4351,9 @@ bool __sched yield_to(struct task_struct *p, bool 
> preempt)
>   goto again;
>   }
>  
> - if (!curr->sched_class->yield_to_task)
> - goto out;
> -
> - if (curr->sched_class != p->sched_class)
> - goto out;
> -
> - if (task_running(p_rq, p) || p->state)
> - goto out;
> + /* validate state, holding p_rq ensures p's state cannot change */
> + if (!__yield_to_candidate(curr, p))
> + goto out_unlock;
>  
>   yielded = curr->sched_class->yield_to_task(rq, p, preempt);
>   if (yielded) {
> @@ -4350,8 +4366,9 @@ bool __sched yield_to(struct task_struct *p, bool 
> preempt)
>   resched_task(p_rq->curr);
>   }
>  
> -out:
> +out_unlock:
>   double_rq_unlock(rq, p_rq);
> +out_irq:
>   local_irq_restore(flags);
>  
>   if (yielded)
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC PATCH] KVM: optimize apic interrupt delivery

2012-09-10 Thread Gleb Natapov

On Mon, Sep 10, 2012 at 07:17:54PM +0300, Gleb Natapov wrote:
> > > + return 0;
> > > +}
> > > +
> > > +static inline int kvm_apic_set_id(struct kvm_lapic *apic, u8 id)
> > > +{
> > > + apic_set_reg(apic, APIC_ID, id << 24);
> > > + return recalculate_apic_map(apic->vcpu->kvm);
> > > +}
> > > +
> > > +static inline int kvm_apic_set_ldr(struct kvm_lapic *apic, u32 id)
> > > +{
> > > + apic_set_reg(apic, APIC_LDR, id);
> > > + return recalculate_apic_map(apic->vcpu->kvm);
> > > +}
> > > +
> > 
> > return value of these functions seems never checked.
> > 
> Yes, the problem is that we can do nothing about the failure if failure
> happens during guest write.
> 
Actually I have an idea how to handle the error. Never return one. If
map cannot be allocated go slow path always. phys_map should be checked
for NULL during delivery in this case obviously.

--
Gleb.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Peter Zijlstra

On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > +static bool __yield_to_candidate(struct task_struct *curr, struct 
> > task_struct *p)
> > +{
> > + if (!curr->sched_class->yield_to_task)
> > + return false;
> > +
> > + if (curr->sched_class != p->sched_class)
> > + return false;
> 
> 
> Peter, 
> 
> Should we also add a check if the runq has a skip buddy (as pointed out
> by Raghu) and return if the skip buddy is already set. 

Oh right, I missed that suggestion.. the performance improvement went
from 81% to 139% using this, right?

It might make more sense to keep that separate, outside of this
function, since its not a strict prerequisite.

> > 
> > + if (task_running(p_rq, p) || p->state)
> > + return false;
> > +
> > + return true;
> > +} 

> > @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> bool preempt)
> >   rq = this_rq();
> >  
> >  again:
> > + /* optimistic test to avoid taking locks */
> > + if (!__yield_to_candidate(curr, p))
> > + goto out_irq;
> > +

So add something like:

/* Optimistic, if we 'raced' with another yield_to(), don't bother */
if (p_rq->cfs_rq->skip)
goto out_irq;
> 
> 
> >   p_rq = task_rq(p);
> >   double_rq_lock(rq, p_rq);
> 
> 
But I do have a question on this optimization though,.. Why do we check
p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?

That is, I'd like to see this thing explained a little better.

Does it go something like: p_rq is the runqueue of the task we'd like to
yield to, rq is our own, they might be the same. If we have a ->skip,
there's nothing we can do about it, OTOH p_rq having a ->skip and
failing the yield_to() simply means us picking the next VCPU thread,
which might be running on an entirely different cpu (rq) and could
succeed?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

problems using virt-manager

2012-09-10 Thread Lentes, Bernd

Hi,

i try to run virt-manager on a SLES 11 SP1 box. I'm using kernel 2.6.32.12 and 
virt-manager 0.9.4-106.1.x86_64 .
The system is a 64bit box.

Here is the output:
=
pc56846:/media/idg2/SysAdmin_AG_Wurst/software_und_treiber/virt_manager/sles_11_sp1
 # virt-manager &
[1] 9659
pc56846:/media/idg2/SysAdmin_AG_Wurst/software_und_treiber/virt_manager/sles_11_sp1
 # Traceback (most recent call last):
  File "/usr/share/virt-manager/virt-manager.py", line 386, in 
main()
  File "/usr/share/virt-manager/virt-manager.py", line 247, in main
from virtManager import cli
  File "/usr/share/virt-manager/virtManager/cli.py", line 29, in 
import libvirt
  File "/usr/lib64/python2.6/site-packages/libvirt.py", line 25, in 
raise lib_e
ImportError: /usr/lib64/libvirt.so.0: undefined symbol: 
selinux_virtual_domain_context_path

[1]+  Exit 1  virt-manager
=

Thanks for any hint.


Bernd

--
Bernd Lentes

Systemadministration
Institut für Entwicklungsgenetik
Gebäude 35.34 - Raum 208
HelmholtzZentrum münchen
bernd.len...@helmholtz-muenchen.de
phone: +49 89 3187 1241
fax:   +49 89 3187 2294
http://www.helmholtz-muenchen.de/idg

Wir sollten nicht den Tod fürchten, sondern
das schlechte Leben

Helmholtz Zentrum München
Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH)
Ingolstädter Landstr. 1
85764 Neuherberg
www.helmholtz-muenchen.de
Aufsichtsratsvorsitzende: MinDir´in Bärbel Brumme-Bothe
Geschäftsführer: Prof. Dr. Günther Wess und Dr. Nikolaus Blum
Registergericht: Amtsgericht München HRB 6466
USt-IdNr: DE 129521671
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Sending a VM for restart (after it has been shutdown and saved)

2012-09-10 Thread Marcelo Tosatti

On Mon, Sep 03, 2012 at 06:35:24PM +0100, Michael Johns wrote:
> Hi list,
> 
> I have been hacking the KVM-QEMU code, but need some help to be able
> to perform a particular operation.
> 
> Currently, I perform some operations on the VM image after it has
> received a shutdown call, and after the image itself has been fully
> saved before KVM-QEMU shuts down. I perform these operations within
> the bdrv_close_all () method within block.c, and it works exactly as I
> want it to.
> 
> What I would like to be able to do is to send the VM in question back
> for a restart from that part of the code, and I was wondering if this
> was possible with a given command. When I say send it back, I mean
> literally to have the KVM-QEMU treat the VM as if has just been
> started/restarted. Although I know a few things, there is still a lot
> I don't know so it would be great to get some help on this.
> 
> Thanks,
> 
> M

Do you mean system_reset command from QEMU monitor?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Mike Waychison

On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin  wrote:
> On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
>> This implementation of a virtio balloon driver uses the page cache to
>> "store" pages that have been released to the host.  The communication
>> (outside of target counts) is one way--the guest notifies the host when
>> it adds a page to the page cache, allowing the host to madvise(2) with
>> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
>> (via the regular page reclaim).  This means that inflating the balloon
>> is similar to the existing balloon mechanism, but the deflate is
>> different--it re-uses existing Linux kernel functionality to
>> automatically reclaim.
>>
>> Signed-off-by: Frank Swiderski 

Hi Michael,

I'm very sorry that Frank and I have been silent on these threads.
I've been out of the office and Frank has been been swamped :)

I'll take a stab at answering some of your questions below, and
hopefully we can end up on the same page.

> I've been trying to understand this, and I have
> a question: what exactly is the benefit
> of this new device?

The key difference between this device/driver and the pre-existing
virtio_balloon device/driver is in how the memory pressure loop is
controlled.

With the pre-existing balloon device/driver, the control loop for how
much memory a given VM is allowed to use is controlled completely by
the host.  This is probably fine if the goal is to pack as much work
on a given host as possible, but it says nothing about the expected
performance that any given VM is expecting to have.  Specifically, it
allows the host to set a target goal for the size of a VM, and the
driver in the guest does whatever is needed to get to that goal.  This
is great for systems where one wants to "grow or shrink" a VM from the
outside.

This behaviour however doesn't match what applications actually expect
from a memory control loop however.  In a native setup, an application
can usually expect to allocate memory from the kernel on an as-needed
basis, and can in turn return memory back to the system (using a heap
implementation that actually releases memory that is).  The dynamic
size of an application is completely controlled by the application,
and there is very little that cluster management software can do to
ensure that the application fits some prescribed size.

We recognized this in the development of our cluster management
software long ago, so our systems are designed for managing tasks that
have a dynamic memory footprint.  Overcommit is possible (as most
applications do not use the full reservation of memory they asked for
originally), letting us do things like schedule lower priority/lower
service-classification work using resources that are otherwise
available in stand-by for high-priority/low-latency workloads.

>
> Note that users could not care less about how a driver
> is implemented internally.
>
> Is there some workload where you see VM working better with
> this than regular balloon? Any numbers?

This device is less about performance as it is about getting the
memory size of a job (or in this case, a job in a VM) to grow and
shrink as the application workload sees fit, much like how processes
today can grow and shrink without external direction.

>
> Also, can't we just replace existing balloon implementation
> with this one?

Perhaps, but as described above, both devices have very different
characteristics.

> Why it is so important to deflate silently?

It may not be so important to deflate silently.  I'm not sure why it
is important that we deflate "loudly" though either :)  Doing so seems
like unnecessary guest/host communication IMO, especially if the guest
is expecting to be able to grow to totalram (and the host isn't able
to nack any pages reclaimed anyway...).

> I guess filesystem does not currently get a callback
> before page is reclaimed but this isan implementation detail -
> maybe this can be fixed?

I do not follow this question.

>
> Also can you pls answer Avi's question?
> How is overcommit managed?

Overcommit in our deployments is managed using memory cgroups on the
host.  This allows us to have very directed policies as to how
competing VMs on a host may overcommit.

>
>
>> ---
>>  drivers/virtio/Kconfig  |   13 +
>>  drivers/virtio/Makefile |1 +
>>  drivers/virtio/virtio_fileballoon.c |  636 
>> +++
>>  include/linux/virtio_balloon.h  |9 +
>>  include/linux/virtio_ids.h  |1 +
>>  5 files changed, 660 insertions(+), 0 deletions(-)
>>  create mode 100644 drivers/virtio/virtio_fileballoon.c
>>
>> diff --git a/drivers/virtio/Kconfig b/drivers/virtio/Kconfig
>> index f38b17a..cffa2a7 100644
>> --- a/drivers/virtio/Kconfig
>> +++ b/drivers/virtio/Kconfig
>> @@ -35,6 +35,19 @@ config VIRTIO_BALLOON
>>
>>If unsure, say M.
>>
>> +config VIRTIO_FILEBALLOON
>> + tristate "Virtio page cache-backed balloon

Re: [libvirt-users] Kernel unresponsive after booting 700+ vm's on a single host

2012-09-10 Thread Eric Blake

On 09/10/2012 07:51 AM, Alfred Bratterud wrote:
> For a research project we are trying to boot a very large amount of tiny, 
> custom built VM's on KVM/ubuntu. The maximum VM-count achieved was 1000, but 
> with substantial slowness, and eventually kernel failure, while the 
> cpu/memory loads were nowhere near limits. Where is the likely bottleneck? 
> Any solutions, workarounds, hacks or dirty tricks?

Are you using cgroups?  There have been some known bottlenecks in the
kernel cgroup code, where it scales very miserably; and since libvirt
uses a different cgroup per VM by default when cgroups are enabled, that
might explain part of the problem.

Other than that, if you can profile the slowdowns, I'm sure people would
be interested in the results.

-- 
Eric Blake   ebl...@redhat.com+1-919-301-3266
Libvirt virtualization library http://libvirt.org

signature.asc
Description: OpenPGP digital signature

Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Rik van Riel


On 09/10/2012 01:37 PM, Mike Waychison wrote:

On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin  wrote:



Also can you pls answer Avi's question?
How is overcommit managed?


Overcommit in our deployments is managed using memory cgroups on the
host.  This allows us to have very directed policies as to how
competing VMs on a host may overcommit.


How do your memory cgroups lead to guests inflating their balloons?

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Mike Waychison

On Mon, Sep 10, 2012 at 2:04 PM, Rik van Riel  wrote:
> On 09/10/2012 01:37 PM, Mike Waychison wrote:
>>
>> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin 
>> wrote:
>
>
>>> Also can you pls answer Avi's question?
>>> How is overcommit managed?
>>
>>
>> Overcommit in our deployments is managed using memory cgroups on the
>> host.  This allows us to have very directed policies as to how
>> competing VMs on a host may overcommit.
>
>
> How do your memory cgroups lead to guests inflating their balloons?

The control loop that is driving the cgroup on the host can still move
the balloon target page count causing the balloon in the guest to try
and inflate.  This allows the host to effectively slowly grow the
balloon in the guest, allowing reclaim of guest free memory, followed
by guest page cache (and memory on the host system).  This can then be
compared with the subsequent growth (as this balloon setup allows the
guest to grow as it sees fit), which in effect gives us a memory
pressure indicator on the host, allowing it to back-off shrinking the
guest if the guest balloon quickly deflates.

The net effect is an opportunistic release of memory from the guest
back to the host, and the ability to quickly grow a VM's memory
footprint as the workload within it requires.

This dynamic memory sizing of VMs is much more in line with what we
can expect from native tasks today (and which is what our resource
management systems are designed to handle).
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCHv4] virtio-spec: virtio network device multiqueue support

2012-09-10 Thread Rick Jones


On 09/09/2012 07:12 PM, Rusty Russell wrote:

OK, I read the spec (pasted below for easy of reading), but I'm still
confused over how this will work.

I thought normal net drivers have the hardware provide an rxhash for
each packet, and we map that to CPU to queue the packet on[1].  We hope
that the receiving process migrates to that CPU, so xmit queue
matches.

For virtio this would mean a new per-packet rxhash value, right?

Why are we doing something different?  What am I missing?

Thanks,
Rusty.
[1] Everything I Know About Networking I Learned From LWN:
 https://lwn.net/Articles/362339/


In my taxonomy at least, "multi-queue" predates RPS and RFS and is 
simply where the NIC via some means, perhaps a headers hash, separates 
incoming frames to different queues.


RPS can be thought of as doing something similar inside the host.  That 
could be used to get a spread from an otherwise "dumb" NIC (certainly 
that is what one of its predecessors - Inbound Packet Scheduling - used 
it for in HP-UX 10.20), or it could be used to augment the multi-queue 
support of a not-so-dump NIC - say if said NIC had a limit of queues 
that was rather lower than the number of cores/threads in the host. 
Indeed some driver/NIC combinations provide a hash value to the host for 
the host to use as it sees fit.


However, there is still the matter of a single thread of an application 
servicing multiple connections, each of which would hash to different 
locations.


RFS  (Receive Flow Steering) then goes one step further, and looks-up 
where the flow endpoint was last accessed and steers the traffic there. 
 The idea being that a thread of execution servicing multiple flows 
will have the traffic of those flows sent to the same place.  It then 
allows the scheduler to decide where things should be run rather than 
the networking code.


rick jones

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Win2k8 guest with very poor disk performance on KVM host

2012-09-10 Thread Daniel Tschritter

Hi everybody,

I got a server with CentOS 6.3 and KVM as a host and a windows 2k8 
guest.

The windows machine's disk performance is very poor.
The windows guest uses VirtIO disk drivers, no cache and uses a LVM 
partition on a Raid1.

atop shows 100% disk utilization as soon as the windows guest accesses 
the HDD, data transfers figures given are most times less than 1MB/s 
r/w, peaks are around 3MB/s r/w.

I've run a few tests to see what's going on:

Creating a 10GB test file on CentOS (guests switched off):
time dd if=/dev/random of=testfile bs=1 count=0 seek=10G
0+0 records in
0+0 records out
0 bytes (0 B) copied, 5.777e-06 s, 0.0 kB/s
real 0m0.001s
user 0m0.000s
sys 0m0.000s
 
Create copy of test file created above:
time cp testfile testfile2
real 0m3.136s
user 0m0.440s
sys 0m2.693s
That looks ok to me. According to atop data transfer rates are between 
130 and 180MB/s.

Create copy of test file above while Windows guest boots up:
time cp testfile testfile2
real 0m3.367s
user 0m0.515s
sys 0m2.826s
not much different...

Creating copy of test file above within the Win2k8R2 guest:
Current Time: 20:12:32,41
copy testfile testfile2
1 file(s) copied
Current Time: 20:22:08,64
576,23s
That takes about 170 time longer than the copy unter CentOS!

I've run the same test on a CentOS6.3 guest with the following results:
time cp testfile testfile2
real 0m3.950s
user 0m0.470s
sys 0m3.383s
that's almost as quick as the host...

I've run these tests a few times, always giving about the same result.

Why is the disk performance in the Win guest that poor?

What can be done to improve things?

Comments and ideas welcome!

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Raghavendra K T


On 09/10/2012 10:42 PM, Peter Zijlstra wrote:

On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:

+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct 
*p)
+{
+ if (!curr->sched_class->yield_to_task)
+ return false;
+
+ if (curr->sched_class != p->sched_class)
+ return false;



Peter,

Should we also add a check if the runq has a skip buddy (as pointed out
by Raghu) and return if the skip buddy is already set.


Oh right, I missed that suggestion.. the performance improvement went
from 81% to 139% using this, right?

It might make more sense to keep that separate, outside of this
function, since its not a strict prerequisite.



+ if (task_running(p_rq, p) || p->state)
+ return false;
+
+ return true;
+}




@@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,

bool preempt)

   rq = this_rq();

  again:
+ /* optimistic test to avoid taking locks */
+ if (!__yield_to_candidate(curr, p))
+ goto out_irq;
+


So add something like:

/* Optimistic, if we 'raced' with another yield_to(), don't bother */
if (p_rq->cfs_rq->skip)
goto out_irq;




   p_rq = task_rq(p);
   double_rq_lock(rq, p_rq);




But I do have a question on this optimization though,.. Why do we check
p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?

That is, I'd like to see this thing explained a little better.

Does it go something like: p_rq is the runqueue of the task we'd like to
yield to, rq is our own, they might be the same. If we have a ->skip,
there's nothing we can do about it, OTOH p_rq having a ->skip and
failing the yield_to() simply means us picking the next VCPU thread,
which might be running on an entirely different cpu (rq) and could
succeed?



Yes, That is the intention (mean checking  p_rq->cfs->skip). Though we
may be overdoing this check in the scenario when multiple vcpus of same
VM pinned to same CPU.

I am testing the above patch. Hope to be able to get back with the
results tomorrow.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Michael S. Tsirkin

On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote:
> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin  wrote:
> > On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
> >> This implementation of a virtio balloon driver uses the page cache to
> >> "store" pages that have been released to the host.  The communication
> >> (outside of target counts) is one way--the guest notifies the host when
> >> it adds a page to the page cache, allowing the host to madvise(2) with
> >> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> >> (via the regular page reclaim).  This means that inflating the balloon
> >> is similar to the existing balloon mechanism, but the deflate is
> >> different--it re-uses existing Linux kernel functionality to
> >> automatically reclaim.
> >>
> >> Signed-off-by: Frank Swiderski 
> 
> Hi Michael,
> 
> I'm very sorry that Frank and I have been silent on these threads.
> I've been out of the office and Frank has been been swamped :)
> 
> I'll take a stab at answering some of your questions below, and
> hopefully we can end up on the same page.
> 
> > I've been trying to understand this, and I have
> > a question: what exactly is the benefit
> > of this new device?
> 
> The key difference between this device/driver and the pre-existing
> virtio_balloon device/driver is in how the memory pressure loop is
> controlled.
> 
> With the pre-existing balloon device/driver, the control loop for how
> much memory a given VM is allowed to use is controlled completely by
> the host.  This is probably fine if the goal is to pack as much work
> on a given host as possible, but it says nothing about the expected
> performance that any given VM is expecting to have.  Specifically, it
> allows the host to set a target goal for the size of a VM, and the
> driver in the guest does whatever is needed to get to that goal.  This
> is great for systems where one wants to "grow or shrink" a VM from the
> outside.
> 
> 
> This behaviour however doesn't match what applications actually expect
> from a memory control loop however.  In a native setup, an application
> can usually expect to allocate memory from the kernel on an as-needed
> basis, and can in turn return memory back to the system (using a heap
> implementation that actually releases memory that is).  The dynamic
> size of an application is completely controlled by the application,
> and there is very little that cluster management software can do to
> ensure that the application fits some prescribed size.
> 
> We recognized this in the development of our cluster management
> software long ago, so our systems are designed for managing tasks that
> have a dynamic memory footprint.  Overcommit is possible (as most
> applications do not use the full reservation of memory they asked for
> originally), letting us do things like schedule lower priority/lower
> service-classification work using resources that are otherwise
> available in stand-by for high-priority/low-latency workloads.

OK I am not sure I got this right so pls tell me if this summary is
correct (note: this does not talk about what guest does with memory, 
ust what it is that device does):

- existing balloon is told lower limit on target size by host and pulls in at 
least
  target size. Guest can inflate > target size if it likes
  and then it is OK to deflate back to target size but not less.
- your balloon is told upper limit on target size by host and pulls at most
  target size. Guest can deflate down to 0 at any point.

If so I think both approaches make sense and in fact they
can be useful at the same time for the same guest.
In that case, I see two ways how this can be done:

1. two devices: existing ballon + cache balloon
2. add "upper limit" to existing ballon

A single device looks a bit more natural in that we don't
really care in which balloon a page is as long as we
are between lower and upper limit. Right?
>From implementation POV we could have it use
pagecache for pages above lower limit but that
is a separate question about driver design,
I would like to make sure I understand the high
level design first.





> >
> > Note that users could not care less about how a driver
> > is implemented internally.
> >
> > Is there some workload where you see VM working better with
> > this than regular balloon? Any numbers?
> 
> This device is less about performance as it is about getting the
> memory size of a job (or in this case, a job in a VM) to grow and
> shrink as the application workload sees fit, much like how processes
> today can grow and shrink without external direction.

Still, e.g. swap in host achieves more or less the same functionality.
I am guessing balloon can work better by getting more cooperation
from guest but aren't there any tests showing this is true in practice?


> >
> > Also, can't we just replace existing balloon implementation
> > with this one?
> 
> Perhaps, but as described above, both devices have very diff

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Andrew Theurer

On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > > +static bool __yield_to_candidate(struct task_struct *curr, struct 
> > > task_struct *p)
> > > +{
> > > + if (!curr->sched_class->yield_to_task)
> > > + return false;
> > > +
> > > + if (curr->sched_class != p->sched_class)
> > > + return false;
> > 
> > 
> > Peter, 
> > 
> > Should we also add a check if the runq has a skip buddy (as pointed out
> > by Raghu) and return if the skip buddy is already set. 
> 
> Oh right, I missed that suggestion.. the performance improvement went
> from 81% to 139% using this, right?
> 
> It might make more sense to keep that separate, outside of this
> function, since its not a strict prerequisite.
> 
> > > 
> > > + if (task_running(p_rq, p) || p->state)
> > > + return false;
> > > +
> > > + return true;
> > > +} 
> 
> 
> > > @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > bool preempt)
> > >   rq = this_rq();
> > >  
> > >  again:
> > > + /* optimistic test to avoid taking locks */
> > > + if (!__yield_to_candidate(curr, p))
> > > + goto out_irq;
> > > +
> 
> So add something like:
> 
>   /* Optimistic, if we 'raced' with another yield_to(), don't bother */
>   if (p_rq->cfs_rq->skip)
>   goto out_irq;
> > 
> > 
> > >   p_rq = task_rq(p);
> > >   double_rq_lock(rq, p_rq);
> > 
> > 
> But I do have a question on this optimization though,.. Why do we check
> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> 
> That is, I'd like to see this thing explained a little better.
> 
> Does it go something like: p_rq is the runqueue of the task we'd like to
> yield to, rq is our own, they might be the same. If we have a ->skip,
> there's nothing we can do about it, OTOH p_rq having a ->skip and
> failing the yield_to() simply means us picking the next VCPU thread,
> which might be running on an entirely different cpu (rq) and could
> succeed?

Here's two new versions, both include a __yield_to_candidate(): "v3"
uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
skip check.  Raghu, I am not sure if this is exactly what you want
implemented in v4.

Results:
> ple on:   2552 +/- .70%
> ple on: w/fixv1:  4621 +/- 2.12%  (81% improvement)
> ple on: w/fixv2:  6115   (139% improvement)
   v3:  5735   (124% improvement)
   v4:  4524   (  3% regression)

Both patches included below

-Andrew

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..0d98a67 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4820,6 +4820,23 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct 
*p, struct rq *p_rq)
+{
+   if (!curr->sched_class->yield_to_task)
+   return false;
+
+   if (curr->sched_class != p->sched_class)
+   return false;
+
+   if (task_running(p_rq, p) || p->state)
+   return false;
+
+   return true;
+}
+
 /**
  * yield_to - yield the current processor to another thread in
  * your thread group, or accelerate that thread toward the
@@ -4844,20 +4861,27 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
 
 again:
p_rq = task_rq(p);
+
+   /* optimistic test to avoid taking locks */
+   if (!__yield_to_candidate(curr, p, p_rq))
+   goto out_irq;
+
+   /*
+* if the target task is not running, then only yield if the
+* current task is in guest mode
+*/
+   if (!(p_rq->curr->flags & PF_VCPU))
+   goto out_irq;
+
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
double_rq_unlock(rq, p_rq);
goto again;
}
 
-   if (!curr->sched_class->yield_to_task)
-   goto out;
-
-   if (curr->sched_class != p->sched_class)
-   goto out;
-
-   if (task_running(p_rq, p) || p->state)
-   goto out;
+   /* validate state, holding p_rq ensures p's state cannot change */
+   if (!__yield_to_candidate(curr, p, p_rq))
+   goto out_unlock;
 
yielded = curr->sched_class->yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4877,8 +4901,9 @@ again:
rq->skip_clock_update = 0;
}
 
-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);
 
if (yielded)




v4:


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..2bec2ed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4820,6 +4820,23 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candida

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Peter Zijlstra

On Mon, 2012-09-10 at 15:12 -0500, Andrew Theurer wrote:
> +   /*
> +* if the target task is not running, then only yield if the
> +* current task is in guest mode
> +*/
> +   if (!(p_rq->curr->flags & PF_VCPU))
> +   goto out_irq; 

This would make yield_to() only ever work on KVM, not that I mind this
too much, its a horrid thing and making it less useful for (ab)use is a
good thing, still this probably wants mention somewhere :-)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Rik van Riel


On 09/10/2012 04:19 PM, Peter Zijlstra wrote:

On Mon, 2012-09-10 at 15:12 -0500, Andrew Theurer wrote:

+   /*
+* if the target task is not running, then only yield if the
+* current task is in guest mode
+*/
+   if (!(p_rq->curr->flags & PF_VCPU))
+   goto out_irq;


This would make yield_to() only ever work on KVM, not that I mind this
too much, its a horrid thing and making it less useful for (ab)use is a
good thing, still this probably wants mention somewhere :-)


Also, it would not preempt a non-kvm task, even if we need
to do that to boost a VCPU. I think the lines above should
be dropped.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Mike Waychison

On Mon, Sep 10, 2012 at 3:59 PM, Michael S. Tsirkin  wrote:
> On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote:
>> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin  wrote:
>> > On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
>> >> This implementation of a virtio balloon driver uses the page cache to
>> >> "store" pages that have been released to the host.  The communication
>> >> (outside of target counts) is one way--the guest notifies the host when
>> >> it adds a page to the page cache, allowing the host to madvise(2) with
>> >> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
>> >> (via the regular page reclaim).  This means that inflating the balloon
>> >> is similar to the existing balloon mechanism, but the deflate is
>> >> different--it re-uses existing Linux kernel functionality to
>> >> automatically reclaim.
>> >>
>> >> Signed-off-by: Frank Swiderski 
>>
>> Hi Michael,
>>
>> I'm very sorry that Frank and I have been silent on these threads.
>> I've been out of the office and Frank has been been swamped :)
>>
>> I'll take a stab at answering some of your questions below, and
>> hopefully we can end up on the same page.
>>
>> > I've been trying to understand this, and I have
>> > a question: what exactly is the benefit
>> > of this new device?r balloon is told upper limit on target size by host 
>> > and pulls
>>
>> The key difference between this device/driver and the pre-existing
>> virtio_balloon device/driver is in how the memory pressure loop is
>> controlled.
>>
>> With the pre-existing balloon device/driver, the control loop for how
>> much memory a given VM is allowed to use is controlled completely by
>> the host.  This is probably fine if the goal is to pack as much work
>> on a given host as possible, but it says nothing about the expected
>> performance that any given VM is expecting to have.  Specifically, it
>> allows the host to set a target goal for the size of a VM, and the
>> driver in the guest does whatever is needed to get to that goal.  This
>> is great for systems where one wants to "grow or shrink" a VM from the
>> outside.
>>
>>
>> This behaviour however doesn't match what applications actually expectr 
>> balloon is told upper limit on target size by host and pulls
>> from a memory control loop however.  In a native setup, an application
>> can usually expect to allocate memory from the kernel on an as-needed
>> basis, and can in turn return memory back to the system (using a heap
>> implementation that actually releases memory that is).  The dynamic
>> size of an application is completely controlled by the application,
>> and there is very little that cluster management software can do to
>> ensure that the application fits some prescribed size.
>>
>> We recognized this in the development of our cluster management
>> software long ago, so our systems are designed for managing tasks that
>> have a dynamic memory footprint.  Overcommit is possible (as most
>> applications do not use the full reservation of memory they asked for
>> originally), letting us do things like schedule lower priority/lower
>> service-classification work using resources that are otherwise
>> available in stand-by for high-priority/low-latency workloads.
>
> OK I am not sure I got this right so pls tell me if this summary is
> correct (note: this does not talk about what guest does with memory,
> ust what it is that device does):
>
> - existing balloon is told lower limit on target size by host and pulls in at 
> least
>   target size. Guest can inflate > target size if it likes
>   and then it is OK to deflate back to target size but not less.

Is this true?   I take it nothing is keeping the existing balloon
driver from going over the target, but the same can be said about
either balloon implementation.

> - your balloon is told upper limit on target size by host and pulls at most
>   target size. Guest can deflate down to 0 at any point.
>
> If so I think both approaches make sense and in fact they
> can be useful at the same time for the same guest.
> In that case, I see two ways how this can be done:
>
> 1. two devices: existing ballon + cache balloon the
> 2. add "upper limit" to existing ballon
>
> A single device looks a bit more natural in that we don't
> really care in which balloon a page is as long as we
> are between lower and upper limit. Right?

I agree that this may be better done using a single device if possible.

> From implementation POV we could have it use
> pagecache for pages above lower limit but that
> is a separate question about driver design,
> I would like to make sure I understand the highr balloon is told upper limit 
> on tr balloon is told upper limit on target size by host and pullsarget size 
> by host and pulls
> level design first.

I agree that this is an implementation detail that is separate from
discussions of high and low limits.  That said, there are several
advantages to pushing these pages to the

Re: [PATCH] Add a page cache-backed balloon device driver.

2012-09-10 Thread Michael S. Tsirkin

On Mon, Sep 10, 2012 at 04:49:40PM -0400, Mike Waychison wrote:
> On Mon, Sep 10, 2012 at 3:59 PM, Michael S. Tsirkin  wrote:
> > On Mon, Sep 10, 2012 at 01:37:06PM -0400, Mike Waychison wrote:
> >> On Mon, Sep 10, 2012 at 5:05 AM, Michael S. Tsirkin  
> >> wrote:
> >> > On Tue, Jun 26, 2012 at 01:32:58PM -0700, Frank Swiderski wrote:
> >> >> This implementation of a virtio balloon driver uses the page cache to
> >> >> "store" pages that have been released to the host.  The communication
> >> >> (outside of target counts) is one way--the guest notifies the host when
> >> >> it adds a page to the page cache, allowing the host to madvise(2) with
> >> >> MADV_DONTNEED.  Reclaim in the guest is therefore automatic and implicit
> >> >> (via the regular page reclaim).  This means that inflating the balloon
> >> >> is similar to the existing balloon mechanism, but the deflate is
> >> >> different--it re-uses existing Linux kernel functionality to
> >> >> automatically reclaim.
> >> >>
> >> >> Signed-off-by: Frank Swiderski 
> >>
> >> Hi Michael,
> >>
> >> I'm very sorry that Frank and I have been silent on these threads.
> >> I've been out of the office and Frank has been been swamped :)
> >>
> >> I'll take a stab at answering some of your questions below, and
> >> hopefully we can end up on the same page.
> >>
> >> > I've been trying to understand this, and I have
> >> > a question: what exactly is the benefit
> >> > of this new device?r balloon is told upper limit on target size by host 
> >> > and pulls
> >>
> >> The key difference between this device/driver and the pre-existing
> >> virtio_balloon device/driver is in how the memory pressure loop is
> >> controlled.
> >>
> >> With the pre-existing balloon device/driver, the control loop for how
> >> much memory a given VM is allowed to use is controlled completely by
> >> the host.  This is probably fine if the goal is to pack as much work
> >> on a given host as possible, but it says nothing about the expected
> >> performance that any given VM is expecting to have.  Specifically, it
> >> allows the host to set a target goal for the size of a VM, and the
> >> driver in the guest does whatever is needed to get to that goal.  This
> >> is great for systems where one wants to "grow or shrink" a VM from the
> >> outside.
> >>
> >>
> >> This behaviour however doesn't match what applications actually expectr 
> >> balloon is told upper limit on target size by host and pulls
> >> from a memory control loop however.  In a native setup, an application
> >> can usually expect to allocate memory from the kernel on an as-needed
> >> basis, and can in turn return memory back to the system (using a heap
> >> implementation that actually releases memory that is).  The dynamic
> >> size of an application is completely controlled by the application,
> >> and there is very little that cluster management software can do to
> >> ensure that the application fits some prescribed size.
> >>
> >> We recognized this in the development of our cluster management
> >> software long ago, so our systems are designed for managing tasks that
> >> have a dynamic memory footprint.  Overcommit is possible (as most
> >> applications do not use the full reservation of memory they asked for
> >> originally), letting us do things like schedule lower priority/lower
> >> service-classification work using resources that are otherwise
> >> available in stand-by for high-priority/low-latency workloads.
> >
> > OK I am not sure I got this right so pls tell me if this summary is
> > correct (note: this does not talk about what guest does with memory,
> > ust what it is that device does):
> >
> > - existing balloon is told lower limit on target size by host and pulls in 
> > at least
> >   target size. Guest can inflate > target size if it likes
> >   and then it is OK to deflate back to target size but not less.
> 
> Is this true?   I take it nothing is keeping the existing balloon
> driver from going over the target, but the same can be said about
> either balloon implementation.
> 
> > - your balloon is told upper limit on target size by host and pulls at most
> >   target size. Guest can deflate down to 0 at any point.
> >
> > If so I think both approaches make sense and in fact they
> > can be useful at the same time for the same guest.
> > In that case, I see two ways how this can be done:
> >
> > 1. two devices: existing ballon + cache balloon the
> > 2. add "upper limit" to existing ballon
> >
> > A single device looks a bit more natural in that we don't
> > really care in which balloon a page is as long as we
> > are between lower and upper limit. Right?
> 
> I agree that this may be better done using a single device if possible.

I am not sure myself, just asking.

> > From implementation POV we could have it use
> > pagecache for pages above lower limit but that
> > is a separate question about driver design,
> > I would like to make sure I understand the highr balloon is told upper 
> > limit

Re: [PATCH v6 12/12] KVM: indicate readonly access fault

2012-09-10 Thread Marcelo Tosatti

On Fri, Sep 07, 2012 at 05:56:39PM +0800, Xiao Guangrong wrote:
> On 09/06/2012 10:09 PM, Avi Kivity wrote:
> > On 08/22/2012 03:47 PM, Xiao Guangrong wrote:
> >> On 08/22/2012 08:06 PM, Avi Kivity wrote:
> >>> On 08/21/2012 06:03 AM, Xiao Guangrong wrote:
>  Introduce write_readonly_mem in mmio-exit-info to indicate this exit is
>  caused by write access on readonly memslot
> >>>
> >>> Please document this in chapter 5 of apic.txt.
> >>>
> >>
> >> Okay, please review this one.
> >>
> >> Subject: [PATCH v6 12/12] KVM: indicate readonly access fault
> >>
> >> Introduce write_readonly_mem in mmio-exit-info to indicate this exit is
> >> caused by write access on readonly memslot
> >>
> > 
> > I'm not sure whether this indication can be trusted by userspace.  By
> > the time userspace gets to process this, the slot may no longer exist,
> > or it may be writable.
> 
> The case of deleting memslot is ok, because userspace just skips this fault
> if no readonly mem or no fault handler can be found.
> 
> Switching memslot from readonly to writable sounds strange, i agree with you
> that this flag is untrusty under this case.
> 
> Marcelo, any comments?

The same can happen with slot deletion, for example. 

Userspace (which performed the modification which can result in faults
to non-existant/read-only/.../new-tag memslot), must handle the faults 
properly or avoid the possibility for reference to memslot information 
from the past.

I think its worthwhile to add a note about this in the API
documentation: "The user of this interface is responsible for handling 
references to stale memslot information, either by handling
exit notifications which reference stale memslot information or not
allowing these notifications to exist by stopping all vcpus in userspace
before performing modifications to the memslots map".

> > (in the same way an mmio exit might actually hit RAM)
> 
> So, in the userspace, for the safe reason, we should walk all memslots not
> just walking mmio handlers?
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PULL] KVM updates

2012-09-10 Thread Anthony Liguori

Avi Kivity  writes:

> Please pull from:
>
>   git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master
>
> to merge some kvm updates, most notably a port of qemu-kvm's pre-vfio device
> assignment.  With this there are no significant changes left between qemu and
> qemu-kvm (though some work remains).

Pulled. Thanks.

I took a look at the difference, appears to be some deprecated options
and CPU hotplug.  What's the plan there?

Is CPU hotplug functional in qemu-kvm?

Regards,

Anthony Liguori

>
> 
> Christian Borntraeger (1):
>   qemu: Use valgrind annotations to mark kvm guest memory as defined
>
> Jan Kiszka (5):
>   kvm: Clean up irqfd API
>   kvm: Introduce kvm_irqchip_update_msi_route
>   kvm: Introduce kvm_has_intx_set_mask
>   kvm: i386: Add services required for PCI device assignment
>   kvm: i386: Add classic PCI device assignment
>
>  configure  |3 +-
>  hw/kvm/Makefile.objs   |2 +-
>  hw/kvm/pci-assign.c| 1915 
> 
>  hw/qdev-monitor.c  |1 +
>  hw/virtio-pci.c|4 +-
>  kvm-all.c  |   75 +-
>  kvm-stub.c |   14 +-
>  kvm.h  |8 +-
>  target-i386/kvm.c  |  141 
>  target-i386/kvm_i386.h |   22 +
>  10 files changed, 2151 insertions(+), 34 deletions(-)
>  create mode 100644 hw/kvm/pci-assign.c
>
> -- 
> error compiling committee.c: too many arguments to function
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix i8259 interrupt high to low transition logic

2012-09-10 Thread Maciej W. Rozycki

On Sun, 9 Sep 2012, Matthew Ogilvie wrote:

> This bug manifested itself when the guest was Microport UNIX
> System V/386 v2.1 (ca. 1987), because it would sometimes mask
> off IRQ14 in the slave IMR after it had already been asserted.
> The master would still try to deliver an interrupt even though
> IRQ2 had dropped again, resulting in a spurious interupt
> (IRQ15) and a panicked UNIX kernel.

 That is quite weird actually -- from my experience the spurious vector is 
never sent from a slave (quite understandably -- since the interrupt is 
gone and no other is pending, the master has no reason to select a slave 
to supply a vector and therefore supplies the spurious vector itself) and 
therefore a spurious IRQ7 is always issued regardless of whether the 
discarded request came from a slave or from the master.

 Is there a bug elsewhere then too?  I would have expected a reasonable 
(and especially an old-school) x86 OS to be able to cope with spurious 
8259A interrupts, but then obviously one would expect them on IRQ7 only.

  Maciej
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: problems using virt-manager

2012-09-10 Thread 王金浦

2012/9/11 Lentes, Bernd 
>
> Hi,
>
> i try to run virt-manager on a SLES 11 SP1 box. I'm using kernel 2.6.32.12
> and virt-manager 0.9.4-106.1.x86_64 .
> The system is a 64bit box.
>
> Here is the output:
> =
>
> pc56846:/media/idg2/SysAdmin_AG_Wurst/software_und_treiber/virt_manager/sles_11_sp1
> # virt-manager &
> [1] 9659
>
> pc56846:/media/idg2/SysAdmin_AG_Wurst/software_und_treiber/virt_manager/sles_11_sp1
> # Traceback (most recent call last):
>   File "/usr/share/virt-manager/virt-manager.py", line 386, in 
> main()
>   File "/usr/share/virt-manager/virt-manager.py", line 247, in main
> from virtManager import cli
>   File "/usr/share/virt-manager/virtManager/cli.py", line 29, in 
> import libvirt
>   File "/usr/lib64/python2.6/site-packages/libvirt.py", line 25, in
> 
> raise lib_e
> ImportError: /usr/lib64/libvirt.so.0: undefined symbol:
> selinux_virtual_domain_context_path
>
> [1]+  Exit 1  virt-manager
> =
>
Seems libvirt popup a importError,  so libvir list may be a good place
to ask, add into cc.

Jack
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] support readonly memory feature in qemu

2012-09-10 Thread Kevin O'Connor

On Mon, Sep 10, 2012 at 11:25:38AM +0200, Jan Kiszka wrote:
> On 2012-09-09 17:45, Avi Kivity wrote:
> > On 09/07/2012 11:50 AM, Jan Kiszka wrote:
> >>
> >>> +} else {
> >>> +cpu_physical_memory_rw(run->mmio.phys_addr,
> >>> +   run->mmio.data,
> >>> +   run->mmio.len,
> >>> +   run->mmio.is_write);
> >>> +}
> >>> +
> >>>  ret = 0;
> >>>  break;
> >>>  case KVM_EXIT_IRQ_WINDOW_OPEN:
> >>>
> >>
> >> Great to see this feature for KVM finally! I'm just afraid that this
> >> will finally break good old isapc - due to broken Seabios. KVM used to
> >> "unbreak" it as it didn't respect write protections. ;)
> > 
> > Can you describe the breakage?
> 
> Try "qemu -machine isapc [-enable-kvm]". Seabios is writing to some
> read-only marked area. Don't recall where precisely.

On boot, QEMU marks the memory at 0xc-0x10 as read-only.
SeaBIOS then makes the area read-write, performs its init, and then
makes portions of it read-only before launching the OS.

The registers SeaBIOS uses to make the memory read-write are on a PCI
device.  On isapc, this device is not reachable, and thus SeaBIOS
can't make the memory writable.

The easiest way to fix this is to change QEMU to boot with the area
read-write.  There's no real gain in booting with the memory read-only
as the first thing SeaBIOS does is make it read-write.

-Kevin
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5] vhost-scsi: Add support for host virtualized target

2012-09-10 Thread Asias He

Hello Nicholas,

On 09/07/2012 02:48 PM, Nicholas A. Bellinger wrote:
> From: Nicholas Bellinger 
> 
> Hello Anthony & Co,
> 
> This is the fourth installment to add host virtualized target support for
> the mainline tcm_vhost fabric driver using Linux v3.6-rc into QEMU 1.3.0-rc.
> 
> The series is available directly from the following git branch:
> 
>git://git.kernel.org/pub/scm/virt/kvm/nab/qemu-kvm.git vhost-scsi-for-1.3
> 
> Note the code is cut against yesterday's QEMU head, and dispite the name
> of the tree is based upon mainline qemu.org git code + has thus far been
> running overnight with > 100K IOPs small block 4k workloads using v3.6-rc2+
> based target code with RAMDISK_DR backstores.

Are you still seeing the performance degradation discussed in the thread

 "vhost-scsi port to v1.1.0 + MSI-X performance regression"

?

> Other than some minor fuzz between jumping from QEMU 1.2.0 -> 1.2.50, this
> series is functionally identical to what's been posted for vhost-scsi RFC-v3
> to qemu-devel.
> 
> Please consider applying these patches for an initial vhost-scsi merge into
> QEMU 1.3.0-rc code, or let us know what else you'd like to see addressed for
> this series to in order to merge.
> 
> Thank you!
> 
> --nab
> 
> Nicholas Bellinger (2):
>   monitor: Rename+move net_handle_fd_param -> monitor_handle_fd_param
>   virtio-scsi: Set max_target=0 during vhost-scsi operation
> 
> Stefan Hajnoczi (3):
>   vhost: Pass device path to vhost_dev_init()
>   vhost-scsi: add -vhost-scsi host device for use with tcm-vhost
>   virtio-scsi: Add start/stop functionality for vhost-scsi
> 
>  configure|   10 +++
>  hw/Makefile.objs |1 +
>  hw/qdev-properties.c |   41 +++
>  hw/vhost-scsi.c  |  190 
> ++
>  hw/vhost-scsi.h  |   62 
>  hw/vhost.c   |5 +-
>  hw/vhost.h   |3 +-
>  hw/vhost_net.c   |2 +-
>  hw/virtio-pci.c  |2 +
>  hw/virtio-scsi.c |   55 ++-
>  hw/virtio-scsi.h |1 +
>  monitor.c|   18 +
>  monitor.h|1 +
>  net.c|   18 -
>  net.h|2 -
>  net/socket.c |2 +-
>  net/tap.c|4 +-
>  qemu-common.h|1 +
>  qemu-config.c|   19 +
>  qemu-options.hx  |4 +
>  vl.c |   18 +
>  21 files changed, 431 insertions(+), 28 deletions(-)
>  create mode 100644 hw/vhost-scsi.c
>  create mode 100644 hw/vhost-scsi.h
> 


-- 
Asias
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] KVM: fix i8259 interrupt high to low transition logic

2012-09-10 Thread Matthew Ogilvie

On Tue, Sep 11, 2012 at 01:49:51AM +0100, Maciej W. Rozycki wrote:
> On Sun, 9 Sep 2012, Matthew Ogilvie wrote:
> 
> > This bug manifested itself when the guest was Microport UNIX
> > System V/386 v2.1 (ca. 1987), because it would sometimes mask
> > off IRQ14 in the slave IMR after it had already been asserted.
> > The master would still try to deliver an interrupt even though
> > IRQ2 had dropped again, resulting in a spurious interupt
> > (IRQ15) and a panicked UNIX kernel.
> 
>  That is quite weird actually -- from my experience the spurious vector is 
> never sent from a slave (quite understandably -- since the interrupt is 
> gone and no other is pending, the master has no reason to select a slave 
> to supply a vector and therefore supplies the spurious vector itself) and 
> therefore a spurious IRQ7 is always issued regardless of whether the 
> discarded request came from a slave or from the master.

Keep in mind that this paragraph is describing QEMU's 8259 device
model behavior (and also KVM's), not real hardware.  Reading the
unpatched code, the master clearly latches on to the momentary IRQ2,
does not cancel it when it is cleared again, and ultimately delivers
a spurious IRQ15.

As for what the OS is doing with the IRQ15 (or IRQ7), I only have a large
dissamebly listing (with only a vague idea of it's overall interrupt
handling strategy), and some printf logs of stuff happening in the
8259 model when the OS is running (more useful).

> 
>  Is there a bug elsewhere then too?  I would have expected a reasonable 
> (and especially an old-school) x86 OS to be able to cope with spurious 
> 8259A interrupts, but then obviously one would expect them on IRQ7 only.
> 
>   Maciej
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Raghavendra K T


On 09/11/2012 01:42 AM, Andrew Theurer wrote:

On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:

On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:

+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct 
*p)
+{
+ if (!curr->sched_class->yield_to_task)
+ return false;
+
+ if (curr->sched_class != p->sched_class)
+ return false;



Peter,

Should we also add a check if the runq has a skip buddy (as pointed out
by Raghu) and return if the skip buddy is already set.


Oh right, I missed that suggestion.. the performance improvement went
from 81% to 139% using this, right?

It might make more sense to keep that separate, outside of this
function, since its not a strict prerequisite.



+ if (task_running(p_rq, p) || p->state)
+ return false;
+
+ return true;
+}




@@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,

bool preempt)

   rq = this_rq();

  again:
+ /* optimistic test to avoid taking locks */
+ if (!__yield_to_candidate(curr, p))
+ goto out_irq;
+


So add something like:

/* Optimistic, if we 'raced' with another yield_to(), don't bother */
if (p_rq->cfs_rq->skip)
goto out_irq;




   p_rq = task_rq(p);
   double_rq_lock(rq, p_rq);




But I do have a question on this optimization though,.. Why do we check
p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?

That is, I'd like to see this thing explained a little better.

Does it go something like: p_rq is the runqueue of the task we'd like to
yield to, rq is our own, they might be the same. If we have a ->skip,
there's nothing we can do about it, OTOH p_rq having a ->skip and
failing the yield_to() simply means us picking the next VCPU thread,
which might be running on an entirely different cpu (rq) and could
succeed?


Here's two new versions, both include a __yield_to_candidate(): "v3"
uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
skip check.  Raghu, I am not sure if this is exactly what you want
implemented in v4.



Andrew, Yes that is what I had. I think there was a mis-understanding. 
My intention was to if there is a directed_yield happened in runqueue 
(say rqA), do not bother to directed yield to that. But unfortunately as 
PeterZ pointed that would have resulted in setting next buddy of a 
different run queue than rqA.
So we can drop this "skip" idea. Pondering more over what to do? can we 
use next buddy itself ... thinking..


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH v2] kvm/fpu: Enable fully eager restore kvm FPU

2012-09-10 Thread Hao, Xudong

> -Original Message-
> From: Avi Kivity [mailto:a...@redhat.com]
> Sent: Monday, September 10, 2012 4:07 PM
> To: Hao, Xudong
> Cc: kvm@vger.kernel.org; Zhang, Xiantao; joerg.roe...@amd.com
> Subject: Re: [PATCH v2] kvm/fpu: Enable fully eager restore kvm FPU
> 
> >
> > Avi, I'm not sure if I fully understand of you. Do you mean enter guest 
> > with a
> fpu_active=0 and then fpu does not restore?
> 
> Yes.
> 
> > If so, I will add fpu_active=1 in the no-lazy case.
> >
> > -   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
> > +   if (kvm_read_cr4_bits(vcpu, X86_CR4_OSXSAVE) &&
> > +   (vcpu->arch.xcr0 & ~((u64)KVM_XSTATE_LAZY))) {
> > +   kvm_x86_ops->fpu_activate(vcpu);
> > +   vcpu->fpu_active=1;
> > +   }
> > +   else
> > +   kvm_make_request(KVM_REQ_DEACTIVATE_FPU, vcpu);
> >
> 
> It doesn't help here.
> 
>   1 guest boot
>   2 kvm_userspace_exit (deactivates fpu)
>   3 XSETBV exit that sets xcr0.new_bit
>   4 kvm_enter
> 
> There is no call to kvm_put_guest_fpu() between 3 and 4, you need
> something in __kvm_set_xcr() to activate the fpu.
> 

Yes, it's code path when enable xsave in guest, I'll add fpu activate there and 
remain v2 patch in kvm_put_guest_fpu().

@@ -554,6 +554,8 @@ int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
xcr0 = xcr;
if (kvm_x86_ops->get_cpl(vcpu) != 0)
return 1;
+   if (xcr0 & ~((u64)KVM_XSTATE_LAZY))
+   kvm_x86_ops->fpu_activate(vcpu);
if (!(xcr0 & XSTATE_FP))
return 1;
if ((xcr0 & XSTATE_YMM) && !(xcr0 & XSTATE_SSE))

> Note you also need to consider writes to xcr0 and cr4 that happen in the
> reverse order due to live migration.
> 

I'm confused of this, doesn't setting cr4 firstly then xcr0?
Do you mean current live migration has a reverse order, or it must be a reverse 
order with my eager restore patch?

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

80 matches

Mail list logo