Re: [PATCHv2/RFC] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING
> Il 17/01/2014 09:29, Christian Borntraeger ha scritto: > > Michael, > > do you have a quick way to check if srcu has a noticeable impact on int > > injection on your systems? I am happy with either v2 or v3 of the patch, > > but srcu_synchronize_expedited seems to have less latency impact on the > > full system than rcu_synchronize_expedited. This might give Paolo a hint > > which of the patches is the right way to go. > > Hi all, > > I've asked Andrew Theurer to run network tests on a 10G connection (TCP > request/response to check for performance, TCP streaming for host CPU > utilization). I am hoping to have some results some time tomorrow (Friday). -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Wed, 2013-06-26 at 15:52 +0300, Gleb Natapov wrote: > On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote: > > On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote: > > > On 06/25/2013 08:20 PM, Andrew Theurer wrote: > > > >On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: > > > >>This series replaces the existing paravirtualized spinlock mechanism > > > >>with a paravirtualized ticketlock mechanism. The series provides > > > >>implementation for both Xen and KVM. > > > >> > > > >>Changes in V9: > > > >>- Changed spin_threshold to 32k to avoid excess halt exits that are > > > >>causing undercommit degradation (after PLE handler improvement). > > > >>- Added kvm_irq_delivery_to_apic (suggested by Gleb) > > > >>- Optimized halt exit path to use PLE handler > > > >> > > > >>V8 of PVspinlock was posted last year. After Avi's suggestions to look > > > >>at PLE handler's improvements, various optimizations in PLE handling > > > >>have been tried. > > > > > > > >Sorry for not posting this sooner. I have tested the v9 pv-ticketlock > > > >patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have > > > >tested these patches with and without PLE, as PLE is still not scalable > > > >with large VMs. > > > > > > > > > > Hi Andrew, > > > > > > Thanks for testing. > > > > > > >System: x3850X5, 40 cores, 80 threads > > > > > > > > > > > >1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: > > > >-- > > > > Total > > > >ConfigurationThroughput(MB/s)Notes > > > > > > > >3.10-default-ple_on 22945 5% CPU > > > >in host kernel, 2% spin_lock in guests > > > >3.10-default-ple_off 23184 5% CPU > > > >in host kernel, 2% spin_lock in guests > > > >3.10-pvticket-ple_on 22895 5% CPU > > > >in host kernel, 2% spin_lock in guests > > > >3.10-pvticket-ple_off23051 5% CPU > > > >in host kernel, 2% spin_lock in guests > > > >[all 1x results look good here] > > > > > > Yes. The 1x results look too close > > > > > > > > > > > > > > >2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: > > > >--- > > > > Total > > > >ConfigurationThroughput Notes > > > > > > > >3.10-default-ple_on 6287 55% CPU > > > > host kernel, 17% spin_lock in guests > > > >3.10-default-ple_off 1849 2% CPU > > > >in host kernel, 95% spin_lock in guests > > > >3.10-pvticket-ple_on 6691 50% CPU > > > >in host kernel, 15% spin_lock in guests > > > >3.10-pvticket-ple_off16464 8% CPU > > > >in host kernel, 33% spin_lock in guests > > > > > > I see 6.426% improvement with ple_on > > > and 161.87% improvement with ple_off. I think this is a very good sign > > > for the patches > > > > > > >[PLE hinders pv-ticket improvements, but even with PLE off, > > > > we still off from ideal throughput (somewhere >2)] > > > > > > > > > > Okay, The ideal throughput you are referring is getting around atleast > > > 80% of 1x throughput for over-commit. Yes we are still far away from > > > there. > > > > > > > > > > >1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: > > > >-- > > > > Total > > > >ConfigurationThroughput Notes > > > > > > > >3.10-default-ple_on 22736 6% CPU > > > >in host kernel, 3% spin_lock in guests > > > >3.10-default-ple_off
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: > This series replaces the existing paravirtualized spinlock mechanism > with a paravirtualized ticketlock mechanism. The series provides > implementation for both Xen and KVM. > > Changes in V9: > - Changed spin_threshold to 32k to avoid excess halt exits that are >causing undercommit degradation (after PLE handler improvement). > - Added kvm_irq_delivery_to_apic (suggested by Gleb) > - Optimized halt exit path to use PLE handler > > V8 of PVspinlock was posted last year. After Avi's suggestions to look > at PLE handler's improvements, various optimizations in PLE handling > have been tried. Sorry for not posting this sooner. I have tested the v9 pv-ticketlock patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I have tested these patches with and without PLE, as PLE is still not scalable with large VMs. System: x3850X5, 40 cores, 80 threads 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput(MB/s)Notes 3.10-default-ple_on 22945 5% CPU in host kernel, 2% spin_lock in guests 3.10-default-ple_off23184 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_on22895 5% CPU in host kernel, 2% spin_lock in guests 3.10-pvticket-ple_off 23051 5% CPU in host kernel, 2% spin_lock in guests [all 1x results look good here] 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: --- Total Configuration Throughput Notes 3.10-default-ple_on 6287 55% CPU host kernel, 17% spin_lock in guests 3.10-default-ple_off 1849 2% CPU in host kernel, 95% spin_lock in guests 3.10-pvticket-ple_on 6691 50% CPU in host kernel, 15% spin_lock in guests 3.10-pvticket-ple_off 16464 8% CPU in host kernel, 33% spin_lock in guests [PLE hinders pv-ticket improvements, but even with PLE off, we still off from ideal throughput (somewhere >2)] 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on 22736 6% CPU in host kernel, 3% spin_lock in guests 3.10-default-ple_off23377 5% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_on22471 6% CPU in host kernel, 3% spin_lock in guests 3.10-pvticket-ple_off 23445 5% CPU in host kernel, 3% spin_lock in guests [1x looking fine here] 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: -- Total Configuration Throughput Notes 3.10-default-ple_on 1965 70% CPU in host kernel, 34% spin_lock in guests 3.10-default-ple_off 226 2% CPU in host kernel, 94% spin_lock in guests 3.10-pvticket-ple_on 1942 70% CPU in host kernel, 35% spin_lock in guests 3.10-pvticket-ple_off8003 11% CPU in host kernel, 70% spin_lock in guests [quite bad all around, but pv-tickets with PLE off the best so far. Still quite a bit off from ideal throughput] In summary, I would state that the pv-ticket is an overall win, but the current PLE handler tends to "get in the way" on these larger guests. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks
ment in 4x > > +---+---+---++---+ > ebizzy (records/sec) higher is better > +---+---+---++---+ >basestdevpatchedstdev%improvement > +---+---+---++---+ >5574.9000 237.4997 523.7000 1.4181 -90.60611 >2741.5000 561.3090 597.800034.9755 -78.19442 >2146.2500 216.7718 902.666782.4228 -57.94215 >1663. 141.92351245.67.2989 -25.13530 > +---+---+---++---+ > +---+---+---++---+ > dbench (Throughput) higher is better > +---+---+---++---+ > basestdevpatchedstdev%improvement > +---+---+---++---+ > 14111.5600 754.4525 884.905124.4723 -93.72922 >2481.627071.26652383.5700 333.2435-3.95132 >1510.248331.86341477.735850.5126-2.15279 >1029.487516.91661075.922513.9911 4.51050 > +---+-------+---++---+ > > > IMO hash based timeout is worth a try further. > I think little more tuning will get more better results. The problem I see (especially for dbench) is that we are still way off what I would consider the goal. IMO, 2x over-commit result should be a bit lower than 50% (to account for switching overhead and less cache warmth). We are at about 17.5% for 2x. I am thinking we need a completely different approach to get there, but of course I do not know what that is yet :) I am testing your patches now and hopefully with some analysis data we can better understand what's going on. > > Jiannan, When you start working on this, I can also help > to get best of preemptable lock idea if you wish and share > the patches I tried. -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Preemptable Ticket Spinlock
3532] [] vfs_stat+0x16/0x20 > [ 2144.673534] [] sys_newstat+0x1f/0x50 > [ 2144.673538] [] ? __audit_syscall_exit+0x246/0x2f0 > [ 2144.673541] [] ? __audit_syscall_entry+0x8c/0xf0 > [ 2144.673543] [] system_call_fastpath+0x16/0x1b This is on a 40 core / 80 thread Westmere-EX with 16 VMs, each VM having 20 vCPUs (so 4x over-commit). All VMs run dbench in tmpfs, which is a pretty good test for spinlock preempt problems. I had PLE enabled for the test. When you re-base your patches I will try it again. Thanks, -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
gt; > > > > } > > > > > > > > > > -out: > > > > > +out_unlock: > > > > > double_rq_unlock(rq, p_rq); > > > > > +out_irq: > > > > > local_irq_restore(flags); > > > > > > > > > > -if (yielded) > > > > > +if (yielded > 0) > > > > > schedule(); > > > > > > > > > > return yielded; > > > > > > > > > > > > > Acked-by: Andrew Jones > > > > > > > > > > Thank you Drew. > > > > > > Marcelo Gleb.. Please let me know if you have comments / concerns > > > on the patches.. > > > > > > Andrew, Vinod, IMO, the patch set looks good for undercommit > > > scenarios > > > especially for large guests where we do have overhead of vcpu > > > iteration > > > of ple handler.. > > > > > > . > > > > > Thanks Raghu. Will try to get this latest patch set evaluated and > > get back to you. > > > > > Hi Raghu, > > Here is some preliminary data with your latest set of PLE patches (& > also with Andrew's throttled yield_to() change). > > Ran a single guest on a 80 core Westmere platform. [Note: Host and > Guest had the latest kernel from kvm.git and also using the latest > qemu from qemu.git as of yesterday morning]. > > The guest was running a AIM7 high_systime workload. (Note: > high_systime is a kernel intensive micro-benchmark but in this case it > was run just as a workload in the guest to trigger spinlock etc. > contention in the guest OS and hence PLE (i.e. this is not a real > benchmark run). 'have run this workload with a constant # (i.e. 2000) > users with 100 jobs per user. The numbers below represent the # of > jobs per minute (JPM) - higher the better) . > > 40VCPU 60VCPU 80VCPU > > a) 3.7.0-rc6+ w/ ple_gap=0 ~102K ~88K~81K > > b) 3.7.0-rc6+ ~53K ~25K~18-20K > c) 3.7.0-rc6+ w/ PLE patches ~100K ~81K~48K-69K <- lot of variation > from run to run. > > d) 3.7.0-rc6+ w/ throttled ~101K ~87K~78K > yield_to() change > FYI here's the latest throttled yield_to() patch (the one Vinod tested). Signed-off-by: Andrew Theurer diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index ecc5543..61d12ea 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -192,6 +192,7 @@ struct kvm_vcpu { int mode; unsigned long requests; unsigned long guest_debug; + unsigned long last_yield_to; struct mutex mutex; struct kvm_run *run; diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index be70035..987a339 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -49,6 +49,7 @@ #include #include #include +#include #include #include @@ -222,6 +223,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id) vcpu->kvm = kvm; vcpu->vcpu_id = id; vcpu->pid = NULL; + vcpu->last_yield_to = 0; init_waitqueue_head(&vcpu->wq); kvm_async_pf_vcpu_init(vcpu); @@ -1708,29 +1710,38 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) kvm_vcpu_set_in_spin_loop(me, true); /* +* A yield_to() can be quite expensive, so we try to limit +* its use to just 1 per jiffie. +*/ + if (me->last_yield_to == jiffies) + yield(); + else { + /* * We boost the priority of a VCPU that is runnable but not * currently running, because it got preempted by something * else and called schedule in __vcpu_run. Hopefully that * VCPU is holding the lock that we need and will release it. * We approximate round-robin by starting at the last boosted VCPU. */ - for (pass = 0; pass < 2 && !yielded; pass++) { - kvm_for_each_vcpu(i, vcpu, kvm) { - if (!pass && i <= last_boosted_vcpu) { - i = last_boosted_vcpu; - continue; - } else if (pass && i > last_boosted_vcpu) - break; - if (vcpu == me) - continue; - if (waitqueue_active(&vcpu->wq)) - continue; - if (!kvm_vcpu_eligible_for_directed_yield(vcpu)) - continue; - if (kvm_vcpu_yield_to(vcpu)) { - kvm->last_boosted_vcpu = i; - yielded = 1; - break; + for (pass = 0; pass < 2 && !yielded; pass++) { + kvm_for_each_vcpu(i, vcpu, kvm) { + if (!pass && i <= last_boosted_vcpu) { + i = last_boosted_vcpu; + continue; + } else if (pass && i > last_boosted_vcpu) + break; + if (vcpu == me) + continue; + if (waitqueue_active(&vcpu->wq)) + continue; + if (!kvm_vcpu_eligible_for_directed_yield(vcpu)) + continue; + if (kvm_vcpu_yield_to(vcpu)) { + kvm->last_boosted_vcpu = i; + me->last_yield_to = jiffies; + yielded = 1; + break; + } } } } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task
On Tue, 2012-11-27 at 16:00 +0530, Raghavendra K T wrote: > On 11/26/2012 07:05 PM, Andrew Jones wrote: > > On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote: > >> From: Peter Zijlstra > >> > >> In case of undercomitted scenarios, especially in large guests > >> yield_to overhead is significantly high. when run queue length of > >> source and target is one, take an opportunity to bail out and return > >> -ESRCH. This return condition can be further exploited to quickly come > >> out of PLE handler. > >> > >> (History: Raghavendra initially worked on break out of kvm ple handler upon > >> seeing source runqueue length = 1, but it had to export rq length). > >> Peter came up with the elegant idea of return -ESRCH in scheduler core. > >> > >> Signed-off-by: Peter Zijlstra > >> Raghavendra, Checking the rq length of target vcpu condition added.(thanks > >> Avi) > >> Reviewed-by: Srikar Dronamraju > >> Signed-off-by: Raghavendra K T > >> --- > >> > >> kernel/sched/core.c | 25 +++-- > >> 1 file changed, 19 insertions(+), 6 deletions(-) > >> > >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c > >> index 2d8927f..fc219a5 100644 > >> --- a/kernel/sched/core.c > >> +++ b/kernel/sched/core.c > >> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield); > >>* It's the caller's job to ensure that the target task struct > >>* can't go away on us before we can do any checks. > >>* > >> - * Returns true if we indeed boosted the target task. > >> + * Returns: > >> + *true (>0) if we indeed boosted the target task. > >> + *false (0) if we failed to boost the target. > >> + *-ESRCH if there's no task to yield to. > >>*/ > >> bool __sched yield_to(struct task_struct *p, bool preempt) > >> { > >> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool > >> preempt) > >> > >> again: > >>p_rq = task_rq(p); > >> + /* > >> + * If we're the only runnable task on the rq and target rq also > >> + * has only one task, there's absolutely no point in yielding. > >> + */ > >> + if (rq->nr_running == 1 && p_rq->nr_running == 1) { > >> + yielded = -ESRCH; > >> + goto out_irq; > >> + } > >> + > >>double_rq_lock(rq, p_rq); > >>while (task_rq(p) != p_rq) { > >>double_rq_unlock(rq, p_rq); > >> @@ -4310,13 +4322,13 @@ again: > >>} > >> > >>if (!curr->sched_class->yield_to_task) > >> - goto out; > >> + goto out_unlock; > >> > >>if (curr->sched_class != p->sched_class) > >> - goto out; > >> + goto out_unlock; > >> > >>if (task_running(p_rq, p) || p->state) > >> - goto out; > >> + goto out_unlock; > >> > >>yielded = curr->sched_class->yield_to_task(rq, p, preempt); > >>if (yielded) { > >> @@ -4329,11 +4341,12 @@ again: > >>resched_task(p_rq->curr); > >>} > >> > >> -out: > >> +out_unlock: > >>double_rq_unlock(rq, p_rq); > >> +out_irq: > >>local_irq_restore(flags); > >> > >> - if (yielded) > >> + if (yielded > 0) > >>schedule(); > >> > >>return yielded; > >> > > > > Acked-by: Andrew Jones > > > > Thank you Drew. > > Marcelo Gleb.. Please let me know if you have comments / concerns on the > patches.. > > Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios > especially for large guests where we do have overhead of vcpu iteration > of ple handler.. I agree, looks fine for undercommit scenarios. I do wonder what happens with 1.5x overcommit, where we might see 1/2 the host cpus with runqueue of 2 and 1/2 of the host cpus with a runqueue of 1. Even with this change that scenario still might be fine, but it would be nice to see a comparison. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit scenarios
Handle yield_to failure return for potential undercommit case > Check system load and handle different commit cases accordingly > > Please let me know your comments and suggestions. > > Link for V1: > https://lkml.org/lkml/2012/9/21/168 > > kernel/sched/core.c | 25 +++-- > virt/kvm/kvm_main.c | 56 > ++-- > 2 files changed, 65 insertions(+), 16 deletions(-) -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On Fri, 2012-10-19 at 14:00 +0530, Raghavendra K T wrote: > On 10/15/2012 08:04 PM, Andrew Theurer wrote: > > On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote: > >> On 10/11/2012 01:06 AM, Andrew Theurer wrote: > >>> On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote: > >>>> On 10/10/2012 08:29 AM, Andrew Theurer wrote: > >>>>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote: > >>>>>> * Avi Kivity [2012-10-04 17:00:28]: > >>>>>> > >>>>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote: > >>>>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote: > >>>>>>>>> > >> [...] > >>>>> A big concern I have (if this is 1x overcommit) for ebizzy is that it > >>>>> has just terrible scalability to begin with. I do not think we should > >>>>> try to optimize such a bad workload. > >>>>> > >>>> > >>>> I think my way of running dbench has some flaw, so I went to ebizzy. > >>>> Could you let me know how you generally run dbench? > >>> > >>> I mount a tmpfs and then specify that mount for dbench to run on. This > >>> eliminates all IO. I use a 300 second run time and number of threads is > >>> equal to number of vcpus. All of the VMs of course need to have a > >>> synchronized start. > >>> > >>> I would also make sure you are using a recent kernel for dbench, where > >>> the dcache scalability is much improved. Without any lock-holder > >>> preemption, the time in spin_lock should be very low: > >>> > >>> > >>>> 21.54% 78016 dbench [kernel.kallsyms] [k] > >>>> copy_user_generic_unrolled > >>>>3.51% 12723 dbench libc-2.12.so[.] > >>>> __strchr_sse42 > >>>>2.81% 10176 dbench dbench [.] child_run > >>>>2.54% 9203 dbench [kernel.kallsyms] [k] > >>>> _raw_spin_lock > >>>>2.33% 8423 dbench dbench [.] > >>>> next_token > >>>>2.02% 7335 dbench [kernel.kallsyms] [k] > >>>> __d_lookup_rcu > >>>>1.89% 6850 dbench libc-2.12.so[.] > >>>> __strstr_sse42 > >>>>1.53% 5537 dbench libc-2.12.so[.] > >>>> __memset_sse2 > >>>>1.47% 5337 dbench [kernel.kallsyms] [k] > >>>> link_path_walk > >>>>1.40% 5084 dbench [kernel.kallsyms] [k] > >>>> kmem_cache_alloc > >>>>1.38% 5009 dbench libc-2.12.so[.] memmove > >>>>1.24% 4496 dbench libc-2.12.so[.] vfprintf > >>>>1.15% 4169 dbench [kernel.kallsyms] [k] > >>>> __audit_syscall_exit > >>> > >> > >> Hi Andrew, > >> I ran the test with dbench with tmpfs. I do not see any improvements in > >> dbench for 16k ple window. > >> > >> So it seems apart from ebizzy no workload benefited by that. and I > >> agree that, it may not be good to optimize for ebizzy. > >> I shall drop changing to 16k default window and continue with other > >> original patch series. Need to experiment with latest kernel. > > > > Thanks for running this again. I do believe there are some workloads, > > when run at 1x overcommit, would benefit from a larger ple_window [with > > he current ple handling code], but I do not also want to potentially > > degrade >1x with a larger window. I do, however, think there may be a > > another option. I have not fully worked this out, but I think I am on > > to something. > > > > I decided to revert back to just a yield() instead of a yield_to(). My > > motivation was that yield_to() [for large VMs] is like a dog chasing its > > tail, round and round we go Just yield(), in particular a yield() > > which results in yielding to something -other- than the current VM's > > vcpus, helps synchronize the execution of sibling vcpus by deferring > > them until the lock holder vcpu is running again. The more we can do to > > get all vcpus running at the same time, the far less we deal with the > > preemption problem. The othe
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote: > On 10/11/2012 01:06 AM, Andrew Theurer wrote: > > On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote: > >> On 10/10/2012 08:29 AM, Andrew Theurer wrote: > >>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote: > >>>> * Avi Kivity [2012-10-04 17:00:28]: > >>>> > >>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote: > >>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote: > >>>>>>> > [...] > >>> A big concern I have (if this is 1x overcommit) for ebizzy is that it > >>> has just terrible scalability to begin with. I do not think we should > >>> try to optimize such a bad workload. > >>> > >> > >> I think my way of running dbench has some flaw, so I went to ebizzy. > >> Could you let me know how you generally run dbench? > > > > I mount a tmpfs and then specify that mount for dbench to run on. This > > eliminates all IO. I use a 300 second run time and number of threads is > > equal to number of vcpus. All of the VMs of course need to have a > > synchronized start. > > > > I would also make sure you are using a recent kernel for dbench, where > > the dcache scalability is much improved. Without any lock-holder > > preemption, the time in spin_lock should be very low: > > > > > >> 21.54% 78016 dbench [kernel.kallsyms] [k] > >> copy_user_generic_unrolled > >> 3.51% 12723 dbench libc-2.12.so[.] > >> __strchr_sse42 > >> 2.81% 10176 dbench dbench [.] child_run > >> 2.54% 9203 dbench [kernel.kallsyms] [k] > >> _raw_spin_lock > >> 2.33% 8423 dbench dbench [.] next_token > >> 2.02% 7335 dbench [kernel.kallsyms] [k] > >> __d_lookup_rcu > >> 1.89% 6850 dbench libc-2.12.so[.] > >> __strstr_sse42 > >> 1.53% 5537 dbench libc-2.12.so[.] > >> __memset_sse2 > >> 1.47% 5337 dbench [kernel.kallsyms] [k] > >> link_path_walk > >> 1.40% 5084 dbench [kernel.kallsyms] [k] > >> kmem_cache_alloc > >> 1.38% 5009 dbench libc-2.12.so[.] memmove > >> 1.24% 4496 dbench libc-2.12.so[.] vfprintf > >> 1.15% 4169 dbench [kernel.kallsyms] [k] > >> __audit_syscall_exit > > > > Hi Andrew, > I ran the test with dbench with tmpfs. I do not see any improvements in > dbench for 16k ple window. > > So it seems apart from ebizzy no workload benefited by that. and I > agree that, it may not be good to optimize for ebizzy. > I shall drop changing to 16k default window and continue with other > original patch series. Need to experiment with latest kernel. Thanks for running this again. I do believe there are some workloads, when run at 1x overcommit, would benefit from a larger ple_window [with he current ple handling code], but I do not also want to potentially degrade >1x with a larger window. I do, however, think there may be a another option. I have not fully worked this out, but I think I am on to something. I decided to revert back to just a yield() instead of a yield_to(). My motivation was that yield_to() [for large VMs] is like a dog chasing its tail, round and round we go Just yield(), in particular a yield() which results in yielding to something -other- than the current VM's vcpus, helps synchronize the execution of sibling vcpus by deferring them until the lock holder vcpu is running again. The more we can do to get all vcpus running at the same time, the far less we deal with the preemption problem. The other benefit is that yield() is far, far lower overhead than yield_to() This does assume that vcpus from same VM do not share same runqueues. Yielding to a sibling vcpu with yield() is not productive for larger VMs in the same way that yield_to() is not. My recent results include restricting vcpu placement so that sibling vcpus do not get to run on the same runqueue. I do believe we could implement a initial placement and load balance policy to strive for this restriction (making it purely optional, but I bet could also help user apps which use spin locks). For 1x VMs which still vm_exit due to PLE, I believe we could probably just leave the ple_window alone, as long as we mostly use yield() instead of yield_to(). The problem with the unneeded exits in this case has been the overhead in r
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote: > On 10/10/2012 08:29 AM, Andrew Theurer wrote: > > On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote: > >> * Avi Kivity [2012-10-04 17:00:28]: > >> > >>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote: > >>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote: > >>>>> > >>>>> Again the numbers are ridiculously high for arch_local_irq_restore. > >>>>> Maybe there's a bad perf/kvm interaction when we're injecting an > >>>>> interrupt, I can't believe we're spending 84% of the time running the > >>>>> popf instruction. > >>>> > >>>> Smells like a software fallback that doesn't do NMI, hrtimer based > >>>> sampling typically hits popf where we re-enable interrupts. > >>> > >>> Good nose, that's probably it. Raghavendra, can you ensure that the PMU > >>> is properly exposed? 'dmesg' in the guest will tell. If it isn't, -cpu > >>> host will expose it (and a good idea anyway to get best performance). > >>> > >> > >> Hi Avi, you are right. SandyBridge machine result was not proper. > >> I cleaned up the services, enabled PMU, re-ran all the test again. > >> > >> Here is the summary: > >> We do get good benefit by increasing ple window. Though we don't > >> see good benefit for kernbench and sysbench, for ebizzy, we get huge > >> improvement for 1x scenario. (almost 2/3rd of ple disabled case). > >> > >> Let me know if you think we can increase the default ple_window > >> itself to 16k. > >> > >> I am experimenting with V2 version of undercommit improvement(this) patch > >> series, But I think if you wish to go for increase of > >> default ple_window, then we would have to measure the benefit of patches > >> when ple_window = 16k. > >> > >> I can respin the whole series including this default ple_window change. > >> > >> I also have the perf kvm top result for both ebizzy and kernbench. > >> I think they are in expected lines now. > >> > >> Improvements > >> > >> > >> 16 core PLE machine with 16 vcpu guest > >> > >> base = 3.6.0-rc5 + ple handler optimization patches > >> base_pleopt_16k = base + ple_window = 16k > >> base_pleopt_32k = base + ple_window = 32k > >> base_pleopt_nople = base + ple_gap = 0 > >> kernbench, hackbench, sysbench (time in sec lower is better) > >> ebizzy (rec/sec higher is better) > >> > >> % improvements w.r.t base (ple_window = 4k) > >> ---+---+-+---+ > >> |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople | > >> ---+---+-+---+ > >> kernbench_1x | 0.42371 | 1.15164| 0.09320 | > >> kernbench_2x | -1.40981 | -17.48282 | -570.77053 | > >> ---+---+-+---+ > >> sysbench_1x| -0.92367 | 0.24241 | -0.27027 | > >> sysbench_2x| -2.22706 |-0.30896 | -1.27573 | > >> sysbench_3x| -0.75509 | 0.09444 | -2.97756 | > >> ---+---+-+---+ > >> ebizzy_1x | 54.99976 | 67.29460| 74.14076 | > >> ebizzy_2x | -8.83386 |-27.38403| -96.22066 | > >> ---+---+-+---+ > >> > >> perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window) > >> > > > > Is the perf data for 1x overcommit? > > Yes, 16vcpu guest on 16 core > > > > >> pleopt ple_gap=0 > >> > >> ebizzy : 18131 records/s > >> 63.78% [guest.kernel] [g] _raw_spin_lock_irqsave > >> 5.65% [guest.kernel] [g] smp_call_function_many > >> 3.12% [guest.kernel] [g] clear_page > >> 3.02% [guest.kernel] [g] down_read_trylock > >> 1.85% [guest.kernel] [g] async_page_fault > >> 1.81% [guest.kernel] [g] up_read > >> 1.76% [guest.kernel] [g] native_apic_mem_write > >> 1.70% [guest.kernel] [g] find_v
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On Wed, 2012-10-10 at 23:13 +0530, Raghavendra K T wrote: > On 10/10/2012 07:54 PM, Andrew Theurer wrote: > > I ran 'perf sched map' on the dbench workload for medium and large VMs, > > and I thought I would share some of the results. I think it helps to > > visualize what's going on regarding the yielding. > > > > These files are png bitmaps, generated from processing output from 'perf > > sched map' (and perf data generated from 'perf sched record'). The Y > > axis is the host cpus, each row being 10 pixels high. For these tests, > > there are 80 host cpus, so the total height is 800 pixels. The X axis > > is time (in microseconds), with each pixel representing 1 microsecond. > > Each bitmap plots 30,000 microseconds. The bitmaps are quite wide > > obviously, and zooming in/out while viewing is recommended. > > > > Each row (each host cpu) is assigned a color based on what thread is > > running. vCPUs of the same VM are assigned a common color (like red, > > blue, magenta, etc), and each vCPU has a unique brightness for that > > color. There are a maximum of 12 assignable colors, so in any VMs >12 > > revert to vCPU color of gray. I would use more colors, but it becomes > > harder to distinguish one color from another. The white color > > represents missing data from perf, and black color represents any thread > > which is not a vCPU. > > > > For the following tests, VMs were pinned to host NUMA nodes and to > > specific cpus to help with consistency and operate within the > > constraints of the last test (gang scheduler). > > > > Here is a good example of PLE. These are 10-way VMs, 16 of them (as > > described above only 12 of the VMs have a color, rest are gray). > > > > https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU > > This looks very nice to visualize what is happening. Beginning of the > graph looks little messy but later it is clear. > > > > > If you zoom out and look at the whole bitmap, you may notice the 4ms > > intervals of the scheduler. They are pretty well aligned across all > > cpus. Normally, for cpu bound workloads, we would expect to see each > > thread to run for 4 ms, then something else getting to run, and so on. > > That is mostly true in this test. We have 2x over-commit and we > > generally see the switching of threads at 4ms. One thing to note is > > that not all vCPU threads for the same VM run at exactly the same time, > > and that is expected and the whole reason for lock-holder preemption. > > Now, if you zoom in on the bitmap, you should notice within the 4ms > > intervals there is some task switching going on. This is most likely > > because of the yield_to initiated by the PLE handler. In this case > > there is not that much yielding to do. It's quite clean, and the > > performance is quite good. > > > > Below is an example of PLE, but this time with 20-way VMs, 8 of them. > > CPU over-commit is still 2x. > > > > https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU > > I think this link still 10x16. Could you paste the link again? Oops https://docs.google.com/open?id=0B6tfUNlZ-14wSGtYYzZtRTcyVjQ > > > > > This one looks quite different. In short, it's a mess. The switching > > between tasks can be lower than 10 microseconds. It basically never > > recovers. There is constant yielding all the time. > > > > Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang > > scheduling patches. While I am not recommending gang scheduling, I > > think it's a good data point. The performance is 3.88x the PLE result. > > > > https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M > > > > Note that the task switching intervals of 4ms are quite obvious again, > > and this time all vCPUs from same VM run at the same time. It > > represents the best possible outcome. > > > > > > Anyway, I thought the bitmaps might help better visualize what's going > > on. > > > > -Andrew > > > > > > > > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
I ran 'perf sched map' on the dbench workload for medium and large VMs, and I thought I would share some of the results. I think it helps to visualize what's going on regarding the yielding. These files are png bitmaps, generated from processing output from 'perf sched map' (and perf data generated from 'perf sched record'). The Y axis is the host cpus, each row being 10 pixels high. For these tests, there are 80 host cpus, so the total height is 800 pixels. The X axis is time (in microseconds), with each pixel representing 1 microsecond. Each bitmap plots 30,000 microseconds. The bitmaps are quite wide obviously, and zooming in/out while viewing is recommended. Each row (each host cpu) is assigned a color based on what thread is running. vCPUs of the same VM are assigned a common color (like red, blue, magenta, etc), and each vCPU has a unique brightness for that color. There are a maximum of 12 assignable colors, so in any VMs >12 revert to vCPU color of gray. I would use more colors, but it becomes harder to distinguish one color from another. The white color represents missing data from perf, and black color represents any thread which is not a vCPU. For the following tests, VMs were pinned to host NUMA nodes and to specific cpus to help with consistency and operate within the constraints of the last test (gang scheduler). Here is a good example of PLE. These are 10-way VMs, 16 of them (as described above only 12 of the VMs have a color, rest are gray). https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU If you zoom out and look at the whole bitmap, you may notice the 4ms intervals of the scheduler. They are pretty well aligned across all cpus. Normally, for cpu bound workloads, we would expect to see each thread to run for 4 ms, then something else getting to run, and so on. That is mostly true in this test. We have 2x over-commit and we generally see the switching of threads at 4ms. One thing to note is that not all vCPU threads for the same VM run at exactly the same time, and that is expected and the whole reason for lock-holder preemption. Now, if you zoom in on the bitmap, you should notice within the 4ms intervals there is some task switching going on. This is most likely because of the yield_to initiated by the PLE handler. In this case there is not that much yielding to do. It's quite clean, and the performance is quite good. Below is an example of PLE, but this time with 20-way VMs, 8 of them. CPU over-commit is still 2x. https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU This one looks quite different. In short, it's a mess. The switching between tasks can be lower than 10 microseconds. It basically never recovers. There is constant yielding all the time. Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang scheduling patches. While I am not recommending gang scheduling, I think it's a good data point. The performance is 3.88x the PLE result. https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M Note that the task switching intervals of 4ms are quite obvious again, and this time all vCPUs from same VM run at the same time. It represents the best possible outcome. Anyway, I thought the bitmaps might help better visualize what's going on. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote: > * Avi Kivity [2012-10-04 17:00:28]: > > > On 10/04/2012 03:07 PM, Peter Zijlstra wrote: > > > On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote: > > >> > > >> Again the numbers are ridiculously high for arch_local_irq_restore. > > >> Maybe there's a bad perf/kvm interaction when we're injecting an > > >> interrupt, I can't believe we're spending 84% of the time running the > > >> popf instruction. > > > > > > Smells like a software fallback that doesn't do NMI, hrtimer based > > > sampling typically hits popf where we re-enable interrupts. > > > > Good nose, that's probably it. Raghavendra, can you ensure that the PMU > > is properly exposed? 'dmesg' in the guest will tell. If it isn't, -cpu > > host will expose it (and a good idea anyway to get best performance). > > > > Hi Avi, you are right. SandyBridge machine result was not proper. > I cleaned up the services, enabled PMU, re-ran all the test again. > > Here is the summary: > We do get good benefit by increasing ple window. Though we don't > see good benefit for kernbench and sysbench, for ebizzy, we get huge > improvement for 1x scenario. (almost 2/3rd of ple disabled case). > > Let me know if you think we can increase the default ple_window > itself to 16k. > > I am experimenting with V2 version of undercommit improvement(this) patch > series, But I think if you wish to go for increase of > default ple_window, then we would have to measure the benefit of patches > when ple_window = 16k. > > I can respin the whole series including this default ple_window change. > > I also have the perf kvm top result for both ebizzy and kernbench. > I think they are in expected lines now. > > Improvements > > > 16 core PLE machine with 16 vcpu guest > > base = 3.6.0-rc5 + ple handler optimization patches > base_pleopt_16k = base + ple_window = 16k > base_pleopt_32k = base + ple_window = 32k > base_pleopt_nople = base + ple_gap = 0 > kernbench, hackbench, sysbench (time in sec lower is better) > ebizzy (rec/sec higher is better) > > % improvements w.r.t base (ple_window = 4k) > ---+---+-+---+ >|base_pleopt_16k| base_pleopt_32k | base_pleopt_nople | > ---+---+-+---+ > kernbench_1x | 0.42371 | 1.15164| 0.09320 | > kernbench_2x | -1.40981 | -17.48282 | -570.77053 | > ---+---+-+---+ > sysbench_1x| -0.92367 | 0.24241 | -0.27027 | > sysbench_2x| -2.22706 |-0.30896 | -1.27573 | > sysbench_3x| -0.75509 | 0.09444 | -2.97756 | > ---+---+-+---+ > ebizzy_1x | 54.99976 | 67.29460| 74.14076 | > ebizzy_2x | -8.83386 |-27.38403| -96.22066 | > ---+---+-+---+ > > perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window) > Is the perf data for 1x overcommit? > pleopt ple_gap=0 > > ebizzy : 18131 records/s > 63.78% [guest.kernel] [g] _raw_spin_lock_irqsave > 5.65% [guest.kernel] [g] smp_call_function_many > 3.12% [guest.kernel] [g] clear_page > 3.02% [guest.kernel] [g] down_read_trylock > 1.85% [guest.kernel] [g] async_page_fault > 1.81% [guest.kernel] [g] up_read > 1.76% [guest.kernel] [g] native_apic_mem_write > 1.70% [guest.kernel] [g] find_vma Does 'perf kvm top' not give host samples at the same time? Would be nice to see the host overhead as a function of varying ple window. I would expect that to be the major difference between 4/16/32k window sizes. A big concern I have (if this is 1x overcommit) for ebizzy is that it has just terrible scalability to begin with. I do not think we should try to optimize such a bad workload. > kernbench :Elapsed Time 29.4933 (27.6007) >5.72% [guest.kernel] [g] async_page_fault > 3.48% [guest.kernel] [g] pvclock_clocksource_read > 2.68% [guest.kernel] [g] copy_user_generic_unrolled > 2.58% [guest.kernel] [g] clear_page > 2.09% [guest.kernel] [g] page_cache_get_speculative > 2.00% [guest.kernel] [g] do_raw_spin_lock > 1.78% [guest.kernel] [g] unmap_single_vma > 1.74% [guest.kernel] [g] kmem_cache_alloc > > pleopt ple_window = 4k > --- > ebizzy: 10176 records/s >69.17% [guest.kernel] [g] _raw_spin_lock_irqsave > 3.34% [guest.kernel] [g] clear_page > 2.16% [guest.kernel] [g] down_read_trylock > 1.94% [guest.kernel] [g] async_page_fault > 1.89% [guest.kernel] [g] native_apic_mem_write > 1.63% [guest.kernel] [g] smp_cal
Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler
On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote: > On 10/04/2012 12:49 PM, Raghavendra K T wrote: > > On 10/03/2012 10:35 PM, Avi Kivity wrote: > >> On 10/03/2012 02:22 PM, Raghavendra K T wrote: > So I think it's worth trying again with ple_window of 2-4. > > >>> > >>> Hi Avi, > >>> > >>> I ran different benchmarks increasing ple_window, and results does not > >>> seem to be encouraging for increasing ple_window. > >> > >> Thanks for testing! Comments below. > >> > >>> Results: > >>> 16 core PLE machine with 16 vcpu guest. > >>> > >>> base kernel = 3.6-rc5 + ple handler optimization patch > >>> base_pleopt_8k = base kernel + ple window = 8k > >>> base_pleopt_16k = base kernel + ple window = 16k > >>> base_pleopt_32k = base kernel + ple window = 32k > >>> > >>> > >>> Percentage improvements of benchmarks w.r.t base_pleopt with > >>> ple_window = 4096 > >>> > >>> base_pleopt_8kbase_pleopt_16kbase_pleopt_32k > >>> - > >>> > >>> > >>> kernbench_1x-5.54915-15.94529-44.31562 > >>> kernbench_2x-7.89399-17.75039-37.73498 > >> > >> So, 44% degradation even with no overcommit? That's surprising. > > > > Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it > > spending 8 times the original ple_window cycles for 16 vcpus > > significant? > > A PLE exit when not overcommitted cannot do any good, it is better to > spin in the guest rather that look for candidates on the host. In fact > when we benchmark we often disable PLE completely. Agreed. However, I really do not understand why the kernbench regressed with bigger ple_window. It should stay the same or improve. Raghu, do you have perf data for the kernbench runs? > > > > >> > >>> I also got perf top output to analyse the difference. Difference comes > >>> because of flushtlb (and also spinlock). > >> > >> That's in the guest, yes? > > > > Yes. Perf is in guest. > > > >> > >>> > >>> Ebizzy run for 4k ple_window > >>> - 87.20% [kernel] [k] arch_local_irq_restore > >>> - arch_local_irq_restore > >>>- 100.00% _raw_spin_unlock_irqrestore > >>> + 52.89% release_pages > >>> + 47.10% pagevec_lru_move_fn > >>> - 5.71% [kernel] [k] arch_local_irq_restore > >>> - arch_local_irq_restore > >>>+ 86.03% default_send_IPI_mask_allbutself_phys > >>>+ 13.96% default_send_IPI_mask_sequence_phys > >>> - 3.10% [kernel] [k] smp_call_function_many > >>> smp_call_function_many > >>> > >>> > >>> Ebizzy run for 32k ple_window > >>> > >>> - 91.40% [kernel] [k] arch_local_irq_restore > >>> - arch_local_irq_restore > >>>- 100.00% _raw_spin_unlock_irqrestore > >>> + 53.13% release_pages > >>> + 46.86% pagevec_lru_move_fn > >>> - 4.38% [kernel] [k] smp_call_function_many > >>> smp_call_function_many > >>> - 2.51% [kernel] [k] arch_local_irq_restore > >>> - arch_local_irq_restore > >>>+ 90.76% default_send_IPI_mask_allbutself_phys > >>>+ 9.24% default_send_IPI_mask_sequence_phys > >>> > >> > >> Both the 4k and the 32k results are crazy. Why is > >> arch_local_irq_restore() so prominent? Do you have a very high > >> interrupt rate in the guest? > > > > How to measure if I have high interrupt rate in guest? > > From /proc/interrupt numbers I am not able to judge :( > > 'vmstat 1' > > > > > I went back and got the results on a 32 core machine with 32 vcpu guest. > > Strangely, I got result supporting the claim that increasing ple_window > > helps for non-overcommitted scenario. > > > > 32 core 32 vcpu guest 1x scenarios. > > > > ple_gap = 0 > > kernbench: Elapsed Time 38.61 > > ebizzy: 7463 records/s > > > > ple_window = 4k > > kernbench: Elapsed Time 43.5067 > > ebizzy:2528 records/s > > > > ple_window = 32k > > kernebench : Elapsed Time 39.4133 > > ebizzy: 7196 records/s > > So maybe something was wrong with the first measurement. OK, this is more in line with what I expected for kernbench. FWIW, in order to show an improvement for a larger ple_window, we really need a workload which we know has a longer lock holding time (without factoring in LHP). We have noticed this on IO based locks mostly. We saw it with a massive disk IO test (qla2xxx lock), and also with a large web serving test (some vfs related lock, but I forget what exactly it was). > > > > > > > perf top for ebizzy for above: > > ple_gap = 0 > > - 84.74% [kernel] [k] arch_local_irq_restore > >- arch_local_irq_restore > > - 100.00% _raw_spin_unlock_irqrestore > > + 50.96% release_pages > > + 49.02% pagevec_lru_move_fn > > - 6.57% [kernel] [k] arch_local_irq_restore > >- arch_local_irq_restore > > + 92.54% default_send_IPI_mask_allbutself_phys > > + 7.46% default_send_IPI_mask_sequence_phys > > - 1.54% [kernel] [k] smp_call_function_many > > smp_call_function_man
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote: > On 09/27/2012 05:33 PM, Avi Kivity wrote: > > On 09/27/2012 01:23 PM, Raghavendra K T wrote: > >>> > >>> This gives us a good case for tracking preemption on a per-vm basis. As > >>> long as we aren't preempted, we can keep the PLE window high, and also > >>> return immediately from the handler without looking for candidates. > >> > >> 1) So do you think, deferring preemption patch ( Vatsa was mentioning > >> long back) is also another thing worth trying, so we reduce the chance > >> of LHP. > > > > Yes, we have to keep it in mind. It will be useful for fine grained > > locks, not so much so coarse locks or IPIs. > > > > Agree. > > > I would still of course prefer a PLE solution, but if we can't get it to > > work we can consider preemption deferral. > > > > Okay. > > >> > >> IIRC, with defer preemption : > >> we will have hook in spinlock/unlock path to measure depth of lock held, > >> and shared with host scheduler (may be via MSRs now). > >> Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather > >> give say one chance. > > > > A downside is that we have to do that even when undercommitted. Hopefully vcpu preemption is very rare when undercommitted, so it should not happen much at all. > > > > Also there may be a lot of false positives (deferred preemptions even > > when there is no contention). It will be interesting to see how this behaves with a very high lock activity in a guest. Once the scheduler defers preemption, is it for a fixed amount of time, or does it know to cut the deferral short as soon as the lock depth is reduced [by x]? > > Yes. That is a worry. > > > > >> > >> 2) looking at the result (comparing A & C) , I do feel we have > >> significant in iterating over vcpus (when compared to even vmexit) > >> so We still would need undercommit fix sugested by PeterZ (improving by > >> 140%). ? > > > > Looking only at the current runqueue? My worry is that it misses a lot > > of cases. Maybe try the current runqueue first and then others. > > > > Or were you referring to something else? > > No. I was referring to the same thing. > > However. I had tried following also (which works well to check > undercommited scenario). But thinking to use only for yielding in case > of overcommit (yield in overcommit suggested by Rik) and keep > undercommit patch as suggested by PeterZ > > [ patch is not in proper diff I suppose ]. > > Will test them. > > Peter, Can I post your patch with your from/sob.. in V2? > Please let me know.. > > --- > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 28f00bc..9ed3759 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1620,6 +1620,21 @@ bool kvm_vcpu_eligible_for_directed_yield(struct > kvm_vcpu *vcpu) > return eligible; > } > #endif > + > +bool kvm_overcommitted() > +{ > + unsigned long load; > + > + load = avenrun[0] + FIXED_1/200; > + load = load >> FSHIFT; > + load = (load << 7) / num_online_cpus(); > + > + if (load > 128) > + return true; > + > + return false; > +} > + > void kvm_vcpu_on_spin(struct kvm_vcpu *me) > { > struct kvm *kvm = me->kvm; > @@ -1629,6 +1644,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) > int pass; > int i; > > + if (!kvm_overcommitted()) > + return; > + > kvm_vcpu_set_in_spin_loop(me, true); > /* >* We boost the priority of a VCPU that is runnable but not -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Thu, 2012-09-27 at 14:03 +0200, Avi Kivity wrote: > On 09/27/2012 01:23 PM, Raghavendra K T wrote: > >> > >> This gives us a good case for tracking preemption on a per-vm basis. As > >> long as we aren't preempted, we can keep the PLE window high, and also > >> return immediately from the handler without looking for candidates. > > > > 1) So do you think, deferring preemption patch ( Vatsa was mentioning > > long back) is also another thing worth trying, so we reduce the chance > > of LHP. > > Yes, we have to keep it in mind. It will be useful for fine grained > locks, not so much so coarse locks or IPIs. > > I would still of course prefer a PLE solution, but if we can't get it to > work we can consider preemption deferral. > > > > > IIRC, with defer preemption : > > we will have hook in spinlock/unlock path to measure depth of lock held, > > and shared with host scheduler (may be via MSRs now). > > Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather > > give say one chance. > > A downside is that we have to do that even when undercommitted. > > Also there may be a lot of false positives (deferred preemptions even > when there is no contention). > > > > > 2) looking at the result (comparing A & C) , I do feel we have > > significant in iterating over vcpus (when compared to even vmexit) > > so We still would need undercommit fix sugested by PeterZ (improving by > > 140%). ? > > Looking only at the current runqueue? My worry is that it misses a lot > of cases. Maybe try the current runqueue first and then others. > > Or were you referring to something else? > > > > > So looking back at threads/ discussions so far, I am trying to > > summarize, the discussions so far. I feel, at least here are the few > > potential candidates to go in: > > > > 1) Avoiding double runqueue lock overhead (Andrew Theurer/ PeterZ) > > 2) Dynamically changing PLE window (Avi/Andrew/Chegu) > > 3) preempt_notify handler to identify preempted VCPUs (Avi) > > 4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ) > > 5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik) > > 6) Pv spinlock > > 7) Jiannan's proposed improvements > > 8) Defer preemption patches > > > > Did we miss anything (or added extra?) > > > > So here are my action items: > > - I plan to repost this series with what PeterZ, Rik suggested with > > performance analysis. > > - I ll go back and explore on (3) and (6) .. > > > > Please Let me know.. > > Undoubtedly we'll think of more stuff. But this looks like a good start. 9) lazy gang-like scheduling with PLE to cover the non-gang-like exceptions (/me runs and hides from scheduler folks) -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
On Sun, 2012-09-16 at 11:55 +0300, Avi Kivity wrote: > On 09/14/2012 12:30 AM, Andrew Theurer wrote: > > > The concern I have is that even though we have gone through changes to > > help reduce the candidate vcpus we yield to, we still have a very poor > > idea of which vcpu really needs to run. The result is high cpu usage in > > the get_pid_task and still some contention in the double runqueue lock. > > To make this scalable, we either need to significantly reduce the > > occurrence of the lock-holder preemption, or do a much better job of > > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus > > which do not need to run). > > > > On reducing the occurrence: The worst case for lock-holder preemption > > is having vcpus of same VM on the same runqueue. This guarantees the > > situation of 1 vcpu running while another [of the same VM] is not. To > > prove the point, I ran the same test, but with vcpus restricted to a > > range of host cpus, such that any single VM's vcpus can never be on the > > same runqueue. In this case, all 10 VMs' vcpu-0's are on host cpus 0-4, > > vcpu-1's are on host cpus 5-9, and so on. Here is the result: > > > > kvm_cpu_spin, and all > > yield_to changes, plus > > restricted vcpu placement: 8823 +/- 3.20% much, much better > > > > On picking a better vcpu to yield to: I really hesitate to rely on > > paravirt hint [telling us which vcpu is holding a lock], but I am not > > sure how else to reduce the candidate vcpus to yield to. I suspect we > > are yielding to way more vcpus than are prempted lock-holders, and that > > IMO is just work accomplishing nothing. Trying to think of way to > > further reduce candidate vcpus > > I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing. > That other vcpu gets work done (unless it is in pause loop itself) and > the yielding vcpu gets put to sleep for a while, so it doesn't spend > cycles spinning. While we haven't fixed the problem at least the guest > is accomplishing work, and meanwhile the real lock holder may get > naturally scheduled and clear the lock. OK, yes, if the other thread gets useful work done, then it is not wasteful. I was thinking of the worst case scenario, where any other vcpu would likely spin as well, and the host side cpu-time for switching vcpu threads was not all that productive. Well, I suppose it does help eliminate potential lock holding vcpus; it just seems to be not that efficient or fast enough. > The main problem with this theory is that the experiments don't seem to > bear it out. Granted, my test case is quite brutal. It's nothing but over-committed VMs which always have some spin lock activity. However, we really should try to fix the worst case scenario. > So maybe one of the assumptions is wrong - the yielding > vcpu gets scheduled early. That could be the case if the two vcpus are > on different runqueues - you could be changing the relative priority of > vcpus on the target runqueue, but still remain on top yourself. Is this > possible with the current code? > > Maybe we should prefer vcpus on the same runqueue as yield_to targets, > and only fall back to remote vcpus when we see it didn't help. > > Let's examine a few cases: > > 1. spinner on cpu 0, lock holder on cpu 0 > > win! > > 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0 > > Spinner gets put to sleep, random vcpus get to work, low lock contention > (no double_rq_lock), by the time spinner gets scheduled we might have won > > 3. spinner on cpu 0, another spinner on cpu 0 > > Worst case, we'll just spin some more. Need to detect this case and > migrate something in. Well, we can certainly experiment and see what we get. IMO, the key to getting this working really well on the large VMs is finding the lock-holding cpu -quickly-. What I think is happening is that we go through a relatively long process to get to that one right vcpu. I guess I need to find a faster way to get there. > 4. spinner on cpu 0, alone > > Similar > > > It seems we need to tie in to the load balancer. > > Would changing the priority of the task while it is spinning help the > load balancer? Not sure. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote: > * Andrew Theurer [2012-09-11 13:27:41]: > > > On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote: > > > On 09/11/2012 01:42 AM, Andrew Theurer wrote: > > > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: > > > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: > > > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct > > > >>>> task_struct *p) > > > >>>> +{ > > > >>>> + if (!curr->sched_class->yield_to_task) > > > >>>> + return false; > > > >>>> + > > > >>>> + if (curr->sched_class != p->sched_class) > > > >>>> + return false; > > > >>> > > > >>> > > > >>> Peter, > > > >>> > > > >>> Should we also add a check if the runq has a skip buddy (as pointed > > > >>> out > > > >>> by Raghu) and return if the skip buddy is already set. > > > >> > > > >> Oh right, I missed that suggestion.. the performance improvement went > > > >> from 81% to 139% using this, right? > > > >> > > > >> It might make more sense to keep that separate, outside of this > > > >> function, since its not a strict prerequisite. > > > >> > > > >>>> > > > >>>> + if (task_running(p_rq, p) || p->state) > > > >>>> + return false; > > > >>>> + > > > >>>> + return true; > > > >>>> +} > > > >> > > > >> > > > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, > > > >>> bool preempt) > > > >>>>rq = this_rq(); > > > >>>> > > > >>>> again: > > > >>>> + /* optimistic test to avoid taking locks */ > > > >>>> + if (!__yield_to_candidate(curr, p)) > > > >>>> + goto out_irq; > > > >>>> + > > > >> > > > >> So add something like: > > > >> > > > >>/* Optimistic, if we 'raced' with another yield_to(), don't > > > >> bother */ > > > >>if (p_rq->cfs_rq->skip) > > > >>goto out_irq; > > > >>> > > > >>> > > > >>>>p_rq = task_rq(p); > > > >>>>double_rq_lock(rq, p_rq); > > > >>> > > > >>> > > > >> But I do have a question on this optimization though,.. Why do we check > > > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? > > > >> > > > >> That is, I'd like to see this thing explained a little better. > > > >> > > > >> Does it go something like: p_rq is the runqueue of the task we'd like > > > >> to > > > >> yield to, rq is our own, they might be the same. If we have a ->skip, > > > >> there's nothing we can do about it, OTOH p_rq having a ->skip and > > > >> failing the yield_to() simply means us picking the next VCPU thread, > > > >> which might be running on an entirely different cpu (rq) and could > > > >> succeed? > > > > > > > > Here's two new versions, both include a __yield_to_candidate(): "v3" > > > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq > > > > skip check. Raghu, I am not sure if this is exactly what you want > > > > implemented in v4. > > > > > > > > > > Andrew, Yes that is what I had. I think there was a mis-understanding. > > > My intention was to if there is a directed_yield happened in runqueue > > > (say rqA), do not bother to directed yield to that. But unfortunately as > > > PeterZ pointed that would have resulted in setting next buddy of a > > > different run queue than rqA. > > > So we can drop this "skip" idea. Pondering more over what to do? can we > > > use next buddy itself ... thinking.. > > > > As I mentioned earlier today, I did not have your changes from kvm.git > > tree when I tested my changes. Here are yo
Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote: > On 09/11/2012 01:42 AM, Andrew Theurer wrote: > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct > >>>> task_struct *p) > >>>> +{ > >>>> + if (!curr->sched_class->yield_to_task) > >>>> + return false; > >>>> + > >>>> + if (curr->sched_class != p->sched_class) > >>>> + return false; > >>> > >>> > >>> Peter, > >>> > >>> Should we also add a check if the runq has a skip buddy (as pointed out > >>> by Raghu) and return if the skip buddy is already set. > >> > >> Oh right, I missed that suggestion.. the performance improvement went > >> from 81% to 139% using this, right? > >> > >> It might make more sense to keep that separate, outside of this > >> function, since its not a strict prerequisite. > >> > >>>> > >>>> + if (task_running(p_rq, p) || p->state) > >>>> + return false; > >>>> + > >>>> + return true; > >>>> +} > >> > >> > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, > >>> bool preempt) > >>>>rq = this_rq(); > >>>> > >>>> again: > >>>> + /* optimistic test to avoid taking locks */ > >>>> + if (!__yield_to_candidate(curr, p)) > >>>> + goto out_irq; > >>>> + > >> > >> So add something like: > >> > >>/* Optimistic, if we 'raced' with another yield_to(), don't bother */ > >>if (p_rq->cfs_rq->skip) > >>goto out_irq; > >>> > >>> > >>>>p_rq = task_rq(p); > >>>>double_rq_lock(rq, p_rq); > >>> > >>> > >> But I do have a question on this optimization though,.. Why do we check > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? > >> > >> That is, I'd like to see this thing explained a little better. > >> > >> Does it go something like: p_rq is the runqueue of the task we'd like to > >> yield to, rq is our own, they might be the same. If we have a ->skip, > >> there's nothing we can do about it, OTOH p_rq having a ->skip and > >> failing the yield_to() simply means us picking the next VCPU thread, > >> which might be running on an entirely different cpu (rq) and could > >> succeed? > > > > Here's two new versions, both include a __yield_to_candidate(): "v3" > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq > > skip check. Raghu, I am not sure if this is exactly what you want > > implemented in v4. > > > > Andrew, Yes that is what I had. I think there was a mis-understanding. > My intention was to if there is a directed_yield happened in runqueue > (say rqA), do not bother to directed yield to that. But unfortunately as > PeterZ pointed that would have resulted in setting next buddy of a > different run queue than rqA. > So we can drop this "skip" idea. Pondering more over what to do? can we > use next buddy itself ... thinking.. As I mentioned earlier today, I did not have your changes from kvm.git tree when I tested my changes. Here are your changes and my changes compared: throughput in MB/sec kvm_vcpu_on_spin changes: 4636 +/- 15.74% yield_to changes: 4515 +/- 12.73% I would be inclined to stick with your changes which are kept in kvm code. I did try both combined, and did not get good results: both changes: 4074 +/- 19.12% So, having both is probably not a good idea. However, I feel like there's more work to be done. With no over-commit (10 VMs), total throughput is 23427 +/- 2.76%. A 2x over-commit will no doubt have some overhead, but a reduction to ~4500 is still terrible. By contrast, 8-way VMs with 2x over-commit have a total throughput roughly 10% less than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread host). We still have what appears to be scalability problems, but now it's not so much in runqueue locks for yield_to(), but now get_pid_task(): perf on host: 32.10% 320131 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task 11.60% 115686 qemu-s
Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote: > On 09/11/2012 01:42 AM, Andrew Theurer wrote: > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct > >>>> task_struct *p) > >>>> +{ > >>>> + if (!curr->sched_class->yield_to_task) > >>>> + return false; > >>>> + > >>>> + if (curr->sched_class != p->sched_class) > >>>> + return false; > >>> > >>> > >>> Peter, > >>> > >>> Should we also add a check if the runq has a skip buddy (as pointed out > >>> by Raghu) and return if the skip buddy is already set. > >> > >> Oh right, I missed that suggestion.. the performance improvement went > >> from 81% to 139% using this, right? > >> > >> It might make more sense to keep that separate, outside of this > >> function, since its not a strict prerequisite. > >> > >>>> > >>>> + if (task_running(p_rq, p) || p->state) > >>>> + return false; > >>>> + > >>>> + return true; > >>>> +} > >> > >> > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, > >>> bool preempt) > >>>>rq = this_rq(); > >>>> > >>>> again: > >>>> + /* optimistic test to avoid taking locks */ > >>>> + if (!__yield_to_candidate(curr, p)) > >>>> + goto out_irq; > >>>> + > >> > >> So add something like: > >> > >>/* Optimistic, if we 'raced' with another yield_to(), don't bother */ > >>if (p_rq->cfs_rq->skip) > >>goto out_irq; > >>> > >>> > >>>>p_rq = task_rq(p); > >>>>double_rq_lock(rq, p_rq); > >>> > >>> > >> But I do have a question on this optimization though,.. Why do we check > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? > >> > >> That is, I'd like to see this thing explained a little better. > >> > >> Does it go something like: p_rq is the runqueue of the task we'd like to > >> yield to, rq is our own, they might be the same. If we have a ->skip, > >> there's nothing we can do about it, OTOH p_rq having a ->skip and > >> failing the yield_to() simply means us picking the next VCPU thread, > >> which might be running on an entirely different cpu (rq) and could > >> succeed? > > > > Here's two new versions, both include a __yield_to_candidate(): "v3" > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq > > skip check. Raghu, I am not sure if this is exactly what you want > > implemented in v4. > > > > Andrew, Yes that is what I had. I think there was a mis-understanding. > My intention was to if there is a directed_yield happened in runqueue > (say rqA), do not bother to directed yield to that. But unfortunately as > PeterZ pointed that would have resulted in setting next buddy of a > different run queue than rqA. > So we can drop this "skip" idea. Pondering more over what to do? can we > use next buddy itself ... thinking.. FYI, I regretfully forgot include your recent changes to kvm_vcpu_on_spin in my tests (found in kvm.git/next branch), so I am going to get some results for that before I experiment any more on 3.6-rc. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote: > On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote: > > > +static bool __yield_to_candidate(struct task_struct *curr, struct > > > task_struct *p) > > > +{ > > > + if (!curr->sched_class->yield_to_task) > > > + return false; > > > + > > > + if (curr->sched_class != p->sched_class) > > > + return false; > > > > > > Peter, > > > > Should we also add a check if the runq has a skip buddy (as pointed out > > by Raghu) and return if the skip buddy is already set. > > Oh right, I missed that suggestion.. the performance improvement went > from 81% to 139% using this, right? > > It might make more sense to keep that separate, outside of this > function, since its not a strict prerequisite. > > > > > > > + if (task_running(p_rq, p) || p->state) > > > + return false; > > > + > > > + return true; > > > +} > > > > > @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p, > > bool preempt) > > > rq = this_rq(); > > > > > > again: > > > + /* optimistic test to avoid taking locks */ > > > + if (!__yield_to_candidate(curr, p)) > > > + goto out_irq; > > > + > > So add something like: > > /* Optimistic, if we 'raced' with another yield_to(), don't bother */ > if (p_rq->cfs_rq->skip) > goto out_irq; > > > > > > > p_rq = task_rq(p); > > > double_rq_lock(rq, p_rq); > > > > > But I do have a question on this optimization though,.. Why do we check > p_rq->cfs_rq->skip and not rq->cfs_rq->skip ? > > That is, I'd like to see this thing explained a little better. > > Does it go something like: p_rq is the runqueue of the task we'd like to > yield to, rq is our own, they might be the same. If we have a ->skip, > there's nothing we can do about it, OTOH p_rq having a ->skip and > failing the yield_to() simply means us picking the next VCPU thread, > which might be running on an entirely different cpu (rq) and could > succeed? Here's two new versions, both include a __yield_to_candidate(): "v3" uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq skip check. Raghu, I am not sure if this is exactly what you want implemented in v4. Results: > ple on: 2552 +/- .70% > ple on: w/fixv1: 4621 +/- 2.12% (81% improvement) > ple on: w/fixv2: 6115 (139% improvement) v3: 5735 (124% improvement) v4: 4524 ( 3% regression) Both patches included below -Andrew diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..0d98a67 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4820,6 +4820,23 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/* + * Tests preconditions required for sched_class::yield_to(). + */ +static bool __yield_to_candidate(struct task_struct *curr, struct task_struct *p, struct rq *p_rq) +{ + if (!curr->sched_class->yield_to_task) + return false; + + if (curr->sched_class != p->sched_class) + return false; + + if (task_running(p_rq, p) || p->state) + return false; + + return true; +} + /** * yield_to - yield the current processor to another thread in * your thread group, or accelerate that thread toward the @@ -4844,20 +4861,27 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); + + /* optimistic test to avoid taking locks */ + if (!__yield_to_candidate(curr, p, p_rq)) + goto out_irq; + + /* +* if the target task is not running, then only yield if the +* current task is in guest mode +*/ + if (!(p_rq->curr->flags & PF_VCPU)) + goto out_irq; + double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); goto again; } - if (!curr->sched_class->yield_to_task) - goto out; - - if (curr->sched_class != p->sched_class) - goto out; - - if (task_running(p_rq, p) || p->state) - goto out; + /* validate state, holding p_rq ensures p's state cannot change */ + if (!__yield_to_candidate(curr, p, p_rq)) + goto out_unlock; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4877,8 +4901,9 @@ again: rq->skip_clock_update = 0; } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); if (yielded) v4: diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..2bec2ed 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4820,6 +4820,23 @@ void __sched yield(void) } EXPORT_SYMBOL(yield); +/* + * Tests preconditions required for sched_class::yield_to(). + */ +static bool __yield_to_candida
Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
On Sat, 2012-09-08 at 14:13 +0530, Srikar Dronamraju wrote: > > > > signed-off-by: Andrew Theurer > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index fbf1fd0..c767915 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool > > preempt) > > > > again: > > p_rq = task_rq(p); > > + if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags & > > PF_VCPU)) { > > + goto out_no_unlock; > > + } > > double_rq_lock(rq, p_rq); > > while (task_rq(p) != p_rq) { > > double_rq_unlock(rq, p_rq); > > @@ -4856,8 +4859,6 @@ again: > > if (curr->sched_class != p->sched_class) > > goto out; > > > > - if (task_running(p_rq, p) || p->state) > > - goto out; > > Is it possible that by this time the current thread takes double rq > lock, thread p could actually be running? i.e is there merit to keep > this check around even with your similar check above? I think that's a good idea. I'll add that back in. > > > > > yielded = curr->sched_class->yield_to_task(rq, p, preempt); > > if (yielded) { > > @@ -4879,6 +4880,7 @@ again: > > > > out: > > double_rq_unlock(rq, p_rq); > > +out_no_unlock: > > local_irq_restore(flags); > > > > if (yielded) > > > > > -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC][PATCH] Improving directed yield scalability for PLE handler
On Fri, 2012-09-07 at 23:36 +0530, Raghavendra K T wrote: > CCing PeterZ also. > > On 09/07/2012 06:41 PM, Andrew Theurer wrote: > > I have noticed recently that PLE/yield_to() is still not that scalable > > for really large guests, sometimes even with no CPU over-commit. I have > > a small change that make a very big difference. > > > > First, let me explain what I saw: > > > > Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80 > > thread Westmere-EX system: 645 seconds! > > > > Host cpu: ~98% in kernel, nearly all of it in spin_lock from double > > runqueue lock for yield_to() > > > > So, I added some schedstats to yield_to(), one to count when we failed > > this test in yield_to() > > > > if (task_running(p_rq, p) || p->state) > > > > and one when we pass all the conditions and get to actually yield: > > > > yielded = curr->sched_class->yield_to_task(rq, p, preempt); > > > > > > And during boot up of this guest, I saw: > > > > > > failed yield_to() because task is running: 8368810426 > > successful yield_to(): 13077658 > >0.156022% of yield_to calls > >1 out of 640 yield_to calls > > > > Obviously, we have a problem. Every exit causes a loop over 80 vcpus, > > each one trying to get two locks. This is happening on all [but one] > > vcpus at around the same time. Not going to work well. > > > > True and interesting. I had once thought of reducing overall O(n^2) > iteration to O(n log(n)) iterations by reducing number of candidates > to search to O(log(n)) instead of current O(n). May be I have to get > back to my experiment modes. > > > So, since the check for a running task is nearly always true, I moved > > that -before- the double runqueue lock, so 99.84% of the attempts do not > > take the locks. Now, I do not know is this [not getting the locks] is a > > problem. However, I'd rather have a little inaccurate test for a > > running vcpu than burning 98% of CPU in host kernel. With the change > > the VM boot time went to: 100 seconds, an 85% reduction in time. > > > > I also wanted to check to see this did not affect truly over-committed > > situations, so I first started with smaller VMs at 2x cpu over-commit: > > > > 16 VMs, 8-way each, all running dbench (2x cpu over-commmit) > > throughput +/- stddev > > - - > > ple off:2281 +/- 7.32% (really bad as expected) > > ple on:19796 +/- 1.36% > > ple on: w/fix: 19796 +/- 1.37% (no degrade at all) > > > > In this case the VMs are small enough, that we do not loop through > > enough vcpus to trigger the problem. host CPU is very low (3-4% range) > > for both default ple and with yield_to() fix. > > > > So I went on to a bigger VM: > > > > 10 VMs, 16-way each, all running dbench (2x cpu over-commit) > > throughput +/- stddev > > - - > > ple on: 2552 +/- .70% > > ple on: w/fix: 4621 +/- 2.12% (81% improvement!) > > > > This is where we start seeing a major difference. Without the fix, host > > cpu was around 70%, mostly in spin_lock. That was reduced to 60% (and > > guest went from 30 to 40%). I believe this is on the right track to > > reduce the spin lock contention, still get proper directed yield, and > > therefore improve the guest CPU available and its performance. > > > > However, we still have lock contention, and I think we can reduce it > > even more. We have eliminated some attempts at double runqueue lock > > acquire because the check for the target vcpu is running is now before > > the lock. However, even if the target-to-yield-to vcpu [for the same > > guest upon we PLE exited] is not running, the physical > > processor/runqueue that target-to-yield-to vcpu is located on could be > > running a different VM's vcpu -and- going through a directed yield, > > therefore that run queue lock may already acquired. We do not want to > > just spin and wait, we want to move to the next candidate vcpu. We need > > a check to see if the smp processor/runqueue is already in a directed > > yield. Or, perhaps we just check if that cpu is not in guest mode, and > > if so, we skip that yield attempt for that vcpu and move to the next > > candidate vcpu. So, my question is: given a runqueue, what's the best > > way to check if that corresponding phys cpu is not in guest mode? > > > > We are indeed avoid
[RFC][PATCH] Improving directed yield scalability for PLE handler
I have noticed recently that PLE/yield_to() is still not that scalable for really large guests, sometimes even with no CPU over-commit. I have a small change that make a very big difference. First, let me explain what I saw: Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80 thread Westmere-EX system: 645 seconds! Host cpu: ~98% in kernel, nearly all of it in spin_lock from double runqueue lock for yield_to() So, I added some schedstats to yield_to(), one to count when we failed this test in yield_to() if (task_running(p_rq, p) || p->state) and one when we pass all the conditions and get to actually yield: yielded = curr->sched_class->yield_to_task(rq, p, preempt); And during boot up of this guest, I saw: failed yield_to() because task is running: 8368810426 successful yield_to(): 13077658 0.156022% of yield_to calls 1 out of 640 yield_to calls Obviously, we have a problem. Every exit causes a loop over 80 vcpus, each one trying to get two locks. This is happening on all [but one] vcpus at around the same time. Not going to work well. So, since the check for a running task is nearly always true, I moved that -before- the double runqueue lock, so 99.84% of the attempts do not take the locks. Now, I do not know is this [not getting the locks] is a problem. However, I'd rather have a little inaccurate test for a running vcpu than burning 98% of CPU in host kernel. With the change the VM boot time went to: 100 seconds, an 85% reduction in time. I also wanted to check to see this did not affect truly over-committed situations, so I first started with smaller VMs at 2x cpu over-commit: 16 VMs, 8-way each, all running dbench (2x cpu over-commmit) throughput +/- stddev - - ple off:2281 +/- 7.32% (really bad as expected) ple on:19796 +/- 1.36% ple on: w/fix: 19796 +/- 1.37% (no degrade at all) In this case the VMs are small enough, that we do not loop through enough vcpus to trigger the problem. host CPU is very low (3-4% range) for both default ple and with yield_to() fix. So I went on to a bigger VM: 10 VMs, 16-way each, all running dbench (2x cpu over-commit) throughput +/- stddev - - ple on: 2552 +/- .70% ple on: w/fix: 4621 +/- 2.12% (81% improvement!) This is where we start seeing a major difference. Without the fix, host cpu was around 70%, mostly in spin_lock. That was reduced to 60% (and guest went from 30 to 40%). I believe this is on the right track to reduce the spin lock contention, still get proper directed yield, and therefore improve the guest CPU available and its performance. However, we still have lock contention, and I think we can reduce it even more. We have eliminated some attempts at double runqueue lock acquire because the check for the target vcpu is running is now before the lock. However, even if the target-to-yield-to vcpu [for the same guest upon we PLE exited] is not running, the physical processor/runqueue that target-to-yield-to vcpu is located on could be running a different VM's vcpu -and- going through a directed yield, therefore that run queue lock may already acquired. We do not want to just spin and wait, we want to move to the next candidate vcpu. We need a check to see if the smp processor/runqueue is already in a directed yield. Or, perhaps we just check if that cpu is not in guest mode, and if so, we skip that yield attempt for that vcpu and move to the next candidate vcpu. So, my question is: given a runqueue, what's the best way to check if that corresponding phys cpu is not in guest mode? Here's the changes so far (schedstat changes not included here): signed-off-by: Andrew Theurer diff --git a/kernel/sched/core.c b/kernel/sched/core.c index fbf1fd0..f8eff8c 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool preempt) again: p_rq = task_rq(p); + if (task_running(p_rq, p) || p->state) { + goto out_no_unlock; + } double_rq_lock(rq, p_rq); while (task_rq(p) != p_rq) { double_rq_unlock(rq, p_rq); @@ -4856,8 +4859,6 @@ again: if (curr->sched_class != p->sched_class) goto out; - if (task_running(p_rq, p) || p->state) - goto out; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4879,6 +4880,7 @@ again: out: double_rq_unlock(rq, p_rq); +out_no_unlock: local_irq_restore(flags); if (yielded) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
pagemapscan-numa: find out where your mulit-node VM's memory is (and a question)
We've started running multi-node VMs lately, and we want to track where the VM memory actually resides in the host. Our ability to manually pin certain parts of memory are limited, so we want to see if they are effective. We also want to see what things like autoNUMA are doing. Below is the program if anyone else wants to track memory placement. One question I have is about the memory allocation made by Qemu for the VM. This program assumes that there is a single mapping for VM memory, and that if the VM has N NUMA nodes, this mapping contains all the memory for all nodes, and memory for nodes 0...N are ordered, such that node 0 is at the beginning and node N is at the end of the mapping. Is this a correct assumption? Or even close? If not, what can we do to extract this info from Qemu? We assume that the memory per node is equal, which I can see could be wrong, since the user could specify different memory sizes per node. But at this point I am willing to limit my tests to equal sized NUMA nodes. Here are some examples of the output: This is a 320GB, 80 vcpu, 4 node VM just booted on a 4 x WST-EX host. There is no pinning/numactl, and no autoNUMA, etc. > [root@at-host ~]# ./pagemapscan-numa 31021 4 > pid is 31021 > Processing /proc/zoneinfo for start_pfn's > host node 0 start pfn: 1048576 > host node 1 start pfn: 67633152 > host node 2 start pfn: 134742016 > host node 3 start pfn: 201850880 > > Processing /proc/31021/maps and /proc/31021/pagemap > > Found mapping 7F8033E0-7FD033E0 that is >2GiB (327680 MiB, > 83886080 pages) > Pages present for this mapping: > vNUMA-node-IDnode00node01node02node03 > 00 15872 8704 8192573952 > 01 5120 2560 1024512512 > 02 0 0 0508416 > 03 0 0 512508416 > 2145280/8380 total pages/MiB present for this process Next one is a 139GB, 16 vcpu, 2 node VM, on a 2 x WST-EP host. The VM uses PCI device assignment, so all the memory is allocated at VM creation. We also used hugetlbfs, and we used a mempolicy to prefer node 0. We allocated just enough huge pages so that once Qemu allocated vNode0's memory, host node0's hugepages were depleted, and the remainder of the VM's memory (vNode1) was fulfilled by huge pages in host node1. At least that was our theory, and we wanted to confirm it worked: > pid is 26899 > Processing /proc/zoneinfo for start_pfn's > host node 0 start pfn: 1048576 > host node 1 start pfn: 19398656 > > processing /proc/26899/maps /proc/26899/pagemap > > Found mapping 7F713700-7F93F700 that is >2GiB (142336 MiB, > 36438016 pages) > Pages present for this mapping: > vNUMA-node-IDnode00node01 > 00 18219008 0 > 01 0 18219008 > 36438016/142336 total pages/MiB present for this process We suspect this might be useful in testing autoNUMA and other NUMA related tests. Thanks, -Andrew Theurer /* pagemapscan-numa.c v0.01 * * Copyright (c) 2012 IBM * * Author: Andrew Theurer * * This software is licensed to you under the GNU General Public License, * version 2 (GPLv2). There is NO WARRANTY for this software, express or * implied, including the implied warranties of MERCHANTABILITY or FITNESS * FOR A PARTICULAR PURPOSE. You should have received a copy of GPLv2 * along with this software; if not, see * http://www.gnu.org/licenses/old-licenses/gpl-2.0.txt. * * pagemapscan-numa: This program will take a Qemu PID and a value * equal to the number of NUMA nodes that the VM should have and process * /proc//pagemap, first finding the mapping that is the for VM memory, * and then finding where each page physically resides (which NUMA node) * on the host. This is only useful if you have a mulit-node NUMA * topology on your host, and you have a multi-node NUMA topology * in your guest, and you want to know where the VM's memory maps to * on the host. */ #define _LARGEFILE64_SOURCE #include #include #include #include #include #define MAX_NODES 256 #define MAX_LENGTH 256 #define PAGE_SIZE 4096 int main(int argc, char* argv[]) { int pagemapfile = -1; int nr_vm_nodes; FILE *mapsfile = NULL; FILE *zonefile = NULL; char pagemappath[MAX_LENGTH]; char mapspath[MAX_LENGTH]; char mapsline[MAX_LENGTH]; char zoneline[MAX_LENGTH]; long file_offset, offset; unsigned long start_addr, end_addr, size_b, size_mib, nr_pages, page, present_pages, present_mib, start_pfn[MAX_NODES]; int i, node, nr_host_nodes, find_node; if (argc != 3) { printf("You must provide two arguments, the Qemu PID and the number
Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
On Tue, 2012-07-10 at 17:24 +0530, Raghavendra K T wrote: > On 07/10/2012 03:17 AM, Andrew Theurer wrote: > > On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote: > >> Currently Pause Looop Exit (PLE) handler is doing directed yield to a > >> random VCPU on PL exit. Though we already have filtering while choosing > >> the candidate to yield_to, we can do better. > > > > Hi, Raghu. > > > [...] > > > > Can you briefly explain the 1x and 2x configs? This of course is highly > > dependent whether or not HT is enabled... > > > > Sorry if I had not made very clear in earlier threads. Have you applied > Rik's following patch for base. without this you could see some > inconsistent results perhaps. > > https://lkml.org/lkml/2012/6/19/401 Yes, I do have that applied with your patch and in my baseline. -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler
On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote: > Currently Pause Looop Exit (PLE) handler is doing directed yield to a > random VCPU on PL exit. Though we already have filtering while choosing > the candidate to yield_to, we can do better. Hi, Raghu. > Problem is, for large vcpu guests, we have more probability of yielding > to a bad vcpu. We are not able to prevent directed yield to same guy who > has done PL exit recently, who perhaps spins again and wastes CPU. > > Fix that by keeping track of who has done PL exit. So The Algorithm in series > give chance to a VCPU which has: > > (a) Not done PLE exit at all (probably he is preempted lock-holder) > > (b) VCPU skipped in last iteration because it did PL exit, and probably > has become eligible now (next eligible lock holder) > > Future enhancemnets: > (1) Currently we have a boolean to decide on eligibility of vcpu. It > would be nice if I get feedback on guest (>32 vcpu) whether we can > improve better with integer counter. (with counter = say f(log n )). > > (2) We have not considered system load during iteration of vcpu. With >that information we can limit the scan and also decide whether schedule() >is better. [ I am able to use #kicked vcpus to decide on this But may >be there are better ideas like information from global loadavg.] > > (3) We can exploit this further with PV patches since it also knows about >next eligible lock-holder. > > Summary: There is a huge improvement for moderate / no overcommit scenario > for kvm based guest on PLE machine (which is difficult ;) ). > > Result: > Base : kernel 3.5.0-rc5 with Rik's Ple handler fix > > Machine : Intel(R) Xeon(R) CPU X7560 @ 2.27GHz, 4 numa node, 256GB RAM, > 32 core machine Is this with HT enabled, therefore 64 CPU threads? > Host: enterprise linux gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC) > with test kernels > > Guest: fedora 16 with 32 vcpus 8GB memory. Can you briefly explain the 1x and 2x configs? This of course is highly dependent whether or not HT is enabled... FWIW, I started testing what I would call "0.5x", where I have one 40 vcpu guest running on a host with 40 cores and 80 CPU threads total (HT enabled, no extra load on the system). For ebizzy, the results are quite erratic from run to run, so I am inclined to discard it as a workload, but maybe I should try "1x" and "2x" cpu over-commit as well. >From initial observations, at least for the ebizzy workload, the percentage of exits that result in a yield_to() are very low, around 1%, before these patches. So, I am concerned that at least for this test, reducing that number even more has diminishing returns. I am however still concerned about the scalability problem with yield_to(), which shows like this for me (perf): > 63.56% 282095 qemu-kvm [kernel.kallsyms][k] > _raw_spin_lock > 5.42% 24420 qemu-kvm [kvm][k] > kvm_vcpu_yield_to > 5.33% 26481 qemu-kvm [kernel.kallsyms][k] get_pid_task > > 4.35% 20049 qemu-kvm [kernel.kallsyms][k] yield_to > > 2.74% 15652 qemu-kvm [kvm][k] > kvm_apic_present > 1.70% 8657 qemu-kvm [kvm][k] > kvm_vcpu_on_spin > 1.45% 7889 qemu-kvm [kvm][k] > vcpu_enter_guest For the cpu threads in the host that are actually active (in this case 1/2 of them), ~50% of their time is in kernel and ~43% in guest. This is for a no-IO workload, so that's just incredible to see so much cpu wasted. I feel that 2 important areas to tackle are a more scalable yield_to() and reducing the number of pause exits itself (hopefully by just tuning ple_window for the latter). Honestly, I not confident addressing this problem will improve the ebizzy score. That workload is so erratic for me, that I do not trust the results at all. I have however seen consistent improvements in disabling PLE for a http guest workload and a very high IOPS guest workload, both with much time spent in host in the double runqueue lock for yield_to(), so that's why I still gravitate toward that issue. -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] add PLE stats to kvmstat
On Sat, 2012-07-07 at 01:40 +0800, Xiao Guangrong wrote: > On 07/06/2012 09:22 PM, Andrew Theurer wrote: > > On Fri, 2012-07-06 at 15:42 +0800, Xiao Guangrong wrote: > >> On 07/06/2012 05:50 AM, Andrew Theurer wrote: > >>> I, and I expect others, have a keen interest in knowing how often we > >>> exit for PLE, and also how often that includes a yielding to another > >>> vcpu. The following adds two more counters to kvmstat to track the > >>> exits and the vcpu yields. This in no way changes PLE behavior, just > >>> helps us track what's going on. > >>> > >> > >> Tracepoint is a better choice than the counters you used. :) > > > > Xiao, is kvmstat considered to be deprecated? Or are debug stats like > > this just generally favored to be processed via something like perf > > instead of debugfs? > > Andrew, please refer to Documentation/feature-removal-schedule.txt, > it says: > > What: KVM debugfs statistics > When: 2013 > Why:KVM tracepoints provide mostly equivalent information in a much more > flexible fashion. > > You can use tracepoints instead of your debugfs-counters in this patch. Great, thanks. I will work on a tracepoint based approach. > > > Should we be removing kvmstat? > > > > Some months ago, i implemented 'perf kvm-events' which can analyse kvm > events more smartly, the patchset can be found at: > https://lkml.org/lkml/2012/3/6/86 I will take a look. > > Avi said it may instead of kvmstat, but i am too busy to update this > patchset. :) > -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] add PLE stats to kvmstat
On Fri, 2012-07-06 at 15:42 +0800, Xiao Guangrong wrote: > On 07/06/2012 05:50 AM, Andrew Theurer wrote: > > I, and I expect others, have a keen interest in knowing how often we > > exit for PLE, and also how often that includes a yielding to another > > vcpu. The following adds two more counters to kvmstat to track the > > exits and the vcpu yields. This in no way changes PLE behavior, just > > helps us track what's going on. > > > > Tracepoint is a better choice than the counters you used. :) Xiao, is kvmstat considered to be deprecated? Or are debug stats like this just generally favored to be processed via something like perf instead of debugfs? Should we be removing kvmstat? Thanks, -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] add PLE stats to kvmstat
I, and I expect others, have a keen interest in knowing how often we exit for PLE, and also how often that includes a yielding to another vcpu. The following adds two more counters to kvmstat to track the exits and the vcpu yields. This in no way changes PLE behavior, just helps us track what's going on. -Andrew Theurer Signed-off-by: Andrew Theurer arch/x86/include/asm/kvm_host.h |2 ++ arch/x86/kvm/svm.c |1 + arch/x86/kvm/vmx.c |1 + arch/x86/kvm/x86.c |2 ++ virt/kvm/kvm_main.c |1 + 5 files changed, 7 insertions(+), 0 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 24b7647..aebba8a 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -593,6 +593,8 @@ struct kvm_vcpu_stat { u32 hypercalls; u32 irq_injections; u32 nmi_injections; + u32 pause_exits; + u32 vcpu_yield_to; }; struct x86_instruction_info; diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 7a41878..1c1b81e 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -3264,6 +3264,7 @@ static int interrupt_window_interception(struct vcpu_svm *svm) static int pause_interception(struct vcpu_svm *svm) { + ++svm->vcpu.stat.pause_exits; kvm_vcpu_on_spin(&(svm->vcpu)); return 1; } diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index eeeb4a2..1309578 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5004,6 +5004,7 @@ out: */ static int handle_pause(struct kvm_vcpu *vcpu) { + ++vcpu->stat.pause_exits; skip_emulated_instruction(vcpu); kvm_vcpu_on_spin(vcpu); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 8eacb2e..ad85403 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -143,6 +143,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = { { "insn_emulation_fail", VCPU_STAT(insn_emulation_fail) }, { "irq_injections", VCPU_STAT(irq_injections) }, { "nmi_injections", VCPU_STAT(nmi_injections) }, + { "pause_exits", VCPU_STAT(pause_exits) }, + { "vcpu_yield_to", VCPU_STAT(vcpu_yield_to) }, { "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) }, { "mmu_pte_write", VM_STAT(mmu_pte_write) }, { "mmu_pte_updated", VM_STAT(mmu_pte_updated) }, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 636bd08..d80b6cd 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1610,6 +1610,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) if (kvm_vcpu_yield_to(vcpu)) { kvm->last_boosted_vcpu = i; yielded = 1; + ++vcpu->stat.vcpu_yield_to; break; } } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case
On Mon, 2012-07-02 at 10:49 -0400, Rik van Riel wrote: > On 06/28/2012 06:55 PM, Vinod, Chegu wrote: > > Hello, > > > > I am just catching up on this email thread... > > > > Perhaps one of you may be able to help answer this query.. preferably along > > with some data. [BTW, I do understand the basic intent behind PLE in a > > typical [sweet spot] use case where there is over subscription etc. and the > > need to optimize the PLE handler in the host etc. ] > > > > In a use case where the host has fewer but much larger guests (say 40VCPUs > > and higher) and there is no over subscription (i.e. # of vcpus across > > guests<= physical cpus in the host and perhaps each guest has their vcpu's > > pinned to specific physical cpus for other reasons), I would like to > > understand if/how the PLE really helps ? For these use cases would it be > > ok to turn PLE off (ple_gap=0) since is no real need to take an exit and > > find some other VCPU to yield to ? > > Yes, that should be ok. > > On a related note, I wonder if we should increase the ple_gap > significantly. > > After all, 4096 cycles of spinning is not that much, when you > consider how much time is spent doing the subsequent vmexit, > scanning the other VCPU's status (200 cycles per cache miss), > deciding what to do, maybe poking another CPU, and eventually > a vmenter. > > A factor 4 increase in ple_gap might be what it takes to > get the amount of time spent spinning equal to the amount of > time spent on the host side doing KVM stuff... I was recently thinking the same thing as I have observed over 180,000 exits/sec from a 40-way VM on a 80-way host, where there should be no cpu overcommit. Also, the number of directed yields for this was only 1800/sec, so we have a 1% usefulness for our exits. I am wondering if the ple_window should be similar to the host scheduler task switching granularity, and not what we think a typical max cycles should be for holding a lock. BTW, I have a patch to add a couple PLE stats to kvmstat which I will send out shortly. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr
On 05/22/2012 04:28 AM, Liu ping fan wrote: On Sat, May 19, 2012 at 12:14 AM, Shirley Ma wrote: On Thu, 2012-05-17 at 17:20 +0800, Liu Ping Fan wrote: Currently, the guest can not know the NUMA info of the vcpu, which will result in performance drawback. This is the discovered and experiment by Shirley Ma Krishna Kumar Tom Lendacky Refer to - http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html we can see the big perfermance gap between NUMA aware and unaware. Enlightened by their discovery, I think, we can do more work -- that is to export NUMA info of host to guest. There three problems we've found: 1. KVM doesn't support NUMA load balancer. Even there are no other workloads in the system, and the number of vcpus on the guest is smaller than the number of cpus per node, the vcpus could be scheduled on different nodes. Someone is working on in-kernel solution. Andrew Theurer has a working user-space NUMA aware VM balancer, it requires libvirt and cgroups (which is default for RHEL6 systems). Interesting, and I found that "sched/numa: Introduce sys_numa_{t,m}bind()" committed by Peter and Ingo may help. But I think from the guest view, it can not tell whether the two vcpus are on the same host node. For example, vcpu-a in node-A is not vcpu-b in node-B, the guest lb will be more expensive if it pull_task from vcpu-a and choose vcpu-b to push. And my idea is to export such info to guest, still working on it. The long term solution is to two-fold: 1) Guests that are quite large (in that they cannot fit in a host NUMA node) must have static mulit-node NUMA topology implemented by Qemu. That is here today, but we do not do it automatically, which is probably going to be a VM management responsibility. 2) Host scheduler and NUMA code must be enhanced to get better placement of Qemu memory and threads. For single-node vNUMA guests, this is easy, put it all in one node. For mulit-node vNUMA guests, the host must understand that some Qemu memory belongs with certain vCPU threads (which make up one of the guests vNUMA nodes), and then place that memory/threads in a specific host node (and continue for other memory/threads for each Qemu vNUMA node). Note that even if a guest's memory/threads for a vNUMA node are relocated to another host node (which will be necessary) the NUMA characteristics of guest are still maintained (as all those vCPUs and memory are still "close" to each other). The problem with exposing the host's NUMA info directly to the guest is that (1) vCPUs will get relocated, so their topology info in the guest will have to change over time. IMO that is a bad idea. We have a hard enough time getting applications to work with a static NUMA info. To get applications to react to changing NUMA topology is not going to turn out well. (2) Every single guest would have to have the same number of NUMA nodes defined as the host. That is overkill, especially for small guests. 2. The host scheduler is not aware the relationship between guest vCPUs and vhost. So it's possible for host scheduler to schedule per-device vhost thread on the same cpu on which the vCPU kick a TX packet, or schecule vhost thread on different node than the vCPU for; For RX packet it's possible for vhost delivers RX packet on the vCPU running on different node too. Yes. I notice this point in your original patch. 3. per-device vhost thread is not scaled. What about the scale-ability of per-vm * host_NUMA_NODE? When we make advantage of multi-core, we produce mulit vcpu threads for one VM. So what about the emulated device? Is it acceptable to scale to take advantage of host NUMA attr. After all, how many nodes on which the VM can be run on are the user's control. It is a balance of scale-ability and performance. So the problems are in host scheduling and vhost thread scalability. I am not sure how much help from exposing NUMA info from host to guest. Have you tested these patched? How much performance gain here? Sorry, not yet. As you have mentioned, the vhost thread scalability is a big problem. So I want to see others' opinion before going on. Thanks and regards, pingfan Thanks Shirley So here comes the idea: 1. export host numa info through guest's sched domain to its scheduler Export vcpu's NUMA info to guest scheduler(I think mem NUMA problem has been handled by host). So the guest's lb will consider the cost. I am still working on this, and my original idea is to export these info through "static struct sched_domain_topology_level *sched_domain_topology" to guest. 2. Do a better emulation of virt mach exported to guest. In real world, the devices are limited by kinds of reasons to own the NUMA property. But as to Qemu, the device is emulated by thread, which inherit the NUMA attr in nature. We can implement the d
Re: gettimeofday() vsyscall for kvm-clock?
On 05/21/2012 03:36 PM, Marcelo Tosatti wrote: On Mon, May 21, 2012 at 03:26:54PM -0500, Andrew Theurer wrote: Wondering if a user-space gettimofday() for kvm-clock has been considered before. I am seeing a pretty large difference in performance between tsc and kvm-clock. I have to assume at least some of this is due to the mode switch for kvm-clock. Here are the results: (this is a 16 vCPU VM on a 16 thread 2S Nehalem-EP host, looped gettimeofday() calls on all vCPUs) tsc:.0645 usec per call kvm-clock: .4222 usec per call (6.54x) -Andrew Theurer https://bugzilla.redhat.com/show_bug.cgi?id=679207 "model name : Intel(R) Xeon(R) CPU E5540 @ 2.53GHz native, gettimeofday (vsyscall): 45ns guest, kvmclock (syscall): 198ns" But this was before commit 489fb490dbf8dab0249ad82b56688ae3842a79e8 Author: Glauber Costa Date: Tue May 11 12:17:40 2010 -0400 x86, paravirt: Add a global synchronization point for pvclock (see the full changelog for details). Can you try disabling the global variable, to see if that makes a difference (should not be enabled in production)? Untested patch (against guest kernel) below The following was re-done on a 3.4 guest kernel (previously RHEL kernel): 1-way: tsc: .0315 kvm-clock:.2112 (6.7x) 16-way: tsc: .0432 kvm-clock:.4825 (11.1x) Now with global var disabled: 16-way: kvm-clock:.4628 Does not look like much of a difference. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
gettimeofday() vsyscall for kvm-clock?
Wondering if a user-space gettimofday() for kvm-clock has been considered before. I am seeing a pretty large difference in performance between tsc and kvm-clock. I have to assume at least some of this is due to the mode switch for kvm-clock. Here are the results: (this is a 16 vCPU VM on a 16 thread 2S Nehalem-EP host, looped gettimeofday() calls on all vCPUs) tsc:.0645 usec per call kvm-clock: .4222 usec per call (6.54x) -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: perf stat to collect the performance statistics of KVM process
On 05/13/2012 10:56 AM, Hailong Yang wrote: Dear all, I am running perf stat to collect the performance statistics of the already running KVM guest VM process. And I have a shell script to execute the perf stat as a daemon process. But when I use the 'kill' command to stop the perf stat, there is no output redirected to a file. What I would like to do is that to collect performance counters of the guest VM process for a certain period and redirect the output to a log file, but without user interaction (such as using CTRL + C to stop the perf stat) [root@dell06 ~]# (perf stat -p 7473 -x ,) 2> perftest& [1] 15086 [root@dell06 ~]# kill 15086 [root@dell06 ~]# [1]+ Terminated ( perf stat -p 7473 -x , ) 2> perftest [root@dell06 ~]# cat perftest [root@dell06 ~]# Any clue? Can you please try "kill -s INT " Best Regards Hailong -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to determine the backing host physical memory for a given guest ?
On 05/09/2012 08:46 AM, Avi Kivity wrote: On 05/09/2012 04:05 PM, Chegu Vinod wrote: Hello, On an 8 socket Westmere host I am attempting to run a single guest and characterize the virtualization overhead for a system intensive workload (AIM7-high_systime) as the size of the guest scales (10way/64G, 20way/128G, ... 80way/512G). To do some comparisons between the native vs. guest runs. I have been using "numactl" to control the cpu node& memory node bindings for the qemu instance. For larger guest sizes I end up binding across multiple localities. for e.g. a 40 way guest : numactl --cpunodebind=0,1,2,3 --membind=0,1,2,3 \ qemu-system-x86_64 -smp 40 -m 262144 \ <> I understand that actual mappings from a guest virtual address to host physical address could change. Is there a way to determine [at a given instant] which host's NUMA node is providing the backing physical memory for the active guest's kernel and also for the the apps actively running in the guest ? Guessing that there is a better way (some tool available?) than just diff'ng the per node memory usage...from the before and after output of "numactl --hardware" on the host. Not sure if that's what you want, but there's Documentation/vm/pagemap.txt. You can look at /proc//numa_maps and see all the mappings for the qemu process. There should be one really large mapping for the guest memory, and in that line a number of dirty pages list potentially for each NUMA node. This will tell you how much from each node, but not specifically "which page is mapped where". Keep in mind with the current numactl you are using, you will likely not get the benefits of NUMA enhancements found in the linux kernel from your guest (or host). There are a couple reasons: (1) your guest does not have a NUMA topology defined (based on what I see from the qemu command above), so it will not do anything special based on the host topology. Also, things that are broken down per-NUMA-node like some spin-locks and sched-domains are now system-wide/flat. This is a big deal for scheduler and other things like kmem allocation. With a single 80way VM with no NUMA, you will likely have massive spin-lock contention on some workloads. (2) Once the VM does have NUMA toplogy (via qemu -numa), one still cannot manually set mempolicy for a portion of the VM memory that represents each NUMA node in the VM (or have this done automatically with something like autoNUMA). Therefore, it's difficult to forcefully map each of the VM's node's memory to the corresponding host node. There are a some things you can do to mitigate some of this. Definitely define the VM to match the NUMA topology found on the host. That will at least allow good scaling wrt locks and scheduler in the guest. As for getting memory placement close (a page in VM node x actually resides in host node x), you have to rely on vcpu pinning + guest NUMA topology, combined with default mempolicy in the guest and host. As pages are faulted in the guest, the hope is that the vcpu which did the faulting is running in the right node (guest and host), its guest OS mempolicy ensures this page is to be allocated in the guest local node, and that allocation cause a fault in qemu, which is -also- running on the -host- node X. The vcpu pinning is critical to get qemu to fault that memory to the correct node. Make sure you do not use numactl for any of this. I would suggest using libvirt and define the vcpu-pinning and the numa topology in the XML. -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: performance of virtual functions compared to virtio
On Mon, 2011-04-25 at 13:49 -0600, David Ahern wrote: > > On 04/25/11 13:29, Alex Williamson wrote: > > So we're effectively getting host-host latency/throughput for the VF, > > it's just that in the 82576 implementation of SR-IOV, the VF takes a > > latency hit that puts it pretty close to virtio. Unfortunate. I think > > For host-to-VM using VFs is worse than virtio which is counterintuitive. > > > you'll find that passing the PF to the guests should be pretty close to > > that 185us latency. I would assume (hope) the higher end NICs reduce > > About that 185usec: do you know where the bottleneck is? It seems as if > the packet is held in some queue waiting for an event/timeout before it > is transmitted. you might want to check the VF driver. I know versions of the ixgbevf driver have a throttled interrupt option which will increase latency with some settings. I don't remember if the igbvf driver has the same feature. If it does, you will want to turn this option off for best latency. > > David > > > > this, but it seems to be a hardware limitation, so it's hard to predict. > > Thanks, > > > > Alex -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Network performance with small packets
On Tue, 2011-03-08 at 13:57 -0800, Shirley Ma wrote: > On Wed, 2011-02-09 at 11:07 +1030, Rusty Russell wrote: > > I've finally read this thread... I think we need to get more serious > > with our stats gathering to diagnose these kind of performance issues. > > > > This is a start; it should tell us what is actually happening to the > > virtio ring(s) without significant performance impact... > > Should we also add similar stat on vhost vq as well for monitoring > vhost_signal & vhost_notify? Tom L has started using Rusty's patches and found some interesting results, sent yesterday: http://marc.info/?l=kvm&m=129953710930124&w=2 -Andrew > > Shirley > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/3] [RFC] Implement multiqueue (RX & TX) virtio-net
On Mon, 2011-02-28 at 12:04 +0530, Krishna Kumar wrote: > This patch series is a continuation of an earlier one that > implemented guest MQ TX functionality. This new patchset > implements both RX and TX MQ. Qemu changes are not being > included at this time solely to aid in easier review. > Compatibility testing with old/new combinations of qemu/guest > and vhost was done without any issues. > > Some early TCP/UDP test results are at the bottom of this > post, I plan to submit more test results in the coming days. > > Please review and provide feedback on what can improve. > > Thanks! > > Signed-off-by: Krishna Kumar > --- > > > Test configuration: > Host: 8 Intel Xeon, 8 GB memory > Guest: 4 cpus, 2 GB memory > > Each test case runs for 60 secs, results below are average over > two runs. Bandwidth numbers are in gbps. I have used default > netperf, and no testing/system tuning other than taskset each > vhost to 0xf (cpus 0-3). Comparison is testing original kernel > vs new kernel with #txqs=8 ("#" refers to number of netperf > sessions). > > ___ > TCP: Guest -> Local Host (TCP_STREAM) > TCP: Local Host -> Guest (TCP_MAERTS) > UDP: Local Host -> Guest (UDP_STREAM) Any reason why the tests don't include a guest-to-guest on same host, or on different hosts? Seems like those would be a lot more common that guest-to/from-localhost. Thanks, -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
On Tue, 2010-08-31 at 17:03 -0500, Anthony Liguori wrote: > On 08/31/2010 03:54 PM, Andrew Theurer wrote: > > On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote: > > > >> On 08/23/2010 04:16 PM, Andre Przywara wrote: > >> > >>> Anthony Liguori wrote: > >>> > >>>> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote: > >>>> > >>>>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote: > >>>>> > >>>>>> According to the user-provided assignment bind the respective part > >>>>>> of the guest's memory to the given host node. This uses Linux' > >>>>>> mbind syscall (which is wrapped only in libnuma) to realize the > >>>>>> pinning right after the allocation. > >>>>>> Failures are not fatal, but produce a warning. > >>>>>> > >>>>>> Signed-off-by: Andre Przywara > >>>>>> ... > >>>>>> > >>>>> Why is it not possible (or perhaps not desired) to change the binding > >>>>> after the guest is started? > >>>>> > >>>>> Sounds unflexible. > >>>>> > >>> The solution is to introduce a monitor interface to later adjust the > >>> pinning, allowing both changing the affinity only (only valid for > >>> future fault-ins) and actually copying the memory (more costly). > >>> > >> This is just duplicating numactl. > >> > >> > >>> Actually this is the next item on my list, but I wanted to bring up > >>> the basics first to avoid recoding parts afterwards. Also I am not > >>> (yet) familiar with the QMP protocol. > >>> > >>>> We really need a solution that lets a user use a tool like numactl > >>>> outside of the QEMU instance. > >>>> > >>> I fear that is not how it's meant to work with the Linux' NUMA API. In > >>> opposite to the VCPU threads, which are externally visible entities > >>> (PIDs), the memory should be private to the QEMU process. While you > >>> can change the NUMA allocation policy of the _whole_ process, there is > >>> no way to externally distinguish parts of the process' memory. > >>> Although you could later (and externally) migrate already faulted > >>> pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you > >>> would let an external tool interfere with QEMUs internal memory > >>> management. Take for instance the change of the allocation policy > >>> regarding the 1MB and 3.5-4GB holes. An external tool would have to > >>> either track such changes or you simply could not change such things > >>> in QEMU. > >>> > >> It's extremely likely that if you're doing NUMA pinning, you're also > >> doing large pages via hugetlbfs. numactl can already set policies for > >> files in hugetlbfs so all you need to do is have a separate hugetlbfs > >> file for each numa node. > >> > > Why would we resort to hugetlbfs when we have transparent hugepages? > > > > If you care about NUMA pinning, I can't believe you don't want > guaranteed large page allocation which THP does not provide. I personally want a more automatic approach to placing VMs in NUMA nodes (not directed by the qemu process itself), but I'd also like to support a user's desire to pin and place cpus and memory, especially for large VMs that need to be defined as multi-node. For user defined pinning, libhugetlbfs will probably be fine, but for most VMs, I'd like to ensure we can do things like ballooning well, and I am not so sure that will be easy with libhugetlbfs. > The general point though is that we should find a way to partition > memory in qemu such that an external process can control the actual NUMA > placement. This gives us maximum flexibility. > > Otherwise, what do we implement in QEMU? Direct pinning of memory to > nodes? Can we migrate memory between nodes? Should we support > interleaving memory between two virtual nodes? Why pick and choose when > we can have it all. If there were a better way to do this than hugetlbfs, then I don't think I would shy away from this. Is there another way to change NUMA policies on mappings from a user tool? We can already inspect with /proc//numamaps. Is this something that could be a
Re: [PATCH 4/4] NUMA: realize NUMA memory pinning
On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote: > On 08/23/2010 04:16 PM, Andre Przywara wrote: > > Anthony Liguori wrote: > >> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote: > >>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote: > >>>> According to the user-provided assignment bind the respective part > >>>> of the guest's memory to the given host node. This uses Linux' > >>>> mbind syscall (which is wrapped only in libnuma) to realize the > >>>> pinning right after the allocation. > >>>> Failures are not fatal, but produce a warning. > >>>> > >>>> Signed-off-by: Andre Przywara > > >>> ... > >>> Why is it not possible (or perhaps not desired) to change the binding > >>> after the guest is started? > >>> > >>> Sounds unflexible. > > The solution is to introduce a monitor interface to later adjust the > > pinning, allowing both changing the affinity only (only valid for > > future fault-ins) and actually copying the memory (more costly). > > This is just duplicating numactl. > > > Actually this is the next item on my list, but I wanted to bring up > > the basics first to avoid recoding parts afterwards. Also I am not > > (yet) familiar with the QMP protocol. > >> > >> We really need a solution that lets a user use a tool like numactl > >> outside of the QEMU instance. > > I fear that is not how it's meant to work with the Linux' NUMA API. In > > opposite to the VCPU threads, which are externally visible entities > > (PIDs), the memory should be private to the QEMU process. While you > > can change the NUMA allocation policy of the _whole_ process, there is > > no way to externally distinguish parts of the process' memory. > > Although you could later (and externally) migrate already faulted > > pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you > > would let an external tool interfere with QEMUs internal memory > > management. Take for instance the change of the allocation policy > > regarding the 1MB and 3.5-4GB holes. An external tool would have to > > either track such changes or you simply could not change such things > > in QEMU. > > It's extremely likely that if you're doing NUMA pinning, you're also > doing large pages via hugetlbfs. numactl can already set policies for > files in hugetlbfs so all you need to do is have a separate hugetlbfs > file for each numa node. Why would we resort to hugetlbfs when we have transparent hugepages? FWIW, large apps like databases have set a precedent for managing their own NUMA policies. I don't see why qemu should be any different. Numactl is great for small apps that need to be pinned in one node, or spread evenly on all nodes. Having to get hugetlbfs involved just to workaround a shortcoming of numactl just seems like a bad idea. > > Then you have all the flexibility of numactl and you can implement node > migration external to QEMU if you so desire. > > > So what is wrong with keeping that code in QEMU, which knows best > > about the internals and already has flexible and mighty ways (command > > line and QMP) of manipulating its behavior? > > NUMA is a last-mile optimization. For the audience that cares about > this level of optimization, only providing an interface that allows a > small set of those optimizations to be used is unacceptable. > > There's a very simple way to do this right and that's by adding > interfaces to QEMU that let's us work with existing tooling instead of > inventing new interfaces. > > Regards, > > Anthony Liguori > > > Regards, > > Andre. -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
windows workload: many ept_violation and mmio exits
I am running a windows workload which has 26 windows VMs running many instances of a J2EE workload. There are 13 pairs of an application server VM and database server VM. There seem to be quite a bit of vm_exits, and it looks over a third of them are mmio_exit: efer_relo 0 exits 337139 fpu_reloa 247321 halt_exit 19092 halt_wake 18611 host_stat 247332 hypercall 0 insn_emul 184265 insn_emul 184265 invlpg 0 io_exits 69184 irq_exits 52953 irq_injec 48115 irq_windo 2411 largepage 19 mmio_exit 123554 mmu_cache 0 mmu_flood 0 mmu_pde_z 0 mmu_pte_u 0 mmu_pte_w 0 mmu_recyc 0 mmu_shado 0 mmu_unsyn 0 nmi_injec 0 nmi_windo 0 pf_fixed 19 pf_guest 0 remote_tl 0 request_i 0 signal_ex 0 tlb_flush 0 I collected a kvmtrace, and below is a very small portion of that. Is there a way I can figure out what device the mmio's are for? Also, is it normal to have lots of ept_violations? This is a 2 socket Nehalem system with SMT on. qemu-system-x86-19673 [014] 213577.939614: kvm_entry: vcpu 0 qemu-system-x86-19673 [014] 213577.939624: kvm_exit: reason ept_violation rip 0xf8000160ef8e qemu-system-x86-19673 [014] 213577.939624: kvm_page_fault: address fed000f0 error_code 181 qemu-system-x86-19673 [014] 213577.939627: kvm_mmio: mmio unsatisfied-read len 4 gpa 0xfed000f0 val 0x0 qemu-system-x86-19673 [014] 213577.939629: kvm_mmio: mmio read len 4 gpa 0xfed000f0 val 0xfb8f214d qemu-system-x86-19673 [014] 213577.939631: kvm_entry: vcpu 0 qemu-system-x86-19673 [014] 213577.939633: kvm_exit: reason ept_violation rip 0xf8000160ef8e qemu-system-x86-19673 [014] 213577.939634: kvm_page_fault: address fed000f0 error_code 181 qemu-system-x86-19673 [014] 213577.939636: kvm_mmio: mmio unsatisfied-read len 4 gpa 0xfed000f0 val 0x0 qemu-system-x86-19332 [008] 213577.939637: kvm_entry: vcpu 0 qemu-system-x86-19673 [014] 213577.939638: kvm_mmio: mmio read len 4 gpa 0xfed000f0 val 0xfb8f24e2 qemu-system-x86-19673 [014] 213577.939640: kvm_entry: vcpu 0 qemu-system-x86-19211 [010] 213577.939663: kvm_set_irq: gsi 11 level 1 source 0 qemu-system-x86-19211 [010] 213577.939664: kvm_pic_set_irq: chip 1 pin 3 (level|masked) qemu-system-x86-19211 [010] 213577.939665: kvm_apic_accept_irq: apicid 0 vec 130 (LowPrio|level) qemu-system-x86-19211 [010] 213577.939666: kvm_ioapic_set_irq: pin 11 dst 1 vec=130 (LowPrio|logical|level) qemu-system-x86-19673 [014] 213577.939692: kvm_exit: reason ept_violation rip 0xf8000160ef8e qemu-system-x86-19673 [014] 213577.939693: kvm_page_fault: address fed000f0 error_code 181 qemu-system-x86-19673 [014] 213577.939696: kvm_mmio: mmio unsatisfied-read len 4 gpa 0xfed000f0 val 0x0 qemu-system-x86-19332 [008] 213577.939699: kvm_exit: reason ept_violation rip 0xf80001b3af8e qemu-system-x86-19332 [008] 213577.939700: kvm_page_fault: address fed000f0 error_code 181 qemu-system-x86-19673 [014] 213577.939702: kvm_mmio: mmio read len 4 gpa 0xfed000f0 val 0xfb8f3da6 qemu-system-x86-19563 [010] 213577.939702: kvm_set_irq: gsi 11 level 1 source 0 qemu-system-x86-19563 [010] 213577.939703: kvm_pic_set_irq: chip 1 pin 3 (level|masked) qemu-system-x86-19673 [014] 213577.939704: kvm_entry: vcpu 0 qemu-system-x86-19563 [010] 213577.939705: kvm_apic_accept_irq: apicid 0 vec 130 (LowPrio|level) qemu-system-x86-19332 [008] 213577.939706: kvm_mmio: mmio unsatisfied-read len 4 gpa 0xfed000f0 val 0x0 qemu-system-x86-19563 [010] 213577.939707: kvm_ioapic_set_irq: pin 11 dst 1 vec=130 (LowPrio|logical|level) qemu-system-x86-19332 [008] 213577.939713: kvm_mmio: mmio read len 4 gpa 0xfed000f0 val 0x29a105de qemu-system-x86-19332 [008] 213577.939715: kvm_entry: vcpu 0 qemu-system-x86-19201 [011] 213577.939716: kvm_exit: reason exception rip 0x1162412 qemu-system-x86-19332 [008] 213577.939717: kvm_exit: reason halt rip 0xfa6000fae7a1 qemu-system-x86-19201 [011] 213577.939717: kvm_entry: vcpu 0 qemu-system-x86-19673 [014] 213577.939761: kvm_exit: reason ept_violation rip 0xf8000160ef8e qemu-system-x86-19673 [014] 213577.939762: kvm_page_fault: address fed000f0 error_code 181 qemu-system-x86-19673 [014] 213577.939766: kvm_mmio: mmio unsatisfied-read len 4 gpa 0xfed000f0 val 0x0 qemu-system-x86-19673 [014] 213577.939772: kvm_mmio: mmio read len 4 gpa 0xfed000f0 val 0xfb8f58dd qemu-system-x86-19673 [014] 213577.939774: kvm_entry: vcpu 0 qemu-system-x86-19673 [014] 213577.939776: kvm_exit: reason ept_violation rip 0xf8000160ef8e qemu-system-x86-19673 [014] 213577.939776: kvm_page_fault: address fed000f0 error_code 181 qemu-system-x86-19673 [014] 213577.939779: kvm_mmio: mmio unsatisfied-read len 4 gpa 0xfed000f0 val 0x0 qemu-system-x86-19673 [014] 213577.939782: kvm_mmio: mmio read len 4 gpa 0xfed000f0 val 0xfb8f5d09 qemu-system-x86-19673 [014] 213577.939784: kvm_entry: vcpu 0 qemu-system-x86-19673 [014] 213577.939791: kvm_exit: reason ept_violation rip 0xf8000160ef8e qemu-system-x86-19673 [014] 21
Re: kernel bug in kvm_intel
On Sun, 2009-11-29 at 16:46 +0200, Avi Kivity wrote: > On 11/26/2009 03:35 AM, Andrew Theurer wrote: > > I just tried testing tip of kvm.git, but unfortunately I think I might > > be hitting a different problem, where processes run 100% in kernel > > mode. In my case, cpus 9 and 13 were stuck, running qemu processes. > > A stack backtrace for both cpus are below. FWIW, kernel.org > > 2.6.32-rc7 does not have this problem, or the original problem. > > I just posted a patch fixing this, titled "[PATCH tip:x86/entry] core: > fix user return notifier on fork()". > Thank you, Avi. I am running on this patch and am not seeing this problem anymore. I'll be testing for the previous issue next. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel bug in kvm_intel
Avi Kivity wrote: On 11/26/2009 03:35 AM, Andrew Theurer wrote: NMI backtrace for cpu 9 CPU 9: Modules linked in: tun sunrpc af_packet bridge stp ipv6 binfmt_misc dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod kvm_intel kvm uinput sr_mod cdrom ata_generic pata_acpi ata_piix joydev libata ide_pci_generic usbhid ide_core hid serio_raw cdc_ether usbnet mii matroxfb_base matroxfb_DAC1064 matroxfb_accel matroxfb_Ti3026 matroxfb_g450 g450_pll matroxfb_misc iTCO_wdt i2c_i801 i2c_core pcspkr iTCO_vendor_support ioatdma thermal rtc_cmos rtc_core bnx2 rtc_lib dca thermal_sys hwmon sg button shpchp pci_hotplug qla2xxx scsi_transport_fc scsi_tgt sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: processor] Pid: 5687, comm: qemu-system-x86 Not tainted 2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1 -[7947AC1]- RIP: 0010:[] [] fire_user_return_notifiers+0x31/0x36 RSP: 0018:88095024df08 EFLAGS: 0246 RAX: RBX: 0800 RCX: 88095024c000 RDX: 88002834 RSI: RDI: 88095024df58 RBP: 88095024df18 R08: R09: 0001 R10: 00caf1fff62d R11: 8805b584de40 R12: 7fffae48e0f0 R13: R14: 0001 R15: FS: 7f45c69d57c0() GS:88002834() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: f9800121056e CR3: 000953d36000 CR4: 26e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Call Trace: <#DB[1]> <> Pid: 5687, comm: qemu-system-x86 Not tainted 2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1 Call Trace: [] ? show_regs+0x44/0x49 [] nmi_watchdog_tick+0xc2/0x1b9 [] do_nmi+0xb0/0x252 [] nmi+0x20/0x30 [] ? fire_user_return_notifiers+0x31/0x36 <> [] do_notify_resume+0x62/0x69 [] ? int_check_syscall_exit_work+0x9/0x3d [] int_signal+0x12/0x17 That's a bug with the new user return notifiers. Is your host kernel preemptible? preempt is off. I think I saw this once but I'm not sure. I can't reproduce with a host kernel build, some silly guest workload, and 'perf top' to generate an nmi load. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel bug in kvm_intel
Tejun Heo wrote: Hello, 11/01/2009 08:31 PM, Avi Kivity wrote: Here is the code in question: 3ae7: 75 05 jne 3aee 3ae9: 0f 01 c2vmlaunch 3aec: eb 03 jmp 3af1 3aee: 0f 01 c3vmresume 3af1: 48 87 0c 24 xchg %rcx,(%rsp) ^^^ fault, but not at (%rsp) Can you please post the full oops (including kernel debug messages during boot) or give me a pointer to the original message? http://www.mail-archive.com/kvm@vger.kernel.org/msg23458.html Also, does the faulting address coincide with any symbol? No (at least, not in System.map). Has there been any progress? Is kvm + oprofile still broken? I just tried testing tip of kvm.git, but unfortunately I think I might be hitting a different problem, where processes run 100% in kernel mode. In my case, cpus 9 and 13 were stuck, running qemu processes. A stack backtrace for both cpus are below. FWIW, kernel.org 2.6.32-rc7 does not have this problem, or the original problem. NMI backtrace for cpu 9 CPU 9: Modules linked in: tun sunrpc af_packet bridge stp ipv6 binfmt_misc dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod kvm_intel kvm uinput sr_mod cdrom ata_generic pata_acpi ata_piix joydev libata ide_pci_generic usbhid ide_core hid serio_raw cdc_ether usbnet mii matroxfb_base matroxfb_DAC1064 matroxfb_accel matroxfb_Ti3026 matroxfb_g450 g450_pll matroxfb_misc iTCO_wdt i2c_i801 i2c_core pcspkr iTCO_vendor_support ioatdma thermal rtc_cmos rtc_core bnx2 rtc_lib dca thermal_sys hwmon sg button shpchp pci_hotplug qla2xxx scsi_transport_fc scsi_tgt sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: processor] Pid: 5687, comm: qemu-system-x86 Not tainted 2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1 -[7947AC1]- RIP: 0010:[] [] fire_user_return_notifiers+0x31/0x36 RSP: 0018:88095024df08 EFLAGS: 0246 RAX: RBX: 0800 RCX: 88095024c000 RDX: 88002834 RSI: RDI: 88095024df58 RBP: 88095024df18 R08: R09: 0001 R10: 00caf1fff62d R11: 8805b584de40 R12: 7fffae48e0f0 R13: R14: 0001 R15: FS: 7f45c69d57c0() GS:88002834() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: f9800121056e CR3: 000953d36000 CR4: 26e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Call Trace: <#DB[1]> <> Pid: 5687, comm: qemu-system-x86 Not tainted 2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1 Call Trace: [] ? show_regs+0x44/0x49 [] nmi_watchdog_tick+0xc2/0x1b9 [] do_nmi+0xb0/0x252 [] nmi+0x20/0x30 [] ? fire_user_return_notifiers+0x31/0x36 <> [] do_notify_resume+0x62/0x69 [] ? int_check_syscall_exit_work+0x9/0x3d [] int_signal+0x12/0x17 NMI backtrace for cpu 13 CPU 13: Modules linked in: tun sunrpc af_packet bridge stp ipv6 binfmt_misc dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod kvm_intel kvm uinput sr_mod cdrom ata_generic pata_acpi ata_piix joydev libata ide_pci_generic usbhid ide_core hid serio_raw cdc_ether usbnet mii matroxfb_base matroxfb_DAC1064 matroxfb_accel matroxfb_Ti3026 matroxfb_g450 g450_pll matroxfb_misc iTCO_wdt i2c_i801 i2c_core pcspkr iTCO_vendor_support ioatdma thermal rtc_cmos rtc_core bnx2 rtc_lib dca thermal_sys hwmon sg button shpchp pci_hotplug qla2xxx scsi_transport_fc scsi_tgt sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: processor] Pid: 5792, comm: qemu-system-x86 Not tainted 2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1 -[7947AC1]- RIP: 0010:[] [] int_restore_rest+0x1d/0x3d RSP: 0018:88124f491f58 EFLAGS: 0292 RAX: 0800 RBX: 7fff9df852e0 RCX: 88124f49 RDX: 88099ff4 RSI: RDI: fe2e RBP: 7fff9df85260 R08: 88124f49 R09: R10: 0005 R11: 880954971da0 R12: 7fff9df851e0 R13: R14: 0001 R15: FS: 7f73b5b1d7c0() GS:88099ff4() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 7f8d5a8de9d0 CR3: 000eb34d7000 CR4: 26e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Call Trace: <#DB[1]> <> Pid: 5792, comm: qemu-system-x86 Not tainted 2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1 Call Trace: [] ? show_regs+0x44/0x49 [] nmi_watchdog_tick+0xc2/0x1b9 [] do_nmi+0xb0/0x252 [] nmi+0x20/0x30 [] ? int_restore_rest+0x1d/0x3d <> -Andrew -- To unsubscribe fr
Re: kernel bug in kvm_intel
Avi Kivity wrote: On 10/30/2009 08:07 PM, Andrew Theurer wrote: I have finally bisected and isolated this to the following commit: ada3fa15057205b7d3f727bba5cd26b5912e350f http://git.kernel.org/?p=virt/kvm/kvm.git;a=commit;h=ada3fa15057205b7d3f727bba5cd26b5912e350f Merge branch 'for-linus' of git://git./linux/kernel/git/tj/percpu * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits) powerpc64: convert to dynamic percpu allocator sparc64: use embedding percpu first chunk allocator percpu: kill lpage first chunk allocator x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA percpu: update embedding first chunk allocator to handle sparse units percpu: use group information to allocate vmap areas sparsely vmalloc: implement pcpu_get_vm_areas() vmalloc: separate out insert_vmalloc_vm() percpu: add chunk->base_addr percpu: add pcpu_unit_offsets[] percpu: introduce pcpu_alloc_info and pcpu_group_info percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward percpu: add @align to pcpu_fc_alloc_fn_t percpu: make @dyn_size mandatory for pcpu_setup_first_chunk() percpu: drop @static_size from first chunk allocators percpu: generalize first chunk allocator selection percpu: build first chunk allocators selectively percpu: rename 4k first chunk allocator to page percpu: improve boot messages percpu: fix pcpu_reclaim() locking The previous commit (5579fd7e6aed8860ea0c8e3f11897493153b10ad) does not this problem. FYI, this problem only occurs when oprofile is active. Any idea what in this commit might be the issue? 5579 is not the preceding commit, it is the merged branch: commit ada3fa15057205b7d3f727bba5cd26b5912e350f Merge: 2f82af0 5579fd7 Author: Linus Torvalds Date: Tue Sep 15 09:39:44 2009 -0700 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu What happens with 2f82af0? 2f82af0 is: Nicolas Pitre has a new email address Due to problems at cam.org, my n...@cam.org email address is no longer valid. FRom now on, n...@fluxnic.net should be used instead. I have not tested that, but it doesn't seem likely that it would have anything to do with the problem. Or maybe I am misunderstanding the impact of this commit? FWIW, here is the bisect log: git bisect start # good: [227423904c709a8e60245c97081bbeb4fb500655] Merge branch 'x86-pat-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip git bisect good 227423904c709a8e60245c97081bbeb4fb500655 # bad: [0f29f5871c165e346409f62d903f97cfad3894c5] Staging: rtl8192su: remove RTL8192SU ifdefs git bisect bad 0f29f5871c165e346409f62d903f97cfad3894c5 # bad: [ada3fa15057205b7d3f727bba5cd26b5912e350f] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu git bisect bad ada3fa15057205b7d3f727bba5cd26b5912e350f # bad: [ada3fa15057205b7d3f727bba5cd26b5912e350f] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu git bisect bad ada3fa15057205b7d3f727bba5cd26b5912e350f # good: [decee2e8a9538ae5476e6cb3f4b7714c92a04a2b] V4L/DVB (12485): zl10353: correct implementation of FE_READ_UNCORRECTED_BLOCKS git bisect good decee2e8a9538ae5476e6cb3f4b7714c92a04a2b # good: [0ee7e4d6d4f58c3b2d9f0ca8ad8f63abda8694b1] V4L/DVB (12694): gspca - vc032x: Change the start exchanges of the sensor hv7131r. git bisect good 0ee7e4d6d4f58c3b2d9f0ca8ad8f63abda8694b1 # good: [f58dc01ba2ca9fe3ab2ba4ca43d9c8a735cf62d8] percpu: generalize first chunk allocator selection git bisect good f58dc01ba2ca9fe3ab2ba4ca43d9c8a735cf62d8 # good: [2f82af08fcc7dc01a7e98a49a5995a77e32a2925] Nicolas Pitre has a new email address git bisect good 2f82af08fcc7dc01a7e98a49a5995a77e32a2925 # good: [cf88c79006bd6a09ad725ba0b34c0e23db20b19e] vmalloc: separate out insert_vmalloc_vm() git bisect good cf88c79006bd6a09ad725ba0b34c0e23db20b19e # good: [4518e6a0c038b98be4c480e6f4481e8676bd15dd] x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA git bisect good 4518e6a0c038b98be4c480e6f4481e8676bd15dd # good: [bcb2107fdbecef3de55d597d23453747af81ba88] sparc64: use embedding percpu first chunk allocator git bisect good bcb2107fdbecef3de55d597d23453747af81ba88 # good: [5579fd7e6aed8860ea0c8e3f11897493153b10ad] Merge branch 'for-next' into for-linus git bisect good 5579fd7e6aed8860ea0c8e3f11897493153b10ad Oh, wait, that commit was tested, in the middle of the log above. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel bug in kvm_intel
On Thu, 2009-10-15 at 15:18 -0500, Andrew Theurer wrote: > On Thu, 2009-10-15 at 02:10 +0900, Avi Kivity wrote: > > On 10/13/2009 11:04 PM, Andrew Theurer wrote: > > > > > >> Look at the address where vmx_vcpu_run starts, add 0x26d, and show the > > >> surrounding code. > > >> > > >> Thinking about it, it probably _is_ what you showed, due to module page > > >> alignment. But please verify this; I can't reconcile the fault address > > >> (9fe9a2b) with %rsp at the time of the fault. > > >> > > > Here is the start of the function: > > > > > > > > >> 3884: > > >> 3884: 55 push %rbp > > >> 3885: 48 89 e5mov%rsp,%rbp > > >> > > > and 0x26d later is 0x3af1: > > > > > > > > >> 3ad2: 4c 8b b1 88 01 00 00mov0x188(%rcx),%r14 > > >> 3ad9: 4c 8b b9 90 01 00 00mov0x190(%rcx),%r15 > > >> 3ae0: 48 8b 89 20 01 00 00mov0x120(%rcx),%rcx > > >> 3ae7: 75 05 jne3aee > > >> 3ae9: 0f 01 c2vmlaunch > > >> 3aec: eb 03 jmp3af1 > > >> 3aee: 0f 01 c3vmresume > > >> 3af1: 48 87 0c 24 xchg %rcx,(%rsp) > > >> 3af5: 48 89 81 18 01 00 00mov%rax,0x118(%rcx) > > >> 3afc: 48 89 99 30 01 00 00mov%rbx,0x130(%rcx) > > >> 3b03: ff 34 24pushq (%rsp) > > >> 3b06: 8f 81 20 01 00 00 popq 0x120(%rcx) > > >> > > > > > > > Ok. So it faults on the xchg instruction, rsp is 8806369ffc80 but > > the fault address is 9fe9a2b4. So it looks like the IDT is > > corrupted. > > I have finally bisected and isolated this to the following commit: ada3fa15057205b7d3f727bba5cd26b5912e350f http://git.kernel.org/?p=virt/kvm/kvm.git;a=commit;h=ada3fa15057205b7d3f727bba5cd26b5912e350f > Merge branch 'for-linus' of git://git./linux/kernel/git/tj/percpu > > * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 > commits) > powerpc64: convert to dynamic percpu allocator > sparc64: use embedding percpu first chunk allocator > percpu: kill lpage first chunk allocator > x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA > percpu: update embedding first chunk allocator to handle sparse units > percpu: use group information to allocate vmap areas sparsely > vmalloc: implement pcpu_get_vm_areas() > vmalloc: separate out insert_vmalloc_vm() > percpu: add chunk->base_addr > percpu: add pcpu_unit_offsets[] > percpu: introduce pcpu_alloc_info and pcpu_group_info > percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward > percpu: add @align to pcpu_fc_alloc_fn_t > percpu: make @dyn_size mandatory for pcpu_setup_first_chunk() > percpu: drop @static_size from first chunk allocators > percpu: generalize first chunk allocator selection > percpu: build first chunk allocators selectively > percpu: rename 4k first chunk allocator to page > percpu: improve boot messages > percpu: fix pcpu_reclaim() locking The previous commit (5579fd7e6aed8860ea0c8e3f11897493153b10ad) does not this problem. FYI, this problem only occurs when oprofile is active. Any idea what in this commit might be the issue? -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] introduce VMSTATE_U64
> On Tue, Oct 20, 2009 at 08:40:26AM +0900, Avi Kivity wrote: > > On 10/17/2009 04:27 AM, Glauber Costa wrote: > >> This is a patch actually written by Juan, which, according to him, > >> he plans on posting to qemu.git. Problem is that linux defines > >> u64 in a way that is type-uncompatible with uint64_t. > >> > >> I am including it here, because it is a dependency to my patch series > >> that follows. > >> > >> > > > > Why can't we store these values in qemu as uint64_ts? > Because then we have to redefine the whole structure in qemu. > > the proposal is to simply pick the structures directly from linux. I believe > it is much easier, and the versioning scheme in vmstate will help us get > around any changes they might suffer in the future. I get build errors with this. Is there something extra I need to do? I am currently running a rhel54 2.6.18 kernel. CCqdev-properties.o savevm.c: In function ‘get_u64’: savevm.c:856: error: ‘__u64’ undeclared (first use in this function) savevm.c:856: error: (Each undeclared identifier is reported only once savevm.c:856: error: for each function it appears in.) savevm.c:856: error: ‘v’ undeclared (first use in this function) savevm.c: In function ‘put_u64’: savevm.c:863: error: ‘__u64’ undeclared (first use in this function) savevm.c:863: error: ‘v’ undeclared (first use in this function) make[1]: *** [savevm.o] Error 1 make[1]: *** Waiting for unfinished jobs make: *** [build-all] Error 2 Thanks, -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel bug in kvm_intel
On Thu, 2009-10-15 at 02:10 +0900, Avi Kivity wrote: > On 10/13/2009 11:04 PM, Andrew Theurer wrote: > > > >> Look at the address where vmx_vcpu_run starts, add 0x26d, and show the > >> surrounding code. > >> > >> Thinking about it, it probably _is_ what you showed, due to module page > >> alignment. But please verify this; I can't reconcile the fault address > >> (9fe9a2b) with %rsp at the time of the fault. > >> > > Here is the start of the function: > > > > > >> 3884: > >> 3884: 55 push %rbp > >> 3885: 48 89 e5mov%rsp,%rbp > >> > > and 0x26d later is 0x3af1: > > > > > >> 3ad2: 4c 8b b1 88 01 00 00mov0x188(%rcx),%r14 > >> 3ad9: 4c 8b b9 90 01 00 00mov0x190(%rcx),%r15 > >> 3ae0: 48 8b 89 20 01 00 00mov0x120(%rcx),%rcx > >> 3ae7: 75 05 jne3aee > >> 3ae9: 0f 01 c2vmlaunch > >> 3aec: eb 03 jmp3af1 > >> 3aee: 0f 01 c3vmresume > >> 3af1: 48 87 0c 24 xchg %rcx,(%rsp) > >> 3af5: 48 89 81 18 01 00 00mov%rax,0x118(%rcx) > >> 3afc: 48 89 99 30 01 00 00mov%rbx,0x130(%rcx) > >> 3b03: ff 34 24pushq (%rsp) > >> 3b06: 8f 81 20 01 00 00 popq 0x120(%rcx) > >> > > > > Ok. So it faults on the xchg instruction, rsp is 8806369ffc80 but > the fault address is 9fe9a2b4. So it looks like the IDT is > corrupted. > > Can you check what's around 9fe9a2b4 in System.map? 85d85b24 B __bss_stop 85d86000 B __brk_base 85d96000 b .brk.dmi_alloc 85da6000 B __brk_limit ff60 T vgettimeofday ff600100 t vread_tsc ff600130 t vread_hpet ff600140 D __vsyscall_gtod_data ff600400 T vtime -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel bug in kvm_intel
On Tue, 2009-10-13 at 08:50 +0200, Avi Kivity wrote: > On 10/12/2009 08:42 PM, Andrew Theurer wrote: > > On Sun, 2009-10-11 at 07:19 +0200, Avi Kivity wrote: > > > >> On 10/09/2009 10:04 PM, Andrew Theurer wrote: > >> > >>> This is on latest master branch on kvm.git and qemu-kvm.git, running > >>> 12 Windows Server2008 VMs, and using oprofile. I ran again without > >>> oprofile and did not get the BUG. I am wondering if anyone else is > >>> seeing this. > >>> > >>> Thanks, > >>> > >>> -Andrew > >>> > >>> > >>>> Oct 9 11:55:13 virtvictory-eth0 kernel: BUG: unable to handle kernel > >>>> paging request at 9fe9a2b4 > >>>> Oct 9 11:55:13 virtvictory-eth0 kernel: IP: [] > >>>> vmx_vcpu_run+0x26d/0x64f [kvm_intel] > >>>> > >> Can you run this through objdump or gdb to see what source this > >> corresponds to? > >> > >> > > Somewhere here I think (?) > > > > objdump -d > > > > > Look at the address where vmx_vcpu_run starts, add 0x26d, and show the > surrounding code. > > Thinking about it, it probably _is_ what you showed, due to module page > alignment. But please verify this; I can't reconcile the fault address > (9fe9a2b) with %rsp at the time of the fault. Here is the start of the function: > 3884 : > 3884: 55 push %rbp > 3885: 48 89 e5mov%rsp,%rbp and 0x26d later is 0x3af1: > 3ad2: 4c 8b b1 88 01 00 00mov0x188(%rcx),%r14 > 3ad9: 4c 8b b9 90 01 00 00mov0x190(%rcx),%r15 > 3ae0: 48 8b 89 20 01 00 00mov0x120(%rcx),%rcx > 3ae7: 75 05 jne3aee > 3ae9: 0f 01 c2vmlaunch > 3aec: eb 03 jmp3af1 > 3aee: 0f 01 c3vmresume > 3af1: 48 87 0c 24 xchg %rcx,(%rsp) > 3af5: 48 89 81 18 01 00 00mov%rax,0x118(%rcx) > 3afc: 48 89 99 30 01 00 00mov%rbx,0x130(%rcx) > 3b03: ff 34 24pushq (%rsp) > 3b06: 8f 81 20 01 00 00 popq 0x120(%rcx) -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel bug in kvm_intel
On Sun, 2009-10-11 at 07:19 +0200, Avi Kivity wrote: > On 10/09/2009 10:04 PM, Andrew Theurer wrote: > > This is on latest master branch on kvm.git and qemu-kvm.git, running > > 12 Windows Server2008 VMs, and using oprofile. I ran again without > > oprofile and did not get the BUG. I am wondering if anyone else is > > seeing this. > > > > Thanks, > > > > -Andrew > > > >> Oct 9 11:55:13 virtvictory-eth0 kernel: BUG: unable to handle kernel > >> paging request at 9fe9a2b4 > >> Oct 9 11:55:13 virtvictory-eth0 kernel: IP: [] > >> vmx_vcpu_run+0x26d/0x64f [kvm_intel] > > Can you run this through objdump or gdb to see what source this > corresponds to? > Somewhere here I think (?) objdump -d > 3ad9: 4c 8b b9 90 01 00 00mov0x190(%rcx),%r15 > 3ae0: 48 8b 89 20 01 00 00mov0x120(%rcx),%rcx > 3ae7: 75 05 jne3aee > 3ae9: 0f 01 c2vmlaunch > 3aec: eb 03 jmp3af1 > 3aee: 0f 01 c3vmresume > 3af1: 48 87 0c 24 xchg %rcx,(%rsp) > 3af5: 48 89 81 18 01 00 00mov%rax,0x118(%rcx) > 3afc: 48 89 99 30 01 00 00mov%rbx,0x130(%rcx) > 3b03: ff 34 24pushq (%rsp) > 3b06: 8f 81 20 01 00 00 popq 0x120(%rcx) > 3b0c: 48 89 91 28 01 00 00mov%rdx,0x128(%rcx) objdump -S > /* Enter guest mode */ > "jne .Llaunched \n\t" > __ex(ASM_VMX_VMLAUNCH) "\n\t" > "jmp .Lkvm_vmx_return \n\t" > ".Llaunched: " __ex(ASM_VMX_VMRESUME) "\n\t" > ".Lkvm_vmx_return: " > /* Save guest registers, load host registers, keep flags */ > "xchg %0, (%%"R"sp) \n\t" > "mov %%"R"ax, %c[rax](%0) \n\t" > "mov %%"R"bx, %c[rbx](%0) \n\t" > "push"Q" (%%"R"sp); pop"Q" %c[rcx](%0) \n\t" > "mov %%"R"dx, %c[rdx](%0) \n\t" > "mov %%"R"si, %c[rsi](%0) \n\t" -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kernel bug in kvm_intel
This is on latest master branch on kvm.git and qemu-kvm.git, running 12 Windows Server2008 VMs, and using oprofile. I ran again without oprofile and did not get the BUG. I am wondering if anyone else is seeing this. Thanks, -Andrew Oct 9 11:55:13 virtvictory-eth0 kernel: BUG: unable to handle kernel paging request at 9fe9a2b4 Oct 9 11:55:13 virtvictory-eth0 kernel: IP: [] vmx_vcpu_run+0x26d/0x64f [kvm_intel] Oct 9 11:55:13 virtvictory-eth0 kernel: PGD 1003067 PUD 1007063 PMD 0 Oct 9 11:55:13 virtvictory-eth0 kernel: Oops: [#5] SMP Oct 9 11:55:13 virtvictory-eth0 kernel: last sysfs file: /sys/devices/virtual/net/br4/bridge/topology_change_detected Oct 9 11:55:13 virtvictory-eth0 kernel: CPU 6 Oct 9 11:55:13 virtvictory-eth0 kernel: Modules linked in: oprofile tun hidp l2cap crc16 bluetooth rfkill lockd sunrpc bridge stp af_packet ipv6 binfmt_misc dm_multipath scsi_dh video output sbs sbshc pci_slot fan container battery ac parport_pc lp parport kvm_intel kvm joydev sr_mod cdrom sg cdc_ether usbnet mii usbhid hid serio_raw rtc_cmos rtc_core rtc_lib button thermal thermal_sys hwmon pata_acpi bnx2 i2c_i801 ide_pci_generic iTCO_wdt i2c_core ata_generic iTCO_vendor_support ioatdma dca pcspkr dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod ide_gd_mod ide_core usb_storage ata_piix libata shpchp pci_hotplug qla2xxx scsi_transport_fc scsi_tgt sd_mod crc_t10dif scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: oprofile] Oct 9 11:55:13 virtvictory-eth0 kernel: Pid: 6495, comm: qemu-system-x86 Tainted: G D2.6.32-rc3-autokern1 #1 IBM System x -[7947AC1]- Oct 9 11:55:13 virtvictory-eth0 kernel: RIP: 0010:[] [] vmx_vcpu_run+0x26d/0x64f [kvm_intel] Oct 9 11:55:13 virtvictory-eth0 kernel: RSP: 0018:8806369ffc80 EFLAGS: 00010002 Oct 9 11:55:13 virtvictory-eth0 kernel: RAX: 0004001f RBX: 0200 RCX: 0001 Oct 9 11:55:13 virtvictory-eth0 kernel: RDX: RSI: RDI: 8000 Oct 9 11:55:13 virtvictory-eth0 kernel: RBP: R08: fa80025180a8 R09: f800016ca4f0 Oct 9 11:55:13 virtvictory-eth0 kernel: R10: 7797003630747070 R11: fa60039039b8 R12: a003 Oct 9 11:55:13 virtvictory-eth0 kernel: R13: R14: R15: Oct 9 11:55:13 virtvictory-eth0 kernel: FS: 40ae6940() GS:88099fe8() knlGS:ffe66000 Oct 9 11:55:13 virtvictory-eth0 kernel: CS: 0010 DS: 0018 ES: 0018 CR0: 80050033 Oct 9 11:55:13 virtvictory-eth0 kernel: CR2: 9fe9a2b4 CR3: 0006375ec000 CR4: 26f0 Oct 9 11:55:13 virtvictory-eth0 kernel: DR0: DR1: DR2: Oct 9 11:55:13 virtvictory-eth0 kernel: DR3: DR6: 0ff0 DR7: 0400 Oct 9 11:55:13 virtvictory-eth0 kernel: Process qemu-system-x86 (pid: 6495, threadinfo 8806369fe000, task 880632056480) Oct 9 11:55:13 virtvictory-eth0 kernel: Stack: Oct 9 11:55:13 virtvictory-eth0 kernel: 8806320916c0 8806369ffd88 6c14 8806369ffca8 Oct 9 11:55:13 virtvictory-eth0 kernel: <0> 8806320916c0 8806369ffce8 a0293bfa 8806369ffee8 Oct 9 11:55:13 virtvictory-eth0 kernel: <0> 0001 0300 8806320916c0 002c Oct 9 11:55:13 virtvictory-eth0 kernel: Call Trace: Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? emulate_instruction+0x28a/0x2bc [kvm] Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? handle_apic_access+0x20/0x4b [kvm_intel] Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? vmx_handle_exit+0xe1/0x48b [kvm_intel] Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? save_msrs+0x39/0x50 [kvm_intel] Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? apic_update_ppr+0x23/0x51 [kvm] Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? __up_read+0x8f/0x97 Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? kvm_arch_vcpu_ioctl_run+0x6b6/0xa92 [kvm] Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? kvm_vcpu_ioctl+0xf6/0x5c0 [kvm] Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? lapic_next_event+0x18/0x1c Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? clockevents_program_event+0x73/0x7c Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? tick_dev_program_event+0x2a/0x9c Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? vfs_ioctl+0x2a/0x77 Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? do_vfs_ioctl+0x445/0x496 Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? sys_futex+0x111/0x12f Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? sys_ioctl+0x57/0x7a Oct 9 11:55:13 virtvictory-eth0 kernel: [] ? system_call_fastpath+0x16/0x1b -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm scaling question
On Mon, 2009-09-14 at 17:19 -0600, Bruce Rogers wrote: > On 9/11/2009 at 3:53 PM, Marcelo Tosatti wrote: > > On Fri, Sep 11, 2009 at 09:36:10AM -0600, Bruce Rogers wrote: > >> I am wondering if anyone has investigated how well kvm scales when > > supporting many guests, or many vcpus or both. > >> > >> I'll do some investigations into the per vm memory overhead and > >> play with bumping the max vcpu limit way beyond 16, but hopefully > >> someone can comment on issues such as locking problems that are known > >> to exist and needing to be addressed to increased parallellism, > >> general overhead percentages which can help provide consolidation > >> expectations, etc. > > > > I suppose it depends on the guest and workload. With an EPT host and > > 16-way Linux guest doing kernel compilations, on recent kernel, i see: > > > > # Samples: 98703304 > > # > > # Overhead Command Shared Object Symbol > > # ... . .. > > # > > 97.15% sh [kernel] [k] > > vmx_vcpu_run > > 0.27% sh [kernel] [k] > > kvm_arch_vcpu_ioctl_ > > 0.12% sh [kernel] [k] > > default_send_IPI_mas > > 0.09% sh [kernel] [k] > > _spin_lock_irq > > > > Which is pretty good. Without EPT/NPT the mmu_lock seems to be the major > > bottleneck to parallelism. > > > >> Also, when I did a simple experiment with vcpu overcommitment, I was > >> surprised how quickly performance suffered (just bringing a Linux vm > >> up), since I would have assumed the additional vcpus would have been > >> halted the vast majority of the time. On a 2 proc box, overcommitment > >> to 8 vcpus in a guest (I know this isn't a good usage scenario, but > >> does provide some insights) caused the boot time to increase to almost > >> exponential levels. At 16 vcpus, it took hours to just reach the gui > >> login prompt. > > > > One probable reason for that are vcpus which hold spinlocks in the guest > > are scheduled out in favour of vcpus which spin on that same lock. > > I suspected it might be a whole lot of spinning happening. That does seems > most likely. I was just surprised how bad the behavior was. I have collected lock_stat info on a similar vcpu over-commit configuration, but with EPT system, and saw a very significant amount of spinning. However, if you don't have EPT or NPT, I would bet that's the first problem. IMO, I am a little surprised simply booting is such a problem. I would be interesting to see what lock_stat shows on your guest after booting with 16 vcpus. I have observed that shortening the time between vcpus being scheduled can help mitigate the problem with lock holder preemption (presumably because the spinning vcpu is de-scheduled earlier and the vcpu holding the lock is scheduled sooner), but I imagine there are other unwanted side-effects like lower cache hits. -Andrew > > Bruce > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Use thread debug register storage instead of kvm specific data
Brian Jackson wrote: On Friday 04 September 2009 09:48:17 am Andrew Theurer wrote: Still not idle=poll, it may shave off 0.2%. Won't this affect SMT in a negative way? (OK, I am not running SMT now, but eventually we will be) A long time ago, we tested P4's with HT, and a polling idle in one thread always negatively impacted performance in the sibling thread. FWIW, I did try idle=halt, and it was slightly worse. I did get a chance to try the latest qemu (master and next heads). I have been running into a problem with virtIO stor driver for windows on anything much newer than kvm-87. I compiled the driver from the new git tree, installed OK, but still had the same error. Finally, I removed the serial number feature in the virtio-blk in qemu, and I can now get the driver to work in Windows. What were the symptoms you were seeing (i.e. define "a problem"). Device manager reports "a problem code 10" occurred, and the driver cannot initialize. Vadim Rozenfeld informed me: There is a sanity check in the code, which checks the I/O range and fails if is not equal to 40h. Resent virtio-blk devices have I/O range equal to 0x400 (serial number feature). So, out signed viostor driver will fail on the latest KVMs. This problem was fixed and committed to SVN some time ago. I assumed the fix was to the virtio windows driver, but I could not get the driver I compiled from latest git to work either (only on qemu-kvm-87). So, I just backed out the serial number feature in qemu, and it worked. FWIW, the linux virtio-blk driver never had a problem. So, not really any good news on performance with latest qemu builds. Performance is slightly worse: qemu-kvm-87 user nice system irq softirq guest idle iowait 5.79 0.009.28 0.08 1.00 20.81 58.784.26 total busy: 36.97 qemu-kvm-88-905-g6025b2d (master) user nice system irq softirq guest idle iowait 6.57 0.00 10.86 0.08 1.02 21.35 55.904.21 total busy: 39.89 qemu-kvm-88-910-gbf8a05b (next) user nice system irq softirq guest idle iowait 6.60 0.00 10.91 0.09 1.03 21.35 55.714.31 total busy: 39.98 diff of profiles, p1=qemu-kvm-87, p2=qemu-master 18x more samples for gfn_to_memslot_unali*, 37x for emulator_read_emula*, and more CPU time in guest mode. One other thing I decided to try was some cpu binding. I know this is not practical for production, but I wanted to see if there's any benefit at all. One reason was that a coworker here tried binding the qemu thread for the vcpu and the qemu IO thread to the same cpu. On a networking test, guest->local-host, throughput was up about 2x. Obviously there was a nice effect of being on the same cache. I wondered, even without full bore throughput tests, could we see any benefit here. So, I bound each pair of VMs to a dedicated core. What I saw was about a 6% improvement in performance. For a system which has pretty incredible memory performance and is not that busy, I was surprised that I got 6%. I am not advocating binding, but what I do wonder: on 1-way VMs, if we keep all the qemu threads together on the same CPU, but still allowing the scheduler to move them (all of them at once) to different cpus over time, would we see the same benefit? One other thing: So far I have not been using preadv/pwritev. I assume I need a more recent glibc (on 2.5 now) for qemu to take advantage of this? Getting p(read|write)v working almost doubled my virtio-net throughput in a Linux guest. Not quite as much in Windows guests. Yes you need glibc-2.10. I think some distros might have backported it to 2.9. You will also need some support for it in your system includes. Thanks, I will try a newer glibc, or maybe just move to a newer Linux installation which happens to have a newer glic. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Use thread debug register storage instead of kvm specific data
On Tue, 2009-09-01 at 21:23 +0300, Avi Kivity wrote: > On 09/01/2009 09:12 PM, Andrew Theurer wrote: > > Here's a run from branch debugreg with thread debugreg storage + > > conditionally reload dr6: > > > > user nice system irq softirq guest idle iowait > > 5.79 0.009.28 0.08 1.00 20.81 58.784.26 > > total busy: 36.97 > > > > Previous run that had avoided calling adjust_vmx_controls twice: > > > > user nice system irq softirq guest idle iowait > > 5.81 0.009.48 0.081.04 21.32 57.864.41 > > total busy: 37.73 > > > > A relative reduction CPU cycles of 2% > > > > That was an wasy fruit to pick. To bad it was a regression that we > introduced. > > > new oprofile: > > > > > >> samples %app name symbol name > >> 876648 54.1555 kvm-intel.ko vmx_vcpu_run > >> 37595 2.3225 qemu-system-x86_64 cpu_physical_memory_rw > >> 35623 2.2006 qemu-system-x86_64 phys_page_find_alloc > >> 24874 1.5366 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > >> native_write_msr_safe > >> 17710 1.0940 libc-2.5.so memcpy > >> 14664 0.9059 kvm.ko kvm_arch_vcpu_ioctl_run > >> 14577 0.9005 qemu-system-x86_64 qemu_get_ram_ptr > >> 12528 0.7739 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > >> native_read_msr_safe > >> 10979 0.6782 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > >> copy_user_generic_string > >> 9979 0.6165 qemu-system-x86_64 virtqueue_get_head > >> 9371 0.5789 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 schedule > >> 8333 0.5148 qemu-system-x86_64 virtqueue_avail_bytes > >> 7899 0.4880 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fget_light > >> 7289 0.4503 qemu-system-x86_64 main_loop_wait > >> 7217 0.4458 qemu-system-x86_64 lduw_phys > >> > > This is almost entirely host virtio. I can reduce native_write_msr_safe > by a bit, but not much. > > >> 6821 0.4214 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > >> audit_syscall_exit > >> 6749 0.4169 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 do_select > >> 5919 0.3657 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > >> audit_syscall_entry > >> 5466 0.3377 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 kfree > >> 4887 0.3019 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fput > >> 4689 0.2897 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 __switch_to > >> 4636 0.2864 > >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 mwait_idle > >> > > Still not idle=poll, it may shave off 0.2%. Won't this affect SMT in a negative way? (OK, I am not running SMT now, but eventually we will be) A long time ago, we tested P4's with HT, and a polling idle in one thread always negatively impacted performance in the sibling thread. FWIW, I did try idle=halt, and it was slightly worse. I did get a chance to try the latest qemu (master and next heads). I have been running into a problem with virtIO stor driver for windows on anything much newer than kvm-87. I compiled the driver from the new git tree, installed OK, but still had the same error. Finally, I removed the serial number feature in the virtio-blk in qemu, and I can now get the driver to work in Windows. So, not really any good news on performance with latest qemu builds. Performance is slightly worse: qemu-kvm-87 user nice system irq softirq guest idle iowait 5.79 0.009.28 0.08 1.00 20.81 58.784.26 total busy: 36.97 qemu-kvm-88-905-g6025b2d (master) user nice system irq softirq guest idle iowait 6.57 0.00 10.86 0.08 1.02 21.35 55.904.21 total busy: 39.89 qemu-kvm-88-910-gbf8a05b (next) user nice system irq softirq guest idle iowait 6.60 0.00 10.91 0.09 1.03 21.35 55.714.31 total busy: 39.98 diff of profiles, p1=qemu-kvm-87, p2=qemu-master > profile1 is qemu-kvm-87 > profile2 is qemu-master > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit > mask of 0x00 (No unit mask) count 1000 > total samples (ts1) for profi
Re: [PATCH] KVM: Use thread debug register storage instead of kvm specific data
On Tue, 2009-09-01 at 12:47 +0300, Avi Kivity wrote: > On 09/01/2009 12:44 PM, Avi Kivity wrote: > > Instead of saving the debug registers from the processor to a kvm data > > structure, rely in the debug registers stored in the thread structure. > > This allows us not to save dr6 and dr7. > > > > Reduces lightweight vmexit cost by 350 cycles, or 11 percent. > > > > Andrew, this is now available as the 'debugreg' branch of kvm.git. > Given the massive performance improvement, it will be interesting to see > how the test results change. > > Marcelo, please queue this for 2.6.32, and I think it's even suitable > for -stable. > Here's a run from branch debugreg with thread debugreg storage + conditionally reload dr6: user nice system irq softirq guest idle iowait 5.79 0.009.28 0.08 1.00 20.81 58.784.26 total busy: 36.97 Previous run that had avoided calling adjust_vmx_controls twice: user nice system irq softirq guest idle iowait 5.81 0.009.48 0.081.04 21.32 57.864.41 total busy: 37.73 A relative reduction CPU cycles of 2% new oprofile: > samples %app name symbol name > 876648 54.1555 kvm-intel.ko vmx_vcpu_run > 37595 2.3225 qemu-system-x86_64 cpu_physical_memory_rw > 35623 2.2006 qemu-system-x86_64 phys_page_find_alloc > 24874 1.5366 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > native_write_msr_safe > 17710 1.0940 libc-2.5.so memcpy > 14664 0.9059 kvm.ko kvm_arch_vcpu_ioctl_run > 14577 0.9005 qemu-system-x86_64 qemu_get_ram_ptr > 12528 0.7739 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > native_read_msr_safe > 10979 0.6782 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > copy_user_generic_string > 9979 0.6165 qemu-system-x86_64 virtqueue_get_head > 9371 0.5789 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 schedule > 8333 0.5148 qemu-system-x86_64 virtqueue_avail_bytes > 7899 0.4880 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fget_light > 7289 0.4503 qemu-system-x86_64 main_loop_wait > 7217 0.4458 qemu-system-x86_64 lduw_phys > 6821 0.4214 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > audit_syscall_exit > 6749 0.4169 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 do_select > 5919 0.3657 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > audit_syscall_entry > 5466 0.3377 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 kfree > 4887 0.3019 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fput > 4689 0.2897 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 __switch_to > 4636 0.2864 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 mwait_idle > 4505 0.2783 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 getnstimeofday > 4453 0.2751 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 system_call > 4403 0.2720 kvm.ko kvm_load_guest_fpu > 4285 0.2647 kvm.ko kvm_put_guest_fpu > 4241 0.2620 libpthread-2.5.sopthread_mutex_lock > 4172 0.2577 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 > unroll_tree_refs > 4100 0.2533 qemu-system-x86_64 kvm_run > 4044 0.2498 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 __down_read > 3978 0.2457 qemu-system-x86_64 ldl_phys > 3669 0.2267 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 do_vfs_ioctl > 3655 0.2258 > vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 __up_read > A diff of this and previous run's oprofile: > profile1 is [./oprofile.before] > profile2 is [./oprofile.after] > Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit > mask of 0x00 (No unit mask) count 1000 > total samples (ts1) for profile1 is 1661542 > total samples (ts2) for profile2 is 1618760 (includes multiplier of 1.00) > functions which have a abs(pct2-pct1) < 0.02 are not displayed > > pct2: pct1: > >100*100* pct2 > >s1s2 s2/s1 s2/ts1 s1/ts1 -pct1 symbol > bin > - - --- --- --- -- -- > --- > 1559 2747 1.76/1 0.165 0.094 0.071 dput > vmlinux > 34764 35623 1.02/1 2.144 2.092 0.052 phys_page_find_alloc > qemu > 5170 5919 1.14/1 0.356 0.311 0.045 audit_syscall_entry > vmlinux > 3593 4172 1.16/1
Re: [PATCH] don't call adjust_vmx_controls() second time
Avi Kivity wrote: On 08/27/2009 11:42 PM, Andrew Theurer wrote: On Thu, 2009-08-27 at 19:21 +0300, Avi Kivity wrote: On 08/27/2009 06:41 PM, Gleb Natapov wrote: Don't call adjust_vmx_controls() two times for the same control. It restores options that was dropped earlier. Applied, thanks. Andrew, if you rerun your benchmark atop kvm.git 'next' branch, I believe you will see dramatically better results. Yes! CPU is much lower: user nice system irq softirq guest idle iowait 5.81 0.009.48 0.081.04 21.32 57.864.41 previous CPU: user nice system irq softirq guest idle iowait 5.67 0.00 11.64 0.09 1.05 31.90 46.063.59 How does it compare to the other hypervisor now? My original results for other hypervisor were a little inaccurate. They mistakenly used 2 vcpu guests. New runs with 1 vcpu guests (as used in kvm) have slightly lower CPU utilization. Anyway, here's the breakdown: CPU percent more CPU kvm-master/qemu-kvm-87:50.15 78% kvm-next/qemu-kvm-87: 37.73 34% new oprofile: samples %app name symbol name 885444 53.2905 kvm-intel.ko vmx_vcpu_run guest mode = good 38090 2.2924 qemu-system-x86_64 cpu_physical_memory_rw 34764 2.0923 qemu-system-x86_64 phys_page_find_alloc 14730 0.8865 qemu-system-x86_64 qemu_get_ram_ptr 10814 0.6508 vmlinux-2.6.31-rc5-autokern1 copy_user_generic_string 10871 0.6543 qemu-system-x86_64 virtqueue_get_head 8557 0.5150 qemu-system-x86_64 virtqueue_avail_bytes 7173 0.4317 qemu-system-x86_64 lduw_phys 4122 0.2481 qemu-system-x86_64 ldl_phys 3339 0.2010 qemu-system-x86_64 virtqueue_num_heads 4129 0.2485 libpthread-2.5.sopthread_mutex_lock virtio and related qemu overhead: 8.2%. 25278 1.5214 vmlinux-2.6.31-rc5-autokern1 native_write_msr_safe 12278 0.7390 vmlinux-2.6.31-rc5-autokern1 native_read_msr_safe This will be reduced to if we move virtio to kernel context. Are there plans to move that to kernel for disk, too? 12380 0.7451 vmlinux-2.6.31-rc5-autokern1 native_set_debugreg 3550 0.2137 vmlinux-2.6.31-rc5-autokern1 native_get_debugreg A lot less than before, but still annoying. 4631 0.2787 vmlinux-2.6.31-rc5-autokern1 mwait_idle idle=halt may improve this, mwait is slow. I can try idle-halt on the host. I actually assumed it would be using that, but I'll check. Thanks, -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] don't call adjust_vmx_controls() second time
On Thu, 2009-08-27 at 19:21 +0300, Avi Kivity wrote: > On 08/27/2009 06:41 PM, Gleb Natapov wrote: > > Don't call adjust_vmx_controls() two times for the same control. > > It restores options that was dropped earlier. > > > > Applied, thanks. Andrew, if you rerun your benchmark atop kvm.git > 'next' branch, I believe you will see dramatically better results. Yes! CPU is much lower: user nice system irq softirq guest idle iowait 5.81 0.009.48 0.081.04 21.32 57.864.41 previous CPU: user nice system irq softirq guest idle iowait 5.67 0.00 11.64 0.09 1.05 31.90 46.063.59 new oprofile: > samples %app name symbol name > 885444 53.2905 kvm-intel.ko vmx_vcpu_run > 38090 2.2924 qemu-system-x86_64 cpu_physical_memory_rw > 34764 2.0923 qemu-system-x86_64 phys_page_find_alloc > 25278 1.5214 vmlinux-2.6.31-rc5-autokern1 native_write_msr_safe > 18205 1.0957 libc-2.5.so memcpy > 14730 0.8865 qemu-system-x86_64 qemu_get_ram_ptr > 14189 0.8540 kvm.ko kvm_arch_vcpu_ioctl_run > 12380 0.7451 vmlinux-2.6.31-rc5-autokern1 native_set_debugreg > 12278 0.7390 vmlinux-2.6.31-rc5-autokern1 native_read_msr_safe > 10871 0.6543 qemu-system-x86_64 virtqueue_get_head > 10814 0.6508 vmlinux-2.6.31-rc5-autokern1 copy_user_generic_string > 9080 0.5465 vmlinux-2.6.31-rc5-autokern1 fget_light > 9015 0.5426 vmlinux-2.6.31-rc5-autokern1 schedule > 8557 0.5150 qemu-system-x86_64 virtqueue_avail_bytes > 7805 0.4697 vmlinux-2.6.31-rc5-autokern1 do_select > 7173 0.4317 qemu-system-x86_64 lduw_phys > 7019 0.4224 qemu-system-x86_64 main_loop_wait > 6979 0.4200 vmlinux-2.6.31-rc5-autokern1 audit_syscall_exit > 5571 0.3353 vmlinux-2.6.31-rc5-autokern1 kfree > 5170 0.3112 vmlinux-2.6.31-rc5-autokern1 audit_syscall_entry > 5086 0.3061 vmlinux-2.6.31-rc5-autokern1 fput > 4631 0.2787 vmlinux-2.6.31-rc5-autokern1 mwait_idle > 4584 0.2759 kvm.ko kvm_load_guest_fpu > 4491 0.2703 vmlinux-2.6.31-rc5-autokern1 system_call > 4461 0.2685 vmlinux-2.6.31-rc5-autokern1 __switch_to > 4431 0.2667 kvm.ko kvm_put_guest_fpu > 4371 0.2631 vmlinux-2.6.31-rc5-autokern1 __down_read > 4290 0.2582 qemu-system-x86_64 kvm_run > 4218 0.2539 vmlinux-2.6.31-rc5-autokern1 getnstimeofday > 4129 0.2485 libpthread-2.5.sopthread_mutex_lock > 4122 0.2481 qemu-system-x86_64 ldl_phys > 4100 0.2468 vmlinux-2.6.31-rc5-autokern1 do_vfs_ioctl > 3811 0.2294 kvm.ko find_highest_vector > 3593 0.2162 vmlinux-2.6.31-rc5-autokern1 unroll_tree_refs > 3560 0.2143 vmlinux-2.6.31-rc5-autokern1 try_to_wake_up > 3550 0.2137 vmlinux-2.6.31-rc5-autokern1 native_get_debugreg > 3506 0.2110 kvm-intel.ko vmcs_writel > 3487 0.2099 vmlinux-2.6.31-rc5-autokern1 task_rq_lock > 3434 0.2067 vmlinux-2.6.31-rc5-autokern1 __up_read > 3368 0.2027 librt-2.5.so clock_gettime > 3339 0.2010 qemu-system-x86_64 virtqueue_num_heads > Thanks very much for the fix! -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performace data when running Windows VMs
On Wed, 2009-08-26 at 11:27 -0500, Brian Jackson wrote: > On Wednesday 26 August 2009 11:14:57 am Andrew Theurer wrote: > > > > > > > > I/O on the host was not what I would call very high: outbound network > > > > averaged at 163 Mbit/s inbound was 8 Mbit/s, while disk read ops was > > > > 243/sec and write ops was 561/sec > > > > > > What was the disk bandwidth used? Presumably, direct access to the > > > volume with cache=off? > > > > 2.4 MB/sec write, 0.6MB/sec read, cache=none > > The VMs' boot disks are IDE, but apps use their second disk which is > > virtio. > > > In my testing, I got better performance from IDE than the new virtio block > driver for windows. There appears to be some optimization left to do on them. Thanks Brian. I will try IDE on both VM disks to see how it compares. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performace data when running Windows VMs
On Wed, 2009-08-26 at 19:26 +0300, Avi Kivity wrote: > On 08/26/2009 07:14 PM, Andrew Theurer wrote: > > On Wed, 2009-08-26 at 18:44 +0300, Avi Kivity wrote: > > > >> On 08/26/2009 05:57 PM, Andrew Theurer wrote: > >> > >>> I recently gathered some performance data when running Windows Server > >>> 2008 VMs, and I wanted to share it here. There are 12 Windows > >>> Server2008 64-bit VMs (1 vcpu, 2 GB) running which handle the concurrent > >>> execution of 6 J2EE type benchmarks. Each benchmark needs a App VM and > >>> a Database VM. The benchmark clients inject a fixed rate of requests > >>> which yields X% CPU utilization on the host. A different hypervisor was > >>> compared; KVM used about 60% more CPU cycles to complete the same amount > >>> of work. Both had their hypervisor specific paravirt IO drivers in the > >>> VMs. > >>> > >>> Server is a 2 socket Core/i7, SMT off, with 72 GB memory > >>> > >>> > >> Did you use large pages? > >> > > Yes. > > > > The stats show 'largepage = 12'. Something's wrong. There's a commit > (7736d680) that's supposed to fix largepage support for kvm-87, maybe > it's incomplete. How strange. /proc/meminfo showed that almost all of the pages were used: HugePages_Total: 12556 HugePages_Free: 220 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB I just assumed they were used properly. Maybe not. > >>> I/O on the host was not what I would call very high: outbound network > >>> averaged at 163 Mbit/s inbound was 8 Mbit/s, while disk read ops was > >>> 243/sec and write ops was 561/sec > >>> > >>> > >> What was the disk bandwidth used? Presumably, direct access to the > >> volume with cache=off? > >> > > 2.4 MB/sec write, 0.6MB/sec read, cache=none > > The VMs' boot disks are IDE, but apps use their second disk which is > > virtio. > > > > Chickenfeed. > > Do the network stats include interguest traffic? I presume *all* of the > traffic was interguest. Sar network data: > IFACE rxpck/s txpck/srxkB/stxkB/s > Average: lo 0.00 0.00 0.00 0.00 > Average: usb0 0.39 0.19 0.02 0.01 > Average: eth0 2968.83 5093.02340.13 6966.64 > Average: eth1 2992.92 5124.08342.75 7008.53 > Average: eth2 1455.53 2500.63167.45 3421.64 > Average: eth3 1500.59 2574.36171.98 3524.82 > Average: br0 2.41 0.95 0.32 0.13 > Average: br1 1.52 0.00 0.20 0.00 > Average: br2 1.52 0.00 0.20 0.00 > Average: br3 1.52 0.00 0.20 0.00 > Average: br4 0.00 0.00 0.00 0.00 > Average: tap3669.38708.07290.89140.81 > Average: tap109678.53723.58294.07143.31 > Average: tap215673.20711.47291.99141.78 > Average: tap321675.26719.33293.01142.37 > Average:tap27679.23729.90293.86143.60 > Average: tap133680.17734.08294.33143.85 > Average: tap2 1002.24 2214.19 3458.54457.95 > Average: tap108 1021.85 2246.53 3491.02463.48 > Average: tap214 1002.81 2195.22 3411.80457.28 > Average: tap320 1017.43 2241.49 3508.20462.54 > Average:tap26 1028.52 2237.98 3483.84462.53 > Average: tap132 1034.05 2240.89 3493.37463.32 tap0-99 go to eth0, 100-199 to eth1, 200-299 to eth2, 300-399 to eth4. There is some inter-guest traffic between VM pairs (like taps 2&3, 108&119, etc.) but not that significant. > > >> linux-aio should help reduce cpu usage. > >> > > I assume this is in a newer version of Qemu? > > > > No, posted and awaiting merge. > > >> Could it be that Windows uses the debug registers? Maybe we're > >> incorrectly deciding to switch them. > >> > > I was wondering about that. I was thinking of just backing out the > > support for debugregs and see what happens. > > > > Did the up/down_read seem kind of high? Are we doing a lock of locking? > > > > It is. We do. Marcelo made some threats to remove this lock. Thanks, -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performace data when running Windows VMs
On Wed, 2009-08-26 at 18:44 +0300, Avi Kivity wrote: > On 08/26/2009 05:57 PM, Andrew Theurer wrote: > > I recently gathered some performance data when running Windows Server > > 2008 VMs, and I wanted to share it here. There are 12 Windows > > Server2008 64-bit VMs (1 vcpu, 2 GB) running which handle the concurrent > > execution of 6 J2EE type benchmarks. Each benchmark needs a App VM and > > a Database VM. The benchmark clients inject a fixed rate of requests > > which yields X% CPU utilization on the host. A different hypervisor was > > compared; KVM used about 60% more CPU cycles to complete the same amount > > of work. Both had their hypervisor specific paravirt IO drivers in the > > VMs. > > > > Server is a 2 socket Core/i7, SMT off, with 72 GB memory > > > > Did you use large pages? Yes. > > > Host kernel used was kvm.git v2.6.31-rc3-3419-g6df4865 > > Qemu was kvm-87. I tried a few newer versions of Qemu; none of them > > worked with the RedHat virtIO Windows drivers. I tried: > > > > f3600c589a9ee5ea4c0fec74ed4e06a15b461d52 > > 0.11.0-rc1 > > 0.10.6 > > kvm-88 > > > > All but 0.10.6 had "Problem code 10" driver error in the VM. 0.10.6 had > > "a disk read error occurred" very early in the booting of the VM. > > > > Yan? > > > I/O on the host was not what I would call very high: outbound network > > averaged at 163 Mbit/s inbound was 8 Mbit/s, while disk read ops was > > 243/sec and write ops was 561/sec > > > > What was the disk bandwidth used? Presumably, direct access to the > volume with cache=off? 2.4 MB/sec write, 0.6MB/sec read, cache=none The VMs' boot disks are IDE, but apps use their second disk which is virtio. > linux-aio should help reduce cpu usage. I assume this is in a newer version of Qemu? > > Host CPU breakdown was the following: > > > > user nice system irq softirq guest idle iowait > > 5.67 0.00 11.64 0.09 1.0531.90 46.06 3.59 > > > > > > The amount of kernel time had me concerned. Here is oprofile: > > > > user+system is about 55% of guest time, and it's all overhead. > > >> samples %app name symbol name > >> 1163422 52.3744 kvm-intel.ko vmx_vcpu_run > >> 1039964.6816 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > >> native_set_debugreg > >> 81036 3.6480 kvm.ko kvm_arch_vcpu_ioctl_run > >> 37913 1.7068 qemu-system-x86_64 cpu_physical_memory_rw > >> 34720 1.5630 qemu-system-x86_64 phys_page_find_alloc > >> > > We should really optimize these two. > > >> 23234 1.0459 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > >> native_write_msr_safe > >> 20964 0.9437 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > >> native_get_debugreg > >> 17628 0.7936 libc-2.5.so memcpy > >> 16587 0.7467 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > >> __down_read > >> 15681 0.7059 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > >> __up_read > >> 15466 0.6962 kvm.ko find_highest_vector > >> 14611 0.6578 qemu-system-x86_64 qemu_get_ram_ptr > >> 11254 0.5066 kvm-intel.ko vmcs_writel > >> 11133 0.5012 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > >> copy_user_generic_string > >> 10917 0.4915 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > >> native_read_msr_safe > >> 10760 0.4844 qemu-system-x86_64 virtqueue_get_head > >> 9025 0.4063 kvm-intel.ko vmx_handle_exit > >> 8953 0.4030 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > >> schedule > >> 8753 0.3940 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > >> fget_light > >> 8465 0.3811 qemu-system-x86_64 virtqueue_avail_bytes > >> 8185 0.3685 kvm-intel.ko handle_cr > >> 8069 0.3632 kvm.ko kvm_set_irq > >> 7697 0.3465 kvm.ko kvm_lapic_sync_from_vapic > >> 7586 0.3415 qemu-system-x86_64 main_loop_wait > >> 7480 0.3367 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > >> do_select > >> 7121 0.3206 qemu-system-x86_64 lduw_phys > >> 7003 0.3153 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > >> audit_syscall_
Performace data when running Windows VMs
I recently gathered some performance data when running Windows Server 2008 VMs, and I wanted to share it here. There are 12 Windows Server2008 64-bit VMs (1 vcpu, 2 GB) running which handle the concurrent execution of 6 J2EE type benchmarks. Each benchmark needs a App VM and a Database VM. The benchmark clients inject a fixed rate of requests which yields X% CPU utilization on the host. A different hypervisor was compared; KVM used about 60% more CPU cycles to complete the same amount of work. Both had their hypervisor specific paravirt IO drivers in the VMs. Server is a 2 socket Core/i7, SMT off, with 72 GB memory Host kernel used was kvm.git v2.6.31-rc3-3419-g6df4865 Qemu was kvm-87. I tried a few newer versions of Qemu; none of them worked with the RedHat virtIO Windows drivers. I tried: f3600c589a9ee5ea4c0fec74ed4e06a15b461d52 0.11.0-rc1 0.10.6 kvm-88 All but 0.10.6 had "Problem code 10" driver error in the VM. 0.10.6 had "a disk read error occurred" very early in the booting of the VM. I/O on the host was not what I would call very high: outbound network averaged at 163 Mbit/s inbound was 8 Mbit/s, while disk read ops was 243/sec and write ops was 561/sec Host CPU breakdown was the following: user nice system irq softirq guest idle iowait 5.67 0.00 11.64 0.09 1.0531.90 46.06 3.59 The amount of kernel time had me concerned. Here is oprofile: > samples %app name symbol name > 1163422 52.3744 kvm-intel.ko vmx_vcpu_run > 1039964.6816 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > native_set_debugreg > 81036 3.6480 kvm.ko kvm_arch_vcpu_ioctl_run > 37913 1.7068 qemu-system-x86_64 cpu_physical_memory_rw > 34720 1.5630 qemu-system-x86_64 phys_page_find_alloc > 23234 1.0459 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > native_write_msr_safe > 20964 0.9437 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > native_get_debugreg > 17628 0.7936 libc-2.5.so memcpy > 16587 0.7467 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > __down_read > 15681 0.7059 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > __up_read > 15466 0.6962 kvm.ko find_highest_vector > 14611 0.6578 qemu-system-x86_64 qemu_get_ram_ptr > 11254 0.5066 kvm-intel.ko vmcs_writel > 11133 0.5012 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > copy_user_generic_string > 10917 0.4915 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > native_read_msr_safe > 10760 0.4844 qemu-system-x86_64 virtqueue_get_head > 9025 0.4063 kvm-intel.ko vmx_handle_exit > 8953 0.4030 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > schedule > 8753 0.3940 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > fget_light > 8465 0.3811 qemu-system-x86_64 virtqueue_avail_bytes > 8185 0.3685 kvm-intel.ko handle_cr > 8069 0.3632 kvm.ko kvm_set_irq > 7697 0.3465 kvm.ko kvm_lapic_sync_from_vapic > 7586 0.3415 qemu-system-x86_64 main_loop_wait > 7480 0.3367 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > do_select > 7121 0.3206 qemu-system-x86_64 lduw_phys > 7003 0.3153 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > audit_syscall_exit > 6062 0.2729 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 kfree > 5477 0.2466 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 fput > 5454 0.2455 kvm.ko kvm_lapic_get_cr8 > 5096 0.2294 kvm.ko kvm_load_guest_fpu > 5057 0.2277 kvm.ko apic_update_ppr > 4929 0.2219 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > up_read > 4900 0.2206 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > audit_syscall_entry > 4866 0.2191 kvm.ko kvm_apic_has_interrupt > 4670 0.2102 kvm-intel.ko skip_emulated_instruction > 4644 0.2091 kvm.ko kvm_cpu_has_interrupt > 4548 0.2047 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > __switch_to > 4328 0.1948 kvm.ko kvm_apic_accept_pic_intr > 4303 0.1937 libpthread-2.5.sopthread_mutex_lock > 4235 0.1906 vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 > system_call > 4175 0.1879 kvm.ko kvm_put_guest_fpu > 4170 0.1877 qemu-system-x86_64 ldl_phys > 4098 0.1845 kvm-intel.ko vmx_set_interrupt_shadow > 4003 0.1802 qemu-system-x86_64 kvm_run I was wondering why the get/set debugreg was so high. I don't recall seeing this much with Linux VMs. Here is an average of kvm_stat: > efer_relo 0 > exits 1262814 > fpu_reloa 103842 > halt_exit 9918 > halt_wak
Re: Windows Server 2008 VM performance
Avi Kivity wrote: Andrew Theurer wrote: Is there a virtio_block driver to test? There is, but it isn't available yet. OK. Can I assume a better virtio_net driver is in the works as well? Can we find the root cause of the exits (is there a way to get stack dump or something that can show where there are coming from)? Marcelo is working on a super-duper easy to use kvm trace which can show what's going on. The old one is reasonably easy though it exports less data. If you can generate some traces, I'll have a look at them. Thanks Avi. I'll try out kvm-86 and see if I can generate some kvm trace data. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Windows Server 2008 VM performance
I've been looking at how KVM handles windows guests, and I am a little concerned with the CPU overhead. My test case is as follows: I am running 4 instances of a J2EE benchmark. Each instance needs one application server and one DB server. 8 VMs in total are used. I have the same App and DB software for Linux and Windows (and same versions) so I can compare between Linux and Windows. I also have another hypervisor which I can test both Windows and Linux VMs. The host has EPT capable processors. VMs in KVM are backed with large pages. Test results: ConfigCPU utilization --- KVM-85 Windows Server 2008 64-bit VMs 44.84 RedHat 5.3 w/ 2.6.29 64-bit VMs24.56 Other-Hypervisor Windows Server 2008 64-bit VMs 30.63 RedHat 5.3 w/ 2.6.18 64-bit VMs27.13 -KVM running Windows VMs uses 46% more CPU than the Other-Hypervisor -The Other-Hypervisor provides an optimized virtual network driver -KVM results listed above did not use virtio_net or virtio_disk for Windows, but do for Linux -One extra KVM run (not listed above) was made with virtio_net for Windows VMs but only reduced CPU by 2% -Most of the CPU overhead could be attributed to the DB VMs, where there is about 5 MB/sec writes per VM -I don't have a virtio_block driver for Windows to test. Does one exist? -All tests above had 2 vCPUS per VM Here's a comparison of kvm_stat between Windows (run1) and Linux (run2): run1 run2run1/run2 - efer_relo: 0 0 1 exits:1206880121916 9.899 fpu_reloa: 210969 2086310.112 halt_exit: 15092 13222 1.141 halt_wake: 14466 9294 1.556 host_stat: 211066 45117 4.678 hypercall: 0 0 1 insn_emul: 119582 38126 3.136 insn_emul: 0 0 1 invlpg : 0 0 1 io_exits : 131051 26349 4.974 irq_exits: 8128 12937 0.628 irq_injec: 29955 21825 1.373 irq_windo: 2504 2022 1.238 kvm_reque: 0 0 1 largepage: 164 0.009 mmio_exit: 59224 0 Inf mmu_cache: 0 3 0.000 mmu_flood: 0 0 1 mmu_pde_z: 0 0 1 mmu_pte_u: 0 0 1 mmu_pte_w: 0 0 1 mmu_recyc: 0 0 1 mmu_shado: 0 0 1 mmu_unsyn: 0 0 1 mmu_unsyn: 0 0 1 nmi_injec: 0 0 1 nmi_windo: 0 0 1 pf_fixed : 167 0.009 pf_guest : 0 0 1 remote_tl: 0 0 1 request_n: 0 0 1 signal_ex: 0 0 1 tlb_flush:220 14037 0.016 10x the number of exits, a problem? I happened to try just one vCPU per VM for KVM/Windows VMs, and I was surprised how much of a difference it made: Config CPU utilization -- - KVM-85 Windows Server 2008 64-bit VMs, 2 vCPU per VM 44.84 Windows Server 2008 64-bit VMs, 1 vCPU per VM 36.44 A 19% reduction in CPU utilization vs KVM/Windows-2vCPU! Does not explain all the overhead (vs Other-Hypervisor, 2 vCPUs per VM) but, that sure seems like a lot between 1 to 2 vCPUs for KVM/Windows-VMs. I have not run with 1 vCPU per VM with Other-Hypervisor, but I will soon. Anyway, I also collected kvm_stat for the 1 vCPU case, and here it is compared to KVM/Linux VMs with 2 vCPUs: run1 run2run1/run2 - efer_relo: 0 0 1 exits:1184471121916 9.715 fpu_reloa: 192766 20863 9.240 halt_exit: 4697 13222 0.355 halt_wake: 4360 9294 0.469 host_stat: 192828 45117 4.274 hypercall: 0 0 1 insn_emul: 130487 38126 3.422 insn_emul: 0 0 1 invlpg : 0 0 1 io_exits : 114430 26349 4.343 irq_exits: 7075 12937 0.547 irq_injec: 29930 21825 1.371 irq_windo: 2391 2022 1.182 kvm_reque: 0 0 1 largepage: 064 0.001 mmio_exit: 69028 0 Inf mmu_cache: 0
Re: KVM performance vs. Xen
Here are the SMT off results. This workload is designed to not over-saturate the CPU, so you have to pick a number of server sets to ensure that. With SMT on, 4 sets was enough for KVM, but 5 was too much (start seeing response time errors). For SMT off, I tried to size the load as high as we can go without running into these errors. For KVM, thats 3 (18 guests) and for Xen, that's 4 (24 guests). The throughout has a fairly linear relationship to the number of server sets used, but has a bit of wiggle room (mostly affected by response times getting longer and longer, but not exceeding the requirement set forth). Anyway, the relative throughput for these are "1.0" for KVM and "1.34" for Xen. The CPU is 78.71% for KVM the CPU is 87.83%. If we normalize to CPU utilization, Xen is doing 20% more throughput. Avi Kivity wrote: Anthony Liguori wrote: Previously, the block API only exposed non-vector interfaces and bounced vectored operations to a linear buffer. That's been eliminated now though so we need to update the linux-aio patch to implement a vectored backend interface. However, it is an apples to apples comparison in terms of copying since the same is true with the thread pool. My take away was that the thread pool overhead isn't the major source of issues. If the overhead is dominated by copying, then you won't see the difference. Once the copying is eliminated, the comparison may yield different results. We should certainly see a difference in context switches. I would like to test this the proper way. What do I need to do to ensure these copies are eliminated? I am on a 2.6.27 kernel, am I missing anything there? Anthony, would you be willing to provide a patch to support the changes in the block API? One cause of context switches won't be eliminated - the non-saturating workload causes us to switch to the idle thread, which incurs a heavyweight exit. This doesn't matter since we're idle anyway, but when we switch back, we incur a heavyweight entry. I have not looked at the schedstat or ftrace yet, but will soon. Maybe it will tell us a little more about the context switches. Here's a sample of the kvm_stat: efer_relo exits fpu_reloa halt_exit halt_wake host_stat hypercall insn_emul insn_emul invlpg io_exits irq_exits irq_injec irq_windo kvm_reque largepage mmio_exit mmu_cache mmu_flood mmu_pde_z mmu_pte_u mmu_pte_w mmu_recyc mmu_shado mmu_unsyn mmu_unsyn nmi_injec nmi_windo pf_fixed pf_guest remote_tl request_n signal_ex tlb_flush 0 233866 53994 20353 16209 119812 0 48879 0 0 75666 44917 34772 3984 0187 0 10 0 0 0 0 0 0 0 0 0 0202 0 0 0 0 17698 0 244556 67321 15570 12364 116226 0 49865 0 0 69357 56131 32860 4449 0 -1895 0 19 0 0 0 0 21 21 0 0 0 0 1117 0 0 0 0 21586 0 230788 71382 10619 7920 109151 0 44354 0 0 62561 60074 28322 4841 0103 0 13 0 0 0 0 0 0 0 0 0 0122 0 0 0 0 22702 0 275259 82605 14326 11148 127293 0 53738 0 0 73438 70707 34724 5373 0859 0 15 0 0 0 0 21 21 0 0 0 0874 0 0 0 0 26723 0 250576 58760 20368 16476 128296 0 50936 0 0 80439 51219 36329 4621 0 -1170 0 8 0 0 0 0 22 22 0 0 0 0 1333 0 0 0 0 18508 0 244746 59650 19480 15657 122721 0 49882 0 0 76011 50453 35352 4523 0201 0 11 0 0 0 0 21 21 0 0 0 0212 0 0 0 0 19163 0 251724 71715 14049 10920 117255 0 49924 0 0 70173 58040 32328 5058
Re: KVM performance vs. Xen
Avi Kivity wrote: Anthony Liguori wrote: Avi Kivity wrote: 1) I'm seeing about 2.3% in scheduler functions [that I recognize]. Does that seems a bit excessive? Yes, it is. If there is a lot of I/O, this might be due to the thread pool used for I/O. This is why I wrote the linux-aio patch. It only reduced CPU consumption by about 2% although I'm not sure if that's absolute or relative. Andrew? If I recall correctly, it was 2.4% and relative. But with 2.3% in scheduler functions, that's what I expected. Was that before or after the entire path was made copyless? If this is referring to the preadv/writev support, no, I have not tested with that. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM performance vs. Xen
Avi Kivity wrote: Andrew Theurer wrote: Avi Kivity wrote: What's the typical I/O load (disk and network bandwidth) while the tests are running? This is average thrgoughput: network:Tx: 79 MB/sec Rx: 5 MB/sec MB as in Byte or Mb as in bit? Byte. There are 4 x 1 Gb adapters, each handling about 20 MB/sec or 160 Mbit/sec. disk:read: 17 MB/sec write: 40 MB/sec This could definitely cause the extra load, especially if it's many small requests (compared to a few large ones). I don't have the request sizes at my fingertips, but we have to use a lot of disks to support this I/O, so I think it's safe to assume there are a lot more requests than a simple large sequential read/write. The host hardware: A 2 socket, 8 core Nehalem with SMT, and EPT enabled, lots of disks, 4 x 1 GB Ethenret CPU time measurements with SMT can vary wildly if the system is not fully loaded. If the scheduler happens to schedule two threads on a single core, both of these threads will generate less work compared to if they were scheduled on different cores. Understood. Even if at low loads, the scheduler does the right thing and spreads out to all the cores first, once it goes beyond 50% util, the CPU util can climb at a much higher rate (compared to a linear increase in work) because it then starts scheduling 2 threads per core, and each thread can do less work. I have always wanted something which could more accurately show the utilization of a processor core, but I guess we have to use what we have today. I will run again with SMT off to see what we get. On the other hand, without SMT you will get to overcommit much faster, so you'll have scheduling artifacts. Unfortunately there's no good answer here (except to improve the SMT scheduler). Yes, it is. If there is a lot of I/O, this might be due to the thread pool used for I/O. I have a older patch which makes a small change to posix_aio_thread.c by trying to keep the thread pool size a bit lower than it is today. I will dust that off and see if it helps. Really, I think linux-aio support can help here. Yes, I think that would work for real block devices, but would that help for files? I am using real block devices right now, but it would be nice to also see a benefit for files in a file-system. Or maybe I am mis-understanding this, and linux-aio can be used on files? -Andrew Yes, there is a scheduler tracer, though I have no idea how to operate it. Do you have kvm_stat logs? Sorry, I don't, but I'll run that next time. BTW, I did not notice a batch/log mode the last time I ram kvm_stat. Or maybe it was not obvious to me. Is there an ideal way to run kvm_stat without a curses like output? You're probably using an ancient version: $ kvm_stat --help Usage: kvm_stat [options] Options: -h, --helpshow this help message and exit -1, --once, --batch run in batch mode for one second -l, --log run in logging mode (like vmstat) -f FIELDS, --fields=FIELDS fields to display (regex) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM performance vs. Xen
Avi Kivity wrote: Andrew Theurer wrote: I wanted to share some performance data for KVM and Xen. I thought it would be interesting to share some performance results especially compared to Xen, using a more complex situation like heterogeneous server consolidation. The Workload: The workload is one that simulates a consolidation of servers on to a single host. There are 3 server types: web, imap, and app (j2ee). In addition, there are other "helper" servers which are also consolidated: a db server, which helps out with the app server, and an nfs server, which helps out with the web server (a portion of the docroot is nfs mounted). There is also one other server that is simply idle. All 6 servers make up one set. The first 3 server types are sent requests, which in turn may send requests to the db and nfs helper servers. The request rate is throttled to produce a fixed amount of work. In order to increase utilization on the host, more sets of these servers are used. The clients which send requests also have a response time requirement which is monitored. The following results have passed the response time requirements. What's the typical I/O load (disk and network bandwidth) while the tests are running? This is average thrgoughput: network:Tx: 79 MB/sec Rx: 5 MB/sec disk:read: 17 MB/sec write: 40 MB/sec The host hardware: A 2 socket, 8 core Nehalem with SMT, and EPT enabled, lots of disks, 4 x 1 GB Ethenret CPU time measurements with SMT can vary wildly if the system is not fully loaded. If the scheduler happens to schedule two threads on a single core, both of these threads will generate less work compared to if they were scheduled on different cores. Understood. Even if at low loads, the scheduler does the right thing and spreads out to all the cores first, once it goes beyond 50% util, the CPU util can climb at a much higher rate (compared to a linear increase in work) because it then starts scheduling 2 threads per core, and each thread can do less work. I have always wanted something which could more accurately show the utilization of a processor core, but I guess we have to use what we have today. I will run again with SMT off to see what we get. Test Results: The throughput is equal in these tests, as the clients throttle the work (this is assuming you don't run out of a resource on the host). What's telling is the CPU used to do the same amount of work: Xen: 52.85% KVM: 66.93% So, KVM requires 66.93/52.85 = 26.6% more CPU to do the same amount of work. Here's the breakdown: totalusernice system irq softirq guest 66.907.200.00 12.940.353.39 43.02 Comparing guest time to all other busy time, that's a 23.88/43.02 = 55% overhead for virtualization. I certainly don't expect it to be 0, but 55% seems a bit high. So, what's the reason for this overhead? At the bottom is oprofile output of top functions for KVM. Some observations: 1) I'm seeing about 2.3% in scheduler functions [that I recognize]. Does that seems a bit excessive? Yes, it is. If there is a lot of I/O, this might be due to the thread pool used for I/O. I have a older patch which makes a small change to posix_aio_thread.c by trying to keep the thread pool size a bit lower than it is today. I will dust that off and see if it helps. 2) cpu_physical_memory_rw due to not using preadv/pwritev? I think both virtio-net and virtio-blk use memcpy(). 3) vmx_[save|load]_host_state: I take it this is from guest switches? These are called when you context-switch from a guest, and, much more frequently, when you enter qemu. We have 180,000 context switches a second. Is this more than expected? Way more. Across 16 logical cpus, this is >10,000 cs/sec/cpu. I wonder if schedstats can show why we context switch (need to let someone else run, yielded, waiting on io, etc). Yes, there is a scheduler tracer, though I have no idea how to operate it. Do you have kvm_stat logs? Sorry, I don't, but I'll run that next time. BTW, I did not notice a batch/log mode the last time I ram kvm_stat. Or maybe it was not obvious to me. Is there an ideal way to run kvm_stat without a curses like output? -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM performance vs. Xen
Nakajima, Jun wrote: On 4/29/2009 7:41:50 AM, Andrew Theurer wrote: I wanted to share some performance data for KVM and Xen. I thought it would be interesting to share some performance results especially compared to Xen, using a more complex situation like heterogeneous server consolidation. The Workload: The workload is one that simulates a consolidation of servers on to a single host. There are 3 server types: web, imap, and app (j2ee). In addition, there are other "helper" servers which are also consolidated: a db server, which helps out with the app server, and an nfs server, which helps out with the web server (a portion of the docroot is nfs mounted). There is also one other server that is simply idle. All 6 servers make up one set. The first 3 server types are sent requests, which in turn may send requests to the db and nfs helper servers. The request rate is throttled to produce a fixed amount of work. In order to increase utilization on the host, more sets of these servers are used. The clients which send requests also have a response time requirement which is monitored. The following results have passed the response time requirements. The host hardware: A 2 socket, 8 core Nehalem with SMT, and EPT enabled, lots of disks, 4 x 1 GB Ethenret The host software: Both Xen and KVM use the same host Linux OS, SLES11. KVM uses the 2.6.27.19-5-default kernel and Xen uses the 2.6.27.19-5-xen kernel. I have tried 2.6.29 for KVM, but results are actually worse. KVM modules are rebuilt with kvm-85. Qemu is also from kvm-85. Xen version is "3.3.1_18546_12-3.1". The guest software: All guests are RedHat 5.3. The same disk images are used but different kernels. Xen uses the RedHat Xen kernel and KVM uses 2.6.29 with all paravirt build options enabled. Both use PV I/O drivers. Software used: Apache, PHP, Java, Glassfish, Postgresql, and Dovecot. Just for clarification. So are you using PV (Xen) Linux on Xen, not HVM? Is that 32-bit or 64-bit? PV, 64-bit. -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM performance vs. Xen
_avail_bytes 1651070 0.2623 vmlinux-2.6.27.19-5-default do_select 1643139 0.2611 vmlinux-2.6.27.19-5-default update_curr 1640495 0.2606 vmlinux-2.6.27.19-5-default kmem_cache_free 1606493 0.2552 libpthread-2.9.so pthread_mutex_lock 1549536 0.2462 qemu-system-x86_64 lduw_phys 1535539 0.2440 vmlinux-2.6.27.19-5-default tg_shares_up 1438468 0.2285 vmlinux-2.6.27.19-5-default mwait_idle 1316461 0.2092 vmlinux-2.6.27.19-5-default __down_read 1282486 0.2038 vmlinux-2.6.27.19-5-default native_read_tsc 1226069 0.1948 oprofiled odb_update_node 1224551 0.1946 vmlinux-2.6.27.19-5-default sched_clock_cpu 1222684 0.1943 tun.ko tun_chr_aio_read 1194034 0.1897 vmlinux-2.6.27.19-5-default task_rq_lock 1186129 0.1884 kvm.ko x86_decode_insn 1131644 0.1798 bnx2.ko bnx2_start_xmit 1115575 0.1772 vmlinux-2.6.27.19-5-default enqueue_hrtimer 1044329 0.1659 vmlinux-2.6.27.19-5-default native_sched_clock 988546 0.1571 vmlinux-2.6.27.19-5-default fput 980615 0.1558 vmlinux-2.6.27.19-5-default __up_read 942270 0.1497 qemu-system-x86_64 kvm_run 925076 0.1470 kvm-intel.kovmcs_writel 889220 0.1413 vmlinux-2.6.27.19-5-default dev_queue_xmit 884786 0.1406 kvm.ko kvm_apic_has_interrupt 880421 0.1399 librt-2.9.so /lib64/librt-2.9.so 880306 0.1399 vmlinux-2.6.27.19-5-default nf_iterate -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
boot problems with if=virtio
I know there have been a couple other threads here about booting with if=virtio, but I think this might be a different problem, not sure: I am using kvm.git (41b76d8d0487c26d6d4d3fe53c1ff59b3236f096) and qemu-kvm.git (8f7a30dbc40a1d4c09275566f9ed9647ed1ee50f) and linux 2.6.20-rc3 It appears to build fine. I am trying to run the following command: name=newcastle-xmailt01 dev1=/dev/disk/by-id/scsi-3600a0b8f1eb1069748d8c230 dev2=/dev/disk/by-id/scsi-3600a0b8f1eb106dc48f45432 macaddr=00:50:56:00:00:06 tap=tap6 cpus=1 mem=1024 /usr/local/bin/qemu-system-x86_64 -name $name\ -drive file=$dev1,if=virtio,boot=on,cache=none\ -drive file=$dev2,if=virtio,boot=off,cache=none\ -m $mem -net nic,model=virtio,vlan=0,macaddr=$macaddr\ -net tap,vlan=0,ifname=$tap,script=/etc/qemu-ifup -vnc 127.0.0.1:6 -smp $cpus -daemonize ...and I get "Boot failed: could not read the boot disk" This did work with the kvm-userspace.git (kvm-85rc6). I can get this to work with a windows vm, using ide. Was there a recent change to the -drive options that I am missing? Thanks, -Andrew -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: patch for virtual machine oriented scheduling(1)
alex wrote: the following patchs provide an extra control(besides the control of Linux scheduler) over the execution of vcpu threads. In this patch, Xen's credit scheduler(http://wiki.xensource.com/xenwiki/CreditScheduler) is used. User can use "cat" and "echo" command to view and control a guest os' credit. e.g., [r...@localhost ~]# echo "weight=500" > /proc/kvm/12345 will change the credit of guest whose qemu process has the pid 12345 to be 500. The patch consists of 3 parts: 1. modification to the standard KVM 2. modification to the Xen scheduler 3. helper functions Just wondering, was it not possible to introduce a new scheduling class in the current scheduler? My impression was that the current scheduler was fairly modular and should allow this. -Andrew However, some are unnecessary in the latest Linux kernel. The difficulties in the ports lie in: 1. Linux does not provide timer mechanism that the timer function is bind to a dedicate CPU: in case of one cpu receives another cpu's schedule timer expiration, IPI is used to relay it. 2. before linux 2.6.27, the smp_call_function_xxx() can not be re-entered. if kvm is sending ipi at the time of relaying timer expiration information, deadlock would occur in kernel versions below 2.6.27 In my implementation, tasklets are used to run the function of scheduling, and kernel thread is used to send IPI(in kernels above 2.6.27, this is unnecessary) Originally, this code is developed at the release version of KVM-83. In order to post it, I ported to the latest .git tree. As a result, modifications to files like external-module-compat-comm.h are omited. NOTE: 1. Because my not having an AMD machine, only intel platforms are tested. 2. Because sched_setaffinity() is used (while Linux does not export this symbol), the way of loading kvm modules are changed to be ./myins -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: EPT support breakage on: KVM: VMX: Zero ept module parameter if ept is not present
Sheng Yang wrote: Oops... Thanks very much for reporting! I can't believe we haven't awared of that... Could you please try the attached patch? Thanks! Tested and works great. Thanks! -Andrew diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index aba41ae..8d6465b 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -1195,15 +1195,6 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) vmx_capability.ept, vmx_capability.vpid); } - if (!cpu_has_vmx_vpid()) - enable_vpid = 0; - - if (!cpu_has_vmx_ept()) - enable_ept = 0; - - if (!(vmcs_config.cpu_based_2nd_exec_ctrl & SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) - flexpriority_enabled = 0; - min = 0; #ifdef CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; @@ -1307,6 +1298,15 @@ static __init int hardware_setup(void) if (boot_cpu_has(X86_FEATURE_NX)) kvm_enable_efer_bits(EFER_NX); + if (!cpu_has_vmx_vpid()) + enable_vpid = 0; + + if (!cpu_has_vmx_ept()) + enable_ept = 0; + + if (!(vmcs_config.cpu_based_2nd_exec_ctrl & SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES)) + flexpriority_enabled = 0; + return alloc_kvm_area(); } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
EPT support breakage on: KVM: VMX: Zero ept module parameter if ept is not present
I cannot get EPT support to work on commit: 21f65ab2c582594a69dcb1484afa9f88b3414b4f KVM: VMX: Zero ept module parameter if ept is not present I see tons of pf_guest from kvm_stat, where as the previous commit has none. I am using "ept=1" module option for kvm-intel. This is on Nehalem processors. -Andrew commit diff: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index 8b1b9b8..96a19f8 100644 (file) --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -265,7 +265,7 @@ static inline int cpu_has_vmx_ept(void) static inline int vm_need_ept(void) { - return (cpu_has_vmx_ept() && enable_ept); + return enable_ept; } static inline int vm_need_virtualize_apic_accesses(struct kvm *kvm) @@ -1205,6 +1205,9 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf) if (!cpu_has_vmx_vpid()) enable_vpid = 0; + if (!cpu_has_vmx_ept()) + enable_ept = 0; + min = 0; #ifdef CONFIG_X86_64 min |= VM_EXIT_HOST_ADDR_SPACE_SIZE; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v3)
Avi Kivity wrote: KVM currently flushes the tlbs on all cpus when emulating invlpg. This is because at the time of invlpg we lose track of the page, and leaving stale tlb entries could cause the guest to access the page when it is later freed (say after being swapped out). However we have a second change to flush the tlbs, when an mmu notifier is called to let us know the host pte has been invalidated. We can safely defer the flush to this point, which occurs much less frequently. Of course, we still do a local tlb flush when emulating invlpg. I should be able to run some performance comparisons with this in the next day or two. -Andrew Signed-off-by: Avi Kivity --- Changes from v2: - dropped remote flushes from guest pagetable write protect paths - fixed up memory barriers - use existing local tlb flush in invlpg, no need to add another one arch/x86/kvm/mmu.c |3 +-- arch/x86/kvm/paging_tmpl.h |5 + include/linux/kvm_host.h |2 ++ virt/kvm/kvm_main.c| 17 +++-- 4 files changed, 15 insertions(+), 12 deletions(-) diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2a36f7f..f0ea56c 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -1184,8 +1184,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu, for_each_sp(pages, sp, parents, i) protected |= rmap_write_protect(vcpu->kvm, sp->gfn); - if (protected) - kvm_flush_remote_tlbs(vcpu->kvm); + kvm_flush_remote_tlbs_cond(vcpu->kvm, protected); for_each_sp(pages, sp, parents, i) { kvm_sync_page(vcpu, sp); diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 855eb71..2273b26 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -445,7 +445,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) gpa_t pte_gpa = -1; int level; u64 *sptep; - int need_flush = 0; spin_lock(&vcpu->kvm->mmu_lock); @@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) rmap_remove(vcpu->kvm, sptep); if (is_large_pte(*sptep)) --vcpu->kvm->stat.lpages; - need_flush = 1; + vcpu->kvm->remote_tlbs_dirty = true; } set_shadow_pte(sptep, shadow_trap_nonpresent_pte); break; @@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva) break; } - if (need_flush) - kvm_flush_remote_tlbs(vcpu->kvm); spin_unlock(&vcpu->kvm->mmu_lock); if (pte_gpa == -1) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 11eb702..b779c57 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -125,6 +125,7 @@ struct kvm_kernel_irq_routing_entry { struct kvm { struct mutex lock; /* protects the vcpus array and APIC accesses */ spinlock_t mmu_lock; + bool remote_tlbs_dirty; struct rw_semaphore slots_lock; struct mm_struct *mm; /* userspace tied to this vm */ int nmemslots; @@ -235,6 +236,7 @@ void kvm_resched(struct kvm_vcpu *vcpu); void kvm_load_guest_fpu(struct kvm_vcpu *vcpu); void kvm_put_guest_fpu(struct kvm_vcpu *vcpu); void kvm_flush_remote_tlbs(struct kvm *kvm); +void kvm_flush_remote_tlbs_cond(struct kvm *kvm, bool cond); void kvm_reload_remote_mmus(struct kvm *kvm); long kvm_arch_dev_ioctl(struct file *filp, diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 68b217e..12afa50 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -758,10 +758,18 @@ static bool make_all_cpus_request(struct kvm *kvm, unsigned int req) void kvm_flush_remote_tlbs(struct kvm *kvm) { + kvm->remote_tlbs_dirty = false; + smp_wmb(); if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) ++kvm->stat.remote_tlb_flush; } +void kvm_flush_remote_tlbs_cond(struct kvm *kvm, bool cond) +{ + if (cond || kvm->remote_tlbs_dirty) + kvm_flush_remote_tlbs(kvm); +} + void kvm_reload_remote_mmus(struct kvm *kvm) { make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD); @@ -841,8 +849,7 @@ static void kvm_mmu_notifier_invalidate_page(struct mmu_notifier *mn, spin_unlock(&kvm->mmu_lock); /* we've to flush the tlb before the pages can be freed */ - if (need_tlb_flush) - kvm_flush_remote_tlbs(kvm); + kvm_flush_remote_tlbs_cond(kvm, need_tlb_flush); } @@ -866,8 +873,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn, spin_unlock(&kvm->mmu_lock); /* we've to flush the tlb before the pages can be freed */ - if (need_tlb_flush) - kvm_flush_