Re: [PATCHv2/RFC] kvm/irqchip: Speed up KVM_SET_GSI_ROUTING

2014-02-20 Thread Andrew Theurer

> Il 17/01/2014 09:29, Christian Borntraeger ha scritto:
> > Michael,
> > do you have a quick way to check if srcu has a noticeable impact on int
> > injection on your systems? I am happy with either v2 or v3 of the patch,
> > but srcu_synchronize_expedited seems to have less latency impact on the
> > full system than rcu_synchronize_expedited. This might give Paolo a hint
> > which of the patches is the right way to go.
> 
> Hi all,
> 
> I've asked Andrew Theurer to run network tests on a 10G connection (TCP
> request/response to check for performance, TCP streaming for host CPU
> utilization).

I am hoping to have some results some time tomorrow (Friday).

-Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-26 Thread Andrew Theurer
On Wed, 2013-06-26 at 15:52 +0300, Gleb Natapov wrote:
> On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote:
> > On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote:
> > > On 06/25/2013 08:20 PM, Andrew Theurer wrote:
> > > >On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:
> > > >>This series replaces the existing paravirtualized spinlock mechanism
> > > >>with a paravirtualized ticketlock mechanism. The series provides
> > > >>implementation for both Xen and KVM.
> > > >>
> > > >>Changes in V9:
> > > >>- Changed spin_threshold to 32k to avoid excess halt exits that are
> > > >>causing undercommit degradation (after PLE handler improvement).
> > > >>- Added  kvm_irq_delivery_to_apic (suggested by Gleb)
> > > >>- Optimized halt exit path to use PLE handler
> > > >>
> > > >>V8 of PVspinlock was posted last year. After Avi's suggestions to look
> > > >>at PLE handler's improvements, various optimizations in PLE handling
> > > >>have been tried.
> > > >
> > > >Sorry for not posting this sooner.  I have tested the v9 pv-ticketlock
> > > >patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I have
> > > >tested these patches with and without PLE, as PLE is still not scalable
> > > >with large VMs.
> > > >
> > > 
> > > Hi Andrew,
> > > 
> > > Thanks for testing.
> > > 
> > > >System: x3850X5, 40 cores, 80 threads
> > > >
> > > >
> > > >1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
> > > >--
> > > > Total
> > > >ConfigurationThroughput(MB/s)Notes
> > > >
> > > >3.10-default-ple_on  22945   5% CPU 
> > > >in host kernel, 2% spin_lock in guests
> > > >3.10-default-ple_off 23184   5% CPU 
> > > >in host kernel, 2% spin_lock in guests
> > > >3.10-pvticket-ple_on 22895   5% CPU 
> > > >in host kernel, 2% spin_lock in guests
> > > >3.10-pvticket-ple_off23051   5% CPU 
> > > >in host kernel, 2% spin_lock in guests
> > > >[all 1x results look good here]
> > > 
> > > Yes. The 1x results look too close
> > > 
> > > >
> > > >
> > > >2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
> > > >---
> > > > Total
> > > >ConfigurationThroughput  Notes
> > > >
> > > >3.10-default-ple_on   6287   55% CPU 
> > > > host kernel, 17% spin_lock in guests
> > > >3.10-default-ple_off  1849   2% CPU 
> > > >in host kernel, 95% spin_lock in guests
> > > >3.10-pvticket-ple_on  6691   50% CPU 
> > > >in host kernel, 15% spin_lock in guests
> > > >3.10-pvticket-ple_off16464   8% CPU 
> > > >in host kernel, 33% spin_lock in guests
> > > 
> > > I see 6.426% improvement with ple_on
> > > and 161.87% improvement with ple_off. I think this is a very good sign
> > >  for the patches
> > > 
> > > >[PLE hinders pv-ticket improvements, but even with PLE off,
> > > >  we still off from ideal throughput (somewhere >2)]
> > > >
> > > 
> > > Okay, The ideal throughput you are referring is getting around atleast
> > > 80% of 1x throughput for over-commit. Yes we are still far away from
> > > there.
> > > 
> > > >
> > > >1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
> > > >--
> > > > Total
> > > >ConfigurationThroughput  Notes
> > > >
> > > >3.10-default-ple_on  22736   6% CPU 
> > > >in host kernel, 3% spin_lock in guests
> > > >3.10-default-ple_off  

Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-25 Thread Andrew Theurer
On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote:
> This series replaces the existing paravirtualized spinlock mechanism
> with a paravirtualized ticketlock mechanism. The series provides
> implementation for both Xen and KVM.
> 
> Changes in V9:
> - Changed spin_threshold to 32k to avoid excess halt exits that are
>causing undercommit degradation (after PLE handler improvement).
> - Added  kvm_irq_delivery_to_apic (suggested by Gleb)
> - Optimized halt exit path to use PLE handler
> 
> V8 of PVspinlock was posted last year. After Avi's suggestions to look
> at PLE handler's improvements, various optimizations in PLE handling
> have been tried.

Sorry for not posting this sooner.  I have tested the v9 pv-ticketlock
patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs.  I have
tested these patches with and without PLE, as PLE is still not scalable
with large VMs.

System: x3850X5, 40 cores, 80 threads


1x over-commit with 10-vCPU VMs (8 VMs) all running dbench:
--
Total
Configuration   Throughput(MB/s)Notes

3.10-default-ple_on 22945   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-default-ple_off23184   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-pvticket-ple_on22895   5% CPU in host 
kernel, 2% spin_lock in guests
3.10-pvticket-ple_off   23051   5% CPU in host 
kernel, 2% spin_lock in guests
[all 1x results look good here]


2x over-commit with 10-vCPU VMs (16 VMs) all running dbench:
---
Total
Configuration   Throughput  Notes

3.10-default-ple_on  6287   55% CPU  host 
kernel, 17% spin_lock in guests
3.10-default-ple_off 1849   2% CPU in host 
kernel, 95% spin_lock in guests
3.10-pvticket-ple_on 6691   50% CPU in host 
kernel, 15% spin_lock in guests
3.10-pvticket-ple_off   16464   8% CPU in host 
kernel, 33% spin_lock in guests
[PLE hinders pv-ticket improvements, but even with PLE off,
 we still off from ideal throughput (somewhere >2)]


1x over-commit with 20-vCPU VMs (4 VMs) all running dbench:
--
Total
Configuration   Throughput  Notes

3.10-default-ple_on 22736   6% CPU in host 
kernel, 3% spin_lock in guests
3.10-default-ple_off23377   5% CPU in host 
kernel, 3% spin_lock in guests
3.10-pvticket-ple_on22471   6% CPU in host 
kernel, 3% spin_lock in guests
3.10-pvticket-ple_off   23445   5% CPU in host 
kernel, 3% spin_lock in guests
[1x looking fine here]


2x over-commit with 20-vCPU VMs (8 VMs) all running dbench:
--
Total
Configuration   Throughput  Notes

3.10-default-ple_on  1965   70% CPU in host 
kernel, 34% spin_lock in guests 
3.10-default-ple_off  226   2% CPU in host 
kernel, 94% spin_lock in guests
3.10-pvticket-ple_on 1942   70% CPU in host 
kernel, 35% spin_lock in guests
3.10-pvticket-ple_off8003   11% CPU in host 
kernel, 70% spin_lock in guests
[quite bad all around, but pv-tickets with PLE off the best so far.
 Still quite a bit off from ideal throughput]

In summary, I would state that the pv-ticket is an overall win, but the
current PLE handler tends to "get in the way" on these larger guests.

-Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks

2013-06-07 Thread Andrew Theurer
ment in 4x
> 
> +---+---+---++---+
> ebizzy (records/sec) higher is better
> +---+---+---++---+
>basestdevpatchedstdev%improvement
> +---+---+---++---+
>5574.9000   237.4997 523.7000 1.4181   -90.60611
>2741.5000   561.3090 597.800034.9755   -78.19442
>2146.2500   216.7718 902.666782.4228   -57.94215
>1663.   141.92351245.67.2989   -25.13530
> +---+---+---++---+
> +---+---+---++---+
>  dbench  (Throughput) higher is better
> +---+---+---++---+
> basestdevpatchedstdev%improvement
> +---+---+---++---+
>   14111.5600   754.4525 884.905124.4723   -93.72922
>2481.627071.26652383.5700   333.2435-3.95132
>1510.248331.86341477.735850.5126-2.15279
>1029.487516.91661075.922513.9911 4.51050
> +---+-------+---++---+
> 
> 
> IMO hash based timeout is worth a try further.
> I think little more tuning will get more better results.

The problem I see (especially for dbench) is that we are still way off
what I would consider the goal.  IMO, 2x over-commit result should be a
bit lower than 50% (to account for switching overhead and less cache
warmth).  We are at about 17.5% for 2x.  I am thinking we need a
completely different approach to get there, but of course I do not know
what that is yet :)  

I am testing your patches now and hopefully with some analysis data we
can better understand what's going on.
> 
> Jiannan, When you start working on this, I can also help
> to get best of preemptable lock idea if you wish and share
> the patches I tried.

-Andrew Theurer

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Preemptable Ticket Spinlock

2013-04-26 Thread Andrew Theurer
3532]  [] vfs_stat+0x16/0x20
> [ 2144.673534]  [] sys_newstat+0x1f/0x50
> [ 2144.673538]  [] ? __audit_syscall_exit+0x246/0x2f0
> [ 2144.673541]  [] ? __audit_syscall_entry+0x8c/0xf0
> [ 2144.673543]  [] system_call_fastpath+0x16/0x1b

This is on a 40 core / 80 thread Westmere-EX with 16 VMs, each VM having
20 vCPUs (so 4x over-commit).  All VMs run dbench in tmpfs, which is a
pretty good test for spinlock preempt problems.  I had PLE enabled for
the test.

When you re-base your patches I will try it again.

Thanks,

-Andrew Theurer



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task

2012-11-28 Thread Andrew Theurer
gt; > > > >   } 
> > > > > 
> > > > > -out: 
> > > > > +out_unlock: 
> > > > >   double_rq_unlock(rq, p_rq); 
> > > > > +out_irq: 
> > > > >   local_irq_restore(flags); 
> > > > > 
> > > > > -if (yielded) 
> > > > > +if (yielded > 0) 
> > > > >   schedule(); 
> > > > > 
> > > > >   return yielded; 
> > > > > 
> > > > 
> > > > Acked-by: Andrew Jones  
> > > > 
> > > 
> > > Thank you Drew. 
> > > 
> > > Marcelo Gleb.. Please let me know if you have comments / concerns
> > > on the patches.. 
> > > 
> > > Andrew, Vinod, IMO, the patch set looks good for undercommit
> > > scenarios 
> > > especially for large guests where we do have overhead of vcpu
> > > iteration 
> > > of ple handler.. 
> > > 
> > > . 
> > > 
> > Thanks Raghu. Will try to get this latest patch set evaluated and
> > get back to you. 
> > 
> > 
> Hi Raghu,
> 
> Here is some preliminary data with your latest set of  PLE patches (&
> also with Andrew's throttled yield_to() change).
> 
> Ran a single guest on a 80 core Westmere platform. [Note: Host and
> Guest had the latest kernel from kvm.git and also using the latest
>  qemu from qemu.git as of yesterday morning]. 
> 
> The guest was running a AIM7 high_systime workload. (Note:
> high_systime is a kernel intensive micro-benchmark but in this case it
> was run just as a workload in the guest to trigger spinlock etc.
> contention in the guest OS and hence PLE (i.e. this is not a real
> benchmark run). 'have run this workload with a constant # (i.e. 2000)
> users with 100 jobs per user. The numbers below represent the # of
> jobs per minute (JPM) -  higher the better) .
> 
>  40VCPU  60VCPU  80VCPU 
> 
> a) 3.7.0-rc6+ w/ ple_gap=0   ~102K   ~88K~81K
> 
> b) 3.7.0-rc6+ ~53K   ~25K~18-20K

> c) 3.7.0-rc6+ w/ PLE patches ~100K   ~81K~48K-69K  <- lot of variation 
> from run to run.
> 
> d) 3.7.0-rc6+ w/  throttled  ~101K   ~87K~78K
>   yield_to() change
> 

FYI here's the latest throttled yield_to() patch (the one Vinod tested).

Signed-off-by: Andrew Theurer 

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index ecc5543..61d12ea 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -192,6 +192,7 @@ struct kvm_vcpu {
int mode;
unsigned long requests;
unsigned long guest_debug;
+   unsigned long last_yield_to;
 
struct mutex mutex;
struct kvm_run *run;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index be70035..987a339 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -49,6 +49,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -222,6 +223,7 @@ int kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, 
unsigned id)
vcpu->kvm = kvm;
vcpu->vcpu_id = id;
vcpu->pid = NULL;
+   vcpu->last_yield_to = 0;
init_waitqueue_head(&vcpu->wq);
kvm_async_pf_vcpu_init(vcpu);
 
@@ -1708,29 +1710,38 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
 
kvm_vcpu_set_in_spin_loop(me, true);
/*
+* A yield_to() can be quite expensive, so we try to limit
+* its use to just 1 per jiffie.
+*/
+   if (me->last_yield_to == jiffies)
+   yield();
+   else {
+   /*
 * We boost the priority of a VCPU that is runnable but not
 * currently running, because it got preempted by something
 * else and called schedule in __vcpu_run.  Hopefully that
 * VCPU is holding the lock that we need and will release it.
 * We approximate round-robin by starting at the last boosted VCPU.
 */
-   for (pass = 0; pass < 2 && !yielded; pass++) {
-   kvm_for_each_vcpu(i, vcpu, kvm) {
-   if (!pass && i <= last_boosted_vcpu) {
-   i = last_boosted_vcpu;
-   continue;
-   } else if (pass && i > last_boosted_vcpu)
-   break;
-   if (vcpu == me)
-   continue;
-   if (waitqueue_active(&vcpu->wq))
-   continue;
-   if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
-   continue;
-   if (kvm_vcpu_yield_to(vcpu)) {
-   kvm->last_boosted_vcpu = i;
-   yielded = 1;
-   break;
+   for (pass = 0; pass < 2 && !yielded; pass++) {
+   kvm_for_each_vcpu(i, vcpu, kvm) {
+   if (!pass && i <= last_boosted_vcpu) {
+   i = last_boosted_vcpu;
+   continue;
+   } else if (pass && i > last_boosted_vcpu)
+   break;
+   if (vcpu == me)
+   continue;
+   if (waitqueue_active(&vcpu->wq))
+   continue;
+   if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
+   continue;
+   if (kvm_vcpu_yield_to(vcpu)) {
+   kvm->last_boosted_vcpu = i;
+   me->last_yield_to = jiffies;
+   yielded = 1;
+   break;
+   }
}
}
}



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 RFC 1/2] sched: Bail out of yield_to when source and target runqueue has one task

2012-11-27 Thread Andrew Theurer
On Tue, 2012-11-27 at 16:00 +0530, Raghavendra K T wrote:
> On 11/26/2012 07:05 PM, Andrew Jones wrote:
> > On Mon, Nov 26, 2012 at 05:37:54PM +0530, Raghavendra K T wrote:
> >> From: Peter Zijlstra 
> >>
> >> In case of undercomitted scenarios, especially in large guests
> >> yield_to overhead is significantly high. when run queue length of
> >> source and target is one, take an opportunity to bail out and return
> >> -ESRCH. This return condition can be further exploited to quickly come
> >> out of PLE handler.
> >>
> >> (History: Raghavendra initially worked on break out of kvm ple handler upon
> >>   seeing source runqueue length = 1, but it had to export rq length).
> >>   Peter came up with the elegant idea of return -ESRCH in scheduler core.
> >>
> >> Signed-off-by: Peter Zijlstra 
> >> Raghavendra, Checking the rq length of target vcpu condition added.(thanks 
> >> Avi)
> >> Reviewed-by: Srikar Dronamraju 
> >> Signed-off-by: Raghavendra K T 
> >> ---
> >>
> >>   kernel/sched/core.c |   25 +++--
> >>   1 file changed, 19 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 2d8927f..fc219a5 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -4289,7 +4289,10 @@ EXPORT_SYMBOL(yield);
> >>* It's the caller's job to ensure that the target task struct
> >>* can't go away on us before we can do any checks.
> >>*
> >> - * Returns true if we indeed boosted the target task.
> >> + * Returns:
> >> + *true (>0) if we indeed boosted the target task.
> >> + *false (0) if we failed to boost the target.
> >> + *-ESRCH if there's no task to yield to.
> >>*/
> >>   bool __sched yield_to(struct task_struct *p, bool preempt)
> >>   {
> >> @@ -4303,6 +4306,15 @@ bool __sched yield_to(struct task_struct *p, bool 
> >> preempt)
> >>
> >>   again:
> >>p_rq = task_rq(p);
> >> +  /*
> >> +   * If we're the only runnable task on the rq and target rq also
> >> +   * has only one task, there's absolutely no point in yielding.
> >> +   */
> >> +  if (rq->nr_running == 1 && p_rq->nr_running == 1) {
> >> +  yielded = -ESRCH;
> >> +  goto out_irq;
> >> +  }
> >> +
> >>double_rq_lock(rq, p_rq);
> >>while (task_rq(p) != p_rq) {
> >>double_rq_unlock(rq, p_rq);
> >> @@ -4310,13 +4322,13 @@ again:
> >>}
> >>
> >>if (!curr->sched_class->yield_to_task)
> >> -  goto out;
> >> +  goto out_unlock;
> >>
> >>if (curr->sched_class != p->sched_class)
> >> -  goto out;
> >> +  goto out_unlock;
> >>
> >>if (task_running(p_rq, p) || p->state)
> >> -  goto out;
> >> +  goto out_unlock;
> >>
> >>yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> >>if (yielded) {
> >> @@ -4329,11 +4341,12 @@ again:
> >>resched_task(p_rq->curr);
> >>}
> >>
> >> -out:
> >> +out_unlock:
> >>double_rq_unlock(rq, p_rq);
> >> +out_irq:
> >>local_irq_restore(flags);
> >>
> >> -  if (yielded)
> >> +  if (yielded > 0)
> >>schedule();
> >>
> >>return yielded;
> >>
> >
> > Acked-by: Andrew Jones 
> >
> 
> Thank you Drew.
> 
> Marcelo Gleb.. Please let me know if you have comments / concerns on the 
> patches..
> 
> Andrew, Vinod, IMO, the patch set looks good for undercommit scenarios
> especially for large guests where we do have overhead of vcpu iteration
> of ple handler..

I agree, looks fine for undercommit scenarios.  I do wonder what happens
with 1.5x overcommit, where we might see 1/2 the host cpus with runqueue
of 2 and 1/2 of the host cpus with a runqueue of 1.  Even with this
change that scenario still might be fine, but it would be nice to see a
comparison.

-Andrew


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 RFC 0/3] kvm: Improving undercommit,overcommit scenarios

2012-10-30 Thread Andrew Theurer
   Handle yield_to failure return for potential undercommit case
>   Check system load and handle different commit cases accordingly
> 
>  Please let me know your comments and suggestions.
> 
>  Link for V1:
>  https://lkml.org/lkml/2012/9/21/168
> 
>  kernel/sched/core.c | 25 +++--
>  virt/kvm/kvm_main.c | 56 
> ++--
>  2 files changed, 65 insertions(+), 16 deletions(-)

-Andrew Theurer


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-19 Thread Andrew Theurer
On Fri, 2012-10-19 at 14:00 +0530, Raghavendra K T wrote:
> On 10/15/2012 08:04 PM, Andrew Theurer wrote:
> > On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:
> >> On 10/11/2012 01:06 AM, Andrew Theurer wrote:
> >>> On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
> >>>> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
> >>>>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> >>>>>> * Avi Kivity  [2012-10-04 17:00:28]:
> >>>>>>
> >>>>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> >>>>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> >>>>>>>>>
> >> [...]
> >>>>> A big concern I have (if this is 1x overcommit) for ebizzy is that it
> >>>>> has just terrible scalability to begin with.  I do not think we should
> >>>>> try to optimize such a bad workload.
> >>>>>
> >>>>
> >>>> I think my way of running dbench has some flaw, so I went to ebizzy.
> >>>> Could you let me know how you generally run dbench?
> >>>
> >>> I mount a tmpfs and then specify that mount for dbench to run on.  This
> >>> eliminates all IO.  I use a 300 second run time and number of threads is
> >>> equal to number of vcpus.  All of the VMs of course need to have a
> >>> synchronized start.
> >>>
> >>> I would also make sure you are using a recent kernel for dbench, where
> >>> the dcache scalability is much improved.  Without any lock-holder
> >>> preemption, the time in spin_lock should be very low:
> >>>
> >>>
> >>>>   21.54%  78016 dbench  [kernel.kallsyms]   [k] 
> >>>> copy_user_generic_unrolled
> >>>>3.51%  12723 dbench  libc-2.12.so[.] 
> >>>> __strchr_sse42
> >>>>2.81%  10176 dbench  dbench  [.] child_run
> >>>>2.54%   9203 dbench  [kernel.kallsyms]   [k] 
> >>>> _raw_spin_lock
> >>>>2.33%   8423 dbench  dbench  [.] 
> >>>> next_token
> >>>>2.02%   7335 dbench  [kernel.kallsyms]   [k] 
> >>>> __d_lookup_rcu
> >>>>1.89%   6850 dbench  libc-2.12.so[.] 
> >>>> __strstr_sse42
> >>>>1.53%   5537 dbench  libc-2.12.so[.] 
> >>>> __memset_sse2
> >>>>1.47%   5337 dbench  [kernel.kallsyms]   [k] 
> >>>> link_path_walk
> >>>>1.40%   5084 dbench  [kernel.kallsyms]   [k] 
> >>>> kmem_cache_alloc
> >>>>1.38%   5009 dbench  libc-2.12.so[.] memmove
> >>>>1.24%   4496 dbench  libc-2.12.so[.] vfprintf
> >>>>1.15%   4169 dbench  [kernel.kallsyms]   [k] 
> >>>> __audit_syscall_exit
> >>>
> >>
> >> Hi Andrew,
> >> I ran the test with dbench with tmpfs. I do not see any improvements in
> >> dbench for 16k ple window.
> >>
> >> So it seems apart from ebizzy no workload benefited by that. and I
> >> agree that, it may not be good to optimize for ebizzy.
> >> I shall drop changing to 16k default window and continue with other
> >> original patch series. Need to experiment with latest kernel.
> >
> > Thanks for running this again.  I do believe there are some workloads,
> > when run at 1x overcommit, would benefit from a larger ple_window [with
> > he current ple handling code], but I do not also want to potentially
> > degrade >1x with a larger window.  I do, however, think there may be a
> > another option.  I have not fully worked this out, but I think I am on
> > to something.
> >
> > I decided to revert back to just a yield() instead of a yield_to().  My
> > motivation was that yield_to() [for large VMs] is like a dog chasing its
> > tail, round and round we go   Just yield(), in particular a yield()
> > which results in yielding to something -other- than the current VM's
> > vcpus, helps synchronize the execution of sibling vcpus by deferring
> > them until the lock holder vcpu is running again.  The more we can do to
> > get all vcpus running at the same time, the far less we deal with the
> > preemption problem.  The othe

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-15 Thread Andrew Theurer
On Mon, 2012-10-15 at 17:40 +0530, Raghavendra K T wrote:
> On 10/11/2012 01:06 AM, Andrew Theurer wrote:
> > On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
> >> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
> >>> On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> >>>> * Avi Kivity  [2012-10-04 17:00:28]:
> >>>>
> >>>>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> >>>>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> >>>>>>>
> [...]
> >>> A big concern I have (if this is 1x overcommit) for ebizzy is that it
> >>> has just terrible scalability to begin with.  I do not think we should
> >>> try to optimize such a bad workload.
> >>>
> >>
> >> I think my way of running dbench has some flaw, so I went to ebizzy.
> >> Could you let me know how you generally run dbench?
> >
> > I mount a tmpfs and then specify that mount for dbench to run on.  This
> > eliminates all IO.  I use a 300 second run time and number of threads is
> > equal to number of vcpus.  All of the VMs of course need to have a
> > synchronized start.
> >
> > I would also make sure you are using a recent kernel for dbench, where
> > the dcache scalability is much improved.  Without any lock-holder
> > preemption, the time in spin_lock should be very low:
> >
> >
> >>  21.54%  78016 dbench  [kernel.kallsyms]   [k] 
> >> copy_user_generic_unrolled
> >>   3.51%  12723 dbench  libc-2.12.so[.] 
> >> __strchr_sse42
> >>   2.81%  10176 dbench  dbench  [.] child_run
> >>   2.54%   9203 dbench  [kernel.kallsyms]   [k] 
> >> _raw_spin_lock
> >>   2.33%   8423 dbench  dbench  [.] next_token
> >>   2.02%   7335 dbench  [kernel.kallsyms]   [k] 
> >> __d_lookup_rcu
> >>   1.89%   6850 dbench  libc-2.12.so[.] 
> >> __strstr_sse42
> >>   1.53%   5537 dbench  libc-2.12.so[.] 
> >> __memset_sse2
> >>   1.47%   5337 dbench  [kernel.kallsyms]   [k] 
> >> link_path_walk
> >>   1.40%   5084 dbench  [kernel.kallsyms]   [k] 
> >> kmem_cache_alloc
> >>   1.38%   5009 dbench  libc-2.12.so[.] memmove
> >>   1.24%   4496 dbench  libc-2.12.so[.] vfprintf
> >>   1.15%   4169 dbench  [kernel.kallsyms]   [k] 
> >> __audit_syscall_exit
> >
> 
> Hi Andrew,
> I ran the test with dbench with tmpfs. I do not see any improvements in
> dbench for 16k ple window.
> 
> So it seems apart from ebizzy no workload benefited by that. and I
> agree that, it may not be good to optimize for ebizzy.
> I shall drop changing to 16k default window and continue with other
> original patch series. Need to experiment with latest kernel.

Thanks for running this again.  I do believe there are some workloads,
when run at 1x overcommit, would benefit from a larger ple_window [with
he current ple handling code], but I do not also want to potentially
degrade >1x with a larger window.  I do, however, think there may be a
another option.  I have not fully worked this out, but I think I am on
to something.

I decided to revert back to just a yield() instead of a yield_to().  My
motivation was that yield_to() [for large VMs] is like a dog chasing its
tail, round and round we go   Just yield(), in particular a yield()
which results in yielding to something -other- than the current VM's
vcpus, helps synchronize the execution of sibling vcpus by deferring
them until the lock holder vcpu is running again.  The more we can do to
get all vcpus running at the same time, the far less we deal with the
preemption problem.  The other benefit is that yield() is far, far lower
overhead than yield_to()

This does assume that vcpus from same VM do not share same runqueues.
Yielding to a sibling vcpu with yield() is not productive for larger VMs
in the same way that yield_to() is not.  My recent results include
restricting vcpu placement so that sibling vcpus do not get to run on
the same runqueue.  I do believe we could implement a initial placement
and load balance policy to strive for this restriction (making it purely
optional, but I bet could also help user apps which use spin locks).

For 1x VMs which still vm_exit due to PLE, I believe we could probably
just leave the ple_window alone, as long as we mostly use yield()
instead of yield_to().  The problem with the unneeded exits in this case
has been the overhead in r

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-10 Thread Andrew Theurer
On Wed, 2012-10-10 at 23:24 +0530, Raghavendra K T wrote:
> On 10/10/2012 08:29 AM, Andrew Theurer wrote:
> > On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> >> * Avi Kivity  [2012-10-04 17:00:28]:
> >>
> >>> On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> >>>> On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> >>>>>
> >>>>> Again the numbers are ridiculously high for arch_local_irq_restore.
> >>>>> Maybe there's a bad perf/kvm interaction when we're injecting an
> >>>>> interrupt, I can't believe we're spending 84% of the time running the
> >>>>> popf instruction.
> >>>>
> >>>> Smells like a software fallback that doesn't do NMI, hrtimer based
> >>>> sampling typically hits popf where we re-enable interrupts.
> >>>
> >>> Good nose, that's probably it.  Raghavendra, can you ensure that the PMU
> >>> is properly exposed?  'dmesg' in the guest will tell.  If it isn't, -cpu
> >>> host will expose it (and a good idea anyway to get best performance).
> >>>
> >>
> >> Hi Avi, you are right. SandyBridge machine result was not proper.
> >> I cleaned up the services, enabled PMU, re-ran all the test again.
> >>
> >> Here is the summary:
> >> We do get good benefit by increasing ple window. Though we don't
> >> see good benefit for kernbench and sysbench, for ebizzy, we get huge
> >> improvement for 1x scenario. (almost 2/3rd of ple disabled case).
> >>
> >> Let me know if you think we can increase the default ple_window
> >> itself to 16k.
> >>
> >> I am experimenting with V2 version of undercommit improvement(this) patch
> >> series, But I think if you wish  to go for increase of
> >> default ple_window, then we would have to measure the benefit of patches
> >> when ple_window = 16k.
> >>
> >> I can respin the whole series including this default ple_window change.
> >>
> >> I also have the perf kvm top result for both ebizzy and kernbench.
> >> I think they are in expected lines now.
> >>
> >> Improvements
> >> 
> >>
> >> 16 core PLE machine with 16 vcpu guest
> >>
> >> base = 3.6.0-rc5 + ple handler optimization patches
> >> base_pleopt_16k = base + ple_window = 16k
> >> base_pleopt_32k = base + ple_window = 32k
> >> base_pleopt_nople = base + ple_gap = 0
> >> kernbench, hackbench, sysbench (time in sec lower is better)
> >> ebizzy (rec/sec higher is better)
> >>
> >> % improvements w.r.t base (ple_window = 4k)
> >> ---+---+-+---+
> >> |base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
> >> ---+---+-+---+
> >> kernbench_1x   |  0.42371  |  1.15164|   0.09320 |
> >> kernbench_2x   | -1.40981  | -17.48282   |  -570.77053   |
> >> ---+---+-+---+
> >> sysbench_1x| -0.92367  | 0.24241 | -0.27027  |
> >> sysbench_2x| -2.22706  |-0.30896 | -1.27573  |
> >> sysbench_3x| -0.75509  | 0.09444 | -2.97756  |
> >> ---+---+-+---+
> >> ebizzy_1x  | 54.99976  | 67.29460|  74.14076 |
> >> ebizzy_2x  | -8.83386  |-27.38403| -96.22066 |
> >> ---+---+-+---+
> >>
> >> perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window)
> >> 
> >
> > Is the perf data for 1x overcommit?
> 
> Yes, 16vcpu guest on 16 core
> 
> >
> >> pleopt   ple_gap=0
> >> 
> >> ebizzy : 18131 records/s
> >> 63.78%  [guest.kernel]  [g] _raw_spin_lock_irqsave
> >>  5.65%  [guest.kernel]  [g] smp_call_function_many
> >>  3.12%  [guest.kernel]  [g] clear_page
> >>  3.02%  [guest.kernel]  [g] down_read_trylock
> >>  1.85%  [guest.kernel]  [g] async_page_fault
> >>  1.81%  [guest.kernel]  [g] up_read
> >>  1.76%  [guest.kernel]  [g] native_apic_mem_write
> >>  1.70%  [guest.kernel]  [g] find_v

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-10 Thread Andrew Theurer
On Wed, 2012-10-10 at 23:13 +0530, Raghavendra K T wrote:
> On 10/10/2012 07:54 PM, Andrew Theurer wrote:
> > I ran 'perf sched map' on the dbench workload for medium and large VMs,
> > and I thought I would share some of the results.  I think it helps to
> > visualize what's going on regarding the yielding.
> >
> > These files are png bitmaps, generated from processing output from 'perf
> > sched map' (and perf data generated from 'perf sched record').  The Y
> > axis is the host cpus, each row being 10 pixels high.  For these tests,
> > there are 80 host cpus, so the total height is 800 pixels.  The X axis
> > is time (in microseconds), with each pixel representing 1 microsecond.
> > Each bitmap plots 30,000 microseconds.  The bitmaps are quite wide
> > obviously, and zooming in/out while viewing is recommended.
> >
> > Each row (each host cpu) is assigned a color based on what thread is
> > running.  vCPUs of the same VM are assigned a common color (like red,
> > blue, magenta, etc), and each vCPU has a unique brightness for that
> > color.  There are a maximum of 12 assignable colors, so in any VMs >12
> > revert to vCPU color of gray. I would use more colors, but it becomes
> > harder to distinguish one color from another.  The white color
> > represents missing data from perf, and black color represents any thread
> > which is not a vCPU.
> >
> > For the following tests, VMs were pinned to host NUMA nodes and to
> > specific cpus to help with consistency and operate within the
> > constraints of the last test (gang scheduler).
> >
> > Here is a good example of PLE.  These are 10-way VMs, 16 of them (as
> > described above only 12 of the VMs have a color, rest are gray).
> >
> > https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
> 
> This looks very nice to visualize what is happening. Beginning of the 
> graph looks little messy but later it is clear.
> 
> >
> > If you zoom out and look at the whole bitmap, you may notice the 4ms
> > intervals of the scheduler.  They are pretty well aligned across all
> > cpus.  Normally, for cpu bound workloads, we would expect to see each
> > thread to run for 4 ms, then something else getting to run, and so on.
> > That is mostly true in this test.  We have 2x over-commit and we
> > generally see the switching of threads at 4ms.  One thing to note is
> > that not all vCPU threads for the same VM run at exactly the same time,
> > and that is expected and the whole reason for lock-holder preemption.
> > Now, if you zoom in on the bitmap, you should notice within the 4ms
> > intervals there is some task switching going on.  This is most likely
> > because of the yield_to initiated by the PLE handler.  In this case
> > there is not that much yielding to do.   It's quite clean, and the
> > performance is quite good.
> >
> > Below is an example of PLE, but this time with 20-way VMs, 8 of them.
> > CPU over-commit is still 2x.
> >
> > https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU
> 
> I think this link still 10x16. Could you paste the link again?

Oops
https://docs.google.com/open?id=0B6tfUNlZ-14wSGtYYzZtRTcyVjQ

> 
> >
> > This one looks quite different.  In short, it's a mess.  The switching
> > between tasks can be lower than 10 microseconds.  It basically never
> > recovers.  There is constant yielding all the time.
> >
> > Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
> > scheduling patches.  While I am not recommending gang scheduling, I
> > think it's a good data point.  The performance is 3.88x the PLE result.
> >
> > https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M
> >
> > Note that the task switching intervals of 4ms are quite obvious again,
> > and this time all vCPUs from same VM run at the same time.  It
> > represents the best possible outcome.
> >
> >
> > Anyway, I thought the bitmaps might help better visualize what's going
> > on.
> >
> > -Andrew
> >
> >
> >
> >
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-10 Thread Andrew Theurer
I ran 'perf sched map' on the dbench workload for medium and large VMs,
and I thought I would share some of the results.  I think it helps to
visualize what's going on regarding the yielding.

These files are png bitmaps, generated from processing output from 'perf
sched map' (and perf data generated from 'perf sched record').  The Y
axis is the host cpus, each row being 10 pixels high.  For these tests,
there are 80 host cpus, so the total height is 800 pixels.  The X axis
is time (in microseconds), with each pixel representing 1 microsecond.
Each bitmap plots 30,000 microseconds.  The bitmaps are quite wide
obviously, and zooming in/out while viewing is recommended.

Each row (each host cpu) is assigned a color based on what thread is
running.  vCPUs of the same VM are assigned a common color (like red,
blue, magenta, etc), and each vCPU has a unique brightness for that
color.  There are a maximum of 12 assignable colors, so in any VMs >12
revert to vCPU color of gray. I would use more colors, but it becomes
harder to distinguish one color from another.  The white color
represents missing data from perf, and black color represents any thread
which is not a vCPU.

For the following tests, VMs were pinned to host NUMA nodes and to
specific cpus to help with consistency and operate within the
constraints of the last test (gang scheduler).

Here is a good example of PLE.  These are 10-way VMs, 16 of them (as
described above only 12 of the VMs have a color, rest are gray).

https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU

If you zoom out and look at the whole bitmap, you may notice the 4ms
intervals of the scheduler.  They are pretty well aligned across all
cpus.  Normally, for cpu bound workloads, we would expect to see each
thread to run for 4 ms, then something else getting to run, and so on.
That is mostly true in this test.  We have 2x over-commit and we
generally see the switching of threads at 4ms.  One thing to note is
that not all vCPU threads for the same VM run at exactly the same time,
and that is expected and the whole reason for lock-holder preemption.
Now, if you zoom in on the bitmap, you should notice within the 4ms
intervals there is some task switching going on.  This is most likely
because of the yield_to initiated by the PLE handler.  In this case
there is not that much yielding to do.   It's quite clean, and the
performance is quite good.

Below is an example of PLE, but this time with 20-way VMs, 8 of them.
CPU over-commit is still 2x.

https://docs.google.com/open?id=0B6tfUNlZ-14wdmFqUmE5QjJHMFU

This one looks quite different.  In short, it's a mess.  The switching
between tasks can be lower than 10 microseconds.  It basically never
recovers.  There is constant yielding all the time.  

Below is again 8 x 20-way VMs, but this time I tried out Nikunj's gang
scheduling patches.  While I am not recommending gang scheduling, I
think it's a good data point.  The performance is 3.88x the PLE result.

https://docs.google.com/open?id=0B6tfUNlZ-14wWXdscWcwNTVEY3M

Note that the task switching intervals of 4ms are quite obvious again,
and this time all vCPUs from same VM run at the same time.  It
represents the best possible outcome.


Anyway, I thought the bitmaps might help better visualize what's going
on.

-Andrew



--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-09 Thread Andrew Theurer
On Wed, 2012-10-10 at 00:21 +0530, Raghavendra K T wrote:
> * Avi Kivity  [2012-10-04 17:00:28]:
> 
> > On 10/04/2012 03:07 PM, Peter Zijlstra wrote:
> > > On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> > >> 
> > >> Again the numbers are ridiculously high for arch_local_irq_restore.
> > >> Maybe there's a bad perf/kvm interaction when we're injecting an
> > >> interrupt, I can't believe we're spending 84% of the time running the
> > >> popf instruction. 
> > > 
> > > Smells like a software fallback that doesn't do NMI, hrtimer based
> > > sampling typically hits popf where we re-enable interrupts.
> > 
> > Good nose, that's probably it.  Raghavendra, can you ensure that the PMU
> > is properly exposed?  'dmesg' in the guest will tell.  If it isn't, -cpu
> > host will expose it (and a good idea anyway to get best performance).
> > 
> 
> Hi Avi, you are right. SandyBridge machine result was not proper.
> I cleaned up the services, enabled PMU, re-ran all the test again.
> 
> Here is the summary:
> We do get good benefit by increasing ple window. Though we don't
> see good benefit for kernbench and sysbench, for ebizzy, we get huge
> improvement for 1x scenario. (almost 2/3rd of ple disabled case).
> 
> Let me know if you think we can increase the default ple_window
> itself to 16k.
> 
> I am experimenting with V2 version of undercommit improvement(this) patch
> series, But I think if you wish  to go for increase of
> default ple_window, then we would have to measure the benefit of patches
> when ple_window = 16k.
> 
> I can respin the whole series including this default ple_window change.
> 
> I also have the perf kvm top result for both ebizzy and kernbench.
> I think they are in expected lines now.
> 
> Improvements
> 
> 
> 16 core PLE machine with 16 vcpu guest
> 
> base = 3.6.0-rc5 + ple handler optimization patches
> base_pleopt_16k = base + ple_window = 16k
> base_pleopt_32k = base + ple_window = 32k
> base_pleopt_nople = base + ple_gap = 0
> kernbench, hackbench, sysbench (time in sec lower is better)
> ebizzy (rec/sec higher is better)
> 
> % improvements w.r.t base (ple_window = 4k)
> ---+---+-+---+
>|base_pleopt_16k| base_pleopt_32k | base_pleopt_nople |
> ---+---+-+---+
> kernbench_1x   |  0.42371  |  1.15164|   0.09320 |
> kernbench_2x   | -1.40981  | -17.48282   |  -570.77053   |
> ---+---+-+---+
> sysbench_1x| -0.92367  | 0.24241 | -0.27027  |
> sysbench_2x| -2.22706  |-0.30896 | -1.27573  |
> sysbench_3x| -0.75509  | 0.09444 | -2.97756  |
> ---+---+-+---+
> ebizzy_1x  | 54.99976  | 67.29460|  74.14076 |
> ebizzy_2x  | -8.83386  |-27.38403| -96.22066 |
> ---+---+-+---+
> 
> perf kvm top observation for kernbench and ebizzy (nople, 4k, 32k window) 
> 

Is the perf data for 1x overcommit?

> pleopt   ple_gap=0
> 
> ebizzy : 18131 records/s
> 63.78%  [guest.kernel]  [g] _raw_spin_lock_irqsave
> 5.65%  [guest.kernel]  [g] smp_call_function_many
> 3.12%  [guest.kernel]  [g] clear_page
> 3.02%  [guest.kernel]  [g] down_read_trylock
> 1.85%  [guest.kernel]  [g] async_page_fault
> 1.81%  [guest.kernel]  [g] up_read
> 1.76%  [guest.kernel]  [g] native_apic_mem_write
> 1.70%  [guest.kernel]  [g] find_vma

Does 'perf kvm top' not give host samples at the same time?  Would be
nice to see the host overhead as a function of varying ple window.  I
would expect that to be the major difference between 4/16/32k window
sizes.

A big concern I have (if this is 1x overcommit) for ebizzy is that it
has just terrible scalability to begin with.  I do not think we should
try to optimize such a bad workload.

> kernbench :Elapsed Time 29.4933 (27.6007)
>5.72%  [guest.kernel]  [g] async_page_fault
> 3.48%  [guest.kernel]  [g] pvclock_clocksource_read
> 2.68%  [guest.kernel]  [g] copy_user_generic_unrolled
> 2.58%  [guest.kernel]  [g] clear_page
> 2.09%  [guest.kernel]  [g] page_cache_get_speculative
> 2.00%  [guest.kernel]  [g] do_raw_spin_lock
> 1.78%  [guest.kernel]  [g] unmap_single_vma
> 1.74%  [guest.kernel]  [g] kmem_cache_alloc

> 
> pleopt ple_window = 4k
> ---
> ebizzy: 10176 records/s
>69.17%  [guest.kernel]  [g] _raw_spin_lock_irqsave
> 3.34%  [guest.kernel]  [g] clear_page
> 2.16%  [guest.kernel]  [g] down_read_trylock
> 1.94%  [guest.kernel]  [g] async_page_fault
> 1.89%  [guest.kernel]  [g] native_apic_mem_write
> 1.63%  [guest.kernel]  [g] smp_cal

Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler

2012-10-04 Thread Andrew Theurer
On Thu, 2012-10-04 at 14:41 +0200, Avi Kivity wrote:
> On 10/04/2012 12:49 PM, Raghavendra K T wrote:
> > On 10/03/2012 10:35 PM, Avi Kivity wrote:
> >> On 10/03/2012 02:22 PM, Raghavendra K T wrote:
>  So I think it's worth trying again with ple_window of 2-4.
> 
> >>>
> >>> Hi Avi,
> >>>
> >>> I ran different benchmarks increasing ple_window, and results does not
> >>> seem to be encouraging for increasing ple_window.
> >>
> >> Thanks for testing! Comments below.
> >>
> >>> Results:
> >>> 16 core PLE machine with 16 vcpu guest.
> >>>
> >>> base kernel = 3.6-rc5 + ple handler optimization patch
> >>> base_pleopt_8k = base kernel + ple window = 8k
> >>> base_pleopt_16k = base kernel + ple window = 16k
> >>> base_pleopt_32k = base kernel + ple window = 32k
> >>>
> >>>
> >>> Percentage improvements of benchmarks w.r.t base_pleopt with
> >>> ple_window = 4096
> >>>
> >>> base_pleopt_8kbase_pleopt_16kbase_pleopt_32k
> >>> - 
> >>>   
> >>>
> >>> kernbench_1x-5.54915-15.94529-44.31562
> >>> kernbench_2x-7.89399-17.75039-37.73498
> >>
> >> So, 44% degradation even with no overcommit?  That's surprising.
> > 
> > Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it
> > spending 8 times the original ple_window cycles for 16 vcpus
> > significant?
> 
> A PLE exit when not overcommitted cannot do any good, it is better to
> spin in the guest rather that look for candidates on the host.  In fact
> when we benchmark we often disable PLE completely.

Agreed.  However, I really do not understand why the kernbench regressed
with bigger ple_window.  It should stay the same or improve.  Raghu, do
you have perf data for the kernbench runs?
> 
> > 
> >>
> >>> I also got perf top output to analyse the difference. Difference comes
> >>> because of flushtlb (and also spinlock).
> >>
> >> That's in the guest, yes?
> > 
> > Yes. Perf is in guest.
> > 
> >>
> >>>
> >>> Ebizzy run for 4k ple_window
> >>> -  87.20%  [kernel]  [k] arch_local_irq_restore
> >>> - arch_local_irq_restore
> >>>- 100.00% _raw_spin_unlock_irqrestore
> >>>   + 52.89% release_pages
> >>>   + 47.10% pagevec_lru_move_fn
> >>> -   5.71%  [kernel]  [k] arch_local_irq_restore
> >>> - arch_local_irq_restore
> >>>+ 86.03% default_send_IPI_mask_allbutself_phys
> >>>+ 13.96% default_send_IPI_mask_sequence_phys
> >>> -   3.10%  [kernel]  [k] smp_call_function_many
> >>>   smp_call_function_many
> >>>
> >>>
> >>> Ebizzy run for 32k ple_window
> >>>
> >>> -  91.40%  [kernel]  [k] arch_local_irq_restore
> >>> - arch_local_irq_restore
> >>>- 100.00% _raw_spin_unlock_irqrestore
> >>>   + 53.13% release_pages
> >>>   + 46.86% pagevec_lru_move_fn
> >>> -   4.38%  [kernel]  [k] smp_call_function_many
> >>>   smp_call_function_many
> >>> -   2.51%  [kernel]  [k] arch_local_irq_restore
> >>> - arch_local_irq_restore
> >>>+ 90.76% default_send_IPI_mask_allbutself_phys
> >>>+ 9.24% default_send_IPI_mask_sequence_phys
> >>>
> >>
> >> Both the 4k and the 32k results are crazy.  Why is
> >> arch_local_irq_restore() so prominent?  Do you have a very high
> >> interrupt rate in the guest?
> > 
> > How to measure if I have high interrupt rate in guest?
> > From /proc/interrupt numbers I am not able to judge :(
> 
> 'vmstat 1'
> 
> > 
> > I went back and got the results on a 32 core machine with 32 vcpu guest.
> > Strangely, I got result supporting the claim that increasing ple_window
> > helps for non-overcommitted scenario.
> > 
> > 32 core 32 vcpu guest 1x scenarios.
> > 
> > ple_gap = 0
> > kernbench: Elapsed Time 38.61
> > ebizzy: 7463 records/s
> > 
> > ple_window = 4k
> > kernbench: Elapsed Time 43.5067
> > ebizzy:2528 records/s
> > 
> > ple_window = 32k
> > kernebench : Elapsed Time 39.4133
> > ebizzy: 7196 records/s
> 
> So maybe something was wrong with the first measurement.

OK, this is more in line with what I expected for kernbench.  FWIW, in
order to show an improvement for a larger ple_window, we really need a
workload which we know has a longer lock holding time (without factoring
in LHP).  We have noticed this on IO based locks mostly.  We saw it with
a massive disk IO test (qla2xxx lock), and also with a large web serving
test (some vfs related lock, but I forget what exactly it was).
> 
> > 
> > 
> > perf top for ebizzy for above:
> > ple_gap = 0
> > -  84.74%  [kernel]  [k] arch_local_irq_restore
> >- arch_local_irq_restore
> >   - 100.00% _raw_spin_unlock_irqrestore
> >  + 50.96% release_pages
> >  + 49.02% pagevec_lru_move_fn
> > -   6.57%  [kernel]  [k] arch_local_irq_restore
> >- arch_local_irq_restore
> >   + 92.54% default_send_IPI_mask_allbutself_phys
> >   + 7.46% default_send_IPI_mask_sequence_phys
> > -   1.54%  [kernel]  [k] smp_call_function_many
> >  smp_call_function_man

Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-28 Thread Andrew Theurer
On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote:
> On 09/27/2012 05:33 PM, Avi Kivity wrote:
> > On 09/27/2012 01:23 PM, Raghavendra K T wrote:
> >>>
> >>> This gives us a good case for tracking preemption on a per-vm basis.  As
> >>> long as we aren't preempted, we can keep the PLE window high, and also
> >>> return immediately from the handler without looking for candidates.
> >>
> >> 1) So do you think, deferring preemption patch ( Vatsa was mentioning
> >> long back)  is also another thing worth trying, so we reduce the chance
> >> of LHP.
> >
> > Yes, we have to keep it in mind.  It will be useful for fine grained
> > locks, not so much so coarse locks or IPIs.
> >
> 
> Agree.
> 
> > I would still of course prefer a PLE solution, but if we can't get it to
> > work we can consider preemption deferral.
> >
> 
> Okay.
> 
> >>
> >> IIRC, with defer preemption :
> >> we will have hook in spinlock/unlock path to measure depth of lock held,
> >> and shared with host scheduler (may be via MSRs now).
> >> Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
> >> give say one chance.
> >
> > A downside is that we have to do that even when undercommitted.

Hopefully vcpu preemption is very rare when undercommitted, so it should
not happen much at all.
> >
> > Also there may be a lot of false positives (deferred preemptions even
> > when there is no contention).

It will be interesting to see how this behaves with a very high lock
activity in a guest.  Once the scheduler defers preemption, is it for a
fixed amount of time, or does it know to cut the deferral short as soon
as the lock depth is reduced [by x]?
> 
> Yes. That is a worry.
> 
> >
> >>
> >> 2) looking at the result (comparing A & C) , I do feel we have
> >> significant in iterating over vcpus (when compared to even vmexit)
> >> so We still would need undercommit fix sugested by PeterZ (improving by
> >> 140%). ?
> >
> > Looking only at the current runqueue?  My worry is that it misses a lot
> > of cases.  Maybe try the current runqueue first and then others.
> >
> > Or were you referring to something else?
> 
> No. I was referring to the same thing.
> 
> However. I had tried following also (which works well to check 
> undercommited scenario). But thinking to use only for yielding in case
> of overcommit (yield in overcommit suggested by Rik) and keep 
> undercommit patch as suggested by PeterZ
> 
> [ patch is not in proper diff I suppose ].
> 
> Will test them.
> 
> Peter, Can I post your patch with your from/sob.. in V2?
> Please let me know..
> 
> ---
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 28f00bc..9ed3759 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1620,6 +1620,21 @@ bool kvm_vcpu_eligible_for_directed_yield(struct 
> kvm_vcpu *vcpu)
>   return eligible;
>   }
>   #endif
> +
> +bool kvm_overcommitted()
> +{
> + unsigned long load;
> +
> + load = avenrun[0] + FIXED_1/200;
> + load = load >> FSHIFT;
> + load = (load << 7) / num_online_cpus();
> +
> + if (load > 128)
> + return true;
> +
> + return false;
> +}
> +
>   void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>   {
>   struct kvm *kvm = me->kvm;
> @@ -1629,6 +1644,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
>   int pass;
>   int i;
> 
> + if (!kvm_overcommitted())
> + return;
> +
>   kvm_vcpu_set_in_spin_loop(me, true);
>   /*
>* We boost the priority of a VCPU that is runnable but not


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler

2012-09-27 Thread Andrew Theurer
On Thu, 2012-09-27 at 14:03 +0200, Avi Kivity wrote:
> On 09/27/2012 01:23 PM, Raghavendra K T wrote:
> >>
> >> This gives us a good case for tracking preemption on a per-vm basis.  As
> >> long as we aren't preempted, we can keep the PLE window high, and also
> >> return immediately from the handler without looking for candidates.
> > 
> > 1) So do you think, deferring preemption patch ( Vatsa was mentioning
> > long back)  is also another thing worth trying, so we reduce the chance
> > of LHP.
> 
> Yes, we have to keep it in mind.  It will be useful for fine grained
> locks, not so much so coarse locks or IPIs.
> 
> I would still of course prefer a PLE solution, but if we can't get it to
> work we can consider preemption deferral.
> 
> > 
> > IIRC, with defer preemption :
> > we will have hook in spinlock/unlock path to measure depth of lock held,
> > and shared with host scheduler (may be via MSRs now).
> > Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather
> > give say one chance.
> 
> A downside is that we have to do that even when undercommitted.
> 
> Also there may be a lot of false positives (deferred preemptions even
> when there is no contention).
> 
> > 
> > 2) looking at the result (comparing A & C) , I do feel we have
> > significant in iterating over vcpus (when compared to even vmexit)
> > so We still would need undercommit fix sugested by PeterZ (improving by
> > 140%). ?
> 
> Looking only at the current runqueue?  My worry is that it misses a lot
> of cases.  Maybe try the current runqueue first and then others.
> 
> Or were you referring to something else?
> 
> > 
> > So looking back at threads/ discussions so far, I am trying to
> > summarize, the discussions so far. I feel, at least here are the few
> > potential candidates to go in:
> > 
> > 1) Avoiding double runqueue lock overhead  (Andrew Theurer/ PeterZ)
> > 2) Dynamically changing PLE window (Avi/Andrew/Chegu)
> > 3) preempt_notify handler to identify preempted VCPUs (Avi)
> > 4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ)
> > 5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik)
> > 6) Pv spinlock
> > 7) Jiannan's proposed improvements
> > 8) Defer preemption patches
> > 
> > Did we miss anything (or added extra?)
> > 
> > So here are my action items:
> > - I plan to repost this series with what PeterZ, Rik suggested with
> > performance analysis.
> > - I ll go back and explore on (3) and (6) ..
> > 
> > Please Let me know..
> 
> Undoubtedly we'll think of more stuff.  But this looks like a good start.

9) lazy gang-like scheduling with PLE to cover the non-gang-like
exceptions  (/me runs and hides from scheduler folks)

-Andrew Theurer

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-17 Thread Andrew Theurer
On Sun, 2012-09-16 at 11:55 +0300, Avi Kivity wrote:
> On 09/14/2012 12:30 AM, Andrew Theurer wrote:
> 
> > The concern I have is that even though we have gone through changes to
> > help reduce the candidate vcpus we yield to, we still have a very poor
> > idea of which vcpu really needs to run.  The result is high cpu usage in
> > the get_pid_task and still some contention in the double runqueue lock.
> > To make this scalable, we either need to significantly reduce the
> > occurrence of the lock-holder preemption, or do a much better job of
> > knowing which vcpu needs to run (and not unnecessarily yielding to vcpus
> > which do not need to run).
> > 
> > On reducing the occurrence:  The worst case for lock-holder preemption
> > is having vcpus of same VM on the same runqueue.  This guarantees the
> > situation of 1 vcpu running while another [of the same VM] is not.  To
> > prove the point, I ran the same test, but with vcpus restricted to a
> > range of host cpus, such that any single VM's vcpus can never be on the
> > same runqueue.  In this case, all 10 VMs' vcpu-0's are on host cpus 0-4,
> > vcpu-1's are on host cpus 5-9, and so on.  Here is the result:
> > 
> > kvm_cpu_spin, and all
> > yield_to changes, plus
> > restricted vcpu placement:  8823 +/- 3.20%   much, much better
> > 
> > On picking a better vcpu to yield to:  I really hesitate to rely on
> > paravirt hint [telling us which vcpu is holding a lock], but I am not
> > sure how else to reduce the candidate vcpus to yield to.  I suspect we
> > are yielding to way more vcpus than are prempted lock-holders, and that
> > IMO is just work accomplishing nothing.  Trying to think of way to
> > further reduce candidate vcpus
> 
> I wouldn't say that yielding to the "wrong" vcpu accomplishes nothing.
> That other vcpu gets work done (unless it is in pause loop itself) and
> the yielding vcpu gets put to sleep for a while, so it doesn't spend
> cycles spinning.  While we haven't fixed the problem at least the guest
> is accomplishing work, and meanwhile the real lock holder may get
> naturally scheduled and clear the lock.

OK, yes, if the other thread gets useful work done, then it is not
wasteful.  I was thinking of the worst case scenario, where any other
vcpu would likely spin as well, and the host side cpu-time for switching
vcpu threads was not all that productive.  Well, I suppose it does help
eliminate potential lock holding vcpus; it just seems to be not that
efficient or fast enough.

> The main problem with this theory is that the experiments don't seem to
> bear it out.

Granted, my test case is quite brutal.  It's nothing but over-committed
VMs which always have some spin lock activity.  However, we really
should try to fix the worst case scenario.

>   So maybe one of the assumptions is wrong - the yielding
> vcpu gets scheduled early.  That could be the case if the two vcpus are
> on different runqueues - you could be changing the relative priority of
> vcpus on the target runqueue, but still remain on top yourself.  Is this
> possible with the current code?
> 
> Maybe we should prefer vcpus on the same runqueue as yield_to targets,
> and only fall back to remote vcpus when we see it didn't help.
> 
> Let's examine a few cases:
> 
> 1. spinner on cpu 0, lock holder on cpu 0
> 
> win!
> 
> 2. spinner on cpu 0, random vcpu(s) (or normal processes) on cpu 0
> 
> Spinner gets put to sleep, random vcpus get to work, low lock contention
> (no double_rq_lock), by the time spinner gets scheduled we might have won
> 
> 3. spinner on cpu 0, another spinner on cpu 0
> 
> Worst case, we'll just spin some more.  Need to detect this case and
> migrate something in.

Well, we can certainly experiment and see what we get.

IMO, the key to getting this working really well on the large VMs is
finding the lock-holding cpu -quickly-.  What I think is happening is
that we go through a relatively long process to get to that one right
vcpu.  I guess I need to find a faster way to get there.

> 4. spinner on cpu 0, alone
> 
> Similar
> 
> 
> It seems we need to tie in to the load balancer.
> 
> Would changing the priority of the task while it is spinning help the
> load balancer?

Not sure.

-Andrew






--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-13 Thread Andrew Theurer
On Thu, 2012-09-13 at 17:18 +0530, Raghavendra K T wrote:
> * Andrew Theurer  [2012-09-11 13:27:41]:
> 
> > On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> > > On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > > > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> > > >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > > >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct 
> > > >>>> task_struct *p)
> > > >>>> +{
> > > >>>> + if (!curr->sched_class->yield_to_task)
> > > >>>> + return false;
> > > >>>> +
> > > >>>> + if (curr->sched_class != p->sched_class)
> > > >>>> + return false;
> > > >>>
> > > >>>
> > > >>> Peter,
> > > >>>
> > > >>> Should we also add a check if the runq has a skip buddy (as pointed 
> > > >>> out
> > > >>> by Raghu) and return if the skip buddy is already set.
> > > >>
> > > >> Oh right, I missed that suggestion.. the performance improvement went
> > > >> from 81% to 139% using this, right?
> > > >>
> > > >> It might make more sense to keep that separate, outside of this
> > > >> function, since its not a strict prerequisite.
> > > >>
> > > >>>>
> > > >>>> + if (task_running(p_rq, p) || p->state)
> > > >>>> + return false;
> > > >>>> +
> > > >>>> + return true;
> > > >>>> +}
> > > >>
> > > >>
> > > >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > > >>> bool preempt)
> > > >>>>rq = this_rq();
> > > >>>>
> > > >>>>   again:
> > > >>>> + /* optimistic test to avoid taking locks */
> > > >>>> + if (!__yield_to_candidate(curr, p))
> > > >>>> + goto out_irq;
> > > >>>> +
> > > >>
> > > >> So add something like:
> > > >>
> > > >>/* Optimistic, if we 'raced' with another yield_to(), don't 
> > > >> bother */
> > > >>if (p_rq->cfs_rq->skip)
> > > >>goto out_irq;
> > > >>>
> > > >>>
> > > >>>>p_rq = task_rq(p);
> > > >>>>double_rq_lock(rq, p_rq);
> > > >>>
> > > >>>
> > > >> But I do have a question on this optimization though,.. Why do we check
> > > >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> > > >>
> > > >> That is, I'd like to see this thing explained a little better.
> > > >>
> > > >> Does it go something like: p_rq is the runqueue of the task we'd like 
> > > >> to
> > > >> yield to, rq is our own, they might be the same. If we have a ->skip,
> > > >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> > > >> failing the yield_to() simply means us picking the next VCPU thread,
> > > >> which might be running on an entirely different cpu (rq) and could
> > > >> succeed?
> > > >
> > > > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > > > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > > > skip check.  Raghu, I am not sure if this is exactly what you want
> > > > implemented in v4.
> > > >
> > > 
> > > Andrew, Yes that is what I had. I think there was a mis-understanding. 
> > > My intention was to if there is a directed_yield happened in runqueue 
> > > (say rqA), do not bother to directed yield to that. But unfortunately as 
> > > PeterZ pointed that would have resulted in setting next buddy of a 
> > > different run queue than rqA.
> > > So we can drop this "skip" idea. Pondering more over what to do? can we 
> > > use next buddy itself ... thinking..
> > 
> > As I mentioned earlier today, I did not have your changes from kvm.git
> > tree when I tested my changes.  Here are yo

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-11 Thread Andrew Theurer
On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct 
> >>>> task_struct *p)
> >>>> +{
> >>>> + if (!curr->sched_class->yield_to_task)
> >>>> + return false;
> >>>> +
> >>>> + if (curr->sched_class != p->sched_class)
> >>>> + return false;
> >>>
> >>>
> >>> Peter,
> >>>
> >>> Should we also add a check if the runq has a skip buddy (as pointed out
> >>> by Raghu) and return if the skip buddy is already set.
> >>
> >> Oh right, I missed that suggestion.. the performance improvement went
> >> from 81% to 139% using this, right?
> >>
> >> It might make more sense to keep that separate, outside of this
> >> function, since its not a strict prerequisite.
> >>
> >>>>
> >>>> + if (task_running(p_rq, p) || p->state)
> >>>> + return false;
> >>>> +
> >>>> + return true;
> >>>> +}
> >>
> >>
> >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> >>> bool preempt)
> >>>>rq = this_rq();
> >>>>
> >>>>   again:
> >>>> + /* optimistic test to avoid taking locks */
> >>>> + if (!__yield_to_candidate(curr, p))
> >>>> + goto out_irq;
> >>>> +
> >>
> >> So add something like:
> >>
> >>/* Optimistic, if we 'raced' with another yield_to(), don't bother */
> >>if (p_rq->cfs_rq->skip)
> >>goto out_irq;
> >>>
> >>>
> >>>>p_rq = task_rq(p);
> >>>>double_rq_lock(rq, p_rq);
> >>>
> >>>
> >> But I do have a question on this optimization though,.. Why do we check
> >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> >>
> >> That is, I'd like to see this thing explained a little better.
> >>
> >> Does it go something like: p_rq is the runqueue of the task we'd like to
> >> yield to, rq is our own, they might be the same. If we have a ->skip,
> >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> >> failing the yield_to() simply means us picking the next VCPU thread,
> >> which might be running on an entirely different cpu (rq) and could
> >> succeed?
> >
> > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > skip check.  Raghu, I am not sure if this is exactly what you want
> > implemented in v4.
> >
> 
> Andrew, Yes that is what I had. I think there was a mis-understanding. 
> My intention was to if there is a directed_yield happened in runqueue 
> (say rqA), do not bother to directed yield to that. But unfortunately as 
> PeterZ pointed that would have resulted in setting next buddy of a 
> different run queue than rqA.
> So we can drop this "skip" idea. Pondering more over what to do? can we 
> use next buddy itself ... thinking..

As I mentioned earlier today, I did not have your changes from kvm.git
tree when I tested my changes.  Here are your changes and my changes
compared:

  throughput in MB/sec

kvm_vcpu_on_spin changes:  4636 +/- 15.74%
yield_to changes:  4515 +/- 12.73%

I would be inclined to stick with your changes which are kept in kvm
code.  I did try both combined, and did not get good results:

both changes:  4074 +/- 19.12%

So, having both is probably not a good idea.  However, I feel like
there's more work to be done.  With no over-commit (10 VMs), total
throughput is 23427 +/- 2.76%.  A 2x over-commit will no doubt have some
overhead, but a reduction to ~4500 is still terrible.  By contrast,
8-way VMs with 2x over-commit have a total throughput roughly 10% less
than 8-way VMs with no overcommit (20 vs 10 8-way VMs on 80 cpu-thread
host).  We still have what appears to be scalability problems, but now
it's not so much in runqueue locks for yield_to(), but now
get_pid_task():

perf on host:

32.10% 320131 qemu-system-x86 [kernel.kallsyms] [k] get_pid_task
11.60% 115686 qemu-s

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-11 Thread Andrew Theurer
On Tue, 2012-09-11 at 11:38 +0530, Raghavendra K T wrote:
> On 09/11/2012 01:42 AM, Andrew Theurer wrote:
> > On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> >> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> >>>> +static bool __yield_to_candidate(struct task_struct *curr, struct 
> >>>> task_struct *p)
> >>>> +{
> >>>> + if (!curr->sched_class->yield_to_task)
> >>>> + return false;
> >>>> +
> >>>> + if (curr->sched_class != p->sched_class)
> >>>> + return false;
> >>>
> >>>
> >>> Peter,
> >>>
> >>> Should we also add a check if the runq has a skip buddy (as pointed out
> >>> by Raghu) and return if the skip buddy is already set.
> >>
> >> Oh right, I missed that suggestion.. the performance improvement went
> >> from 81% to 139% using this, right?
> >>
> >> It might make more sense to keep that separate, outside of this
> >> function, since its not a strict prerequisite.
> >>
> >>>>
> >>>> + if (task_running(p_rq, p) || p->state)
> >>>> + return false;
> >>>> +
> >>>> + return true;
> >>>> +}
> >>
> >>
> >>>> @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> >>> bool preempt)
> >>>>rq = this_rq();
> >>>>
> >>>>   again:
> >>>> + /* optimistic test to avoid taking locks */
> >>>> + if (!__yield_to_candidate(curr, p))
> >>>> + goto out_irq;
> >>>> +
> >>
> >> So add something like:
> >>
> >>/* Optimistic, if we 'raced' with another yield_to(), don't bother */
> >>if (p_rq->cfs_rq->skip)
> >>goto out_irq;
> >>>
> >>>
> >>>>p_rq = task_rq(p);
> >>>>double_rq_lock(rq, p_rq);
> >>>
> >>>
> >> But I do have a question on this optimization though,.. Why do we check
> >> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> >>
> >> That is, I'd like to see this thing explained a little better.
> >>
> >> Does it go something like: p_rq is the runqueue of the task we'd like to
> >> yield to, rq is our own, they might be the same. If we have a ->skip,
> >> there's nothing we can do about it, OTOH p_rq having a ->skip and
> >> failing the yield_to() simply means us picking the next VCPU thread,
> >> which might be running on an entirely different cpu (rq) and could
> >> succeed?
> >
> > Here's two new versions, both include a __yield_to_candidate(): "v3"
> > uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
> > skip check.  Raghu, I am not sure if this is exactly what you want
> > implemented in v4.
> >
> 
> Andrew, Yes that is what I had. I think there was a mis-understanding. 
> My intention was to if there is a directed_yield happened in runqueue 
> (say rqA), do not bother to directed yield to that. But unfortunately as 
> PeterZ pointed that would have resulted in setting next buddy of a 
> different run queue than rqA.
> So we can drop this "skip" idea. Pondering more over what to do? can we 
> use next buddy itself ... thinking..

FYI, I regretfully forgot include your recent changes to
kvm_vcpu_on_spin in my tests (found in kvm.git/next branch), so I am
going to get some results for that before I experiment any more on
3.6-rc.

-Andrew


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Andrew Theurer
On Mon, 2012-09-10 at 19:12 +0200, Peter Zijlstra wrote:
> On Mon, 2012-09-10 at 22:26 +0530, Srikar Dronamraju wrote:
> > > +static bool __yield_to_candidate(struct task_struct *curr, struct 
> > > task_struct *p)
> > > +{
> > > + if (!curr->sched_class->yield_to_task)
> > > + return false;
> > > +
> > > + if (curr->sched_class != p->sched_class)
> > > + return false;
> > 
> > 
> > Peter, 
> > 
> > Should we also add a check if the runq has a skip buddy (as pointed out
> > by Raghu) and return if the skip buddy is already set. 
> 
> Oh right, I missed that suggestion.. the performance improvement went
> from 81% to 139% using this, right?
> 
> It might make more sense to keep that separate, outside of this
> function, since its not a strict prerequisite.
> 
> > > 
> > > + if (task_running(p_rq, p) || p->state)
> > > + return false;
> > > +
> > > + return true;
> > > +} 
> 
> 
> > > @@ -4323,6 +4340,10 @@ bool __sched yield_to(struct task_struct *p,
> > bool preempt)
> > >   rq = this_rq();
> > >  
> > >  again:
> > > + /* optimistic test to avoid taking locks */
> > > + if (!__yield_to_candidate(curr, p))
> > > + goto out_irq;
> > > +
> 
> So add something like:
> 
>   /* Optimistic, if we 'raced' with another yield_to(), don't bother */
>   if (p_rq->cfs_rq->skip)
>   goto out_irq;
> > 
> > 
> > >   p_rq = task_rq(p);
> > >   double_rq_lock(rq, p_rq);
> > 
> > 
> But I do have a question on this optimization though,.. Why do we check
> p_rq->cfs_rq->skip and not rq->cfs_rq->skip ?
> 
> That is, I'd like to see this thing explained a little better.
> 
> Does it go something like: p_rq is the runqueue of the task we'd like to
> yield to, rq is our own, they might be the same. If we have a ->skip,
> there's nothing we can do about it, OTOH p_rq having a ->skip and
> failing the yield_to() simply means us picking the next VCPU thread,
> which might be running on an entirely different cpu (rq) and could
> succeed?

Here's two new versions, both include a __yield_to_candidate(): "v3"
uses the check for p_rq->curr in guest mode, and "v4" uses the cfs_rq
skip check.  Raghu, I am not sure if this is exactly what you want
implemented in v4.

Results:
> ple on:   2552 +/- .70%
> ple on: w/fixv1:  4621 +/- 2.12%  (81% improvement)
> ple on: w/fixv2:  6115   (139% improvement)
   v3:  5735   (124% improvement)
   v4:  4524   (  3% regression)

Both patches included below

-Andrew

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..0d98a67 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4820,6 +4820,23 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candidate(struct task_struct *curr, struct task_struct 
*p, struct rq *p_rq)
+{
+   if (!curr->sched_class->yield_to_task)
+   return false;
+
+   if (curr->sched_class != p->sched_class)
+   return false;
+
+   if (task_running(p_rq, p) || p->state)
+   return false;
+
+   return true;
+}
+
 /**
  * yield_to - yield the current processor to another thread in
  * your thread group, or accelerate that thread toward the
@@ -4844,20 +4861,27 @@ bool __sched yield_to(struct task_struct *p, bool 
preempt)
 
 again:
p_rq = task_rq(p);
+
+   /* optimistic test to avoid taking locks */
+   if (!__yield_to_candidate(curr, p, p_rq))
+   goto out_irq;
+
+   /*
+* if the target task is not running, then only yield if the
+* current task is in guest mode
+*/
+   if (!(p_rq->curr->flags & PF_VCPU))
+   goto out_irq;
+
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
double_rq_unlock(rq, p_rq);
goto again;
}
 
-   if (!curr->sched_class->yield_to_task)
-   goto out;
-
-   if (curr->sched_class != p->sched_class)
-   goto out;
-
-   if (task_running(p_rq, p) || p->state)
-   goto out;
+   /* validate state, holding p_rq ensures p's state cannot change */
+   if (!__yield_to_candidate(curr, p, p_rq))
+   goto out_unlock;
 
yielded = curr->sched_class->yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4877,8 +4901,9 @@ again:
rq->skip_clock_update = 0;
}
 
-out:
+out_unlock:
double_rq_unlock(rq, p_rq);
+out_irq:
local_irq_restore(flags);
 
if (yielded)




v4:


diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..2bec2ed 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4820,6 +4820,23 @@ void __sched yield(void)
 }
 EXPORT_SYMBOL(yield);
 
+/*
+ * Tests preconditions required for sched_class::yield_to().
+ */
+static bool __yield_to_candida

Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-10 Thread Andrew Theurer
On Sat, 2012-09-08 at 14:13 +0530, Srikar Dronamraju wrote:
> > 
> > signed-off-by: Andrew Theurer 
> > 
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index fbf1fd0..c767915 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
> > preempt)
> > 
> >  again:
> > p_rq = task_rq(p);
> > +   if (task_running(p_rq, p) || p->state || !(p_rq->curr->flags &
> > PF_VCPU)) {
> > +   goto out_no_unlock;
> > +   }
> > double_rq_lock(rq, p_rq);
> > while (task_rq(p) != p_rq) {
> > double_rq_unlock(rq, p_rq);
> > @@ -4856,8 +4859,6 @@ again:
> > if (curr->sched_class != p->sched_class)
> > goto out;
> > 
> > -   if (task_running(p_rq, p) || p->state)
> > -   goto out;
> 
> Is it possible that by this time the current thread takes double rq
> lock, thread p could actually be running?  i.e is there merit to keep
> this check around even with your similar check above?

I think that's a good idea.  I'll add that back in.
> 
> > 
> > yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> > if (yielded) {
> > @@ -4879,6 +4880,7 @@ again:
> > 
> >  out:
> > double_rq_unlock(rq, p_rq);
> > +out_no_unlock:
> > local_irq_restore(flags);
> > 
> > if (yielded)
> > 
> > 
> 

-Andrew Theurer


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-07 Thread Andrew Theurer
On Fri, 2012-09-07 at 23:36 +0530, Raghavendra K T wrote:
> CCing PeterZ also.
> 
> On 09/07/2012 06:41 PM, Andrew Theurer wrote:
> > I have noticed recently that PLE/yield_to() is still not that scalable
> > for really large guests, sometimes even with no CPU over-commit.  I have
> > a small change that make a very big difference.
> >
> > First, let me explain what I saw:
> >
> > Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80
> > thread Westmere-EX system:  645 seconds!
> >
> > Host cpu: ~98% in kernel, nearly all of it in spin_lock from double
> > runqueue lock for yield_to()
> >
> > So, I added some schedstats to yield_to(), one to count when we failed
> > this test in yield_to()
> >
> >  if (task_running(p_rq, p) || p->state)
> >
> > and one when we pass all the conditions and get to actually yield:
> >
> >   yielded = curr->sched_class->yield_to_task(rq, p, preempt);
> >
> >
> > And during boot up of this guest, I saw:
> >
> >
> > failed yield_to() because task is running: 8368810426
> > successful yield_to(): 13077658
> >0.156022% of yield_to calls
> >1 out of 640 yield_to calls
> >
> > Obviously, we have a problem.  Every exit causes a loop over 80 vcpus,
> > each one trying to get two locks.  This is happening on all [but one]
> > vcpus at around the same time.  Not going to work well.
> >
> 
> True and interesting. I had once thought of reducing overall O(n^2)
> iteration to O(n log(n)) iterations by reducing number of candidates
> to search to O(log(n)) instead of current O(n). May be I have to get 
> back to my experiment modes.
> 
> > So, since the check for a running task is nearly always true, I moved
> > that -before- the double runqueue lock, so 99.84% of the attempts do not
> > take the locks.  Now, I do not know is this [not getting the locks] is a
> > problem.  However, I'd rather have a little inaccurate test for a
> > running vcpu than burning 98% of CPU in host kernel.  With the change
> > the VM boot time went to:  100 seconds, an 85% reduction in time.
> >
> > I also wanted to check to see this did not affect truly over-committed
> > situations, so I first started with smaller VMs at 2x cpu over-commit:
> >
> > 16 VMs, 8-way each, all running dbench (2x cpu over-commmit)
> > throughput +/- stddev
> > - -
> > ple off:2281 +/- 7.32%  (really bad as expected)
> > ple on:19796 +/- 1.36%
> > ple on: w/fix: 19796 +/- 1.37%  (no degrade at all)
> >
> > In this case the VMs are small enough, that we do not loop through
> > enough vcpus to trigger the problem.  host CPU is very low (3-4% range)
> > for both default ple and with yield_to() fix.
> >
> > So I went on to a bigger VM:
> >
> > 10 VMs, 16-way each, all running dbench (2x cpu over-commit)
> > throughput +/- stddev
> > - -
> > ple on: 2552 +/- .70%
> > ple on: w/fix:  4621 +/- 2.12%  (81% improvement!)
> >
> > This is where we start seeing a major difference.  Without the fix, host
> > cpu was around 70%, mostly in spin_lock.  That was reduced to 60% (and
> > guest went from 30 to 40%).  I believe this is on the right track to
> > reduce the spin lock contention, still get proper directed yield, and
> > therefore improve the guest CPU available and its performance.
> >
> > However, we still have lock contention, and I think we can reduce it
> > even more.  We have eliminated some attempts at double runqueue lock
> > acquire because the check for the target vcpu is running is now before
> > the lock.  However, even if the target-to-yield-to vcpu [for the same
> > guest upon we PLE exited] is not running, the physical
> > processor/runqueue that target-to-yield-to vcpu is located on could be
> > running a different VM's vcpu -and- going through a directed yield,
> > therefore that run queue lock may already acquired.  We do not want to
> > just spin and wait, we want to move to the next candidate vcpu.  We need
> > a check to see if the smp processor/runqueue is already in a directed
> > yield.  Or, perhaps we just check if that cpu is not in guest mode, and
> > if so, we skip that yield attempt for that vcpu and move to the next
> > candidate vcpu.  So, my question is:  given a runqueue, what's the best
> > way to check if that corresponding phys cpu is not in guest mode?
> >
> 
> We are indeed avoid

[RFC][PATCH] Improving directed yield scalability for PLE handler

2012-09-07 Thread Andrew Theurer
I have noticed recently that PLE/yield_to() is still not that scalable
for really large guests, sometimes even with no CPU over-commit.  I have
a small change that make a very big difference.

First, let me explain what I saw:

Time to boot a 3.6-rc kernel in an 80-way VM on a 4 socket, 40 core, 80
thread Westmere-EX system:  645 seconds!

Host cpu: ~98% in kernel, nearly all of it in spin_lock from double
runqueue lock for yield_to()

So, I added some schedstats to yield_to(), one to count when we failed
this test in yield_to()

if (task_running(p_rq, p) || p->state)

and one when we pass all the conditions and get to actually yield:

 yielded = curr->sched_class->yield_to_task(rq, p, preempt);


And during boot up of this guest, I saw:


failed yield_to() because task is running: 8368810426
successful yield_to(): 13077658
  0.156022% of yield_to calls
  1 out of 640 yield_to calls

Obviously, we have a problem.  Every exit causes a loop over 80 vcpus,
each one trying to get two locks.  This is happening on all [but one]
vcpus at around the same time.  Not going to work well.

So, since the check for a running task is nearly always true, I moved
that -before- the double runqueue lock, so 99.84% of the attempts do not
take the locks.  Now, I do not know is this [not getting the locks] is a
problem.  However, I'd rather have a little inaccurate test for a
running vcpu than burning 98% of CPU in host kernel.  With the change
the VM boot time went to:  100 seconds, an 85% reduction in time.

I also wanted to check to see this did not affect truly over-committed
situations, so I first started with smaller VMs at 2x cpu over-commit:

16 VMs, 8-way each, all running dbench (2x cpu over-commmit)
   throughput +/- stddev
   - -
ple off:2281 +/- 7.32%  (really bad as expected)
ple on:19796 +/- 1.36%
ple on: w/fix: 19796 +/- 1.37%  (no degrade at all)

In this case the VMs are small enough, that we do not loop through
enough vcpus to trigger the problem.  host CPU is very low (3-4% range)
for both default ple and with yield_to() fix.

So I went on to a bigger VM:

10 VMs, 16-way each, all running dbench (2x cpu over-commit)
   throughput +/- stddev
   - -
ple on: 2552 +/- .70%
ple on: w/fix:  4621 +/- 2.12%  (81% improvement!)

This is where we start seeing a major difference.  Without the fix, host
cpu was around 70%, mostly in spin_lock.  That was reduced to 60% (and
guest went from 30 to 40%).  I believe this is on the right track to
reduce the spin lock contention, still get proper directed yield, and
therefore improve the guest CPU available and its performance.

However, we still have lock contention, and I think we can reduce it
even more.  We have eliminated some attempts at double runqueue lock
acquire because the check for the target vcpu is running is now before
the lock.  However, even if the target-to-yield-to vcpu [for the same
guest upon we PLE exited] is not running, the physical
processor/runqueue that target-to-yield-to vcpu is located on could be
running a different VM's vcpu -and- going through a directed yield,
therefore that run queue lock may already acquired.  We do not want to
just spin and wait, we want to move to the next candidate vcpu.  We need
a check to see if the smp processor/runqueue is already in a directed
yield.  Or, perhaps we just check if that cpu is not in guest mode, and
if so, we skip that yield attempt for that vcpu and move to the next
candidate vcpu.  So, my question is:  given a runqueue, what's the best
way to check if that corresponding phys cpu is not in guest mode?

Here's the changes so far (schedstat changes not included here):

signed-off-by:  Andrew Theurer 

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fbf1fd0..f8eff8c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4844,6 +4844,9 @@ bool __sched yield_to(struct task_struct *p, bool
preempt)
 
 again:
p_rq = task_rq(p);
+   if (task_running(p_rq, p) || p->state) {
+   goto out_no_unlock;
+   }
double_rq_lock(rq, p_rq);
while (task_rq(p) != p_rq) {
double_rq_unlock(rq, p_rq);
@@ -4856,8 +4859,6 @@ again:
if (curr->sched_class != p->sched_class)
goto out;
 
-   if (task_running(p_rq, p) || p->state)
-   goto out;
 
yielded = curr->sched_class->yield_to_task(rq, p, preempt);
if (yielded) {
@@ -4879,6 +4880,7 @@ again:
 
 out:
double_rq_unlock(rq, p_rq);
+out_no_unlock:
local_irq_restore(flags);
 
if (yielded)

  




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


pagemapscan-numa: find out where your mulit-node VM's memory is (and a question)

2012-07-13 Thread Andrew Theurer
We've started running multi-node VMs lately, and we want to track where
the VM memory actually resides in the host.  Our ability to manually pin
certain parts of memory are limited, so we want to see if they are
effective.  We also want to see what things like autoNUMA are doing.
Below is the program if anyone else wants to track memory placement.

One question I have is about the memory allocation made by Qemu for the
VM.  This program assumes that there is a single mapping for VM memory,
and that if the VM has N NUMA nodes, this mapping contains all the
memory for all nodes, and memory for nodes 0...N are ordered, such that
node 0 is at the beginning and node N is at the end of the mapping.  Is
this a correct assumption?  Or even close?  If not, what can we do to
extract this info from Qemu?

We assume that the memory per node is equal, which I can see could be
wrong, since the user could specify different memory sizes per node.
But at this point I am  willing to limit my tests to equal sized NUMA
nodes.


Here are some examples of the output:

This is a 320GB, 80 vcpu, 4 node VM just booted on a 4 x WST-EX host.
There is no pinning/numactl, and no autoNUMA, etc.

> [root@at-host ~]# ./pagemapscan-numa 31021 4
> pid is 31021
> Processing /proc/zoneinfo for start_pfn's
> host node 0 start pfn: 1048576
> host node 1 start pfn: 67633152
> host node 2 start pfn: 134742016
> host node 3 start pfn: 201850880
> 
> Processing /proc/31021/maps and /proc/31021/pagemap
> 
> Found mapping 7F8033E0-7FD033E0 that is >2GiB (327680 MiB, 
> 83886080 pages)
> Pages present for this mapping:
>  vNUMA-node-IDnode00node01node02node03
> 00 15872  8704  8192573952
> 01  5120  2560  1024512512
> 02 0 0 0508416
> 03 0 0   512508416
> 2145280/8380 total pages/MiB present for this process

Next one is a 139GB, 16 vcpu, 2 node VM, on a 2 x WST-EP host. The VM
uses PCI device assignment, so all the memory is allocated at VM
creation.  We also used hugetlbfs, and we used a mempolicy to prefer
node 0.  We allocated just enough huge pages so that once Qemu allocated
vNode0's memory, host node0's hugepages were depleted, and the remainder
of the VM's memory (vNode1) was fulfilled by huge pages in host node1.
At least that was our theory, and we wanted to confirm it worked:

> pid is 26899
> Processing /proc/zoneinfo for start_pfn's
> host node 0 start pfn: 1048576
> host node 1 start pfn: 19398656
> 
> processing /proc/26899/maps /proc/26899/pagemap
> 
> Found mapping 7F713700-7F93F700 that is >2GiB (142336 MiB, 
> 36438016 pages)
> Pages present for this mapping:
>  vNUMA-node-IDnode00node01
> 00  18219008 0
> 01 0  18219008
> 36438016/142336 total pages/MiB present for this process

We suspect this might be useful in testing autoNUMA and other NUMA
related tests.

Thanks,

-Andrew Theurer



/* pagemapscan-numa.c v0.01
 *
 * Copyright (c) 2012 IBM
 *
 * Author: Andrew Theurer
 *
 * This software is licensed to you under the GNU General Public License,
 * version 2 (GPLv2). There is NO WARRANTY for this software, express or
 * implied, including the implied warranties of MERCHANTABILITY or FITNESS
 * FOR A PARTICULAR PURPOSE. You should have received a copy of GPLv2
 * along with this software; if not, see
 * http://www.gnu.org/licenses/old-licenses/gpl-2.0.txt.
 *
 * pagemapscan-numa:  This program will take a Qemu PID and a value
 * equal to the number of NUMA nodes that the VM should have and process
 * /proc//pagemap, first finding the mapping that is the for VM memory,
 * and then finding where each page physically resides (which NUMA node)
 * on the host.  This is only useful if you have a mulit-node NUMA
 * topology on your host, and you have a multi-node NUMA topology
 * in your guest, and you want to know where the VM's memory maps to
 * on the host.
 */

#define _LARGEFILE64_SOURCE

#include 
#include 
#include 
#include 
#include 

#define MAX_NODES 256
#define MAX_LENGTH 256
#define PAGE_SIZE 4096

int main(int argc, char* argv[]) {

int pagemapfile = -1;
int nr_vm_nodes;
FILE *mapsfile = NULL;
FILE *zonefile = NULL;
char pagemappath[MAX_LENGTH];
char mapspath[MAX_LENGTH];
char mapsline[MAX_LENGTH];
char zoneline[MAX_LENGTH];
long file_offset, offset;
unsigned long start_addr, end_addr, size_b, size_mib, nr_pages,
  page, present_pages, present_mib, start_pfn[MAX_NODES];
int i, node, nr_host_nodes, find_node;

if (argc != 3) {
printf("You must provide two arguments, the Qemu PID and the 
number

Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler

2012-07-10 Thread Andrew Theurer
On Tue, 2012-07-10 at 17:24 +0530, Raghavendra K T wrote:
> On 07/10/2012 03:17 AM, Andrew Theurer wrote:
> > On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote:
> >> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
> >> random VCPU on PL exit. Though we already have filtering while choosing
> >> the candidate to yield_to, we can do better.
> >
> > Hi, Raghu.
> >
> [...]
> >
> > Can you briefly explain the 1x and 2x configs?  This of course is highly
> > dependent whether or not HT is enabled...
> >
> 
> Sorry if I had not made very clear in earlier threads. Have you applied
> Rik's following patch for base. without this you could see some 
> inconsistent results perhaps.
> 
> https://lkml.org/lkml/2012/6/19/401

Yes, I do have that applied with your patch and in my baseline.

-Andrew Theurer


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/2] kvm: Improving directed yield in PLE handler

2012-07-09 Thread Andrew Theurer
On Mon, 2012-07-09 at 11:50 +0530, Raghavendra K T wrote:
> Currently Pause Looop Exit (PLE) handler is doing directed yield to a
> random VCPU on PL exit. Though we already have filtering while choosing
> the candidate to yield_to, we can do better.

Hi, Raghu.

> Problem is, for large vcpu guests, we have more probability of yielding
> to a bad vcpu. We are not able to prevent directed yield to same guy who
> has done PL exit recently, who perhaps spins again and wastes CPU.
> 
> Fix that by keeping track of who has done PL exit. So The Algorithm in series
> give chance to a VCPU which has:
> 
>  (a) Not done PLE exit at all (probably he is preempted lock-holder)
> 
>  (b) VCPU skipped in last iteration because it did PL exit, and probably
>  has become eligible now (next eligible lock holder)
> 
> Future enhancemnets:
>   (1) Currently we have a boolean to decide on eligibility of vcpu. It
> would be nice if I get feedback on guest (>32 vcpu) whether we can
> improve better with integer counter. (with counter = say f(log n )).
>   
>   (2) We have not considered system load during iteration of vcpu. With
>that information we can limit the scan and also decide whether schedule()
>is better. [ I am able to use #kicked vcpus to decide on this But may
>be there are better ideas like information from global loadavg.]
> 
>   (3) We can exploit this further with PV patches since it also knows about
>next eligible lock-holder.
> 
> Summary: There is a huge improvement for moderate / no overcommit scenario
>  for kvm based guest on PLE machine (which is difficult ;) ).
> 
> Result:
> Base : kernel 3.5.0-rc5 with Rik's Ple handler fix
> 
> Machine : Intel(R) Xeon(R) CPU X7560  @ 2.27GHz, 4 numa node, 256GB RAM,
>   32 core machine

Is this with HT enabled, therefore 64 CPU threads?

> Host: enterprise linux  gcc version 4.4.6 20120305 (Red Hat 4.4.6-4) (GCC)
>   with test kernels 
> 
> Guest: fedora 16 with 32 vcpus 8GB memory. 

Can you briefly explain the 1x and 2x configs?  This of course is highly
dependent whether or not HT is enabled...

FWIW, I started testing what I would call "0.5x", where I have one 40
vcpu guest running on a host with 40 cores and 80 CPU threads total (HT
enabled, no extra load on the system).  For ebizzy, the results are
quite erratic from run to run, so I am inclined to discard it as a
workload, but maybe I should try "1x" and "2x" cpu over-commit as well.

>From initial observations, at least for the ebizzy workload, the
percentage of exits that result in a yield_to() are very low, around 1%,
before these patches.  So, I am concerned that at least for this test,
reducing that number even more has diminishing returns.  I am however
still concerned about the scalability problem with yield_to(), which
shows like this for me (perf):

> 63.56% 282095 qemu-kvm  [kernel.kallsyms][k] 
> _raw_spin_lock  
> 5.42%  24420 qemu-kvm  [kvm][k] 
> kvm_vcpu_yield_to   
> 5.33%  26481 qemu-kvm  [kernel.kallsyms][k] get_pid_task  
>   
> 4.35%  20049 qemu-kvm  [kernel.kallsyms][k] yield_to  
>   
> 2.74%  15652 qemu-kvm  [kvm][k] 
> kvm_apic_present
> 1.70%   8657 qemu-kvm  [kvm][k] 
> kvm_vcpu_on_spin
> 1.45%   7889 qemu-kvm  [kvm][k] 
> vcpu_enter_guest
 
For the cpu threads in the host that are actually active (in this case
1/2 of them), ~50% of their time is in kernel and ~43% in guest.  This
is for a no-IO workload, so that's just incredible to see so much cpu
wasted.  I feel that 2 important areas to tackle are a more scalable
yield_to() and reducing the number of pause exits itself (hopefully by
just tuning ple_window for the latter).

Honestly, I not confident addressing this problem will improve the
ebizzy score. That workload is so erratic for me, that I do not trust
the results at all.  I have however seen consistent improvements in
disabling PLE for a http guest workload and a very high IOPS guest
workload, both with much time spent in host in the double runqueue lock
for yield_to(), so that's why I still gravitate toward that issue.


-Andrew Theurer

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] add PLE stats to kvmstat

2012-07-06 Thread Andrew Theurer
On Sat, 2012-07-07 at 01:40 +0800, Xiao Guangrong wrote:
> On 07/06/2012 09:22 PM, Andrew Theurer wrote:
> > On Fri, 2012-07-06 at 15:42 +0800, Xiao Guangrong wrote:
> >> On 07/06/2012 05:50 AM, Andrew Theurer wrote:
> >>> I, and I expect others, have a keen interest in knowing how often we
> >>> exit for PLE, and also how often that includes a yielding to another
> >>> vcpu.  The following adds two more counters to kvmstat to track the
> >>> exits and the vcpu yields.  This in no way changes PLE behavior, just
> >>> helps us track what's going on.
> >>>
> >>
> >> Tracepoint is a better choice than the counters you used. :)
> > 
> > Xiao, is kvmstat considered to be deprecated?  Or are debug stats like
> > this just generally favored to be processed via something like perf
> > instead of debugfs?
> 
> Andrew, please refer to Documentation/feature-removal-schedule.txt,
> it says:
> 
> What:   KVM debugfs statistics
> When:   2013
> Why:KVM tracepoints provide mostly equivalent information in a much more
> flexible fashion.
> 
> You can use tracepoints instead of your debugfs-counters in this patch.

Great, thanks.  I will work on a tracepoint based approach.
> 
> >  Should we be removing kvmstat?
> > 
> 
> Some months ago, i implemented 'perf kvm-events' which can analyse kvm
> events more smartly, the patchset can be found at:
>   https://lkml.org/lkml/2012/3/6/86

I will take a look.
> 
> Avi said it may instead of kvmstat, but i am too busy to update this
> patchset. :)
> 

-Andrew


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] add PLE stats to kvmstat

2012-07-06 Thread Andrew Theurer
On Fri, 2012-07-06 at 15:42 +0800, Xiao Guangrong wrote:
> On 07/06/2012 05:50 AM, Andrew Theurer wrote:
> > I, and I expect others, have a keen interest in knowing how often we
> > exit for PLE, and also how often that includes a yielding to another
> > vcpu.  The following adds two more counters to kvmstat to track the
> > exits and the vcpu yields.  This in no way changes PLE behavior, just
> > helps us track what's going on.
> > 
> 
> Tracepoint is a better choice than the counters you used. :)

Xiao, is kvmstat considered to be deprecated?  Or are debug stats like
this just generally favored to be processed via something like perf
instead of debugfs?  Should we be removing kvmstat?

Thanks,

-Andrew Theurer

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] add PLE stats to kvmstat

2012-07-05 Thread Andrew Theurer
I, and I expect others, have a keen interest in knowing how often we
exit for PLE, and also how often that includes a yielding to another
vcpu.  The following adds two more counters to kvmstat to track the
exits and the vcpu yields.  This in no way changes PLE behavior, just
helps us track what's going on.

-Andrew Theurer

Signed-off-by: Andrew Theurer 

 arch/x86/include/asm/kvm_host.h |2 ++
 arch/x86/kvm/svm.c  |1 +
 arch/x86/kvm/vmx.c  |1 +
 arch/x86/kvm/x86.c  |2 ++
 virt/kvm/kvm_main.c |1 +
 5 files changed, 7 insertions(+), 0 deletions(-)


diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 24b7647..aebba8a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -593,6 +593,8 @@ struct kvm_vcpu_stat {
u32 hypercalls;
u32 irq_injections;
u32 nmi_injections;
+   u32 pause_exits;
+   u32 vcpu_yield_to;
 };
 
 struct x86_instruction_info;
diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 7a41878..1c1b81e 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -3264,6 +3264,7 @@ static int interrupt_window_interception(struct vcpu_svm 
*svm)
 
 static int pause_interception(struct vcpu_svm *svm)
 {
+   ++svm->vcpu.stat.pause_exits;
kvm_vcpu_on_spin(&(svm->vcpu));
return 1;
 }
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index eeeb4a2..1309578 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5004,6 +5004,7 @@ out:
  */
 static int handle_pause(struct kvm_vcpu *vcpu)
 {
+   ++vcpu->stat.pause_exits;
skip_emulated_instruction(vcpu);
kvm_vcpu_on_spin(vcpu);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8eacb2e..ad85403 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -143,6 +143,8 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "insn_emulation_fail", VCPU_STAT(insn_emulation_fail) },
{ "irq_injections", VCPU_STAT(irq_injections) },
{ "nmi_injections", VCPU_STAT(nmi_injections) },
+   { "pause_exits", VCPU_STAT(pause_exits) },
+   { "vcpu_yield_to", VCPU_STAT(vcpu_yield_to) },
{ "mmu_shadow_zapped", VM_STAT(mmu_shadow_zapped) },
{ "mmu_pte_write", VM_STAT(mmu_pte_write) },
{ "mmu_pte_updated", VM_STAT(mmu_pte_updated) },
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 636bd08..d80b6cd 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1610,6 +1610,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me)
if (kvm_vcpu_yield_to(vcpu)) {
kvm->last_boosted_vcpu = i;
yielded = 1;
+   ++vcpu->stat.vcpu_yield_to;
break;
}
}


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] kvm: handle last_boosted_vcpu = 0 case

2012-07-05 Thread Andrew Theurer
On Mon, 2012-07-02 at 10:49 -0400, Rik van Riel wrote:
> On 06/28/2012 06:55 PM, Vinod, Chegu wrote:
> > Hello,
> >
> > I am just catching up on this email thread...
> >
> > Perhaps one of you may be able to help answer this query.. preferably along 
> > with some data.  [BTW, I do understand the basic intent behind PLE in a 
> > typical [sweet spot] use case where there is over subscription etc. and the 
> > need to optimize the PLE handler in the host etc. ]
> >
> > In a use case where the host has fewer but much larger guests (say 40VCPUs 
> > and higher) and there is no over subscription (i.e. # of vcpus across 
> > guests<= physical cpus in the host  and perhaps each guest has their vcpu's 
> > pinned to specific physical cpus for other reasons), I would like to 
> > understand if/how  the PLE really helps ?  For these use cases would it be 
> > ok to turn PLE off (ple_gap=0) since is no real need to take an exit and 
> > find some other VCPU to yield to ?
> 
> Yes, that should be ok.
> 
> On a related note, I wonder if we should increase the ple_gap
> significantly.
> 
> After all, 4096 cycles of spinning is not that much, when you
> consider how much time is spent doing the subsequent vmexit,
> scanning the other VCPU's status (200 cycles per cache miss),
> deciding what to do, maybe poking another CPU, and eventually
> a vmenter.
> 
> A factor 4 increase in ple_gap might be what it takes to
> get the amount of time spent spinning equal to the amount of
> time spent on the host side doing KVM stuff...

I was recently thinking the same thing as I have observed over 180,000
exits/sec from a 40-way VM on a 80-way host, where there should be no
cpu overcommit.  Also, the number of directed yields for this was only
1800/sec, so we have a 1% usefulness for our exits.  I am wondering if
the ple_window should be similar to the host scheduler task switching
granularity, and not what we think a typical max cycles should be for
holding a lock.

BTW, I have a patch to add a couple PLE stats to kvmstat which I will
send out shortly.

-Andrew




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC:kvm] export host NUMA info to guest & make emulated device NUMA attr

2012-05-23 Thread Andrew Theurer

On 05/22/2012 04:28 AM, Liu ping fan wrote:

On Sat, May 19, 2012 at 12:14 AM, Shirley Ma  wrote:

On Thu, 2012-05-17 at 17:20 +0800, Liu Ping Fan wrote:

Currently, the guest can not know the NUMA info of the vcpu, which
will
result in performance drawback.

This is the discovered and experiment by
 Shirley Ma
 Krishna Kumar
 Tom Lendacky
Refer to -
http://www.mail-archive.com/kvm@vger.kernel.org/msg69868.html
we can see the big perfermance gap between NUMA aware and unaware.

Enlightened by their discovery, I think, we can do more work -- that
is to
export NUMA info of host to guest.


There three problems we've found:

1. KVM doesn't support NUMA load balancer. Even there are no other
workloads in the system, and the number of vcpus on the guest is smaller
than the number of cpus per node, the vcpus could be scheduled on
different nodes.

Someone is working on in-kernel solution. Andrew Theurer has a working
user-space NUMA aware VM balancer, it requires libvirt and cgroups
(which is default for RHEL6 systems).


Interesting, and I found that "sched/numa: Introduce
sys_numa_{t,m}bind()" committed by Peter and Ingo may help.
But I think from the guest view, it can not tell whether the two vcpus
are on the same host node. For example,
vcpu-a in node-A is not vcpu-b in node-B, the guest lb will be more
expensive if it pull_task from vcpu-a and
choose vcpu-b to push.  And my idea is to export such info to guest,
still working on it.


The long term solution is to two-fold:
1) Guests that are quite large (in that they cannot fit in a host NUMA 
node) must have static mulit-node NUMA topology implemented by Qemu. 
That is here today, but we do not do it automatically, which is probably 
going to be a VM management responsibility.
2) Host scheduler and NUMA code must be enhanced to get better placement 
of Qemu memory and threads.  For single-node vNUMA guests, this is easy, 
put it all in one node.  For mulit-node vNUMA guests, the host must 
understand that some Qemu memory belongs with certain vCPU threads 
(which make up one of the guests vNUMA nodes), and then place that 
memory/threads in a specific host node (and continue for other 
memory/threads for each Qemu vNUMA node).


Note that even if a guest's memory/threads for a vNUMA node are 
relocated to another host node (which will be necessary) the NUMA 
characteristics of guest are still maintained (as all those vCPUs and 
memory are still "close" to each other).


The problem with exposing the host's NUMA info directly to the guest is 
that (1) vCPUs will get relocated, so their topology info in the guest 
will have to change over time. IMO that is a bad idea.  We have a hard 
enough time getting applications to work with a static NUMA info.  To 
get applications to react to changing NUMA topology is not going to turn 
out well. (2) Every single guest would have to have the same number of 
NUMA nodes defined as the host.  That is overkill, especially for small 
guests.




2. The host scheduler is not aware the relationship between guest vCPUs
and vhost. So it's possible for host scheduler to schedule per-device
vhost thread on the same cpu on which the vCPU kick a TX packet, or
schecule vhost thread on different node than the vCPU for; For RX packet
it's possible for vhost delivers RX packet on the vCPU running on
different node too.


Yes. I notice this point in your original patch.


3. per-device vhost thread is not scaled.


What about the scale-ability of per-vm * host_NUMA_NODE? When we make
advantage of multi-core,  we produce mulit vcpu threads for one VM.
So what about the emulated device? Is it acceptable to scale to take
advantage of host NUMA attr.  After all, how many nodes on which the
VM
can be run on are the user's control.  It is a balance of
scale-ability and performance.


So the problems are in host scheduling and vhost thread scalability. I
am not sure how much help from exposing NUMA info from host to guest.

Have you tested these patched? How much performance gain here?


Sorry, not yet.  As you have mentioned, the vhost thread scalability
is a big problem. So I want to see others' opinion before going on.

Thanks and regards,
pingfan



Thanks
Shirley


So here comes the idea:
1. export host numa info through guest's sched domain to its scheduler
   Export vcpu's NUMA info to guest scheduler(I think mem NUMA problem
   has been handled by host).  So the guest's lb will consider the
cost.
   I am still working on this, and my original idea is to export these
info
   through "static struct sched_domain_topology_level
*sched_domain_topology"
   to guest.

2. Do a better emulation of virt mach exported to guest.
   In real world, the devices are limited by kinds of reasons to own
the NUMA
   property. But as to Qemu, the device is emulated by thread, which
inherit
   the NUMA attr in nature.  We can implement the d

Re: gettimeofday() vsyscall for kvm-clock?

2012-05-21 Thread Andrew Theurer

On 05/21/2012 03:36 PM, Marcelo Tosatti wrote:

On Mon, May 21, 2012 at 03:26:54PM -0500, Andrew Theurer wrote:

Wondering if a user-space gettimofday() for kvm-clock has been
considered before.  I am seeing a pretty large difference in
performance between tsc and kvm-clock.  I have to assume at least
some of this is due to the mode switch for kvm-clock.  Here are the
results:

(this is a 16 vCPU VM on a 16 thread 2S Nehalem-EP host, looped
gettimeofday() calls on all vCPUs)

tsc:.0645 usec per call
kvm-clock:  .4222 usec per call (6.54x)


-Andrew Theurer


https://bugzilla.redhat.com/show_bug.cgi?id=679207

"model name : Intel(R) Xeon(R) CPU   E5540  @ 2.53GHz

native, gettimeofday (vsyscall): 45ns
guest, kvmclock (syscall): 198ns"

But this was before

commit 489fb490dbf8dab0249ad82b56688ae3842a79e8
Author: Glauber Costa
Date:   Tue May 11 12:17:40 2010 -0400

 x86, paravirt: Add a global synchronization point for pvclock

(see the full changelog for details).

Can you try disabling the global variable, to see if that makes
a difference (should not be enabled in production)? Untested patch
(against guest kernel) below


The following was re-done on a 3.4 guest kernel (previously RHEL kernel):

1-way:
  tsc:  .0315
  kvm-clock:.2112 (6.7x)

16-way:
  tsc:  .0432
  kvm-clock:.4825 (11.1x)

Now with global var disabled:

16-way:
  kvm-clock:.4628

Does not look like much of a difference.

-Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


gettimeofday() vsyscall for kvm-clock?

2012-05-21 Thread Andrew Theurer
Wondering if a user-space gettimofday() for kvm-clock has been 
considered before.  I am seeing a pretty large difference in performance 
between tsc and kvm-clock.  I have to assume at least some of this is 
due to the mode switch for kvm-clock.  Here are the results:


(this is a 16 vCPU VM on a 16 thread 2S Nehalem-EP host, looped 
gettimeofday() calls on all vCPUs)


tsc:.0645 usec per call
kvm-clock:  .4222 usec per call (6.54x)


-Andrew Theurer

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: perf stat to collect the performance statistics of KVM process

2012-05-14 Thread Andrew Theurer

On 05/13/2012 10:56 AM, Hailong Yang wrote:

Dear all,

I am running perf stat to collect the performance statistics of the already
running KVM guest VM process. And I have a shell script to execute the perf
stat as a daemon process. But when I use the 'kill' command to stop the perf
stat, there is no output redirected to a file. What I would like to do is
that to collect performance counters of the guest VM process for a certain
period and redirect the output to a log file, but without user interaction
(such as using CTRL + C to stop the perf stat)

[root@dell06 ~]# (perf stat -p 7473 -x ,) 2>  perftest&
[1] 15086
[root@dell06 ~]# kill 15086
[root@dell06 ~]#
[1]+  Terminated  ( perf stat -p 7473 -x , ) 2>  perftest
[root@dell06 ~]# cat perftest
[root@dell06 ~]#

Any clue?


Can you please try "kill -s INT "


Best Regards

Hailong



-Andrew Theurer

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to determine the backing host physical memory for a given guest ?

2012-05-10 Thread Andrew Theurer

On 05/09/2012 08:46 AM, Avi Kivity wrote:

On 05/09/2012 04:05 PM, Chegu Vinod wrote:

Hello,

On an 8 socket Westmere host I am attempting to run a single guest and
characterize the virtualization overhead for a system intensive
workload (AIM7-high_systime) as the size of the guest scales (10way/64G,
20way/128G, ... 80way/512G).

To do some comparisons between the native vs. guest runs. I have
been using "numactl" to control the cpu node&  memory node bindings for
the qemu instance.  For larger guest sizes I end up binding across multiple
localities. for e.g. a 40 way guest :

numactl --cpunodebind=0,1,2,3  --membind=0,1,2,3  \
qemu-system-x86_64 -smp 40 -m 262144 \
<>

I understand that actual mappings from a guest virtual address to host physical
address could change.

Is there a way to determine [at a given instant] which host's NUMA node is
providing the backing physical memory for the active guest's kernel and
also for the the apps actively running in the guest ?

Guessing that there is a better way (some tool available?) than just
diff'ng the per node memory usage...from the before and after output of
"numactl --hardware" on the host.



Not sure if that's what you want, but there's Documentation/vm/pagemap.txt.



You can look at /proc//numa_maps and see all the mappings for the 
qemu process.  There should be one really large mapping for the guest 
memory, and in that line a number of dirty pages list potentially for 
each NUMA node.  This will tell you how much from each node, but not 
specifically "which page is mapped where".


Keep in mind with the current numactl you are using, you will likely not 
get the benefits of NUMA enhancements found in the linux kernel from 
your guest (or host).  There are a couple reasons: (1) your guest does 
not have a NUMA topology defined (based on what I see from the qemu 
command above), so it will not do anything special based on the host 
topology.  Also, things that are broken down per-NUMA-node like some 
spin-locks and sched-domains are now system-wide/flat.  This is a big 
deal for scheduler and other things like kmem allocation.  With a single 
80way VM with no NUMA, you will likely have massive spin-lock contention 
on some workloads. (2) Once the VM does have NUMA toplogy (via qemu 
-numa), one still cannot manually set mempolicy for a portion of the VM 
memory that represents each NUMA node in the VM (or have this done 
automatically with something like autoNUMA).  Therefore, it's difficult 
to forcefully map each of the VM's node's memory to the corresponding 
host node.


There are a some things you can do to mitigate some of this.  Definitely 
define the VM to match the NUMA topology found on the host.  That will 
at least allow good scaling wrt locks and scheduler in the guest.  As 
for getting memory placement close (a page in VM node x actually resides 
in host node x), you have to rely on vcpu pinning + guest NUMA topology, 
combined with default mempolicy in the guest and host.  As pages are 
faulted in the guest, the hope is that the vcpu which did the faulting 
is running in the right node (guest and host), its guest OS mempolicy 
ensures this page is to be allocated in the guest local node, and that 
allocation cause a fault in qemu, which is -also- running on the -host- 
node X.  The vcpu pinning is critical to get qemu to fault that memory 
to the correct node.  Make sure you do not use numactl for any of this. 
 I would suggest using libvirt and define the vcpu-pinning and the numa 
topology in the XML.


-Andrew Theurer

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: performance of virtual functions compared to virtio

2011-04-25 Thread Andrew Theurer
On Mon, 2011-04-25 at 13:49 -0600, David Ahern wrote:
> 
> On 04/25/11 13:29, Alex Williamson wrote:
> > So we're effectively getting host-host latency/throughput for the VF,
> > it's just that in the 82576 implementation of SR-IOV, the VF takes a
> > latency hit that puts it pretty close to virtio.  Unfortunate.  I think
> 
> For host-to-VM using VFs is worse than virtio which is counterintuitive.
> 
> > you'll find that passing the PF to the guests should be pretty close to
> > that 185us latency.  I would assume (hope) the higher end NICs reduce
> 
> About that 185usec: do you know where the bottleneck is? It seems as if
> the packet is held in some queue waiting for an event/timeout before it
> is transmitted.

you might want to check the VF driver.  I know versions of the ixgbevf
driver have a throttled interrupt option which will increase latency
with some settings.  I don't remember if the igbvf driver has the same
feature.  If it does, you will want to turn this option off for best
latency.

> 
> David
> 
> 
> > this, but it seems to be a hardware limitation, so it's hard to predict.
> > Thanks,
> > 
> > Alex

-Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Network performance with small packets

2011-03-08 Thread Andrew Theurer
On Tue, 2011-03-08 at 13:57 -0800, Shirley Ma wrote:
> On Wed, 2011-02-09 at 11:07 +1030, Rusty Russell wrote:
> > I've finally read this thread... I think we need to get more serious
> > with our stats gathering to diagnose these kind of performance issues.
> > 
> > This is a start; it should tell us what is actually happening to the
> > virtio ring(s) without significant performance impact... 
> 
> Should we also add similar stat on vhost vq as well for monitoring
> vhost_signal & vhost_notify?

Tom L has started using Rusty's patches and found some interesting
results, sent yesterday:
http://marc.info/?l=kvm&m=129953710930124&w=2


-Andrew
> 
> Shirley
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/3] [RFC] Implement multiqueue (RX & TX) virtio-net

2011-03-03 Thread Andrew Theurer
On Mon, 2011-02-28 at 12:04 +0530, Krishna Kumar wrote:
> This patch series is a continuation of an earlier one that
> implemented guest MQ TX functionality.  This new patchset
> implements both RX and TX MQ.  Qemu changes are not being
> included at this time solely to aid in easier review.
> Compatibility testing with old/new combinations of qemu/guest
> and vhost was done without any issues.
> 
> Some early TCP/UDP test results are at the bottom of this
> post, I plan to submit more test results in the coming days.
> 
> Please review and provide feedback on what can improve.
> 
> Thanks!
> 
> Signed-off-by: Krishna Kumar 
> ---
> 
> 
> Test configuration:
>   Host:  8 Intel Xeon, 8 GB memory
>   Guest: 4 cpus, 2 GB memory
> 
> Each test case runs for 60 secs, results below are average over
> two runs.  Bandwidth numbers are in gbps.  I have used default
> netperf, and no testing/system tuning other than taskset each
> vhost to 0xf (cpus 0-3).  Comparison is testing original kernel
> vs new kernel with #txqs=8 ("#" refers to number of netperf
> sessions).

> 
> ___
> TCP: Guest -> Local Host (TCP_STREAM)
>  TCP: Local Host -> Guest (TCP_MAERTS)
> UDP: Local Host -> Guest (UDP_STREAM)


Any reason why the tests don't include a guest-to-guest on same host, or
on different hosts?  Seems like those would be a lot more common that
guest-to/from-localhost.

Thanks,

-Andrew


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/4] NUMA: realize NUMA memory pinning

2010-08-31 Thread Andrew Theurer
On Tue, 2010-08-31 at 17:03 -0500, Anthony Liguori wrote:
> On 08/31/2010 03:54 PM, Andrew Theurer wrote:
> > On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote:
> >
> >> On 08/23/2010 04:16 PM, Andre Przywara wrote:
> >>  
> >>> Anthony Liguori wrote:
> >>>
> >>>> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
> >>>>  
> >>>>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
> >>>>>
> >>>>>> According to the user-provided assignment bind the respective part
> >>>>>> of the guest's memory to the given host node. This uses Linux'
> >>>>>> mbind syscall (which is wrapped only in libnuma) to realize the
> >>>>>> pinning right after the allocation.
> >>>>>> Failures are not fatal, but produce a warning.
> >>>>>>
> >>>>>> Signed-off-by: Andre Przywara
> >>>>>> ...
> >>>>>>  
> >>>>> Why is it not possible (or perhaps not desired) to change the binding
> >>>>> after the guest is started?
> >>>>>
> >>>>> Sounds unflexible.
> >>>>>
> >>> The solution is to introduce a monitor interface to later adjust the
> >>> pinning, allowing both changing the affinity only (only valid for
> >>> future fault-ins) and actually copying the memory (more costly).
> >>>
> >> This is just duplicating numactl.
> >>
> >>  
> >>> Actually this is the next item on my list, but I wanted to bring up
> >>> the basics first to avoid recoding parts afterwards. Also I am not
> >>> (yet) familiar with the QMP protocol.
> >>>
> >>>> We really need a solution that lets a user use a tool like numactl
> >>>> outside of the QEMU instance.
> >>>>  
> >>> I fear that is not how it's meant to work with the Linux' NUMA API. In
> >>> opposite to the VCPU threads, which are externally visible entities
> >>> (PIDs), the memory should be private to the QEMU process. While you
> >>> can change the NUMA allocation policy of the _whole_ process, there is
> >>> no way to externally distinguish parts of the process' memory.
> >>> Although you could later (and externally) migrate already faulted
> >>> pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you
> >>> would let an external tool interfere with QEMUs internal memory
> >>> management. Take for instance the change of the allocation policy
> >>> regarding the 1MB and 3.5-4GB holes. An external tool would have to
> >>> either track such changes or you simply could not change such things
> >>> in QEMU.
> >>>
> >> It's extremely likely that if you're doing NUMA pinning, you're also
> >> doing large pages via hugetlbfs.  numactl can already set policies for
> >> files in hugetlbfs so all you need to do is have a separate hugetlbfs
> >> file for each numa node.
> >>  
> > Why would we resort to hugetlbfs when we have transparent hugepages?
> >
> 
> If you care about NUMA pinning, I can't believe you don't want 
> guaranteed large page allocation which THP does not provide.

I personally want a more automatic approach to placing VMs in NUMA nodes
(not directed by the qemu process itself), but I'd also like to support
a user's desire to pin and place cpus and memory, especially for large
VMs that need to be defined as multi-node.  For user defined pinning,
libhugetlbfs will probably be fine, but for most VMs, I'd like to ensure
we can do things like ballooning well, and I am not so sure that will be
easy with libhugetlbfs.  

> The general point though is that we should find a way to partition 
> memory in qemu such that an external process can control the actual NUMA 
> placement.  This gives us maximum flexibility.
> 
> Otherwise, what do we implement in QEMU?  Direct pinning of memory to 
> nodes?  Can we migrate memory between nodes?  Should we support 
> interleaving memory between two virtual nodes?  Why pick and choose when 
> we can have it all.

If there were a better way to do this than hugetlbfs, then I don't think
I would shy away from this.  Is there another way to change NUMA
policies on mappings from a user tool?  We can already inspect
with /proc//numamaps.  Is this something that could be a

Re: [PATCH 4/4] NUMA: realize NUMA memory pinning

2010-08-31 Thread Andrew Theurer
On Mon, 2010-08-23 at 16:27 -0500, Anthony Liguori wrote:
> On 08/23/2010 04:16 PM, Andre Przywara wrote:
> > Anthony Liguori wrote:
> >> On 08/23/2010 01:59 PM, Marcelo Tosatti wrote:
> >>> On Wed, Aug 11, 2010 at 03:52:18PM +0200, Andre Przywara wrote:
> >>>> According to the user-provided assignment bind the respective part
> >>>> of the guest's memory to the given host node. This uses Linux'
> >>>> mbind syscall (which is wrapped only in libnuma) to realize the
> >>>> pinning right after the allocation.
> >>>> Failures are not fatal, but produce a warning.
> >>>>
> >>>> Signed-off-by: Andre Przywara
> > >>> ...
> >>> Why is it not possible (or perhaps not desired) to change the binding
> >>> after the guest is started?
> >>>
> >>> Sounds unflexible.
> > The solution is to introduce a monitor interface to later adjust the 
> > pinning, allowing both changing the affinity only (only valid for 
> > future fault-ins) and actually copying the memory (more costly).
> 
> This is just duplicating numactl.
> 
> > Actually this is the next item on my list, but I wanted to bring up 
> > the basics first to avoid recoding parts afterwards. Also I am not 
> > (yet) familiar with the QMP protocol.
> >>
> >> We really need a solution that lets a user use a tool like numactl 
> >> outside of the QEMU instance.
> > I fear that is not how it's meant to work with the Linux' NUMA API. In 
> > opposite to the VCPU threads, which are externally visible entities 
> > (PIDs), the memory should be private to the QEMU process. While you 
> > can change the NUMA allocation policy of the _whole_ process, there is 
> > no way to externally distinguish parts of the process' memory. 
> > Although you could later (and externally) migrate already faulted 
> > pages (via move_pages(2) and by looking in /proc/$$/numa_maps), you 
> > would let an external tool interfere with QEMUs internal memory 
> > management. Take for instance the change of the allocation policy 
> > regarding the 1MB and 3.5-4GB holes. An external tool would have to 
> > either track such changes or you simply could not change such things 
> > in QEMU.
> 
> It's extremely likely that if you're doing NUMA pinning, you're also 
> doing large pages via hugetlbfs.  numactl can already set policies for 
> files in hugetlbfs so all you need to do is have a separate hugetlbfs 
> file for each numa node.

Why would we resort to hugetlbfs when we have transparent hugepages?

FWIW, large apps like databases have set a precedent for managing their
own NUMA policies.  I don't see why qemu should be any different.
Numactl is great for small apps that need to be pinned in one node, or
spread evenly on all nodes.  Having to get hugetlbfs involved just to
workaround a shortcoming of numactl just seems like a bad idea.   
> 
> Then you have all the flexibility of numactl and you can implement node 
> migration external to QEMU if you so desire.
> 
> > So what is wrong with keeping that code in QEMU, which knows best 
> > about the internals and already has flexible and mighty ways (command 
> > line and QMP) of manipulating its behavior?
> 
> NUMA is a last-mile optimization.  For the audience that cares about 
> this level of optimization, only providing an interface that allows a 
> small set of those optimizations to be used is unacceptable.
> 
> There's a very simple way to do this right and that's by adding 
> interfaces to QEMU that let's us work with existing tooling instead of 
> inventing new interfaces.
> 
> Regards,
> 
> Anthony Liguori
> 
> > Regards,
> > Andre.

-Andrew Theurer

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


windows workload: many ept_violation and mmio exits

2009-12-03 Thread Andrew Theurer
I am running a windows workload which has 26 windows VMs running many 
instances of a J2EE workload.  There are 13 pairs of an application 
server VM and database server VM.  There seem to be quite a bit of 
vm_exits, and it looks over a third of them are mmio_exit:



efer_relo  0
exits  337139
fpu_reloa  247321
halt_exit  19092
halt_wake  18611
host_stat  247332
hypercall  0
insn_emul  184265
insn_emul  184265
invlpg 0
io_exits   69184
irq_exits  52953
irq_injec  48115
irq_windo  2411
largepage  19
mmio_exit  123554
mmu_cache  0
mmu_flood  0
mmu_pde_z  0
mmu_pte_u  0
mmu_pte_w  0
mmu_recyc  0
mmu_shado  0
mmu_unsyn  0
nmi_injec  0
nmi_windo  0
pf_fixed   19
pf_guest   0
remote_tl  0
request_i  0
signal_ex  0
tlb_flush  0


I collected a kvmtrace, and below is a very small portion of that.  Is 
there a way I can figure out what device the mmio's are for?  Also, is 
it normal to have lots of ept_violations?  This is a 2 socket Nehalem 
system with SMT on.




qemu-system-x86-19673 [014] 213577.939614: kvm_entry: vcpu 0
 qemu-system-x86-19673 [014] 213577.939624: kvm_exit: reason ept_violation rip 
0xf8000160ef8e
 qemu-system-x86-19673 [014] 213577.939624: kvm_page_fault: address fed000f0 
error_code 181
 qemu-system-x86-19673 [014] 213577.939627: kvm_mmio: mmio unsatisfied-read len 
4 gpa 0xfed000f0 val 0x0
 qemu-system-x86-19673 [014] 213577.939629: kvm_mmio: mmio read len 4 gpa 
0xfed000f0 val 0xfb8f214d
 qemu-system-x86-19673 [014] 213577.939631: kvm_entry: vcpu 0
 qemu-system-x86-19673 [014] 213577.939633: kvm_exit: reason ept_violation rip 
0xf8000160ef8e
 qemu-system-x86-19673 [014] 213577.939634: kvm_page_fault: address fed000f0 
error_code 181
 qemu-system-x86-19673 [014] 213577.939636: kvm_mmio: mmio unsatisfied-read len 
4 gpa 0xfed000f0 val 0x0
 qemu-system-x86-19332 [008] 213577.939637: kvm_entry: vcpu 0
 qemu-system-x86-19673 [014] 213577.939638: kvm_mmio: mmio read len 4 gpa 
0xfed000f0 val 0xfb8f24e2
 qemu-system-x86-19673 [014] 213577.939640: kvm_entry: vcpu 0
 qemu-system-x86-19211 [010] 213577.939663: kvm_set_irq: gsi 11 level 1 source 0
 qemu-system-x86-19211 [010] 213577.939664: kvm_pic_set_irq: chip 1 pin 3 
(level|masked)
 qemu-system-x86-19211 [010] 213577.939665: kvm_apic_accept_irq: apicid 0 vec 
130 (LowPrio|level)
 qemu-system-x86-19211 [010] 213577.939666: kvm_ioapic_set_irq: pin 11 dst 1 
vec=130 (LowPrio|logical|level)
 qemu-system-x86-19673 [014] 213577.939692: kvm_exit: reason ept_violation rip 
0xf8000160ef8e
 qemu-system-x86-19673 [014] 213577.939693: kvm_page_fault: address fed000f0 
error_code 181
 qemu-system-x86-19673 [014] 213577.939696: kvm_mmio: mmio unsatisfied-read len 
4 gpa 0xfed000f0 val 0x0
 qemu-system-x86-19332 [008] 213577.939699: kvm_exit: reason ept_violation rip 
0xf80001b3af8e
 qemu-system-x86-19332 [008] 213577.939700: kvm_page_fault: address fed000f0 
error_code 181
 qemu-system-x86-19673 [014] 213577.939702: kvm_mmio: mmio read len 4 gpa 
0xfed000f0 val 0xfb8f3da6
 qemu-system-x86-19563 [010] 213577.939702: kvm_set_irq: gsi 11 level 1 source 0
 qemu-system-x86-19563 [010] 213577.939703: kvm_pic_set_irq: chip 1 pin 3 
(level|masked)
 qemu-system-x86-19673 [014] 213577.939704: kvm_entry: vcpu 0
 qemu-system-x86-19563 [010] 213577.939705: kvm_apic_accept_irq: apicid 0 vec 
130 (LowPrio|level)
 qemu-system-x86-19332 [008] 213577.939706: kvm_mmio: mmio unsatisfied-read len 
4 gpa 0xfed000f0 val 0x0
 qemu-system-x86-19563 [010] 213577.939707: kvm_ioapic_set_irq: pin 11 dst 1 
vec=130 (LowPrio|logical|level)
 qemu-system-x86-19332 [008] 213577.939713: kvm_mmio: mmio read len 4 gpa 
0xfed000f0 val 0x29a105de
 qemu-system-x86-19332 [008] 213577.939715: kvm_entry: vcpu 0
 qemu-system-x86-19201 [011] 213577.939716: kvm_exit: reason exception rip 
0x1162412
 qemu-system-x86-19332 [008] 213577.939717: kvm_exit: reason halt rip 
0xfa6000fae7a1
 qemu-system-x86-19201 [011] 213577.939717: kvm_entry: vcpu 0
 qemu-system-x86-19673 [014] 213577.939761: kvm_exit: reason ept_violation rip 
0xf8000160ef8e
 qemu-system-x86-19673 [014] 213577.939762: kvm_page_fault: address fed000f0 
error_code 181
 qemu-system-x86-19673 [014] 213577.939766: kvm_mmio: mmio unsatisfied-read len 
4 gpa 0xfed000f0 val 0x0
 qemu-system-x86-19673 [014] 213577.939772: kvm_mmio: mmio read len 4 gpa 
0xfed000f0 val 0xfb8f58dd
 qemu-system-x86-19673 [014] 213577.939774: kvm_entry: vcpu 0
 qemu-system-x86-19673 [014] 213577.939776: kvm_exit: reason ept_violation rip 
0xf8000160ef8e
 qemu-system-x86-19673 [014] 213577.939776: kvm_page_fault: address fed000f0 
error_code 181
 qemu-system-x86-19673 [014] 213577.939779: kvm_mmio: mmio unsatisfied-read len 
4 gpa 0xfed000f0 val 0x0
 qemu-system-x86-19673 [014] 213577.939782: kvm_mmio: mmio read len 4 gpa 
0xfed000f0 val 0xfb8f5d09
 qemu-system-x86-19673 [014] 213577.939784: kvm_entry: vcpu 0
 qemu-system-x86-19673 [014] 213577.939791: kvm_exit: reason ept_violation rip 
0xf8000160ef8e
 qemu-system-x86-19673 [014] 21

Re: kernel bug in kvm_intel

2009-11-30 Thread Andrew Theurer
On Sun, 2009-11-29 at 16:46 +0200, Avi Kivity wrote:
> On 11/26/2009 03:35 AM, Andrew Theurer wrote:
> > I just tried testing tip of kvm.git, but unfortunately I think I might 
> > be hitting a different problem, where processes run 100% in kernel 
> > mode.  In my case, cpus 9 and 13 were stuck, running qemu processes.  
> > A stack backtrace for both cpus are below.  FWIW, kernel.org 
> > 2.6.32-rc7 does not have this problem, or the original problem.
> 
> I just posted a patch fixing this, titled "[PATCH tip:x86/entry] core: 
> fix user return notifier on fork()".
> 
Thank you, Avi.  I am running on this patch and am not seeing this
problem anymore.  I'll be testing for the previous issue next.

-Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel bug in kvm_intel

2009-11-26 Thread Andrew Theurer

Avi Kivity wrote:

On 11/26/2009 03:35 AM, Andrew Theurer wrote:



NMI backtrace for cpu 9
CPU 9:
Modules linked in: tun sunrpc af_packet bridge stp ipv6 binfmt_misc 
dm_mirror dm_region_hash dm_log dm_multipath scsi_dh dm_mod kvm_intel 
kvm uinput sr_mod cdrom ata_generic pata_acpi ata_piix joydev libata 
ide_pci_generic usbhid ide_core hid serio_raw cdc_ether usbnet mii 
matroxfb_base matroxfb_DAC1064 matroxfb_accel matroxfb_Ti3026 
matroxfb_g450 g450_pll matroxfb_misc iTCO_wdt i2c_i801 i2c_core 
pcspkr iTCO_vendor_support ioatdma thermal rtc_cmos rtc_core bnx2 
rtc_lib dca thermal_sys hwmon sg button shpchp pci_hotplug qla2xxx 
scsi_transport_fc scsi_tgt sd_mod scsi_mod crc_t10dif ext3 jbd 
mbcache uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: processor]
Pid: 5687, comm: qemu-system-x86 Not tainted 
2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1  
-[7947AC1]-
RIP: 0010:[]  [] 
fire_user_return_notifiers+0x31/0x36

RSP: 0018:88095024df08  EFLAGS: 0246
RAX:  RBX: 0800 RCX: 88095024c000
RDX: 88002834 RSI:  RDI: 88095024df58
RBP: 88095024df18 R08:  R09: 0001
R10: 00caf1fff62d R11: 8805b584de40 R12: 7fffae48e0f0
R13:  R14: 0001 R15: 
FS:  7f45c69d57c0() GS:88002834() 
knlGS:

CS:  0010 DS:  ES:  CR0: 8005003b
CR2: f9800121056e CR3: 000953d36000 CR4: 26e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Call Trace:
<#DB[1]> <> Pid: 5687, comm: qemu-system-x86 Not tainted 
2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1

Call Trace:
  [] ? show_regs+0x44/0x49
 [] nmi_watchdog_tick+0xc2/0x1b9
 [] do_nmi+0xb0/0x252
 [] nmi+0x20/0x30
 [] ? fire_user_return_notifiers+0x31/0x36
<>  [] do_notify_resume+0x62/0x69
 [] ? int_check_syscall_exit_work+0x9/0x3d
 [] int_signal+0x12/0x17




That's a bug with the new user return notifiers.  Is your host kernel 
preemptible?


preempt is off.


I think I saw this once but I'm not sure.  I can't reproduce with a host 
kernel build, some silly guest workload, and 'perf top' to generate an 
nmi load.




-Andrew


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel bug in kvm_intel

2009-11-25 Thread Andrew Theurer

Tejun Heo wrote:

Hello,

11/01/2009 08:31 PM, Avi Kivity wrote:

Here is the code in question:


 3ae7:   75 05   jne   
3aee

   3ae9:   0f 01 c2vmlaunch
   3aec:   eb 03   jmp   
3af1

   3aee:   0f 01 c3vmresume
   3af1:   48 87 0c 24 xchg   %rcx,(%rsp)
   

^^^ fault, but not at (%rsp)
 

Can you please post the full oops (including kernel debug messages
during boot) or give me a pointer to the original message?

http://www.mail-archive.com/kvm@vger.kernel.org/msg23458.html


Also, does
the faulting address coincide with any symbol?
   

No (at least, not in System.map).


Has there been any progress?  Is kvm + oprofile still broken?



I just tried testing tip of kvm.git, but unfortunately I think I might 
be hitting a different problem, where processes run 100% in kernel mode. 
 In my case, cpus 9 and 13 were stuck, running qemu processes.  A stack 
backtrace for both cpus are below.  FWIW, kernel.org 2.6.32-rc7 does not 
have this problem, or the original problem.




NMI backtrace for cpu 9
CPU 9:
Modules linked in: tun sunrpc af_packet bridge stp ipv6 binfmt_misc dm_mirror 
dm_region_hash dm_log dm_multipath scsi_dh dm_mod kvm_intel kvm uinput sr_mod 
cdrom ata_generic pata_acpi ata_piix joydev libata ide_pci_generic usbhid 
ide_core hid serio_raw cdc_ether usbnet mii matroxfb_base matroxfb_DAC1064 
matroxfb_accel matroxfb_Ti3026 matroxfb_g450 g450_pll matroxfb_misc iTCO_wdt 
i2c_i801 i2c_core pcspkr iTCO_vendor_support ioatdma thermal rtc_cmos rtc_core 
bnx2 rtc_lib dca thermal_sys hwmon sg button shpchp pci_hotplug qla2xxx 
scsi_transport_fc scsi_tgt sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd 
ohci_hcd ehci_hcd usbcore [last unloaded: processor]
Pid: 5687, comm: qemu-system-x86 Not tainted 
2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1  -[7947AC1]-
RIP: 0010:[]  [] 
fire_user_return_notifiers+0x31/0x36
RSP: 0018:88095024df08  EFLAGS: 0246
RAX:  RBX: 0800 RCX: 88095024c000
RDX: 88002834 RSI:  RDI: 88095024df58
RBP: 88095024df18 R08:  R09: 0001
R10: 00caf1fff62d R11: 8805b584de40 R12: 7fffae48e0f0
R13:  R14: 0001 R15: 
FS:  7f45c69d57c0() GS:88002834() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: f9800121056e CR3: 000953d36000 CR4: 26e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Call Trace:
 <#DB[1]>  <> Pid: 5687, comm: qemu-system-x86 Not tainted 
2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1
Call Trace:
   [] ? show_regs+0x44/0x49
 [] nmi_watchdog_tick+0xc2/0x1b9
 [] do_nmi+0xb0/0x252
 [] nmi+0x20/0x30
 [] ? fire_user_return_notifiers+0x31/0x36
 <>  [] do_notify_resume+0x62/0x69
 [] ? int_check_syscall_exit_work+0x9/0x3d
 [] int_signal+0x12/0x17



NMI backtrace for cpu 13
CPU 13:
Modules linked in: tun sunrpc af_packet bridge stp ipv6 binfmt_misc dm_mirror 
dm_region_hash dm_log dm_multipath scsi_dh dm_mod kvm_intel kvm uinput sr_mod 
cdrom ata_generic pata_acpi ata_piix joydev libata ide_pci_generic usbhid 
ide_core hid serio_raw cdc_ether usbnet mii matroxfb_base matroxfb_DAC1064 
matroxfb_accel matroxfb_Ti3026 matroxfb_g450 g450_pll matroxfb_misc iTCO_wdt 
i2c_i801 i2c_core pcspkr iTCO_vendor_support ioatdma thermal rtc_cmos rtc_core 
bnx2 rtc_lib dca thermal_sys hwmon sg button shpchp pci_hotplug qla2xxx 
scsi_transport_fc scsi_tgt sd_mod scsi_mod crc_t10dif ext3 jbd mbcache uhci_hcd 
ohci_hcd ehci_hcd usbcore [last unloaded: processor]
Pid: 5792, comm: qemu-system-x86 Not tainted 
2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1  -[7947AC1]-
RIP: 0010:[]  [] int_restore_rest+0x1d/0x3d
RSP: 0018:88124f491f58  EFLAGS: 0292
RAX: 0800 RBX: 7fff9df852e0 RCX: 88124f49
RDX: 88099ff4 RSI:  RDI: fe2e
RBP: 7fff9df85260 R08: 88124f49 R09: 
R10: 0005 R11: 880954971da0 R12: 7fff9df851e0
R13:  R14: 0001 R15: 
FS:  7f73b5b1d7c0() GS:88099ff4() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 7f8d5a8de9d0 CR3: 000eb34d7000 CR4: 26e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Call Trace:
 <#DB[1]>  <> Pid: 5792, comm: qemu-system-x86 Not tainted 
2.6.32-rc7-5e8cb552cb8b48244b6d07bff984b3c4080d4bc9-autokern1 #1
Call Trace:
   [] ? show_regs+0x44/0x49
 [] nmi_watchdog_tick+0xc2/0x1b9
 [] do_nmi+0xb0/0x252
 [] nmi+0x20/0x30
 [] ? int_restore_rest+0x1d/0x3d
 <> 



-Andrew


--
To unsubscribe fr

Re: kernel bug in kvm_intel

2009-10-31 Thread Andrew Theurer

Avi Kivity wrote:

On 10/30/2009 08:07 PM, Andrew Theurer wrote:


I have finally bisected and isolated this to the following commit:

ada3fa15057205b7d3f727bba5cd26b5912e350f
http://git.kernel.org/?p=virt/kvm/kvm.git;a=commit;h=ada3fa15057205b7d3f727bba5cd26b5912e350f 

  

Merge branch 'for-linus' of git://git./linux/kernel/git/tj/percpu

* 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 commits)

   powerpc64: convert to dynamic percpu allocator
   sparc64: use embedding percpu first chunk allocator
   percpu: kill lpage first chunk allocator
   x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
   percpu: update embedding first chunk allocator to handle sparse units
   percpu: use group information to allocate vmap areas sparsely
   vmalloc: implement pcpu_get_vm_areas()
   vmalloc: separate out insert_vmalloc_vm()
   percpu: add chunk->base_addr
   percpu: add pcpu_unit_offsets[]
   percpu: introduce pcpu_alloc_info and pcpu_group_info
   percpu: move pcpu_lpage_build_unit_map() and 
pcpul_lpage_dump_cfg() upward

   percpu: add @align to pcpu_fc_alloc_fn_t
   percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
   percpu: drop @static_size from first chunk allocators
   percpu: generalize first chunk allocator selection
   percpu: build first chunk allocators selectively
   percpu: rename 4k first chunk allocator to page
   percpu: improve boot messages
   percpu: fix pcpu_reclaim() locking
 

The previous commit (5579fd7e6aed8860ea0c8e3f11897493153b10ad) does not
this problem.  FYI, this problem only occurs when oprofile is active.

Any idea what in this commit might be the issue?

   


5579 is not the preceding commit, it is the merged branch:

commit ada3fa15057205b7d3f727bba5cd26b5912e350f
Merge: 2f82af0 5579fd7
Author: Linus Torvalds 
Date:   Tue Sep 15 09:39:44 2009 -0700

Merge branch 'for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu



What happens with 2f82af0?


2f82af0 is:


Nicolas Pitre has a new email address

Due to problems at cam.org, my n...@cam.org email address is no longer
valid.  FRom now on, n...@fluxnic.net should be used instead.


I have not tested that, but it doesn't seem likely that it would have 
anything to do with the problem.  Or maybe I am misunderstanding the 
impact of this commit?


FWIW, here is the bisect log:

git bisect start
# good: [227423904c709a8e60245c97081bbeb4fb500655] Merge branch 
'x86-pat-for-linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip

git bisect good 227423904c709a8e60245c97081bbeb4fb500655
# bad: [0f29f5871c165e346409f62d903f97cfad3894c5] Staging: rtl8192su: 
remove RTL8192SU ifdefs

git bisect bad 0f29f5871c165e346409f62d903f97cfad3894c5
# bad: [ada3fa15057205b7d3f727bba5cd26b5912e350f] Merge branch 
'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu

git bisect bad ada3fa15057205b7d3f727bba5cd26b5912e350f
# bad: [ada3fa15057205b7d3f727bba5cd26b5912e350f] Merge branch 
'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu

git bisect bad ada3fa15057205b7d3f727bba5cd26b5912e350f
# good: [decee2e8a9538ae5476e6cb3f4b7714c92a04a2b] V4L/DVB (12485): 
zl10353: correct implementation of FE_READ_UNCORRECTED_BLOCKS

git bisect good decee2e8a9538ae5476e6cb3f4b7714c92a04a2b
# good: [0ee7e4d6d4f58c3b2d9f0ca8ad8f63abda8694b1] V4L/DVB (12694): 
gspca - vc032x: Change the start exchanges of the sensor hv7131r.

git bisect good 0ee7e4d6d4f58c3b2d9f0ca8ad8f63abda8694b1
# good: [f58dc01ba2ca9fe3ab2ba4ca43d9c8a735cf62d8] percpu: generalize 
first chunk allocator selection

git bisect good f58dc01ba2ca9fe3ab2ba4ca43d9c8a735cf62d8
# good: [2f82af08fcc7dc01a7e98a49a5995a77e32a2925] Nicolas Pitre has a 
new email address

git bisect good 2f82af08fcc7dc01a7e98a49a5995a77e32a2925
# good: [cf88c79006bd6a09ad725ba0b34c0e23db20b19e] vmalloc: separate out 
insert_vmalloc_vm()

git bisect good cf88c79006bd6a09ad725ba0b34c0e23db20b19e
# good: [4518e6a0c038b98be4c480e6f4481e8676bd15dd] x86,percpu: use 
embedding for 64bit NUMA and page for 32bit NUMA

git bisect good 4518e6a0c038b98be4c480e6f4481e8676bd15dd
# good: [bcb2107fdbecef3de55d597d23453747af81ba88] sparc64: use 
embedding percpu first chunk allocator

git bisect good bcb2107fdbecef3de55d597d23453747af81ba88
# good: [5579fd7e6aed8860ea0c8e3f11897493153b10ad] Merge branch 
'for-next' into for-linus

git bisect good 5579fd7e6aed8860ea0c8e3f11897493153b10ad


Oh, wait, that commit was tested, in the middle of the log above.

-Andrew


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel bug in kvm_intel

2009-10-30 Thread Andrew Theurer
On Thu, 2009-10-15 at 15:18 -0500, Andrew Theurer wrote:
> On Thu, 2009-10-15 at 02:10 +0900, Avi Kivity wrote:
> > On 10/13/2009 11:04 PM, Andrew Theurer wrote:
> > >
> > >> Look at the address where vmx_vcpu_run starts, add 0x26d, and show the
> > >> surrounding code.
> > >>
> > >> Thinking about it, it probably _is_ what you showed, due to module page
> > >> alignment.  But please verify this; I can't reconcile the fault address
> > >> (9fe9a2b) with %rsp at the time of the fault.
> > >>  
> > > Here is the start of the function:
> > >
> > >
> > >> 3884:
> > >>  3884:   55  push   %rbp
> > >>  3885:   48 89 e5mov%rsp,%rbp
> > >>  
> > > and 0x26d later is 0x3af1:
> > >
> > >
> > >>  3ad2:   4c 8b b1 88 01 00 00mov0x188(%rcx),%r14
> > >>  3ad9:   4c 8b b9 90 01 00 00mov0x190(%rcx),%r15
> > >>  3ae0:   48 8b 89 20 01 00 00mov0x120(%rcx),%rcx
> > >>  3ae7:   75 05   jne3aee
> > >>  3ae9:   0f 01 c2vmlaunch
> > >>  3aec:   eb 03   jmp3af1
> > >>  3aee:   0f 01 c3vmresume
> > >>  3af1:   48 87 0c 24 xchg   %rcx,(%rsp)
> > >>  3af5:   48 89 81 18 01 00 00mov%rax,0x118(%rcx)
> > >>  3afc:   48 89 99 30 01 00 00mov%rbx,0x130(%rcx)
> > >>  3b03:   ff 34 24pushq  (%rsp)
> > >>  3b06:   8f 81 20 01 00 00   popq   0x120(%rcx)
> > >>  
> > >
> > 
> > Ok.  So it faults on the xchg instruction, rsp is 8806369ffc80 but 
> > the fault address is 9fe9a2b4.  So it looks like the IDT is 
> > corrupted.
> > 

I have finally bisected and isolated this to the following commit:

ada3fa15057205b7d3f727bba5cd26b5912e350f
http://git.kernel.org/?p=virt/kvm/kvm.git;a=commit;h=ada3fa15057205b7d3f727bba5cd26b5912e350f
> Merge branch 'for-linus' of git://git./linux/kernel/git/tj/percpu
> 
> * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (46 
> commits)
>   powerpc64: convert to dynamic percpu allocator
>   sparc64: use embedding percpu first chunk allocator
>   percpu: kill lpage first chunk allocator
>   x86,percpu: use embedding for 64bit NUMA and page for 32bit NUMA
>   percpu: update embedding first chunk allocator to handle sparse units
>   percpu: use group information to allocate vmap areas sparsely
>   vmalloc: implement pcpu_get_vm_areas()
>   vmalloc: separate out insert_vmalloc_vm()
>   percpu: add chunk->base_addr
>   percpu: add pcpu_unit_offsets[]
>   percpu: introduce pcpu_alloc_info and pcpu_group_info
>   percpu: move pcpu_lpage_build_unit_map() and pcpul_lpage_dump_cfg() upward
>   percpu: add @align to pcpu_fc_alloc_fn_t
>   percpu: make @dyn_size mandatory for pcpu_setup_first_chunk()
>   percpu: drop @static_size from first chunk allocators
>   percpu: generalize first chunk allocator selection
>   percpu: build first chunk allocators selectively
>   percpu: rename 4k first chunk allocator to page
>   percpu: improve boot messages
>   percpu: fix pcpu_reclaim() locking

The previous commit (5579fd7e6aed8860ea0c8e3f11897493153b10ad) does not
this problem.  FYI, this problem only occurs when oprofile is active.

Any idea what in this commit might be the issue?

-Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/3] introduce VMSTATE_U64

2009-10-27 Thread Andrew Theurer

> On Tue, Oct 20, 2009 at 08:40:26AM +0900, Avi Kivity wrote:
> > On 10/17/2009 04:27 AM, Glauber Costa wrote:
> >> This is a patch actually written by Juan, which, according to him,
> >> he plans on posting to qemu.git. Problem is that linux defines
> >> u64 in a way that is type-uncompatible with uint64_t.
> >>
> >> I am including it here, because it is a dependency to my patch series
> >> that follows.
> >>
> >>
> >
> > Why can't we store these values in qemu as uint64_ts?
> Because then we have to redefine the whole structure in qemu.
> 
> the proposal is to simply pick the structures directly from linux. I believe
> it is much easier, and the versioning scheme in vmstate will help us get 
> around any changes they might suffer in the future.

I get build errors with this.  Is there something extra I need to do?  I am 
currently running a rhel54 2.6.18 kernel.

  CCqdev-properties.o
savevm.c: In function ‘get_u64’:
savevm.c:856: error: ‘__u64’ undeclared (first use in this function)
savevm.c:856: error: (Each undeclared identifier is reported only once
savevm.c:856: error: for each function it appears in.)
savevm.c:856: error: ‘v’ undeclared (first use in this function)
savevm.c: In function ‘put_u64’:
savevm.c:863: error: ‘__u64’ undeclared (first use in this function)
savevm.c:863: error: ‘v’ undeclared (first use in this function)
make[1]: *** [savevm.o] Error 1
make[1]: *** Waiting for unfinished jobs
make: *** [build-all] Error 2


Thanks,

-Andrew





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel bug in kvm_intel

2009-10-15 Thread Andrew Theurer
On Thu, 2009-10-15 at 02:10 +0900, Avi Kivity wrote:
> On 10/13/2009 11:04 PM, Andrew Theurer wrote:
> >
> >> Look at the address where vmx_vcpu_run starts, add 0x26d, and show the
> >> surrounding code.
> >>
> >> Thinking about it, it probably _is_ what you showed, due to module page
> >> alignment.  But please verify this; I can't reconcile the fault address
> >> (9fe9a2b) with %rsp at the time of the fault.
> >>  
> > Here is the start of the function:
> >
> >
> >> 3884:
> >>  3884:   55  push   %rbp
> >>  3885:   48 89 e5mov%rsp,%rbp
> >>  
> > and 0x26d later is 0x3af1:
> >
> >
> >>  3ad2:   4c 8b b1 88 01 00 00mov0x188(%rcx),%r14
> >>  3ad9:   4c 8b b9 90 01 00 00mov0x190(%rcx),%r15
> >>  3ae0:   48 8b 89 20 01 00 00mov0x120(%rcx),%rcx
> >>  3ae7:   75 05   jne3aee
> >>  3ae9:   0f 01 c2vmlaunch
> >>  3aec:   eb 03   jmp3af1
> >>  3aee:   0f 01 c3vmresume
> >>  3af1:   48 87 0c 24 xchg   %rcx,(%rsp)
> >>  3af5:   48 89 81 18 01 00 00mov%rax,0x118(%rcx)
> >>  3afc:   48 89 99 30 01 00 00mov%rbx,0x130(%rcx)
> >>  3b03:   ff 34 24pushq  (%rsp)
> >>  3b06:   8f 81 20 01 00 00   popq   0x120(%rcx)
> >>  
> >
> 
> Ok.  So it faults on the xchg instruction, rsp is 8806369ffc80 but 
> the fault address is 9fe9a2b4.  So it looks like the IDT is 
> corrupted.
> 
> Can you check what's around 9fe9a2b4 in System.map?

85d85b24 B __bss_stop
85d86000 B __brk_base
85d96000 b .brk.dmi_alloc
85da6000 B __brk_limit
ff60 T vgettimeofday
ff600100 t vread_tsc
ff600130 t vread_hpet
ff600140 D __vsyscall_gtod_data
ff600400 T vtime

-Andrew


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel bug in kvm_intel

2009-10-13 Thread Andrew Theurer
On Tue, 2009-10-13 at 08:50 +0200, Avi Kivity wrote:
> On 10/12/2009 08:42 PM, Andrew Theurer wrote:
> > On Sun, 2009-10-11 at 07:19 +0200, Avi Kivity wrote:
> >
> >> On 10/09/2009 10:04 PM, Andrew Theurer wrote:
> >>  
> >>> This is on latest master branch on kvm.git and qemu-kvm.git, running
> >>> 12 Windows Server2008 VMs, and using oprofile.  I ran again without
> >>> oprofile and did not get the BUG.  I am wondering if anyone else is
> >>> seeing this.
> >>>
> >>> Thanks,
> >>>
> >>> -Andrew
> >>>
> >>>
> >>>> Oct  9 11:55:13 virtvictory-eth0 kernel: BUG: unable to handle kernel
> >>>> paging request at 9fe9a2b4
> >>>> Oct  9 11:55:13 virtvictory-eth0 kernel: IP: []
> >>>> vmx_vcpu_run+0x26d/0x64f [kvm_intel]
> >>>>  
> >> Can you run this through objdump or gdb to see what source this
> >> corresponds to?
> >>
> >>  
> > Somewhere here I think (?)
> >
> > objdump -d
> >
> 
> 
> Look at the address where vmx_vcpu_run starts, add 0x26d, and show the 
> surrounding code.
> 
> Thinking about it, it probably _is_ what you showed, due to module page 
> alignment.  But please verify this; I can't reconcile the fault address 
> (9fe9a2b) with %rsp at the time of the fault.

Here is the start of the function:

> 3884 :
> 3884:   55  push   %rbp
> 3885:   48 89 e5mov%rsp,%rbp

and 0x26d later is 0x3af1:

> 3ad2:   4c 8b b1 88 01 00 00mov0x188(%rcx),%r14
> 3ad9:   4c 8b b9 90 01 00 00mov0x190(%rcx),%r15
> 3ae0:   48 8b 89 20 01 00 00mov0x120(%rcx),%rcx
> 3ae7:   75 05   jne3aee 
> 3ae9:   0f 01 c2vmlaunch
> 3aec:   eb 03   jmp3af1 
> 3aee:   0f 01 c3vmresume
> 3af1:   48 87 0c 24 xchg   %rcx,(%rsp)
> 3af5:   48 89 81 18 01 00 00mov%rax,0x118(%rcx)
> 3afc:   48 89 99 30 01 00 00mov%rbx,0x130(%rcx)
> 3b03:   ff 34 24pushq  (%rsp)
> 3b06:   8f 81 20 01 00 00   popq   0x120(%rcx)


-Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel bug in kvm_intel

2009-10-12 Thread Andrew Theurer
On Sun, 2009-10-11 at 07:19 +0200, Avi Kivity wrote:
> On 10/09/2009 10:04 PM, Andrew Theurer wrote:
> > This is on latest master branch on kvm.git and qemu-kvm.git, running 
> > 12 Windows Server2008 VMs, and using oprofile.  I ran again without 
> > oprofile and did not get the BUG.  I am wondering if anyone else is 
> > seeing this.
> >
> > Thanks,
> >
> > -Andrew
> >
> >> Oct  9 11:55:13 virtvictory-eth0 kernel: BUG: unable to handle kernel 
> >> paging request at 9fe9a2b4
> >> Oct  9 11:55:13 virtvictory-eth0 kernel: IP: [] 
> >> vmx_vcpu_run+0x26d/0x64f [kvm_intel]
> 
> Can you run this through objdump or gdb to see what source this 
> corresponds to?
> 

Somewhere here I think (?)

objdump -d
> 3ad9:   4c 8b b9 90 01 00 00mov0x190(%rcx),%r15
> 3ae0:   48 8b 89 20 01 00 00mov0x120(%rcx),%rcx
> 3ae7:   75 05   jne3aee 
> 3ae9:   0f 01 c2vmlaunch
> 3aec:   eb 03   jmp3af1 
> 3aee:   0f 01 c3vmresume
> 3af1:   48 87 0c 24 xchg   %rcx,(%rsp)
> 3af5:   48 89 81 18 01 00 00mov%rax,0x118(%rcx)
> 3afc:   48 89 99 30 01 00 00mov%rbx,0x130(%rcx)
> 3b03:   ff 34 24pushq  (%rsp)
> 3b06:   8f 81 20 01 00 00   popq   0x120(%rcx)
> 3b0c:   48 89 91 28 01 00 00mov%rdx,0x128(%rcx)


objdump -S
> /* Enter guest mode */
> "jne .Llaunched \n\t"
> __ex(ASM_VMX_VMLAUNCH) "\n\t"
> "jmp .Lkvm_vmx_return \n\t"
> ".Llaunched: " __ex(ASM_VMX_VMRESUME) "\n\t"
> ".Lkvm_vmx_return: "
> /* Save guest registers, load host registers, keep flags */
> "xchg %0, (%%"R"sp) \n\t"
> "mov %%"R"ax, %c[rax](%0) \n\t"
> "mov %%"R"bx, %c[rbx](%0) \n\t"
> "push"Q" (%%"R"sp); pop"Q" %c[rcx](%0) \n\t"
> "mov %%"R"dx, %c[rdx](%0) \n\t"
> "mov %%"R"si, %c[rsi](%0) \n\t"


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


kernel bug in kvm_intel

2009-10-09 Thread Andrew Theurer
This is on latest master branch on kvm.git and qemu-kvm.git, running 12 
Windows Server2008 VMs, and using oprofile.  I ran again without 
oprofile and did not get the BUG.  I am wondering if anyone else is 
seeing this.


Thanks,

-Andrew


Oct  9 11:55:13 virtvictory-eth0 kernel: BUG: unable to handle kernel paging 
request at 9fe9a2b4
Oct  9 11:55:13 virtvictory-eth0 kernel: IP: [] 
vmx_vcpu_run+0x26d/0x64f [kvm_intel]
Oct  9 11:55:13 virtvictory-eth0 kernel: PGD 1003067 PUD 1007063 PMD 0 
Oct  9 11:55:13 virtvictory-eth0 kernel: Oops:  [#5] SMP 
Oct  9 11:55:13 virtvictory-eth0 kernel: last sysfs file: /sys/devices/virtual/net/br4/bridge/topology_change_detected
Oct  9 11:55:13 virtvictory-eth0 kernel: CPU 6 
Oct  9 11:55:13 virtvictory-eth0 kernel: Modules linked in: oprofile tun hidp l2cap crc16 bluetooth rfkill lockd sunrpc bridge stp af_packet ipv6 binfmt_misc dm_multipath scsi_dh video output sbs sbshc pci_slot fan container battery ac parport_pc lp parport kvm_intel kvm joydev sr_mod cdrom sg cdc_ether usbnet mii usbhid hid serio_raw rtc_cmos rtc_core rtc_lib button thermal thermal_sys hwmon pata_acpi bnx2 i2c_i801 ide_pci_generic iTCO_wdt i2c_core ata_generic iTCO_vendor_support ioatdma dca pcspkr dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod ide_gd_mod ide_core usb_storage ata_piix libata shpchp pci_hotplug qla2xxx scsi_transport_fc scsi_tgt sd_mod crc_t10dif scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd usbcore [last unloaded: oprofile]

Oct  9 11:55:13 virtvictory-eth0 kernel: Pid: 6495, comm: qemu-system-x86 
Tainted: G  D2.6.32-rc3-autokern1 #1 IBM System x -[7947AC1]-
Oct  9 11:55:13 virtvictory-eth0 kernel: RIP: 0010:[]  
[] vmx_vcpu_run+0x26d/0x64f [kvm_intel]
Oct  9 11:55:13 virtvictory-eth0 kernel: RSP: 0018:8806369ffc80  EFLAGS: 00010002 
Oct  9 11:55:13 virtvictory-eth0 kernel: RAX: 0004001f RBX: 0200 RCX: 0001

Oct  9 11:55:13 virtvictory-eth0 kernel: RDX:  RSI: 
 RDI: 8000
Oct  9 11:55:13 virtvictory-eth0 kernel: RBP:  R08: 
fa80025180a8 R09: f800016ca4f0
Oct  9 11:55:13 virtvictory-eth0 kernel: R10: 7797003630747070 R11: 
fa60039039b8 R12: a003
Oct  9 11:55:13 virtvictory-eth0 kernel: R13:  R14: 
 R15: 
Oct  9 11:55:13 virtvictory-eth0 kernel: FS:  40ae6940() 
GS:88099fe8() knlGS:ffe66000
Oct  9 11:55:13 virtvictory-eth0 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 
80050033
Oct  9 11:55:13 virtvictory-eth0 kernel: CR2: 9fe9a2b4 CR3: 
0006375ec000 CR4: 26f0
Oct  9 11:55:13 virtvictory-eth0 kernel: DR0:  DR1: 
 DR2: 
Oct  9 11:55:13 virtvictory-eth0 kernel: DR3:  DR6: 
0ff0 DR7: 0400
Oct  9 11:55:13 virtvictory-eth0 kernel: Process qemu-system-x86 (pid: 6495, 
threadinfo 8806369fe000, task 880632056480)
Oct  9 11:55:13 virtvictory-eth0 kernel: Stack:
Oct  9 11:55:13 virtvictory-eth0 kernel:  8806320916c0 8806369ffd88 
6c14 8806369ffca8
Oct  9 11:55:13 virtvictory-eth0 kernel: <0> 8806320916c0 8806369ffce8 
a0293bfa 8806369ffee8
Oct  9 11:55:13 virtvictory-eth0 kernel: <0> 0001 0300 
8806320916c0 002c
Oct  9 11:55:13 virtvictory-eth0 kernel: Call Trace:
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
emulate_instruction+0x28a/0x2bc [kvm]
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
handle_apic_access+0x20/0x4b [kvm_intel]
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
vmx_handle_exit+0xe1/0x48b [kvm_intel]
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
save_msrs+0x39/0x50 [kvm_intel]
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
apic_update_ppr+0x23/0x51 [kvm]
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
__up_read+0x8f/0x97
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
kvm_arch_vcpu_ioctl_run+0x6b6/0xa92 [kvm]
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
kvm_vcpu_ioctl+0xf6/0x5c0 [kvm]
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
lapic_next_event+0x18/0x1c
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
clockevents_program_event+0x73/0x7c
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
tick_dev_program_event+0x2a/0x9c
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
vfs_ioctl+0x2a/0x77
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
do_vfs_ioctl+0x445/0x496
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
sys_futex+0x111/0x12f
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
sys_ioctl+0x57/0x7a
Oct  9 11:55:13 virtvictory-eth0 kernel:  [] ? 
system_call_fastpath+0x16/0x1b


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm scaling question

2009-09-15 Thread Andrew Theurer
On Mon, 2009-09-14 at 17:19 -0600, Bruce Rogers wrote:
> On 9/11/2009 at 3:53 PM, Marcelo Tosatti  wrote:
> > On Fri, Sep 11, 2009 at 09:36:10AM -0600, Bruce Rogers wrote:
> >> I am wondering if anyone has investigated how well kvm scales when 
> > supporting many guests, or many vcpus or both.
> >> 
> >> I'll do some investigations into the per vm memory overhead and
> >> play with bumping the max vcpu limit way beyond 16, but hopefully
> >> someone can comment on issues such as locking problems that are known
> >> to exist and needing to be addressed to increased parallellism,
> >> general overhead percentages which can help provide consolidation
> >> expectations, etc.
> > 
> > I suppose it depends on the guest and workload. With an EPT host and
> > 16-way Linux guest doing kernel compilations, on recent kernel, i see:
> > 
> > # Samples: 98703304
> > #
> > # Overhead  Command  Shared Object  Symbol
> > #   ...  .  ..
> > #
> > 97.15%   sh  [kernel]   [k] 
> > vmx_vcpu_run
> >  0.27%   sh  [kernel]   [k] 
> > kvm_arch_vcpu_ioctl_
> >  0.12%   sh  [kernel]   [k] 
> > default_send_IPI_mas
> >  0.09%   sh  [kernel]   [k] 
> > _spin_lock_irq
> > 
> > Which is pretty good. Without EPT/NPT the mmu_lock seems to be the major
> > bottleneck to parallelism.
> > 
> >> Also, when I did a simple experiment with vcpu overcommitment, I was
> >> surprised how quickly performance suffered (just bringing a Linux vm
> >> up), since I would have assumed the additional vcpus would have been
> >> halted the vast majority of the time. On a 2 proc box, overcommitment
> >> to 8 vcpus in a guest (I know this isn't a good usage scenario, but
> >> does provide some insights) caused the boot time to increase to almost
> >> exponential levels. At 16 vcpus, it took hours to just reach the gui
> >> login prompt.
> > 
> > One probable reason for that are vcpus which hold spinlocks in the guest
> > are scheduled out in favour of vcpus which spin on that same lock.
> 
> I suspected it might be a whole lot of spinning happening. That does seems 
> most likely. I was just surprised how bad the behavior was.

I have collected lock_stat info on a similar vcpu over-commit
configuration, but with EPT system, and saw a very significant amount of
spinning.  However, if you don't have EPT or NPT, I would bet that's the
first problem.  IMO, I am a little surprised simply booting is such a
problem.  I would be interesting to see what lock_stat shows on your
guest after booting with 16 vcpus.  

I have observed that shortening the time between vcpus being scheduled
can help mitigate the problem with lock holder preemption (presumably
because the spinning vcpu is de-scheduled earlier and the vcpu holding
the lock is scheduled sooner), but I imagine there are other unwanted
side-effects like lower cache hits.

-Andrew

> 
> Bruce
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Use thread debug register storage instead of kvm specific data

2009-09-04 Thread Andrew Theurer

Brian Jackson wrote:

On Friday 04 September 2009 09:48:17 am Andrew Theurer wrote:


Still not idle=poll, it may shave off 0.2%.

Won't this affect SMT in a negative way?  (OK, I am not running SMT now,
but eventually we will be) A long time ago, we tested P4's with HT, and
a polling idle in one thread always negatively impacted performance in
the sibling thread.

FWIW, I did try idle=halt, and it was slightly worse.

I did get a chance to try the latest qemu (master and next heads).  I
have been running into a problem with virtIO stor driver for windows on
anything much newer than kvm-87.  I compiled the driver from the new git
tree, installed OK, but still had the same error.  Finally, I removed
the serial number feature in the virtio-blk in qemu, and I can now get
the driver to work in Windows.


What were the symptoms you were seeing (i.e. define "a problem").


Device manager reports "a problem code 10" occurred, and the driver 
cannot initialize.


Vadim Rozenfeld informed me:

There is a sanity check in the code, which checks the I/O range and fails if is 
not equal to 40h.
Resent virtio-blk devices have I/O range equal to 0x400 (serial number feature). So, out signed  viostor driver will fail on the latest KVMs. This problem was fixed 

and committed to SVN some time ago.

I assumed the fix was to the virtio windows driver, but I could not get 
the driver I compiled from latest git to work either (only on 
qemu-kvm-87).  So, I just backed out the serial number feature in qemu, 
and it worked.  FWIW, the linux virtio-blk driver never had a problem.





So, not really any good news on performance with latest qemu builds.
Performance is slightly worse:

qemu-kvm-87
user  nice  system   irq  softirq guest   idle  iowait
5.79  0.009.28  0.08 1.00 20.81  58.784.26
total busy: 36.97

qemu-kvm-88-905-g6025b2d (master)
user  nice  system   irq  softirq guest   idle  iowait
6.57  0.00   10.86  0.08 1.02 21.35  55.904.21
total busy: 39.89

qemu-kvm-88-910-gbf8a05b (next)
user  nice  system   irq  softirq guest   idle  iowait
6.60  0.00  10.91   0.09 1.03 21.35  55.714.31
total busy: 39.98

diff of profiles, p1=qemu-kvm-87, p2=qemu-master




18x more samples for gfn_to_memslot_unali*, 37x for
emulator_read_emula*, and more CPU time in guest mode.

One other thing I decided to try was some cpu binding.  I know this is
not practical for production, but I wanted to see if there's any benefit
at all.  One reason was that a coworker here tried binding the qemu
thread for the vcpu and the qemu IO thread to the same cpu.  On a
networking test, guest->local-host, throughput was up about 2x.
Obviously there was a nice effect of being on the same cache.  I
wondered, even without full bore throughput tests, could we see any
benefit here.  So, I bound each pair of VMs to a dedicated core.  What I
saw was about a 6% improvement in performance.  For a system which has
pretty incredible memory performance and is not that busy, I was
surprised that I got 6%.  I am not advocating binding, but what I do
wonder:  on 1-way VMs, if we keep all the qemu threads together on the
same CPU, but still allowing the scheduler to move them (all of them at
once) to different cpus over time, would we see the same benefit?

One other thing:  So far I have not been using preadv/pwritev.  I assume
I need a more recent glibc (on 2.5 now) for qemu to take advantage of
this?


Getting p(read|write)v working almost doubled my virtio-net throughput in a 
Linux guest. Not quite as much in Windows guests. Yes you need glibc-2.10. I 
think some distros might have backported it to 2.9. You will also need some 
support for it in your system includes.


Thanks, I will try a newer glibc, or maybe just move to a newer Linux 
installation which happens to have a newer glic.


-Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Use thread debug register storage instead of kvm specific data

2009-09-04 Thread Andrew Theurer
On Tue, 2009-09-01 at 21:23 +0300, Avi Kivity wrote:
> On 09/01/2009 09:12 PM, Andrew Theurer wrote:
> > Here's a run from branch debugreg with thread debugreg storage +
> > conditionally reload dr6:
> >
> > user  nice  system   irq  softirq guest   idle  iowait
> > 5.79  0.009.28  0.08 1.00 20.81  58.784.26
> > total busy: 36.97
> >
> > Previous run that had avoided calling adjust_vmx_controls twice:
> >
> > user  nice  system   irq  softirq guest   idle  iowait
> > 5.81  0.009.48  0.081.04  21.32  57.864.41
> > total busy: 37.73
> >
> > A relative reduction CPU cycles of 2%
> >
> 
> That was an wasy fruit to pick.  To bad it was a regression that we 
> introduced.
> 
> > new oprofile:
> >
> >
> >> samples  %app name symbol name
> >> 876648   54.1555  kvm-intel.ko vmx_vcpu_run
> >> 37595 2.3225  qemu-system-x86_64   cpu_physical_memory_rw
> >> 35623 2.2006  qemu-system-x86_64   phys_page_find_alloc
> >> 24874 1.5366  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> >> native_write_msr_safe
> >> 17710 1.0940  libc-2.5.so  memcpy
> >> 14664 0.9059  kvm.ko   kvm_arch_vcpu_ioctl_run
> >> 14577 0.9005  qemu-system-x86_64   qemu_get_ram_ptr
> >> 12528 0.7739  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> >> native_read_msr_safe
> >> 10979 0.6782  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> >> copy_user_generic_string
> >> 9979  0.6165  qemu-system-x86_64   virtqueue_get_head
> >> 9371  0.5789  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 schedule
> >> 8333  0.5148  qemu-system-x86_64   virtqueue_avail_bytes
> >> 7899  0.4880  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fget_light
> >> 7289  0.4503  qemu-system-x86_64   main_loop_wait
> >> 7217  0.4458  qemu-system-x86_64   lduw_phys
> >>  
> 
> This is almost entirely host virtio.  I can reduce native_write_msr_safe 
> by a bit, but not much.
> 
> >> 6821  0.4214  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> >> audit_syscall_exit
> >> 6749  0.4169  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 do_select
> >> 5919  0.3657  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> >> audit_syscall_entry
> >> 5466  0.3377  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 kfree
> >> 4887  0.3019  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fput
> >> 4689  0.2897  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 __switch_to
> >> 4636  0.2864  
> >> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 mwait_idle
> >>  
> 
> Still not idle=poll, it may shave off 0.2%.

Won't this affect SMT in a negative way?  (OK, I am not running SMT now,
but eventually we will be) A long time ago, we tested P4's with HT, and
a polling idle in one thread always negatively impacted performance in
the sibling thread.

FWIW, I did try idle=halt, and it was slightly worse.

I did get a chance to try the latest qemu (master and next heads).  I
have been running into a problem with virtIO stor driver for windows on
anything much newer than kvm-87.  I compiled the driver from the new git
tree, installed OK, but still had the same error.  Finally, I removed
the serial number feature in the virtio-blk in qemu, and I can now get
the driver to work in Windows.

So, not really any good news on performance with latest qemu builds.
Performance is slightly worse:

qemu-kvm-87
user  nice  system   irq  softirq guest   idle  iowait
5.79  0.009.28  0.08 1.00 20.81  58.784.26
total busy: 36.97

qemu-kvm-88-905-g6025b2d (master)
user  nice  system   irq  softirq guest   idle  iowait
6.57  0.00   10.86  0.08 1.02 21.35  55.904.21
total busy: 39.89

qemu-kvm-88-910-gbf8a05b (next)
user  nice  system   irq  softirq guest   idle  iowait
6.60  0.00  10.91   0.09 1.03 21.35  55.714.31
total busy: 39.98

diff of profiles, p1=qemu-kvm-87, p2=qemu-master


> profile1 is qemu-kvm-87
> profile2 is qemu-master
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit 
> mask of 0x00 (No unit mask) count 1000
> total samples (ts1) for profi

Re: [PATCH] KVM: Use thread debug register storage instead of kvm specific data

2009-09-01 Thread Andrew Theurer
On Tue, 2009-09-01 at 12:47 +0300, Avi Kivity wrote:
> On 09/01/2009 12:44 PM, Avi Kivity wrote:
> > Instead of saving the debug registers from the processor to a kvm data
> > structure, rely in the debug registers stored in the thread structure.
> > This allows us not to save dr6 and dr7.
> >
> > Reduces lightweight vmexit cost by 350 cycles, or 11 percent.
> >
> 
> Andrew, this is now available as the 'debugreg' branch of kvm.git.  
> Given the massive performance improvement, it will be interesting to see 
> how the test results change.
> 
> Marcelo, please queue this for 2.6.32, and I think it's even suitable 
> for -stable.
> 

Here's a run from branch debugreg with thread debugreg storage +
conditionally reload dr6:

user  nice  system   irq  softirq guest   idle  iowait
5.79  0.009.28  0.08 1.00 20.81  58.784.26
total busy: 36.97

Previous run that had avoided calling adjust_vmx_controls twice:

user  nice  system   irq  softirq guest   idle  iowait
5.81  0.009.48  0.081.04  21.32  57.864.41
total busy: 37.73

A relative reduction CPU cycles of 2%

new oprofile:

> samples  %app name symbol name
> 876648   54.1555  kvm-intel.ko vmx_vcpu_run
> 37595 2.3225  qemu-system-x86_64   cpu_physical_memory_rw
> 35623 2.2006  qemu-system-x86_64   phys_page_find_alloc
> 24874 1.5366  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> native_write_msr_safe
> 17710 1.0940  libc-2.5.so  memcpy
> 14664 0.9059  kvm.ko   kvm_arch_vcpu_ioctl_run
> 14577 0.9005  qemu-system-x86_64   qemu_get_ram_ptr
> 12528 0.7739  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> native_read_msr_safe
> 10979 0.6782  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> copy_user_generic_string
> 9979  0.6165  qemu-system-x86_64   virtqueue_get_head
> 9371  0.5789  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 schedule
> 8333  0.5148  qemu-system-x86_64   virtqueue_avail_bytes
> 7899  0.4880  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fget_light
> 7289  0.4503  qemu-system-x86_64   main_loop_wait
> 7217  0.4458  qemu-system-x86_64   lduw_phys
> 6821  0.4214  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> audit_syscall_exit
> 6749  0.4169  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 do_select
> 5919  0.3657  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> audit_syscall_entry
> 5466  0.3377  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 kfree
> 4887  0.3019  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 fput
> 4689  0.2897  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 __switch_to
> 4636  0.2864  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 mwait_idle
> 4505  0.2783  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 getnstimeofday
> 4453  0.2751  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 system_call
> 4403  0.2720  kvm.ko   kvm_load_guest_fpu
> 4285  0.2647  kvm.ko   kvm_put_guest_fpu
> 4241  0.2620  libpthread-2.5.sopthread_mutex_lock
> 4172  0.2577  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 
> unroll_tree_refs
> 4100  0.2533  qemu-system-x86_64   kvm_run
> 4044  0.2498  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 __down_read
> 3978  0.2457  qemu-system-x86_64   ldl_phys
> 3669  0.2267  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 do_vfs_ioctl
> 3655  0.2258  
> vmlinux-2.6.31-rc5_debugreg_v2.6.31-rc3-3441-g479fa73-autokern1 __up_read
> 

A diff of this and previous run's oprofile:


> profile1 is [./oprofile.before]
> profile2 is [./oprofile.after]
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit 
> mask of 0x00 (No unit mask) count 1000
> total samples (ts1) for profile1 is 1661542 
> total samples (ts2) for profile2 is 1618760 (includes multiplier of 1.00)
> functions which have a abs(pct2-pct1) < 0.02 are not displayed
> 
>   pct2:   pct1:   
>
>100*100*  pct2 
>
>s1s2   s2/s1  s2/ts1  s1/ts1  -pct1 symbol 
> bin
> - - --- --- --- -- -- 
> ---
>  1559  2747  1.76/1   0.165   0.094  0.071 dput   
> vmlinux
> 34764 35623  1.02/1   2.144   2.092  0.052 phys_page_find_alloc  
> qemu
>  5170  5919  1.14/1   0.356   0.311  0.045 audit_syscall_entry
> vmlinux
>  3593  4172  1.16/1 

Re: [PATCH] don't call adjust_vmx_controls() second time

2009-08-31 Thread Andrew Theurer

Avi Kivity wrote:

On 08/27/2009 11:42 PM, Andrew Theurer wrote:

On Thu, 2009-08-27 at 19:21 +0300, Avi Kivity wrote:
  

On 08/27/2009 06:41 PM, Gleb Natapov wrote:


Don't call adjust_vmx_controls() two times for the same control.
It restores options that was dropped earlier.

   

Applied, thanks.  Andrew, if you rerun your benchmark atop kvm.git
'next' branch, I believe you will see dramatically better results.
 

Yes!  CPU is much lower:
user  nice  system   irq softirq  guest   idle  iowait
5.81  0.009.48  0.081.04  21.32  57.864.41

previous CPU:
user  nice  system   irq  softirq guest   idle  iowait
5.67  0.00   11.64  0.09 1.05 31.90  46.063.59

   


How does it compare to the other hypervisor now?


My original results for other hypervisor were a little inaccurate.  They 
mistakenly used 2 vcpu guests. New runs with 1 vcpu guests (as used in 
kvm) have slightly lower CPU utilization.  Anyway, here's the breakdown:


   CPU   percent more CPU
kvm-master/qemu-kvm-87:50.15 78%
kvm-next/qemu-kvm-87:  37.73 34%




new oprofile:

  

samples  %app name symbol name
885444   53.2905  kvm-intel.ko vmx_vcpu_run
 


guest mode = good


38090 2.2924  qemu-system-x86_64   cpu_physical_memory_rw
34764 2.0923  qemu-system-x86_64   phys_page_find_alloc
14730 0.8865  qemu-system-x86_64   qemu_get_ram_ptr
10814 0.6508  vmlinux-2.6.31-rc5-autokern1 copy_user_generic_string
10871 0.6543  qemu-system-x86_64   virtqueue_get_head
8557  0.5150  qemu-system-x86_64   virtqueue_avail_bytes
7173  0.4317  qemu-system-x86_64   lduw_phys
4122  0.2481  qemu-system-x86_64   ldl_phys
3339  0.2010  qemu-system-x86_64   virtqueue_num_heads
4129  0.2485  libpthread-2.5.sopthread_mutex_lock

 


virtio and related qemu overhead: 8.2%.


25278 1.5214  vmlinux-2.6.31-rc5-autokern1 native_write_msr_safe
12278 0.7390  vmlinux-2.6.31-rc5-autokern1 native_read_msr_safe
 


This will be reduced to if we move virtio to kernel context.


Are there plans to move that to kernel for disk, too?


12380 0.7451  vmlinux-2.6.31-rc5-autokern1 native_set_debugreg
3550  0.2137  vmlinux-2.6.31-rc5-autokern1 native_get_debugreg
 


A lot less than before, but still annoying.


4631  0.2787  vmlinux-2.6.31-rc5-autokern1 mwait_idle


 


idle=halt may improve this, mwait is slow.


I can try idle-halt on the host.  I actually assumed it would be using 
that, but I'll check.


Thanks,

-Andrew


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] don't call adjust_vmx_controls() second time

2009-08-27 Thread Andrew Theurer
On Thu, 2009-08-27 at 19:21 +0300, Avi Kivity wrote:
> On 08/27/2009 06:41 PM, Gleb Natapov wrote:
> > Don't call adjust_vmx_controls() two times for the same control.
> > It restores options that was dropped earlier.
> >
> 
> Applied, thanks.  Andrew, if you rerun your benchmark atop kvm.git 
> 'next' branch, I believe you will see dramatically better results.

Yes!  CPU is much lower:
user  nice  system   irq softirq  guest   idle  iowait
5.81  0.009.48  0.081.04  21.32  57.864.41

previous CPU:
user  nice  system   irq  softirq guest   idle  iowait
5.67  0.00   11.64  0.09 1.05 31.90  46.063.59

new oprofile:

> samples  %app name symbol name
> 885444   53.2905  kvm-intel.ko vmx_vcpu_run
> 38090 2.2924  qemu-system-x86_64   cpu_physical_memory_rw
> 34764 2.0923  qemu-system-x86_64   phys_page_find_alloc
> 25278 1.5214  vmlinux-2.6.31-rc5-autokern1 native_write_msr_safe
> 18205 1.0957  libc-2.5.so  memcpy
> 14730 0.8865  qemu-system-x86_64   qemu_get_ram_ptr
> 14189 0.8540  kvm.ko   kvm_arch_vcpu_ioctl_run
> 12380 0.7451  vmlinux-2.6.31-rc5-autokern1 native_set_debugreg
> 12278 0.7390  vmlinux-2.6.31-rc5-autokern1 native_read_msr_safe
> 10871 0.6543  qemu-system-x86_64   virtqueue_get_head
> 10814 0.6508  vmlinux-2.6.31-rc5-autokern1 copy_user_generic_string
> 9080  0.5465  vmlinux-2.6.31-rc5-autokern1 fget_light
> 9015  0.5426  vmlinux-2.6.31-rc5-autokern1 schedule
> 8557  0.5150  qemu-system-x86_64   virtqueue_avail_bytes
> 7805  0.4697  vmlinux-2.6.31-rc5-autokern1 do_select
> 7173  0.4317  qemu-system-x86_64   lduw_phys
> 7019  0.4224  qemu-system-x86_64   main_loop_wait
> 6979  0.4200  vmlinux-2.6.31-rc5-autokern1 audit_syscall_exit
> 5571  0.3353  vmlinux-2.6.31-rc5-autokern1 kfree
> 5170  0.3112  vmlinux-2.6.31-rc5-autokern1 audit_syscall_entry
> 5086  0.3061  vmlinux-2.6.31-rc5-autokern1 fput
> 4631  0.2787  vmlinux-2.6.31-rc5-autokern1 mwait_idle
> 4584  0.2759  kvm.ko   kvm_load_guest_fpu
> 4491  0.2703  vmlinux-2.6.31-rc5-autokern1 system_call
> 4461  0.2685  vmlinux-2.6.31-rc5-autokern1 __switch_to
> 4431  0.2667  kvm.ko   kvm_put_guest_fpu
> 4371  0.2631  vmlinux-2.6.31-rc5-autokern1 __down_read
> 4290  0.2582  qemu-system-x86_64   kvm_run
> 4218  0.2539  vmlinux-2.6.31-rc5-autokern1 getnstimeofday
> 4129  0.2485  libpthread-2.5.sopthread_mutex_lock
> 4122  0.2481  qemu-system-x86_64   ldl_phys
> 4100  0.2468  vmlinux-2.6.31-rc5-autokern1 do_vfs_ioctl
> 3811  0.2294  kvm.ko   find_highest_vector
> 3593  0.2162  vmlinux-2.6.31-rc5-autokern1 unroll_tree_refs
> 3560  0.2143  vmlinux-2.6.31-rc5-autokern1 try_to_wake_up
> 3550  0.2137  vmlinux-2.6.31-rc5-autokern1 native_get_debugreg
> 3506  0.2110  kvm-intel.ko vmcs_writel
> 3487  0.2099  vmlinux-2.6.31-rc5-autokern1 task_rq_lock
> 3434  0.2067  vmlinux-2.6.31-rc5-autokern1 __up_read
> 3368  0.2027  librt-2.5.so clock_gettime
> 3339  0.2010  qemu-system-x86_64   virtqueue_num_heads
> 

Thanks very much for the fix!

-Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performace data when running Windows VMs

2009-08-26 Thread Andrew Theurer
On Wed, 2009-08-26 at 11:27 -0500, Brian Jackson wrote:
> On Wednesday 26 August 2009 11:14:57 am Andrew Theurer wrote:
> 
> > >
> > > > I/O on the host was not what I would call very high:  outbound network
> > > > averaged at 163 Mbit/s inbound was 8 Mbit/s, while disk read ops was
> > > > 243/sec and write ops was 561/sec
> > >
> > > What was the disk bandwidth used?  Presumably, direct access to the
> > > volume with cache=off?
> >
> > 2.4 MB/sec write, 0.6MB/sec read, cache=none
> > The VMs' boot disks are IDE, but apps use their second disk which is
> > virtio.
> 
> 
> In my testing, I got better performance from IDE than the new virtio block 
> driver for windows. There appears to be some optimization left to do on them.

Thanks Brian.  I will try IDE on both VM disks to see how it compares.

-Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performace data when running Windows VMs

2009-08-26 Thread Andrew Theurer
On Wed, 2009-08-26 at 19:26 +0300, Avi Kivity wrote:
> On 08/26/2009 07:14 PM, Andrew Theurer wrote:
> > On Wed, 2009-08-26 at 18:44 +0300, Avi Kivity wrote:
> >
> >> On 08/26/2009 05:57 PM, Andrew Theurer wrote:
> >>  
> >>> I recently gathered some performance data when running Windows Server
> >>> 2008 VMs, and I wanted to share it here.  There are 12 Windows
> >>> Server2008 64-bit VMs (1 vcpu, 2 GB) running which handle the concurrent
> >>> execution of 6 J2EE type benchmarks.  Each benchmark needs a App VM and
> >>> a Database VM.  The benchmark clients inject a fixed rate of requests
> >>> which yields X% CPU utilization on the host.  A different hypervisor was
> >>> compared; KVM used about 60% more CPU cycles to complete the same amount
> >>> of work.  Both had their hypervisor specific paravirt IO drivers in the
> >>> VMs.
> >>>
> >>> Server is a 2 socket Core/i7, SMT off, with 72 GB memory
> >>>
> >>>
> >> Did you use large pages?
> >>  
> > Yes.
> >
> 
> The stats show 'largepage = 12'.  Something's wrong.  There's a commit 
> (7736d680) that's supposed to fix largepage support for kvm-87, maybe 
> it's incomplete.

How strange.  /proc/meminfo showed that almost all of the pages were
used:

HugePages_Total:   12556
HugePages_Free:  220
HugePages_Rsvd:0
HugePages_Surp:0
Hugepagesize:   2048 kB

I just assumed they were used properly.  Maybe not.

> >>> I/O on the host was not what I would call very high:  outbound network
> >>> averaged at 163 Mbit/s inbound was 8 Mbit/s, while disk read ops was
> >>> 243/sec and write ops was 561/sec
> >>>
> >>>
> >> What was the disk bandwidth used?  Presumably, direct access to the
> >> volume with cache=off?
> >>  
> > 2.4 MB/sec write, 0.6MB/sec read, cache=none
> > The VMs' boot disks are IDE, but apps use their second disk which is
> > virtio.
> >
> 
> Chickenfeed.
> 
> Do the network stats include interguest traffic?  I presume *all* of the 
> traffic was interguest.

Sar network data:

>  IFACE   rxpck/s   txpck/srxkB/stxkB/s
> Average:   lo  0.00  0.00  0.00  0.00 
> Average: usb0  0.39  0.19  0.02  0.01 
> Average: eth0   2968.83   5093.02340.13   6966.64
> Average: eth1   2992.92   5124.08342.75   7008.53 
> Average: eth2   1455.53   2500.63167.45   3421.64 
> Average: eth3   1500.59   2574.36171.98   3524.82 
> Average:  br0  2.41  0.95  0.32  0.13 
> Average:  br1  1.52  0.00  0.20  0.00 
> Average:  br2  1.52  0.00  0.20  0.00 
> Average:  br3  1.52  0.00  0.20  0.00 
> Average:  br4  0.00  0.00  0.00  0.00 
> Average: tap3669.38708.07290.89140.81 
> Average:   tap109678.53723.58294.07143.31 
> Average:   tap215673.20711.47291.99141.78 
> Average:   tap321675.26719.33293.01142.37 
> Average:tap27679.23729.90293.86143.60 
> Average:   tap133680.17734.08294.33143.85 
> Average: tap2   1002.24   2214.19   3458.54457.95 
> Average:   tap108   1021.85   2246.53   3491.02463.48 
> Average:   tap214   1002.81   2195.22   3411.80457.28 
> Average:   tap320   1017.43   2241.49   3508.20462.54 
> Average:tap26   1028.52   2237.98   3483.84462.53 
> Average:   tap132   1034.05   2240.89   3493.37463.32 

tap0-99 go to eth0, 100-199 to eth1, 200-299 to eth2, 300-399 to eth4.
There is some inter-guest traffic between VM pairs (like taps 2&3,
108&119, etc.) but not that significant.

> 
> >> linux-aio should help reduce cpu usage.
> >>  
> > I assume this is in a newer version of Qemu?
> >
> 
> No, posted and awaiting merge.
> 
> >> Could it be that Windows uses the debug registers?  Maybe we're
> >> incorrectly deciding to switch them.
> >>  
> > I was wondering about that.  I was thinking of just backing out the
> > support for debugregs and see what happens.
> >
> > Did the up/down_read seem kind of high?  Are we doing a lock of locking?
> >
> 
> It is.  We do.  Marcelo made some threats to remove this lock.

Thanks,

-Andrew


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performace data when running Windows VMs

2009-08-26 Thread Andrew Theurer
On Wed, 2009-08-26 at 18:44 +0300, Avi Kivity wrote:
> On 08/26/2009 05:57 PM, Andrew Theurer wrote:
> > I recently gathered some performance data when running Windows Server
> > 2008 VMs, and I wanted to share it here.  There are 12 Windows
> > Server2008 64-bit VMs (1 vcpu, 2 GB) running which handle the concurrent
> > execution of 6 J2EE type benchmarks.  Each benchmark needs a App VM and
> > a Database VM.  The benchmark clients inject a fixed rate of requests
> > which yields X% CPU utilization on the host.  A different hypervisor was
> > compared; KVM used about 60% more CPU cycles to complete the same amount
> > of work.  Both had their hypervisor specific paravirt IO drivers in the
> > VMs.
> >
> > Server is a 2 socket Core/i7, SMT off, with 72 GB memory
> >
> 
> Did you use large pages?

Yes.
> 
> > Host kernel used was kvm.git v2.6.31-rc3-3419-g6df4865
> > Qemu was kvm-87.  I tried a few newer versions of Qemu; none of them
> > worked with the RedHat virtIO Windows drivers.  I tried:
> >
> > f3600c589a9ee5ea4c0fec74ed4e06a15b461d52
> > 0.11.0-rc1
> > 0.10.6
> > kvm-88
> >
> > All but 0.10.6 had "Problem code 10" driver error in the VM.  0.10.6 had
> > "a disk read error occurred" very early in the booting of the VM.
> >
> 
> Yan?
> 
> > I/O on the host was not what I would call very high:  outbound network
> > averaged at 163 Mbit/s inbound was 8 Mbit/s, while disk read ops was
> > 243/sec and write ops was 561/sec
> >
> 
> What was the disk bandwidth used?  Presumably, direct access to the 
> volume with cache=off?

2.4 MB/sec write, 0.6MB/sec read, cache=none
The VMs' boot disks are IDE, but apps use their second disk which is
virtio.

> linux-aio should help reduce cpu usage.

I assume this is in a newer version of Qemu?

> > Host CPU breakdown was the following:
> >
> > user  nice  system irq  softirq guest  idle  iowait
> > 5.67  0.00  11.64  0.09 1.0531.90  46.06 3.59
> >
> >
> > The amount of kernel time had me concerned.  Here is oprofile:
> >
> 
> user+system is about 55% of guest time, and it's all overhead.
> 
> >> samples  %app name symbol name
> >> 1163422  52.3744  kvm-intel.ko vmx_vcpu_run
> >> 1039964.6816  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> >> native_set_debugreg
> >> 81036 3.6480  kvm.ko   kvm_arch_vcpu_ioctl_run
> >> 37913 1.7068  qemu-system-x86_64   cpu_physical_memory_rw
> >> 34720 1.5630  qemu-system-x86_64   phys_page_find_alloc
> >>  
> 
> We should really optimize these two.
> 
> >> 23234 1.0459  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> >> native_write_msr_safe
> >> 20964 0.9437  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> >> native_get_debugreg
> >> 17628 0.7936  libc-2.5.so  memcpy
> >> 16587 0.7467  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> >> __down_read
> >> 15681 0.7059  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> >> __up_read
> >> 15466 0.6962  kvm.ko   find_highest_vector
> >> 14611 0.6578  qemu-system-x86_64   qemu_get_ram_ptr
> >> 11254 0.5066  kvm-intel.ko vmcs_writel
> >> 11133 0.5012  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> >> copy_user_generic_string
> >> 10917 0.4915  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> >> native_read_msr_safe
> >> 10760 0.4844  qemu-system-x86_64   virtqueue_get_head
> >> 9025  0.4063  kvm-intel.ko vmx_handle_exit
> >> 8953  0.4030  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> >> schedule
> >> 8753  0.3940  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> >> fget_light
> >> 8465  0.3811  qemu-system-x86_64   virtqueue_avail_bytes
> >> 8185  0.3685  kvm-intel.ko handle_cr
> >> 8069  0.3632  kvm.ko   kvm_set_irq
> >> 7697  0.3465  kvm.ko   kvm_lapic_sync_from_vapic
> >> 7586  0.3415  qemu-system-x86_64   main_loop_wait
> >> 7480  0.3367  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> >> do_select
> >> 7121  0.3206  qemu-system-x86_64   lduw_phys
> >> 7003  0.3153  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> >> audit_syscall_

Performace data when running Windows VMs

2009-08-26 Thread Andrew Theurer
I recently gathered some performance data when running Windows Server
2008 VMs, and I wanted to share it here.  There are 12 Windows
Server2008 64-bit VMs (1 vcpu, 2 GB) running which handle the concurrent
execution of 6 J2EE type benchmarks.  Each benchmark needs a App VM and
a Database VM.  The benchmark clients inject a fixed rate of requests
which yields X% CPU utilization on the host.  A different hypervisor was
compared; KVM used about 60% more CPU cycles to complete the same amount
of work.  Both had their hypervisor specific paravirt IO drivers in the
VMs.

Server is a 2 socket Core/i7, SMT off, with 72 GB memory

Host kernel used was kvm.git v2.6.31-rc3-3419-g6df4865
Qemu was kvm-87.  I tried a few newer versions of Qemu; none of them
worked with the RedHat virtIO Windows drivers.  I tried:

f3600c589a9ee5ea4c0fec74ed4e06a15b461d52
0.11.0-rc1
0.10.6
kvm-88

All but 0.10.6 had "Problem code 10" driver error in the VM.  0.10.6 had
"a disk read error occurred" very early in the booting of the VM.

I/O on the host was not what I would call very high:  outbound network
averaged at 163 Mbit/s inbound was 8 Mbit/s, while disk read ops was
243/sec and write ops was 561/sec

Host CPU breakdown was the following:

user  nice  system irq  softirq guest  idle  iowait
5.67  0.00  11.64  0.09 1.0531.90  46.06 3.59


The amount of kernel time had me concerned.  Here is oprofile:


> samples  %app name symbol name
> 1163422  52.3744  kvm-intel.ko vmx_vcpu_run
> 1039964.6816  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> native_set_debugreg
> 81036 3.6480  kvm.ko   kvm_arch_vcpu_ioctl_run
> 37913 1.7068  qemu-system-x86_64   cpu_physical_memory_rw
> 34720 1.5630  qemu-system-x86_64   phys_page_find_alloc
> 23234 1.0459  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> native_write_msr_safe
> 20964 0.9437  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> native_get_debugreg
> 17628 0.7936  libc-2.5.so  memcpy
> 16587 0.7467  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> __down_read
> 15681 0.7059  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> __up_read
> 15466 0.6962  kvm.ko   find_highest_vector
> 14611 0.6578  qemu-system-x86_64   qemu_get_ram_ptr
> 11254 0.5066  kvm-intel.ko vmcs_writel
> 11133 0.5012  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> copy_user_generic_string
> 10917 0.4915  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> native_read_msr_safe
> 10760 0.4844  qemu-system-x86_64   virtqueue_get_head
> 9025  0.4063  kvm-intel.ko vmx_handle_exit
> 8953  0.4030  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> schedule
> 8753  0.3940  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> fget_light
> 8465  0.3811  qemu-system-x86_64   virtqueue_avail_bytes
> 8185  0.3685  kvm-intel.ko handle_cr
> 8069  0.3632  kvm.ko   kvm_set_irq
> 7697  0.3465  kvm.ko   kvm_lapic_sync_from_vapic
> 7586  0.3415  qemu-system-x86_64   main_loop_wait
> 7480  0.3367  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> do_select
> 7121  0.3206  qemu-system-x86_64   lduw_phys
> 7003  0.3153  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> audit_syscall_exit
> 6062  0.2729  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 kfree
> 5477  0.2466  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 fput
> 5454  0.2455  kvm.ko   kvm_lapic_get_cr8
> 5096  0.2294  kvm.ko   kvm_load_guest_fpu
> 5057  0.2277  kvm.ko   apic_update_ppr
> 4929  0.2219  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> up_read
> 4900  0.2206  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> audit_syscall_entry
> 4866  0.2191  kvm.ko   kvm_apic_has_interrupt
> 4670  0.2102  kvm-intel.ko skip_emulated_instruction
> 4644  0.2091  kvm.ko   kvm_cpu_has_interrupt
> 4548  0.2047  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> __switch_to
> 4328  0.1948  kvm.ko   kvm_apic_accept_pic_intr
> 4303  0.1937  libpthread-2.5.sopthread_mutex_lock
> 4235  0.1906  vmlinux-2.6.31-rc5-v2.6.31-rc3-3419-g6df4865-autokern1 
> system_call
> 4175  0.1879  kvm.ko   kvm_put_guest_fpu
> 4170  0.1877  qemu-system-x86_64   ldl_phys
> 4098  0.1845  kvm-intel.ko vmx_set_interrupt_shadow
> 4003  0.1802  qemu-system-x86_64   kvm_run

I was wondering why the get/set debugreg was so high.  I don't recall
seeing this much with Linux VMs.

Here is an average of kvm_stat:


> efer_relo  0
> exits  1262814
> fpu_reloa  103842
> halt_exit  9918
> halt_wak

Re: Windows Server 2008 VM performance

2009-06-03 Thread Andrew Theurer

Avi Kivity wrote:

Andrew Theurer wrote:


Is there a virtio_block driver to test?  


There is, but it isn't available yet.

OK.  Can I assume a better virtio_net driver is in the works as well?


Can we find the root cause of the exits (is there a way to get stack 
dump or something that can show where there are coming from)?


Marcelo is working on a super-duper easy to use kvm trace which can 
show what's going on.  The old one is reasonably easy though it 
exports less data.  If you can generate some traces, I'll have a look 
at them.


Thanks Avi.  I'll try out kvm-86 and see if I can generate some kvm 
trace data.


-Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Windows Server 2008 VM performance

2009-06-02 Thread Andrew Theurer
I've been looking at how KVM handles windows guests, and I am a little 
concerned with the CPU overhead.  My test case is as follows:


I am running 4 instances of a J2EE benchmark.  Each instance needs one 
application server and one DB server.  8 VMs in total are used.


I have the same App and DB software for Linux and Windows (and same 
versions) so I can compare between Linux and Windows.  I also have 
another hypervisor which I can test both Windows and Linux VMs.


The host has EPT capable processors.  VMs in KVM are backed with large 
pages.



Test results:

ConfigCPU utilization
---
KVM-85
  Windows Server 2008 64-bit VMs 44.84
  RedHat 5.3 w/ 2.6.29 64-bit VMs24.56
Other-Hypervisor
  Windows Server 2008 64-bit VMs 30.63
  RedHat 5.3 w/ 2.6.18 64-bit VMs27.13

-KVM running Windows VMs uses 46% more CPU than the Other-Hypervisor
-The Other-Hypervisor provides an optimized virtual network driver
-KVM results listed above did not use virtio_net or virtio_disk for 
Windows, but do for Linux
-One extra KVM run (not listed above) was made with virtio_net for 
Windows VMs but only reduced CPU by 2%
-Most of the CPU overhead could be attributed to the DB VMs, where there 
is about 5 MB/sec writes per VM

-I don't have a virtio_block driver for Windows to test.  Does one exist?
-All tests above had 2 vCPUS per VM


Here's a comparison of kvm_stat between Windows (run1) and Linux (run2):

run1  run2run1/run2
  -
efer_relo:  0 0 1
exits:1206880121916 9.899
fpu_reloa: 210969 2086310.112
halt_exit:  15092 13222 1.141
halt_wake:  14466  9294 1.556
host_stat: 211066 45117 4.678
hypercall:  0 0 1
insn_emul: 119582 38126 3.136
insn_emul:  0 0 1
invlpg   :  0 0 1
io_exits : 131051 26349 4.974
irq_exits:   8128 12937 0.628
irq_injec:  29955 21825 1.373
irq_windo:   2504  2022 1.238
kvm_reque:  0 0 1
largepage:  164 0.009
mmio_exit:  59224 0   Inf
mmu_cache:  0 3 0.000
mmu_flood:  0 0 1
mmu_pde_z:  0 0 1
mmu_pte_u:  0 0 1
mmu_pte_w:  0 0 1
mmu_recyc:  0 0 1
mmu_shado:  0 0 1
mmu_unsyn:  0 0 1
mmu_unsyn:  0 0 1
nmi_injec:  0 0 1
nmi_windo:  0 0 1
pf_fixed :  167 0.009
pf_guest :  0 0 1
remote_tl:  0 0 1
request_n:  0 0 1
signal_ex:  0 0 1
tlb_flush:220 14037 0.016


10x the number of exits, a problem?

I happened to try just one vCPU per VM for KVM/Windows VMs, and I was 
surprised how much of a difference it made:


Config   CPU utilization
--   -
KVM-85
  Windows Server 2008 64-bit VMs, 2 vCPU per VM 44.84
  Windows Server 2008 64-bit VMs, 1 vCPU per VM 36.44

A 19% reduction in CPU utilization vs KVM/Windows-2vCPU!  Does not 
explain all the overhead (vs Other-Hypervisor, 2 vCPUs per VM) but, that 
sure seems like a lot between 1 to 2 vCPUs for KVM/Windows-VMs.  I have 
not run with 1 vCPU per VM with Other-Hypervisor, but I will soon.  
Anyway, I also collected kvm_stat for the 1 vCPU case, and here it is 
compared to KVM/Linux VMs with 2 vCPUs:


run1  run2run1/run2
  -
efer_relo:  0 0 1
exits:1184471121916 9.715
fpu_reloa: 192766 20863 9.240
halt_exit:   4697 13222 0.355
halt_wake:   4360  9294 0.469
host_stat: 192828 45117 4.274
hypercall:  0 0 1
insn_emul: 130487 38126 3.422
insn_emul:  0 0 1
invlpg   :  0 0 1
io_exits : 114430 26349 4.343
irq_exits:   7075 12937 0.547
irq_injec:  29930 21825 1.371
irq_windo:   2391  2022 1.182
kvm_reque:  0 0 1
largepage:  064 0.001
mmio_exit:  69028 0   Inf
mmu_cache:  0   

Re: KVM performance vs. Xen

2009-04-30 Thread Andrew Theurer
Here are the SMT off results.  This workload is designed to not 
over-saturate the CPU, so you have to pick a number of server sets to 
ensure that.  With SMT on, 4 sets was enough for KVM, but 5 was too much 
(start seeing response time errors).  For SMT off, I tried to size the 
load as high as we can go without running into these errors.  For KVM, 
thats 3 (18 guests) and for Xen, that's 4 (24 guests).  The throughout 
has a fairly linear relationship to the number of server sets used, but 
has a bit of wiggle room (mostly affected by response times getting 
longer and longer, but not exceeding the requirement set forth).  
Anyway, the relative throughput for these are "1.0" for KVM and "1.34" 
for Xen.  The CPU is 78.71% for KVM the CPU is 87.83%. 


If we normalize to CPU utilization, Xen is doing 20% more throughput.

Avi Kivity wrote:

Anthony Liguori wrote:


Previously, the block API only exposed non-vector interfaces and 
bounced vectored operations to a linear buffer.  That's been 
eliminated now though so we need to update the linux-aio patch to 
implement a vectored backend interface.


However, it is an apples to apples comparison in terms of copying 
since the same is true with the thread pool.  My take away was that 
the thread pool overhead isn't the major source of issues.


If the overhead is dominated by copying, then you won't see the 
difference.  Once the copying is eliminated, the comparison may yield 
different results.  We should certainly see a difference in context 
switches.
I would like to test this the proper way.  What do I need to do to 
ensure these copies are eliminated?  I am on a 2.6.27 kernel, am I 
missing anything there?  Anthony, would you be willing to provide a 
patch to support the changes in the block API?


One cause of context switches won't be eliminated - the non-saturating 
workload causes us to switch to the idle thread, which incurs a 
heavyweight exit.  This doesn't matter since we're idle anyway, but 
when we switch back, we incur a heavyweight entry.
I have not looked at the schedstat or ftrace yet, but will soon.  Maybe 
it will tell us a little more about the context switches.


Here's a sample of the kvm_stat:

efer_relo  exits  fpu_reloa  halt_exit  halt_wake  host_stat  hypercall  
insn_emul  insn_emul invlpg   io_exits  irq_exits  irq_injec  irq_windo  
kvm_reque  largepage  mmio_exit  mmu_cache  mmu_flood  mmu_pde_z  mmu_pte_u  
mmu_pte_w  mmu_recyc  mmu_shado  mmu_unsyn  mmu_unsyn  nmi_injec  nmi_windo   
pf_fixed   pf_guest  remote_tl  request_n  signal_ex  tlb_flush
0 233866  53994  20353  16209 119812  0 
 48879  0  0  75666  44917  34772   3984
  0187  0 10  0  0  0  
0  0  0  0  0  0  0202  
0  0  0  0  17698
0 244556  67321  15570  12364 116226  0 
 49865  0  0  69357  56131  32860   4449
  0  -1895  0 19  0  0  0  
0 21 21  0  0  0  0   1117  
0  0  0  0  21586
0 230788  71382  10619   7920 109151  0 
 44354  0  0  62561  60074  28322   4841
  0103  0 13  0  0  0  
0  0  0  0  0  0  0122  
0  0  0  0  22702
0 275259  82605  14326  11148 127293  0 
 53738  0  0  73438  70707  34724   5373
  0859  0 15  0  0  0  
0 21 21  0  0  0  0874  
0  0  0  0  26723
0 250576  58760  20368  16476 128296  0 
 50936  0  0  80439  51219  36329   4621
  0  -1170  0  8  0  0  0  
0 22 22  0  0  0  0   1333  
0  0  0  0  18508
0 244746  59650  19480  15657 122721  0 
 49882  0  0  76011  50453  35352   4523
  0201  0 11  0  0  0  
0 21 21  0  0  0  0212  
0  0  0  0  19163
0 251724  71715  14049  10920 117255  0 
 49924  0  0  70173  58040  32328   5058

Re: KVM performance vs. Xen

2009-04-30 Thread Andrew Theurer

Avi Kivity wrote:

Anthony Liguori wrote:

Avi Kivity wrote:


1) I'm seeing about 2.3% in scheduler functions [that I recognize].
Does that seems a bit excessive?


Yes, it is.  If there is a lot of I/O, this might be due to the 
thread pool used for I/O.


This is why I wrote the linux-aio patch.  It only reduced CPU 
consumption by about 2% although I'm not sure if that's absolute or 
relative.  Andrew?
If  I recall correctly, it was 2.4% and relative.  But with 2.3% in 
scheduler functions, that's what I expected.


Was that before or after the entire path was made copyless?
If this is referring to the preadv/writev support, no, I have not tested 
with that.


-Andrew


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM performance vs. Xen

2009-04-30 Thread Andrew Theurer

Avi Kivity wrote:

Andrew Theurer wrote:

Avi Kivity wrote:




What's the typical I/O load (disk and network bandwidth) while the 
tests are running?

This is average thrgoughput:
network:Tx: 79 MB/sec  Rx: 5 MB/sec


MB as in Byte or Mb as in bit?
Byte.  There are 4 x 1 Gb adapters, each handling about 20 MB/sec or 160 
Mbit/sec.



disk:read: 17 MB/sec  write: 40 MB/sec


This could definitely cause the extra load, especially if it's many 
small requests (compared to a few large ones).
I don't have the request sizes at my fingertips, but we have to use a 
lot of disks to support this I/O, so I think it's safe to assume there 
are a lot more requests than a simple large sequential read/write.



The host hardware:
A 2 socket, 8 core Nehalem with SMT, and EPT enabled, lots of 
disks, 4 x

1 GB Ethenret


CPU time measurements with SMT can vary wildly if the system is not 
fully loaded.  If the scheduler happens to schedule two threads on a 
single core, both of these threads will generate less work compared 
to if they were scheduled on different cores.
Understood.  Even if at low loads, the scheduler does the right thing 
and spreads out to all the cores first, once it goes beyond 50% util, 
the CPU util can climb at a much higher rate (compared to a linear 
increase in work) because it then starts scheduling 2 threads per 
core, and each thread can do less work.  I have always wanted 
something which could more accurately show the utilization of a 
processor core, but I guess we have to use what we have today.  I 
will run again with SMT off to see what we get.


On the other hand, without SMT you will get to overcommit much faster, 
so you'll have scheduling artifacts.  Unfortunately there's no good 
answer here (except to improve the SMT scheduler).


Yes, it is.  If there is a lot of I/O, this might be due to the 
thread pool used for I/O.
I have a older patch which makes a small change to posix_aio_thread.c 
by trying to keep the thread pool size a bit lower than it is today.  
I will dust that off and see if it helps.


Really, I think linux-aio support can help here.
Yes, I think that would work for real block devices, but would that help 
for files?  I am using real block devices right now, but it would be 
nice to also see a benefit for files in a file-system.  Or maybe I am 
mis-understanding this, and linux-aio can be used on files?


-Andrew





Yes, there is a scheduler tracer, though I have no idea how to 
operate it.


Do you have kvm_stat logs?
Sorry, I don't, but I'll run that next time.  BTW, I did not notice a 
batch/log mode the last time I ram kvm_stat.  Or maybe it was not 
obvious to me.  Is there an ideal way to run kvm_stat without a 
curses like output?


You're probably using an ancient version:

$ kvm_stat --help
Usage: kvm_stat [options]

Options:
 -h, --helpshow this help message and exit
 -1, --once, --batch   run in batch mode for one second
 -l, --log run in logging mode (like vmstat)
 -f FIELDS, --fields=FIELDS
   fields to display (regex)





--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM performance vs. Xen

2009-04-30 Thread Andrew Theurer

Avi Kivity wrote:

Andrew Theurer wrote:

I wanted to share some performance data for KVM and Xen.  I thought it
would be interesting to share some performance results especially
compared to Xen, using a more complex situation like heterogeneous
server consolidation.

The Workload:
The workload is one that simulates a consolidation of servers on to a
single host.  There are 3 server types: web, imap, and app (j2ee).  In
addition, there are other "helper" servers which are also consolidated:
a db server, which helps out with the app server, and an nfs server,
which helps out with the web server (a portion of the docroot is nfs
mounted).  There is also one other server that is simply idle.  All 6
servers make up one set.  The first 3 server types are sent requests,
which in turn may send requests to the db and nfs helper servers.  The
request rate is throttled to produce a fixed amount of work.  In order
to increase utilization on the host, more sets of these servers are
used.  The clients which send requests also have a response time
requirement which is monitored.  The following results have passed the
response time requirements.



What's the typical I/O load (disk and network bandwidth) while the 
tests are running?

This is average thrgoughput:
network:Tx: 79 MB/sec  Rx: 5 MB/sec
disk:read: 17 MB/sec  write: 40 MB/sec



The host hardware:
A 2 socket, 8 core Nehalem with SMT, and EPT enabled, lots of disks, 4 x
1 GB Ethenret


CPU time measurements with SMT can vary wildly if the system is not 
fully loaded.  If the scheduler happens to schedule two threads on a 
single core, both of these threads will generate less work compared to 
if they were scheduled on different cores.
Understood.  Even if at low loads, the scheduler does the right thing 
and spreads out to all the cores first, once it goes beyond 50% util, 
the CPU util can climb at a much higher rate (compared to a linear 
increase in work) because it then starts scheduling 2 threads per core, 
and each thread can do less work.  I have always wanted something which 
could more accurately show the utilization of a processor core, but I 
guess we have to use what we have today.  I will run again with SMT off 
to see what we get.




Test Results:
The throughput is equal in these tests, as the clients throttle the work
(this is assuming you don't run out of a resource on the host).  What's
telling is the CPU used to do the same amount of work:

Xen:  52.85%
KVM:  66.93%

So, KVM requires 66.93/52.85 = 26.6% more CPU to do the same amount of
work. Here's the breakdown:

totalusernice  system irq softirq   guest
66.907.200.00   12.940.353.39   43.02

Comparing guest time to all other busy time, that's a 23.88/43.02 = 55%
overhead for virtualization.  I certainly don't expect it to be 0, but
55% seems a bit high.  So, what's the reason for this overhead?  At the
bottom is oprofile output of top functions for KVM.  Some observations:

1) I'm seeing about 2.3% in scheduler functions [that I recognize].
Does that seems a bit excessive?


Yes, it is.  If there is a lot of I/O, this might be due to the thread 
pool used for I/O.
I have a older patch which makes a small change to posix_aio_thread.c by 
trying to keep the thread pool size a bit lower than it is today.  I 
will dust that off and see if it helps.



2) cpu_physical_memory_rw due to not using preadv/pwritev?


I think both virtio-net and virtio-blk use memcpy().


3) vmx_[save|load]_host_state: I take it this is from guest switches?


These are called when you context-switch from a guest, and, much more 
frequently, when you enter qemu.



We have 180,000 context switches a second.  Is this more than expected?



Way more.  Across 16 logical cpus, this is >10,000 cs/sec/cpu.


I wonder if schedstats can show why we context switch (need to let
someone else run, yielded, waiting on io, etc).



Yes, there is a scheduler tracer, though I have no idea how to operate 
it.


Do you have kvm_stat logs?
Sorry, I don't, but I'll run that next time.  BTW, I did not notice a 
batch/log mode the last time I ram kvm_stat.  Or maybe it was not 
obvious to me.  Is there an ideal way to run kvm_stat without a curses 
like output?


-Andrew


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM performance vs. Xen

2009-04-29 Thread Andrew Theurer

Nakajima, Jun wrote:

On 4/29/2009 7:41:50 AM, Andrew Theurer wrote:
  

I wanted to share some performance data for KVM and Xen.  I thought it
would be interesting to share some performance results especially
compared to Xen, using a more complex situation like heterogeneous
server consolidation.

The Workload:
The workload is one that simulates a consolidation of servers on to a
single host.  There are 3 server types: web, imap, and app (j2ee).  In
addition, there are other "helper" servers which are also
consolidated: a db server, which helps out with the app server, and an
nfs server, which helps out with the web server (a portion of the docroot is 
nfs mounted).
There is also one other server that is simply idle.  All 6 servers
make up one set.  The first 3 server types are sent requests, which in
turn may send requests to the db and nfs helper servers.  The request
rate is throttled to produce a fixed amount of work.  In order to
increase utilization on the host, more sets of these servers are used.
The clients which send requests also have a response time requirement
which is monitored.  The following results have passed the response
time requirements.

The host hardware:
A 2 socket, 8 core Nehalem with SMT, and EPT enabled, lots of disks, 4
x
1 GB Ethenret

The host software:
Both Xen and KVM use the same host Linux OS, SLES11.  KVM uses the
2.6.27.19-5-default kernel and Xen uses the 2.6.27.19-5-xen kernel.  I
have tried 2.6.29 for KVM, but results are actually worse.  KVM
modules are rebuilt with kvm-85.  Qemu is also from kvm-85.  Xen
version is "3.3.1_18546_12-3.1".

The guest software:
All guests are RedHat 5.3.  The same disk images are used but
different kernels. Xen uses the RedHat Xen kernel and KVM uses 2.6.29
with all paravirt build options enabled.  Both use PV I/O drivers.  Software 
used:
Apache, PHP, Java, Glassfish, Postgresql, and Dovecot.




Just for clarification. So are you using PV (Xen) Linux on Xen, not HVM? Is 
that 32-bit or 64-bit?
  

PV, 64-bit.

-Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM performance vs. Xen

2009-04-29 Thread Andrew Theurer
_avail_bytes
1651070 0.2623  vmlinux-2.6.27.19-5-default do_select
1643139 0.2611  vmlinux-2.6.27.19-5-default update_curr
1640495 0.2606  vmlinux-2.6.27.19-5-default kmem_cache_free
1606493 0.2552  libpthread-2.9.so   pthread_mutex_lock
1549536 0.2462  qemu-system-x86_64  lduw_phys
1535539 0.2440  vmlinux-2.6.27.19-5-default tg_shares_up
1438468 0.2285  vmlinux-2.6.27.19-5-default mwait_idle
1316461 0.2092  vmlinux-2.6.27.19-5-default __down_read
1282486 0.2038  vmlinux-2.6.27.19-5-default native_read_tsc
1226069 0.1948  oprofiled   odb_update_node
1224551 0.1946  vmlinux-2.6.27.19-5-default sched_clock_cpu
1222684 0.1943  tun.ko  tun_chr_aio_read
1194034 0.1897  vmlinux-2.6.27.19-5-default task_rq_lock
1186129 0.1884  kvm.ko  x86_decode_insn
1131644 0.1798  bnx2.ko bnx2_start_xmit
1115575 0.1772  vmlinux-2.6.27.19-5-default enqueue_hrtimer
1044329 0.1659  vmlinux-2.6.27.19-5-default native_sched_clock
988546  0.1571  vmlinux-2.6.27.19-5-default fput
980615  0.1558  vmlinux-2.6.27.19-5-default __up_read
942270  0.1497  qemu-system-x86_64  kvm_run
925076  0.1470  kvm-intel.kovmcs_writel
889220  0.1413  vmlinux-2.6.27.19-5-default dev_queue_xmit
884786  0.1406  kvm.ko  kvm_apic_has_interrupt
880421  0.1399  librt-2.9.so    /lib64/librt-2.9.so
880306  0.1399  vmlinux-2.6.27.19-5-default nf_iterate


-Andrew Theurer




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


boot problems with if=virtio

2009-04-27 Thread Andrew Theurer
I know there have been a couple other threads here about booting with 
if=virtio, but I think this might be a different problem, not sure:


I am using kvm.git (41b76d8d0487c26d6d4d3fe53c1ff59b3236f096)
and qemu-kvm.git (8f7a30dbc40a1d4c09275566f9ed9647ed1ee50f)
and linux 2.6.20-rc3

It appears to build fine.  I am trying to run the following command:

name=newcastle-xmailt01
dev1=/dev/disk/by-id/scsi-3600a0b8f1eb1069748d8c230
dev2=/dev/disk/by-id/scsi-3600a0b8f1eb106dc48f45432
macaddr=00:50:56:00:00:06
tap=tap6
cpus=1
mem=1024
/usr/local/bin/qemu-system-x86_64 -name $name\
   -drive file=$dev1,if=virtio,boot=on,cache=none\
   -drive file=$dev2,if=virtio,boot=off,cache=none\
   -m $mem  -net nic,model=virtio,vlan=0,macaddr=$macaddr\
   -net tap,vlan=0,ifname=$tap,script=/etc/qemu-ifup -vnc 127.0.0.1:6 
-smp $cpus -daemonize



...and I get "Boot failed: could not read the boot disk"

This did work with the kvm-userspace.git (kvm-85rc6).  I can get this to 
work with a windows vm, using ide.  Was there a recent change to the 
-drive options that I am missing?


Thanks,

-Andrew

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: patch for virtual machine oriented scheduling(1)

2009-04-23 Thread Andrew Theurer

alex wrote:

the following patchs provide an extra control(besides the control of
Linux scheduler) over the execution of vcpu threads.

In this patch, Xen's credit
scheduler(http://wiki.xensource.com/xenwiki/CreditScheduler) is used.
User can use  "cat" and
"echo" command to view and control a guest os' credit.
e.g.,
[r...@localhost ~]#  echo "weight=500" > /proc/kvm/12345
will change the credit of guest whose qemu process has the pid 12345 to be 500.

The patch consists of 3 parts:
1. modification to the standard KVM
2. modification to the Xen scheduler
3. helper functions
  
Just wondering, was it not possible to introduce a new scheduling class 
in the current scheduler?  My impression was that the current scheduler 
was fairly modular and should allow this.


-Andrew



However, some are unnecessary in the latest Linux kernel.

The difficulties in the ports lie in:
1. Linux does not provide timer mechanism that the timer function is
bind to a dedicate CPU:
in case of one cpu receives another cpu's schedule timer
expiration, IPI is used to relay it.
2. before linux 2.6.27, the smp_call_function_xxx()  can not be re-entered.
   if kvm is sending ipi at the time of relaying timer expiration
information, deadlock would occur in kernel versions below 2.6.27

In my implementation, tasklets are used to run the function of
scheduling, and  kernel thread is used to send IPI(in kernels above
2.6.27, this is unnecessary)

Originally, this code is developed at the release version of KVM-83.
In order to post it, I ported to the latest .git tree.  As a result,
modifications to files like external-module-compat-comm.h are omited.

NOTE:
1. Because my not having an AMD machine, only intel platforms are tested.
2. Because sched_setaffinity() is used (while Linux does not export
this symbol), the way of loading kvm modules are changed to be
./myins 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
  


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: EPT support breakage on: KVM: VMX: Zero ept module parameter if ept is not present

2009-04-01 Thread Andrew Theurer

Sheng Yang wrote:

Oops... Thanks very much for reporting! I can't believe we haven't awared of
that...

Could you please try the attached patch? Thanks!
  

Tested and works great.  Thanks!

-Andrew

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index aba41ae..8d6465b 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1195,15 +1195,6 @@ static __init int setup_vmcs_config(struct vmcs_config 
*vmcs_conf)
  vmx_capability.ept, vmx_capability.vpid);
}

-   if (!cpu_has_vmx_vpid())
-   enable_vpid = 0;
-
-   if (!cpu_has_vmx_ept())
-   enable_ept = 0;
-
-   if (!(vmcs_config.cpu_based_2nd_exec_ctrl & 
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
-   flexpriority_enabled = 0;
-
min = 0;
 #ifdef CONFIG_X86_64
min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;
@@ -1307,6 +1298,15 @@ static __init int hardware_setup(void)
if (boot_cpu_has(X86_FEATURE_NX))
kvm_enable_efer_bits(EFER_NX);

+   if (!cpu_has_vmx_vpid())
+   enable_vpid = 0;
+
+   if (!cpu_has_vmx_ept())
+   enable_ept = 0;
+
+   if (!(vmcs_config.cpu_based_2nd_exec_ctrl & 
SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
+   flexpriority_enabled = 0;
+
return alloc_kvm_area();
 }

  


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


EPT support breakage on: KVM: VMX: Zero ept module parameter if ept is not present

2009-03-31 Thread Andrew Theurer

I cannot get EPT support to work on commit:
21f65ab2c582594a69dcb1484afa9f88b3414b4f
KVM: VMX: Zero ept module parameter if ept is not present

I see tons of pf_guest from kvm_stat, where as the previous commit has none.
I am using "ept=1" module option for kvm-intel.

This is on Nehalem processors.

-Andrew


commit diff:

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 8b1b9b8..96a19f8 100644 (file)
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -265,7 +265,7 @@ static inline int cpu_has_vmx_ept(void)

static inline int vm_need_ept(void)
{
-   return (cpu_has_vmx_ept() && enable_ept);
+   return enable_ept;
}

static inline int vm_need_virtualize_apic_accesses(struct kvm *kvm)
@@ -1205,6 +1205,9 @@ static __init int setup_vmcs_config(struct 
vmcs_config *vmcs_conf)

   if (!cpu_has_vmx_vpid())
   enable_vpid = 0;

+   if (!cpu_has_vmx_ept())
+   enable_ept = 0;
+
   min = 0;
#ifdef CONFIG_X86_64
   min |= VM_EXIT_HOST_ADDR_SPACE_SIZE;

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] KVM: Defer remote tlb flushes on invlpg (v3)

2009-03-19 Thread Andrew Theurer

Avi Kivity wrote:

KVM currently flushes the tlbs on all cpus when emulating invlpg.  This
is because at the time of invlpg we lose track of the page, and leaving
stale tlb entries could cause the guest to access the page when it is
later freed (say after being swapped out).

However we have a second change to flush the tlbs, when an mmu notifier is
called to let us know the host pte has been invalidated.  We can safely
defer the flush to this point, which occurs much less frequently.  Of course,
we still do a local tlb flush when emulating invlpg.
  
I should be able to run some performance comparisons with this in the 
next day or two.


-Andrew

Signed-off-by: Avi Kivity 
---

Changes from v2:
- dropped remote flushes from guest pagetable write protect paths
- fixed up memory barriers
- use existing local tlb flush in invlpg, no need to add another one

 arch/x86/kvm/mmu.c |3 +--
 arch/x86/kvm/paging_tmpl.h |5 +
 include/linux/kvm_host.h   |2 ++
 virt/kvm/kvm_main.c|   17 +++--
 4 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2a36f7f..f0ea56c 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -1184,8 +1184,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
for_each_sp(pages, sp, parents, i)
protected |= rmap_write_protect(vcpu->kvm, sp->gfn);

-   if (protected)
-   kvm_flush_remote_tlbs(vcpu->kvm);
+   kvm_flush_remote_tlbs_cond(vcpu->kvm, protected);

for_each_sp(pages, sp, parents, i) {
kvm_sync_page(vcpu, sp);
diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h
index 855eb71..2273b26 100644
--- a/arch/x86/kvm/paging_tmpl.h
+++ b/arch/x86/kvm/paging_tmpl.h
@@ -445,7 +445,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
gpa_t pte_gpa = -1;
int level;
u64 *sptep;
-   int need_flush = 0;

spin_lock(&vcpu->kvm->mmu_lock);

@@ -465,7 +464,7 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
rmap_remove(vcpu->kvm, sptep);
if (is_large_pte(*sptep))
--vcpu->kvm->stat.lpages;
-   need_flush = 1;
+   vcpu->kvm->remote_tlbs_dirty = true;
}
set_shadow_pte(sptep, shadow_trap_nonpresent_pte);
break;
@@ -475,8 +474,6 @@ static void FNAME(invlpg)(struct kvm_vcpu *vcpu, gva_t gva)
break;
}

-   if (need_flush)
-   kvm_flush_remote_tlbs(vcpu->kvm);
spin_unlock(&vcpu->kvm->mmu_lock);

if (pte_gpa == -1)
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 11eb702..b779c57 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -125,6 +125,7 @@ struct kvm_kernel_irq_routing_entry {
 struct kvm {
struct mutex lock; /* protects the vcpus array and APIC accesses */
spinlock_t mmu_lock;
+   bool remote_tlbs_dirty;
struct rw_semaphore slots_lock;
struct mm_struct *mm; /* userspace tied to this vm */
int nmemslots;
@@ -235,6 +236,7 @@ void kvm_resched(struct kvm_vcpu *vcpu);
 void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
 void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
 void kvm_flush_remote_tlbs(struct kvm *kvm);
+void kvm_flush_remote_tlbs_cond(struct kvm *kvm, bool cond);
 void kvm_reload_remote_mmus(struct kvm *kvm);

 long kvm_arch_dev_ioctl(struct file *filp,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 68b217e..12afa50 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -758,10 +758,18 @@ static bool make_all_cpus_request(struct kvm *kvm, 
unsigned int req)

 void kvm_flush_remote_tlbs(struct kvm *kvm)
 {
+   kvm->remote_tlbs_dirty = false;
+   smp_wmb();
if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
++kvm->stat.remote_tlb_flush;
 }

+void kvm_flush_remote_tlbs_cond(struct kvm *kvm, bool cond)
+{
+   if (cond || kvm->remote_tlbs_dirty)
+   kvm_flush_remote_tlbs(kvm);
+}
+
 void kvm_reload_remote_mmus(struct kvm *kvm)
 {
make_all_cpus_request(kvm, KVM_REQ_MMU_RELOAD);
@@ -841,8 +849,7 @@ static void kvm_mmu_notifier_invalidate_page(struct 
mmu_notifier *mn,
spin_unlock(&kvm->mmu_lock);

/* we've to flush the tlb before the pages can be freed */
-   if (need_tlb_flush)
-   kvm_flush_remote_tlbs(kvm);
+   kvm_flush_remote_tlbs_cond(kvm, need_tlb_flush);

 }

@@ -866,8 +873,7 @@ static void kvm_mmu_notifier_invalidate_range_start(struct 
mmu_notifier *mn,
spin_unlock(&kvm->mmu_lock);

/* we've to flush the tlb before the pages can be freed */
-   if (need_tlb_flush)
-   kvm_flush_