Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 10/04/2012 06:14 PM, Avi Kivity wrote: On 10/04/2012 12:56 PM, Raghavendra K T wrote: On 10/03/2012 10:55 PM, Avi Kivity wrote: On 10/03/2012 04:29 PM, Raghavendra K T wrote: * Avi Kivity [2012-09-27 14:03:59]: On 09/27/2012 01:23 PM, Raghavendra K T wrote: [...] 2) looking at the result (comparing A & C) , I do feel we have significant in iterating over vcpus (when compared to even vmexit) so We still would need undercommit fix sugested by PeterZ (improving by 140%). ? Looking only at the current runqueue? My worry is that it misses a lot of cases. Maybe try the current runqueue first and then others. Okay. Do you mean we can have something like + if (rq->nr_running == 1 && p_rq->nr_running == 1) { + yielded = -ESRCH; + goto out_irq; + } in the Peter's patch ? ( I thought lot about && or || . Both seem to have their own cons ). But that should be only when we have short term imbalance, as PeterZ told. I'm missing the context. What is p_rq? p_rq is the run queue of target vcpu. What I was trying below was to address Rik concern. Suppose rq of source vcpu has one task, but target probably has two task, with a eligible vcpu waiting to be scheduled. What I mean was: if can_yield_to_process_in_current_rq do that else if can_yield_to_process_in_other_rq do that else return -ESRCH I think you are saying we have to check the run queue of the source vcpu, if we have a vcpu belonging to same VM and try yield to that? ignoring whatever the target vcpu we received for yield_to. Or is it that kvm_vcpu_yield_to should now check the vcpus of same vm belonging to same run queue first. If we don't succeed, go again for a vcpu in different runqueue. Right. Prioritize vcpus that are cheap to yield to. But may return bad results if all vcpus on the current runqueue are spinners, so probably not a good idea. Okay. 'll drop vcpu from same rq idea now. Does it add more overhead especially in <= 1x scenario? The current runqueue should have just our vcpu in that case, so low overhead. But it's a bad idea due to the above scenario. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 10/04/2012 12:56 PM, Raghavendra K T wrote: > On 10/03/2012 10:55 PM, Avi Kivity wrote: >> On 10/03/2012 04:29 PM, Raghavendra K T wrote: >>> * Avi Kivity [2012-09-27 14:03:59]: >>> On 09/27/2012 01:23 PM, Raghavendra K T wrote: >> >>> [...] > 2) looking at the result (comparing A & C) , I do feel we have > significant in iterating over vcpus (when compared to even vmexit) > so We still would need undercommit fix sugested by PeterZ > (improving by > 140%). ? Looking only at the current runqueue? My worry is that it misses a lot of cases. Maybe try the current runqueue first and then others. >>> >>> Okay. Do you mean we can have something like >>> >>> + if (rq->nr_running == 1 && p_rq->nr_running == 1) { >>> + yielded = -ESRCH; >>> + goto out_irq; >>> + } >>> >>> in the Peter's patch ? >>> >>> ( I thought lot about && or || . Both seem to have their own cons ). >>> But that should be only when we have short term imbalance, as PeterZ >>> told. >> >> I'm missing the context. What is p_rq? > > p_rq is the run queue of target vcpu. > What I was trying below was to address Rik concern. Suppose > rq of source vcpu has one task, but target probably has two task, > with a eligible vcpu waiting to be scheduled. > >> >> What I mean was: >> >>if can_yield_to_process_in_current_rq >> do that >>else if can_yield_to_process_in_other_rq >> do that >>else >> return -ESRCH > > I think you are saying we have to check the run queue of the > source vcpu, if we have a vcpu belonging to same VM and try yield to > that? ignoring whatever the target vcpu we received for yield_to. > > Or is it that kvm_vcpu_yield_to should now check the vcpus of same vm > belonging to same run queue first. If we don't succeed, go again for > a vcpu in different runqueue. Right. Prioritize vcpus that are cheap to yield to. But may return bad results if all vcpus on the current runqueue are spinners, so probably not a good idea. > Does it add more overhead especially in <= 1x scenario? The current runqueue should have just our vcpu in that case, so low overhead. But it's a bad idea due to the above scenario. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 10/03/2012 10:55 PM, Avi Kivity wrote: On 10/03/2012 04:29 PM, Raghavendra K T wrote: * Avi Kivity [2012-09-27 14:03:59]: On 09/27/2012 01:23 PM, Raghavendra K T wrote: [...] 2) looking at the result (comparing A & C) , I do feel we have significant in iterating over vcpus (when compared to even vmexit) so We still would need undercommit fix sugested by PeterZ (improving by 140%). ? Looking only at the current runqueue? My worry is that it misses a lot of cases. Maybe try the current runqueue first and then others. Okay. Do you mean we can have something like + if (rq->nr_running == 1 && p_rq->nr_running == 1) { + yielded = -ESRCH; + goto out_irq; + } in the Peter's patch ? ( I thought lot about && or || . Both seem to have their own cons ). But that should be only when we have short term imbalance, as PeterZ told. I'm missing the context. What is p_rq? p_rq is the run queue of target vcpu. What I was trying below was to address Rik concern. Suppose rq of source vcpu has one task, but target probably has two task, with a eligible vcpu waiting to be scheduled. What I mean was: if can_yield_to_process_in_current_rq do that else if can_yield_to_process_in_other_rq do that else return -ESRCH I think you are saying we have to check the run queue of the source vcpu, if we have a vcpu belonging to same VM and try yield to that? ignoring whatever the target vcpu we received for yield_to. Or is it that kvm_vcpu_yield_to should now check the vcpus of same vm belonging to same run queue first. If we don't succeed, go again for a vcpu in different runqueue. Does it add more overhead especially in <= 1x scenario? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 10/03/2012 04:29 PM, Raghavendra K T wrote: > * Avi Kivity [2012-09-27 14:03:59]: > >> On 09/27/2012 01:23 PM, Raghavendra K T wrote: >> >> > [...] >> > 2) looking at the result (comparing A & C) , I do feel we have >> > significant in iterating over vcpus (when compared to even vmexit) >> > so We still would need undercommit fix sugested by PeterZ (improving by >> > 140%). ? >> >> Looking only at the current runqueue? My worry is that it misses a lot >> of cases. Maybe try the current runqueue first and then others. >> > > Okay. Do you mean we can have something like > > + if (rq->nr_running == 1 && p_rq->nr_running == 1) { > + yielded = -ESRCH; > + goto out_irq; > + } > > in the Peter's patch ? > > ( I thought lot about && or || . Both seem to have their own cons ). > But that should be only when we have short term imbalance, as PeterZ > told. I'm missing the context. What is p_rq? What I mean was: if can_yield_to_process_in_current_rq do that else if can_yield_to_process_in_other_rq do that else return -ESRCH -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
* Avi Kivity [2012-09-27 14:03:59]: > On 09/27/2012 01:23 PM, Raghavendra K T wrote: > >> [...] > > 2) looking at the result (comparing A & C) , I do feel we have > > significant in iterating over vcpus (when compared to even vmexit) > > so We still would need undercommit fix sugested by PeterZ (improving by > > 140%). ? > > Looking only at the current runqueue? My worry is that it misses a lot > of cases. Maybe try the current runqueue first and then others. > Okay. Do you mean we can have something like + if (rq->nr_running == 1 && p_rq->nr_running == 1) { + yielded = -ESRCH; + goto out_irq; + } in the Peter's patch ? ( I thought lot about && or || . Both seem to have their own cons ). But that should be only when we have short term imbalance, as PeterZ told. I am experimenting all these for V2 patch. Will come back with analysis and patch. > Or were you referring to something else? > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/28/2012 01:40 PM, Andrew Theurer wrote: >> >> >> >> >> IIRC, with defer preemption : >> >> we will have hook in spinlock/unlock path to measure depth of lock held, >> >> and shared with host scheduler (may be via MSRs now). >> >> Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather >> >> give say one chance. >> > >> > A downside is that we have to do that even when undercommitted. > > Hopefully vcpu preemption is very rare when undercommitted, so it should > not happen much at all. As soon as you're preempted, you're effectively overcommitted (even if the system as a whole is undercommitted). What I meant was that you need to communicate your lock state to the host, and with fine-grained locking this can happen a lot. It may be as simple as an increment/decrement instruction though. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/28/2012 05:10 PM, Andrew Theurer wrote: On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote: On 09/27/2012 05:33 PM, Avi Kivity wrote: On 09/27/2012 01:23 PM, Raghavendra K T wrote: [...] Also there may be a lot of false positives (deferred preemptions even when there is no contention). It will be interesting to see how this behaves with a very high lock activity in a guest. Once the scheduler defers preemption, is it for a fixed amount of time, or does it know to cut the deferral short as soon as the lock depth is reduced [by x]? Design/protocol that Vatsa, had in mind was something like this: - scheduler does not give a vcpu holding lock forever, it may give one chance that would give only few ticks. In addition to giving chance, scheduler also sets some indication that he has been given chance. - vcpu once he release (all) the lock(s), if it had given chance, it should clear that (ACK), and relinquish the cpu. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Fri, 2012-09-28 at 06:40 -0500, Andrew Theurer wrote: > It will be interesting to see how this behaves with a very high lock > activity in a guest. Once the scheduler defers preemption, is it for > a > fixed amount of time, or does it know to cut the deferral short as > soon > as the lock depth is reduced [by x]? Since the locks live in a guest/userspace, we don't even know they're held at all, let alone when state changes. Also, afaik PLE simply exits the guest whenever you do a busy-wait, there's no guarantee its due to a lock at all, we could be waiting for a 'virtual' hardware resource or whatnot. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote: > On 09/27/2012 05:33 PM, Avi Kivity wrote: > > On 09/27/2012 01:23 PM, Raghavendra K T wrote: > >>> > >>> This gives us a good case for tracking preemption on a per-vm basis. As > >>> long as we aren't preempted, we can keep the PLE window high, and also > >>> return immediately from the handler without looking for candidates. > >> > >> 1) So do you think, deferring preemption patch ( Vatsa was mentioning > >> long back) is also another thing worth trying, so we reduce the chance > >> of LHP. > > > > Yes, we have to keep it in mind. It will be useful for fine grained > > locks, not so much so coarse locks or IPIs. > > > > Agree. > > > I would still of course prefer a PLE solution, but if we can't get it to > > work we can consider preemption deferral. > > > > Okay. > > >> > >> IIRC, with defer preemption : > >> we will have hook in spinlock/unlock path to measure depth of lock held, > >> and shared with host scheduler (may be via MSRs now). > >> Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather > >> give say one chance. > > > > A downside is that we have to do that even when undercommitted. Hopefully vcpu preemption is very rare when undercommitted, so it should not happen much at all. > > > > Also there may be a lot of false positives (deferred preemptions even > > when there is no contention). It will be interesting to see how this behaves with a very high lock activity in a guest. Once the scheduler defers preemption, is it for a fixed amount of time, or does it know to cut the deferral short as soon as the lock depth is reduced [by x]? > > Yes. That is a worry. > > > > >> > >> 2) looking at the result (comparing A & C) , I do feel we have > >> significant in iterating over vcpus (when compared to even vmexit) > >> so We still would need undercommit fix sugested by PeterZ (improving by > >> 140%). ? > > > > Looking only at the current runqueue? My worry is that it misses a lot > > of cases. Maybe try the current runqueue first and then others. > > > > Or were you referring to something else? > > No. I was referring to the same thing. > > However. I had tried following also (which works well to check > undercommited scenario). But thinking to use only for yielding in case > of overcommit (yield in overcommit suggested by Rik) and keep > undercommit patch as suggested by PeterZ > > [ patch is not in proper diff I suppose ]. > > Will test them. > > Peter, Can I post your patch with your from/sob.. in V2? > Please let me know.. > > --- > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c > index 28f00bc..9ed3759 100644 > --- a/virt/kvm/kvm_main.c > +++ b/virt/kvm/kvm_main.c > @@ -1620,6 +1620,21 @@ bool kvm_vcpu_eligible_for_directed_yield(struct > kvm_vcpu *vcpu) > return eligible; > } > #endif > + > +bool kvm_overcommitted() > +{ > + unsigned long load; > + > + load = avenrun[0] + FIXED_1/200; > + load = load >> FSHIFT; > + load = (load << 7) / num_online_cpus(); > + > + if (load > 128) > + return true; > + > + return false; > +} > + > void kvm_vcpu_on_spin(struct kvm_vcpu *me) > { > struct kvm *kvm = me->kvm; > @@ -1629,6 +1644,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) > int pass; > int i; > > + if (!kvm_overcommitted()) > + return; > + > kvm_vcpu_set_in_spin_loop(me, true); > /* >* We boost the priority of a VCPU that is runnable but not -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Fri, 2012-09-28 at 11:08 +0530, Raghavendra K T wrote: > > Peter, Can I post your patch with your from/sob.. in V2? > Please let me know.. Yeah I guess ;-) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/28/2012 11:15 AM, H. Peter Anvin wrote: On 09/27/2012 10:38 PM, Raghavendra K T wrote: + +bool kvm_overcommitted() +{ This better not be C... I think you meant I should have had like kvm_overcommitted(void) and (different function name perhaps) or is it the body of function? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/27/2012 10:38 PM, Raghavendra K T wrote: + +bool kvm_overcommitted() +{ This better not be C... -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/27/2012 05:33 PM, Avi Kivity wrote: On 09/27/2012 01:23 PM, Raghavendra K T wrote: This gives us a good case for tracking preemption on a per-vm basis. As long as we aren't preempted, we can keep the PLE window high, and also return immediately from the handler without looking for candidates. 1) So do you think, deferring preemption patch ( Vatsa was mentioning long back) is also another thing worth trying, so we reduce the chance of LHP. Yes, we have to keep it in mind. It will be useful for fine grained locks, not so much so coarse locks or IPIs. Agree. I would still of course prefer a PLE solution, but if we can't get it to work we can consider preemption deferral. Okay. IIRC, with defer preemption : we will have hook in spinlock/unlock path to measure depth of lock held, and shared with host scheduler (may be via MSRs now). Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather give say one chance. A downside is that we have to do that even when undercommitted. Also there may be a lot of false positives (deferred preemptions even when there is no contention). Yes. That is a worry. 2) looking at the result (comparing A & C) , I do feel we have significant in iterating over vcpus (when compared to even vmexit) so We still would need undercommit fix sugested by PeterZ (improving by 140%). ? Looking only at the current runqueue? My worry is that it misses a lot of cases. Maybe try the current runqueue first and then others. Or were you referring to something else? No. I was referring to the same thing. However. I had tried following also (which works well to check undercommited scenario). But thinking to use only for yielding in case of overcommit (yield in overcommit suggested by Rik) and keep undercommit patch as suggested by PeterZ [ patch is not in proper diff I suppose ]. Will test them. Peter, Can I post your patch with your from/sob.. in V2? Please let me know.. --- diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index 28f00bc..9ed3759 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -1620,6 +1620,21 @@ bool kvm_vcpu_eligible_for_directed_yield(struct kvm_vcpu *vcpu) return eligible; } #endif + +bool kvm_overcommitted() +{ + unsigned long load; + + load = avenrun[0] + FIXED_1/200; + load = load >> FSHIFT; + load = (load << 7) / num_online_cpus(); + + if (load > 128) + return true; + + return false; +} + void kvm_vcpu_on_spin(struct kvm_vcpu *me) { struct kvm *kvm = me->kvm; @@ -1629,6 +1644,9 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me) int pass; int i; + if (!kvm_overcommitted()) + return; + kvm_vcpu_set_in_spin_loop(me, true); /* * We boost the priority of a VCPU that is runnable but not -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Thu, 2012-09-27 at 14:03 +0200, Avi Kivity wrote: > On 09/27/2012 01:23 PM, Raghavendra K T wrote: > >> > >> This gives us a good case for tracking preemption on a per-vm basis. As > >> long as we aren't preempted, we can keep the PLE window high, and also > >> return immediately from the handler without looking for candidates. > > > > 1) So do you think, deferring preemption patch ( Vatsa was mentioning > > long back) is also another thing worth trying, so we reduce the chance > > of LHP. > > Yes, we have to keep it in mind. It will be useful for fine grained > locks, not so much so coarse locks or IPIs. > > I would still of course prefer a PLE solution, but if we can't get it to > work we can consider preemption deferral. > > > > > IIRC, with defer preemption : > > we will have hook in spinlock/unlock path to measure depth of lock held, > > and shared with host scheduler (may be via MSRs now). > > Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather > > give say one chance. > > A downside is that we have to do that even when undercommitted. > > Also there may be a lot of false positives (deferred preemptions even > when there is no contention). > > > > > 2) looking at the result (comparing A & C) , I do feel we have > > significant in iterating over vcpus (when compared to even vmexit) > > so We still would need undercommit fix sugested by PeterZ (improving by > > 140%). ? > > Looking only at the current runqueue? My worry is that it misses a lot > of cases. Maybe try the current runqueue first and then others. > > Or were you referring to something else? > > > > > So looking back at threads/ discussions so far, I am trying to > > summarize, the discussions so far. I feel, at least here are the few > > potential candidates to go in: > > > > 1) Avoiding double runqueue lock overhead (Andrew Theurer/ PeterZ) > > 2) Dynamically changing PLE window (Avi/Andrew/Chegu) > > 3) preempt_notify handler to identify preempted VCPUs (Avi) > > 4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ) > > 5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik) > > 6) Pv spinlock > > 7) Jiannan's proposed improvements > > 8) Defer preemption patches > > > > Did we miss anything (or added extra?) > > > > So here are my action items: > > - I plan to repost this series with what PeterZ, Rik suggested with > > performance analysis. > > - I ll go back and explore on (3) and (6) .. > > > > Please Let me know.. > > Undoubtedly we'll think of more stuff. But this looks like a good start. 9) lazy gang-like scheduling with PLE to cover the non-gang-like exceptions (/me runs and hides from scheduler folks) -Andrew Theurer -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/27/2012 01:23 PM, Raghavendra K T wrote: >> >> This gives us a good case for tracking preemption on a per-vm basis. As >> long as we aren't preempted, we can keep the PLE window high, and also >> return immediately from the handler without looking for candidates. > > 1) So do you think, deferring preemption patch ( Vatsa was mentioning > long back) is also another thing worth trying, so we reduce the chance > of LHP. Yes, we have to keep it in mind. It will be useful for fine grained locks, not so much so coarse locks or IPIs. I would still of course prefer a PLE solution, but if we can't get it to work we can consider preemption deferral. > > IIRC, with defer preemption : > we will have hook in spinlock/unlock path to measure depth of lock held, > and shared with host scheduler (may be via MSRs now). > Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather > give say one chance. A downside is that we have to do that even when undercommitted. Also there may be a lot of false positives (deferred preemptions even when there is no contention). > > 2) looking at the result (comparing A & C) , I do feel we have > significant in iterating over vcpus (when compared to even vmexit) > so We still would need undercommit fix sugested by PeterZ (improving by > 140%). ? Looking only at the current runqueue? My worry is that it misses a lot of cases. Maybe try the current runqueue first and then others. Or were you referring to something else? > > So looking back at threads/ discussions so far, I am trying to > summarize, the discussions so far. I feel, at least here are the few > potential candidates to go in: > > 1) Avoiding double runqueue lock overhead (Andrew Theurer/ PeterZ) > 2) Dynamically changing PLE window (Avi/Andrew/Chegu) > 3) preempt_notify handler to identify preempted VCPUs (Avi) > 4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ) > 5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik) > 6) Pv spinlock > 7) Jiannan's proposed improvements > 8) Defer preemption patches > > Did we miss anything (or added extra?) > > So here are my action items: > - I plan to repost this series with what PeterZ, Rik suggested with > performance analysis. > - I ll go back and explore on (3) and (6) .. > > Please Let me know.. Undoubtedly we'll think of more stuff. But this looks like a good start. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/27/2012 03:58 PM, Andrew Jones wrote: On Thu, Sep 27, 2012 at 03:19:45PM +0530, Raghavendra K T wrote: On 09/25/2012 08:30 PM, Dor Laor wrote: On 09/24/2012 02:02 PM, Raghavendra K T wrote: On 09/24/2012 02:12 PM, Dor Laor wrote: In order to help PLE and pvticketlock converge I thought that a small test code should be developed to test this in a predictable, deterministic way. The idea is to have a guest kernel module that spawn a new thread each time you write to a /sys/ entry. Each such a thread spins over a spin lock. The specific spin lock is also chosen by the /sys/ interface. Let's say we have an array of spin locks *10 times the amount of vcpus. All the threads are running a while (1) { spin_lock(my_lock); sum += execute_dummy_cpu_computation(time); spin_unlock(my_lock); if (sys_tells_thread_to_die()) break; } print_result(sum); Instead of calling the kernel's spin_lock functions, clone them and make the ticket lock order deterministic and known (like a linear walk of all the threads trying to catch that lock). By Cloning you mean hierarchy of the locks? No, I meant to clone the implementation of the current spin lock code in order to set any order you may like for the ticket selection. (even for a non pvticket lock version) For instance, let's say you have N threads trying to grab the lock, you can always make the ticket go linearly from 1->2...->N. Not sure it's a good idea, just a recommendation. Also I believe time should be passed via sysfs / hardcoded for each type of lock we are mimicking Yap This way you can easy calculate: 1. the score of a single vcpu running a single thread 2. the score of sum of all thread scores when #thread==#vcpu all taking the same spin lock. The overall sum should be close as possible to #1. 3. Like #2 but #threads > #vcpus and other versions of #total vcpus (belonging to all VMs) > #pcpus. 4. Create #thread == #vcpus but let each thread have it's own spin lock 5. Like 4 + 2 Hopefully this way will allows you to judge and evaluate the exact overhead of scheduling VMs and threads since you have the ideal result in hand and you know what the threads are doing. My 2 cents, Dor Thank you, I think this is an excellent idea. ( Though I am trying to put all the pieces together you mentioned). So overall we should be able to measure the performance of pvspinlock/PLE improvements with a deterministic load in guest. Only thing I am missing is, How to generate different combinations of the lock. Okay, let me see if I can come with a solid model for this. Do you mean the various options for PLE/pvticket/other? I haven't thought of it and assumed its static but it can also be controlled through the temporary /sys interface. No, I am not there yet. So In summary, we are suffering with inconsistent benchmark result, while measuring the benefit of our improvement in PLE/pvlock etc.. Are you measuring the combined throughput of all running guests, or just looking at the results of the benchmarks in a single test guest? I've done some benchmarking as well and my stddevs look pretty good for kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that full sequence of tests (one with the overcommit levels in scrambled order). The relative stddevs for each of the sets of 5 runs look pretty good, and the data for the 2 runs match nicely as well. To try and get consistent results I do the following - interleave the memory of all guests across all numa nodes on the machine - echo 0 > /proc/sys/kernel/randomize_va_space on both host and test guest I was not doing this. - echo 3 > /proc/sys/vm/drop_caches on both host and test guest before each run was doing already as you know - use a ramdisk for the benchmark output files on all running guests Yes.. this is also helpful - no periodically running services installed on the test guest - HT is turned off as you do, although I'd like to try running again with it turned back on Although, I still need to run again measuring the combined throughput of all running vms (including the ones launched just to generate busy vcpus). Maybe my results won't be as consistent then... May be. I take average from all the VMs.. Drew So good point from your suggestion is, - Giving predictability to workload that runs in guest, so that we have pi-pi comparison of improvement. - we can easily tune the workload via sysfs, and we can have script to automate them. What is complicated is: - How can we simulate a workload close to what we measure with benchmarks? - How can we mimic lock holding time/ lock hierarchy close to the way it is seen with real workloads (for e.g. highly contended zone lru lock with similar amount of lockholding times). - How close it would be to when we forget about other types of spinning (for e.g, flush_tlb). So I feel it is not as trivial as it looks like. -- To unsubscribe f
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/27/2012 02:06 PM, Avi Kivity wrote: On 09/25/2012 03:40 PM, Raghavendra K T wrote: On 09/24/2012 07:46 PM, Raghavendra K T wrote: On 09/24/2012 07:24 PM, Peter Zijlstra wrote: On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote: However Rik had a genuine concern in the cases where runqueue is not equally distributed and lockholder might actually be on a different run queue but not running. Load should eventually get distributed equally -- that's what the load-balancer is for -- so this is a temporary situation. We already try and favour the non running vcpu in this case, that's what yield_to_task_fair() is about. If its still not eligible to run, tough luck. Yes, I agree. Do you think instead of using rq->nr_running, we could get a global sense of load using avenrun (something like avenrun/num_onlinecpus) To what purpose? Also, global stuff is expensive, so you should try and stay away from it as hard as you possibly can. Yes, that concern only had made me to fall back to rq->nr_running. Will come back with the result soon. Got the result with the patches: So here is the result, Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with 1x and 2x overcommits Base = 3.6.0-rc5 + ple handler optimization patches A = Base + checking rq_running in vcpu_on_spin() patch B = Base + checking rq->nr_running in sched/core C = Base - PLE ---+---+---+---+---+ |Ebizzy result (rec/sec higher is better) | ---+---+---+---+---+ |Base | A | B| C | ---+---+---+---+---+ 1x | 2374.1250 | 7273.7500 | 5690.8750 | 7364.3750| 2x | 2536.2500 | 2458.5000 | 2426.3750 |48.5000| ---+---+---+---+---+ % improvements w.r.t BASE ---++++ | A |B | C | ---++++ 1x | 206.37603 | 139.70410 | 210.19323 | 2x | -3.06555 | -4.33218 | -98.08773 | ---++++ we are getting the benefit of almost PLE disabled case with this approach. With patch B, we have dropped a bit in gain. (because we still would iterate vcpus until we decide to do a directed yield). This gives us a good case for tracking preemption on a per-vm basis. As long as we aren't preempted, we can keep the PLE window high, and also return immediately from the handler without looking for candidates. 1) So do you think, deferring preemption patch ( Vatsa was mentioning long back) is also another thing worth trying, so we reduce the chance of LHP. IIRC, with defer preemption : we will have hook in spinlock/unlock path to measure depth of lock held, and shared with host scheduler (may be via MSRs now). Host scheduler 'prefers' not to preempt lock holding vcpu. (or rather give say one chance. 2) looking at the result (comparing A & C) , I do feel we have significant in iterating over vcpus (when compared to even vmexit) so We still would need undercommit fix sugested by PeterZ (improving by 140%). ? So looking back at threads/ discussions so far, I am trying to summarize, the discussions so far. I feel, at least here are the few potential candidates to go in: 1) Avoiding double runqueue lock overhead (Andrew Theurer/ PeterZ) 2) Dynamically changing PLE window (Avi/Andrew/Chegu) 3) preempt_notify handler to identify preempted VCPUs (Avi) 4) Avoiding iterating over VCPUs in undercommit scenario. (Raghu/PeterZ) 5) Avoiding unnecessary spinning in overcommit scenario (Raghu/Rik) 6) Pv spinlock 7) Jiannan's proposed improvements 8) Defer preemption patches Did we miss anything (or added extra?) So here are my action items: - I plan to repost this series with what PeterZ, Rik suggested with performance analysis. - I ll go back and explore on (3) and (6) .. Please Let me know.. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/27/2012 12:28 PM, Andrew Jones wrote: >> No, I am not there yet. >> >> So In summary, we are suffering with inconsistent benchmark result, >> while measuring the benefit of our improvement in PLE/pvlock etc.. > > Are you measuring the combined throughput of all running guests, or > just looking at the results of the benchmarks in a single test guest? > > I've done some benchmarking as well and my stddevs look pretty good for > kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each > overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that > full sequence of tests (one with the overcommit levels in scrambled > order). The relative stddevs for each of the sets of 5 runs look pretty > good, and the data for the 2 runs match nicely as well. > > To try and get consistent results I do the following > - interleave the memory of all guests across all numa nodes on the > machine > - echo 0 > /proc/sys/kernel/randomize_va_space on both host and test > guest > - echo 3 > /proc/sys/vm/drop_caches on both host and test guest before > each run > - use a ramdisk for the benchmark output files on all running guests > - no periodically running services installed on the test guest > - HT is turned off as you do, although I'd like to try running again > with it turned back on > > Although, I still need to run again measuring the combined throughput > of all running vms (including the ones launched just to generate busy > vcpus). Maybe my results won't be as consistent then... > Another way to test is to execute perf stat -e 'kvm_exit exit_reason==40' sleep 10 to see how many PAUSEs were intercepted in a given time (except I just invented the filter syntax). The fewer we get, the more useful work the system does. This ignores kvm_vcpu_on_spin overhead though, so it's just a rough measure. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/27/2012 11:49 AM, Raghavendra K T wrote: On 09/25/2012 08:30 PM, Dor Laor wrote: On 09/24/2012 02:02 PM, Raghavendra K T wrote: On 09/24/2012 02:12 PM, Dor Laor wrote: In order to help PLE and pvticketlock converge I thought that a small test code should be developed to test this in a predictable, deterministic way. The idea is to have a guest kernel module that spawn a new thread each time you write to a /sys/ entry. Each such a thread spins over a spin lock. The specific spin lock is also chosen by the /sys/ interface. Let's say we have an array of spin locks *10 times the amount of vcpus. All the threads are running a while (1) { spin_lock(my_lock); sum += execute_dummy_cpu_computation(time); spin_unlock(my_lock); if (sys_tells_thread_to_die()) break; } print_result(sum); Instead of calling the kernel's spin_lock functions, clone them and make the ticket lock order deterministic and known (like a linear walk of all the threads trying to catch that lock). By Cloning you mean hierarchy of the locks? No, I meant to clone the implementation of the current spin lock code in order to set any order you may like for the ticket selection. (even for a non pvticket lock version) For instance, let's say you have N threads trying to grab the lock, you can always make the ticket go linearly from 1->2...->N. Not sure it's a good idea, just a recommendation. Also I believe time should be passed via sysfs / hardcoded for each type of lock we are mimicking Yap This way you can easy calculate: 1. the score of a single vcpu running a single thread 2. the score of sum of all thread scores when #thread==#vcpu all taking the same spin lock. The overall sum should be close as possible to #1. 3. Like #2 but #threads > #vcpus and other versions of #total vcpus (belonging to all VMs) > #pcpus. 4. Create #thread == #vcpus but let each thread have it's own spin lock 5. Like 4 + 2 Hopefully this way will allows you to judge and evaluate the exact overhead of scheduling VMs and threads since you have the ideal result in hand and you know what the threads are doing. My 2 cents, Dor Thank you, I think this is an excellent idea. ( Though I am trying to put all the pieces together you mentioned). So overall we should be able to measure the performance of pvspinlock/PLE improvements with a deterministic load in guest. Only thing I am missing is, How to generate different combinations of the lock. Okay, let me see if I can come with a solid model for this. Do you mean the various options for PLE/pvticket/other? I haven't thought of it and assumed its static but it can also be controlled through the temporary /sys interface. No, I am not there yet. So In summary, we are suffering with inconsistent benchmark result, while measuring the benefit of our improvement in PLE/pvlock etc.. So good point from your suggestion is, - Giving predictability to workload that runs in guest, so that we have pi-pi comparison of improvement. - we can easily tune the workload via sysfs, and we can have script to automate them. What is complicated is: - How can we simulate a workload close to what we measure with benchmarks? - How can we mimic lock holding time/ lock hierarchy close to the way it is seen with real workloads (for e.g. highly contended zone lru lock with similar amount of lockholding times). You can spin for a similar instruction count that you're interested - How close it would be to when we forget about other types of spinning (for e.g, flush_tlb). So I feel it is not as trivial as it looks like. Indeed this is mainly a tool that can serve to optimize few synthetic workloads. I still believe that it worth to go through this exercise since a 100% predictable and controlled case can help us purely asses the state of PLE and pvticket code. Otherwise we're dealing w/ too many parameters and assumptions at once. Dor -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Thu, Sep 27, 2012 at 03:19:45PM +0530, Raghavendra K T wrote: > On 09/25/2012 08:30 PM, Dor Laor wrote: > >On 09/24/2012 02:02 PM, Raghavendra K T wrote: > >>On 09/24/2012 02:12 PM, Dor Laor wrote: > >>>In order to help PLE and pvticketlock converge I thought that a small > >>>test code should be developed to test this in a predictable, > >>>deterministic way. > >>> > >>>The idea is to have a guest kernel module that spawn a new thread each > >>>time you write to a /sys/ entry. > >>> > >>>Each such a thread spins over a spin lock. The specific spin lock is > >>>also chosen by the /sys/ interface. Let's say we have an array of spin > >>>locks *10 times the amount of vcpus. > >>> > >>>All the threads are running a > >>>while (1) { > >>> > >>>spin_lock(my_lock); > >>>sum += execute_dummy_cpu_computation(time); > >>>spin_unlock(my_lock); > >>> > >>>if (sys_tells_thread_to_die()) break; > >>>} > >>> > >>>print_result(sum); > >>> > >>>Instead of calling the kernel's spin_lock functions, clone them and make > >>>the ticket lock order deterministic and known (like a linear walk of all > >>>the threads trying to catch that lock). > >> > >>By Cloning you mean hierarchy of the locks? > > > >No, I meant to clone the implementation of the current spin lock code in > >order to set any order you may like for the ticket selection. > >(even for a non pvticket lock version) > > > >For instance, let's say you have N threads trying to grab the lock, you > >can always make the ticket go linearly from 1->2...->N. > >Not sure it's a good idea, just a recommendation. > > > >>Also I believe time should be passed via sysfs / hardcoded for each > >>type of lock we are mimicking > > > >Yap > > > >> > >>> > >>>This way you can easy calculate: > >>>1. the score of a single vcpu running a single thread > >>>2. the score of sum of all thread scores when #thread==#vcpu all > >>>taking the same spin lock. The overall sum should be close as > >>>possible to #1. > >>>3. Like #2 but #threads > #vcpus and other versions of #total vcpus > >>>(belonging to all VMs) > #pcpus. > >>>4. Create #thread == #vcpus but let each thread have it's own spin > >>>lock > >>>5. Like 4 + 2 > >>> > >>>Hopefully this way will allows you to judge and evaluate the exact > >>>overhead of scheduling VMs and threads since you have the ideal result > >>>in hand and you know what the threads are doing. > >>> > >>>My 2 cents, Dor > >>> > >> > >>Thank you, > >>I think this is an excellent idea. ( Though I am trying to put all the > >>pieces together you mentioned). So overall we should be able to measure > >>the performance of pvspinlock/PLE improvements with a deterministic > >>load in guest. > >> > >>Only thing I am missing is, > >>How to generate different combinations of the lock. > >> > >>Okay, let me see if I can come with a solid model for this. > >> > > > >Do you mean the various options for PLE/pvticket/other? I haven't > >thought of it and assumed its static but it can also be controlled > >through the temporary /sys interface. > > > > No, I am not there yet. > > So In summary, we are suffering with inconsistent benchmark result, > while measuring the benefit of our improvement in PLE/pvlock etc.. Are you measuring the combined throughput of all running guests, or just looking at the results of the benchmarks in a single test guest? I've done some benchmarking as well and my stddevs look pretty good for kcbench, ebizzy, dbench, and sysbench-memory. I do 5 runs for each overcommit level (1.0 - 3.0, stepped by .25 or .5), and 2 runs of that full sequence of tests (one with the overcommit levels in scrambled order). The relative stddevs for each of the sets of 5 runs look pretty good, and the data for the 2 runs match nicely as well. To try and get consistent results I do the following - interleave the memory of all guests across all numa nodes on the machine - echo 0 > /proc/sys/kernel/randomize_va_space on both host and test guest - echo 3 > /proc/sys/vm/drop_caches on both host and test guest before each run - use a ramdisk for the benchmark output files on all running guests - no periodically running services installed on the test guest - HT is turned off as you do, although I'd like to try running again with it turned back on Although, I still need to run again measuring the combined throughput of all running vms (including the ones launched just to generate busy vcpus). Maybe my results won't be as consistent then... Drew > > So good point from your suggestion is, > - Giving predictability to workload that runs in guest, so that we have > pi-pi comparison of improvement. > > - we can easily tune the workload via sysfs, and we can have script to > automate them. > > What is complicated is: > - How can we simulate a workload close to what we measure with > benchmarks? > - How can we mimic lock holding time/ lock hierarchy close to the way > it is seen with real workloads (for e.g. highly contended zone lru lock > with similar amount o
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/26/2012 06:27 PM, Andrew Jones wrote: On Mon, Sep 24, 2012 at 02:36:05PM +0200, Peter Zijlstra wrote: On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote: On 09/24/2012 05:04 PM, Peter Zijlstra wrote: On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote: In some special scenarios like #vcpu<= #pcpu, PLE handler may prove very costly, because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. What's the costly thing? The vm-exit, the yield (which should be a nop if its the only task there) or something else entirely? Both vmexit and yield_to() actually, because unsuccessful yield_to() overall is costly in PLE handler. This is because when we have large guests, say 32/16 vcpus, and one vcpu is holding lock, rest of the vcpus waiting for the lock, when they do PL-exit, each of the vcpu try to iterate over rest of vcpu list in the VM and try to do directed yield (unsuccessful). (O(n^2) tries). this results is fairly high amount of cpu burning and double run queue lock contention. (if they were spinning probably lock progress would have been faster). As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which seems little complex to achieve currently. OK, so the vmexit stays and we need to improve yield_to. Can't we do this check sooner as well, as it only requires per-cpu data? If we do it way back in kvm_vcpu_on_spin, then we avoid get_pid_task() and a bunch of read barriers from kvm_for_each_vcpu. Also, moving the test into kvm code would allow us to do other kvm things as a result of the check in order to avoid some vmexits. It looks like we should be able to avoid some without much complexity by just making a per-vm ple_window variable, and then, when we hit the nr_running == 1 condition, also doing vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP)) Reset the window to the default value when we successfully yield (and maybe we should limit the number of bumps). We indeed checked early in original undercommit patch and it has given result closer to PLE disabled case. But Agree with Peter that it is ugly to export nr_running info to ple handler. Looking at the result and comparing result of A and C, Base = 3.6.0-rc5 + ple handler optimization patches A = Base + checking rq_running in vcpu_on_spin() patch B = Base + checking rq->nr_running in sched/core C = Base - PLE % improvements w.r.t BASE ---++++ | A |B | C | ---++++ 1x | 206.37603 | 139.70410 | 210.19323 | I have a feeling that vmexit has not caused significant overhead compared to iterating over vcpus in PLE handler.. Does it not sound so? But vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP)) is worth trying. I will have to see it eventually. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/26/2012 05:57 PM, Konrad Rzeszutek Wilk wrote: On Tue, Sep 25, 2012 at 05:00:30PM +0200, Dor Laor wrote: On 09/24/2012 02:02 PM, Raghavendra K T wrote: On 09/24/2012 02:12 PM, Dor Laor wrote: In order to help PLE and pvticketlock converge I thought that a small test code should be developed to test this in a predictable, deterministic way. The idea is to have a guest kernel module that spawn a new thread each time you write to a /sys/ entry. Each such a thread spins over a spin lock. The specific spin lock is also chosen by the /sys/ interface. Let's say we have an array of spin locks *10 times the amount of vcpus. All the threads are running a while (1) { spin_lock(my_lock); sum += execute_dummy_cpu_computation(time); spin_unlock(my_lock); if (sys_tells_thread_to_die()) break; } print_result(sum); Instead of calling the kernel's spin_lock functions, clone them and make the ticket lock order deterministic and known (like a linear walk of all the threads trying to catch that lock). By Cloning you mean hierarchy of the locks? No, I meant to clone the implementation of the current spin lock code in order to set any order you may like for the ticket selection. (even for a non pvticket lock version) Wouldn't that defeat the purpose of trying the test the different implementations that try to fix the lock-holder preemption problem? You want something that you can shoe-in for all work-loads - also for this test system. Hmm true. I think it is indeed difficult to shoe-in all workloads. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/25/2012 08:30 PM, Dor Laor wrote: On 09/24/2012 02:02 PM, Raghavendra K T wrote: On 09/24/2012 02:12 PM, Dor Laor wrote: In order to help PLE and pvticketlock converge I thought that a small test code should be developed to test this in a predictable, deterministic way. The idea is to have a guest kernel module that spawn a new thread each time you write to a /sys/ entry. Each such a thread spins over a spin lock. The specific spin lock is also chosen by the /sys/ interface. Let's say we have an array of spin locks *10 times the amount of vcpus. All the threads are running a while (1) { spin_lock(my_lock); sum += execute_dummy_cpu_computation(time); spin_unlock(my_lock); if (sys_tells_thread_to_die()) break; } print_result(sum); Instead of calling the kernel's spin_lock functions, clone them and make the ticket lock order deterministic and known (like a linear walk of all the threads trying to catch that lock). By Cloning you mean hierarchy of the locks? No, I meant to clone the implementation of the current spin lock code in order to set any order you may like for the ticket selection. (even for a non pvticket lock version) For instance, let's say you have N threads trying to grab the lock, you can always make the ticket go linearly from 1->2...->N. Not sure it's a good idea, just a recommendation. Also I believe time should be passed via sysfs / hardcoded for each type of lock we are mimicking Yap This way you can easy calculate: 1. the score of a single vcpu running a single thread 2. the score of sum of all thread scores when #thread==#vcpu all taking the same spin lock. The overall sum should be close as possible to #1. 3. Like #2 but #threads > #vcpus and other versions of #total vcpus (belonging to all VMs) > #pcpus. 4. Create #thread == #vcpus but let each thread have it's own spin lock 5. Like 4 + 2 Hopefully this way will allows you to judge and evaluate the exact overhead of scheduling VMs and threads since you have the ideal result in hand and you know what the threads are doing. My 2 cents, Dor Thank you, I think this is an excellent idea. ( Though I am trying to put all the pieces together you mentioned). So overall we should be able to measure the performance of pvspinlock/PLE improvements with a deterministic load in guest. Only thing I am missing is, How to generate different combinations of the lock. Okay, let me see if I can come with a solid model for this. Do you mean the various options for PLE/pvticket/other? I haven't thought of it and assumed its static but it can also be controlled through the temporary /sys interface. No, I am not there yet. So In summary, we are suffering with inconsistent benchmark result, while measuring the benefit of our improvement in PLE/pvlock etc.. So good point from your suggestion is, - Giving predictability to workload that runs in guest, so that we have pi-pi comparison of improvement. - we can easily tune the workload via sysfs, and we can have script to automate them. What is complicated is: - How can we simulate a workload close to what we measure with benchmarks? - How can we mimic lock holding time/ lock hierarchy close to the way it is seen with real workloads (for e.g. highly contended zone lru lock with similar amount of lockholding times). - How close it would be to when we forget about other types of spinning (for e.g, flush_tlb). So I feel it is not as trivial as it looks like. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/25/2012 03:40 PM, Raghavendra K T wrote: > On 09/24/2012 07:46 PM, Raghavendra K T wrote: >> On 09/24/2012 07:24 PM, Peter Zijlstra wrote: >>> On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote: However Rik had a genuine concern in the cases where runqueue is not equally distributed and lockholder might actually be on a different run queue but not running. >>> >>> Load should eventually get distributed equally -- that's what the >>> load-balancer is for -- so this is a temporary situation. >>> >>> We already try and favour the non running vcpu in this case, that's what >>> yield_to_task_fair() is about. If its still not eligible to run, tough >>> luck. >> >> Yes, I agree. >> >>> Do you think instead of using rq->nr_running, we could get a global sense of load using avenrun (something like avenrun/num_onlinecpus) >>> >>> To what purpose? Also, global stuff is expensive, so you should try and >>> stay away from it as hard as you possibly can. >> >> Yes, that concern only had made me to fall back to rq->nr_running. >> >> Will come back with the result soon. > > Got the result with the patches: > So here is the result, > > Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with > 1x and 2x overcommits > > Base = 3.6.0-rc5 + ple handler optimization patches > A = Base + checking rq_running in vcpu_on_spin() patch > B = Base + checking rq->nr_running in sched/core > C = Base - PLE > > ---+---+---+---+---+ >|Ebizzy result (rec/sec higher is better) | > ---+---+---+---+---+ >|Base | A | B| C | > ---+---+---+---+---+ > 1x | 2374.1250 | 7273.7500 | 5690.8750 | 7364.3750| > 2x | 2536.2500 | 2458.5000 | 2426.3750 |48.5000| > ---+---+---+---+---+ > >% improvements w.r.t BASE > ---++++ >| A |B | C | > ---++++ > 1x | 206.37603 | 139.70410 | 210.19323 | > 2x | -3.06555 | -4.33218 | -98.08773 | > ---++++ > > we are getting the benefit of almost PLE disabled case with this > approach. With patch B, we have dropped a bit in gain. > (because we still would iterate vcpus until we decide to do a directed > yield). This gives us a good case for tracking preemption on a per-vm basis. As long as we aren't preempted, we can keep the PLE window high, and also return immediately from the handler without looking for candidates. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Wed, 2012-09-26 at 15:39 +0200, Andrew Jones wrote: > On Wed, Sep 26, 2012 at 03:26:11PM +0200, Peter Zijlstra wrote: > > On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote: > > > Wouldn't a clean solution be to promote a task's scheduler > > > class to the spinner class when we PLE (or come from some special > > > syscall > > > for userspace spinlocks?)? > > > > Userspace spinlocks are typically employed to avoid syscalls.. > > I'm guessing there could be a slow path - spin N times and then give > up and yield. Much better they should do a blocking futex call or so, once you do the syscall you're in kernel space anyway and have paid the transition cost. > > > > > That class would be higher priority than the > > > fair class and would schedule in FIFO order, but it would only run its > > > tasks for short periods before switching. > > > > Since lock hold times aren't limited, esp. for things like userspace > > 'spin' locks, you've got a very good denial of service / opportunity for > > abuse right there. > > Maybe add some throttling to avoid overuse/maliciousness? At which point you're pretty much back to where you started. A much better approach is using things like priority inheritance, which can be extended to cover the fair class just fine.. Also note that user-space spinning is inherently prone to live-locks when combined with the static priority RT scheduling classes. In general its a very bad idea.. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Wed, Sep 26, 2012 at 03:26:11PM +0200, Peter Zijlstra wrote: > On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote: > > Wouldn't a clean solution be to promote a task's scheduler > > class to the spinner class when we PLE (or come from some special > > syscall > > for userspace spinlocks?)? > > Userspace spinlocks are typically employed to avoid syscalls.. I'm guessing there could be a slow path - spin N times and then give up and yield. > > > That class would be higher priority than the > > fair class and would schedule in FIFO order, but it would only run its > > tasks for short periods before switching. > > Since lock hold times aren't limited, esp. for things like userspace > 'spin' locks, you've got a very good denial of service / opportunity for > abuse right there. Maybe add some throttling to avoid overuse/maliciousness? > > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Wed, 2012-09-26 at 15:20 +0200, Andrew Jones wrote: > Wouldn't a clean solution be to promote a task's scheduler > class to the spinner class when we PLE (or come from some special > syscall > for userspace spinlocks?)? Userspace spinlocks are typically employed to avoid syscalls.. > That class would be higher priority than the > fair class and would schedule in FIFO order, but it would only run its > tasks for short periods before switching. Since lock hold times aren't limited, esp. for things like userspace 'spin' locks, you've got a very good denial of service / opportunity for abuse right there. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Mon, Sep 24, 2012 at 06:20:12PM +0200, Avi Kivity wrote: > On 09/24/2012 06:03 PM, Peter Zijlstra wrote: > > On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote: > >> On 09/24/2012 03:54 PM, Peter Zijlstra wrote: > >> > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote: > >> >> However Rik had a genuine concern in the cases where runqueue is not > >> >> equally distributed and lockholder might actually be on a different run > >> >> queue but not running. > >> > > >> > Load should eventually get distributed equally -- that's what the > >> > load-balancer is for -- so this is a temporary situation. > >> > >> What's the expected latency? This is the whole problem. Eventually the > >> scheduler would pick the lock holder as well, the problem is that it's > >> in the millisecond scale while lock hold times are in the microsecond > >> scale, leading to a 1000x slowdown. > > > > Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes > > something like accurate or fast, never both. > > > >> If we want to yield, we really want to boost someone. > > > > Now if only you knew which someone ;-) This non-modified guest nonsense > > is such a snake pit.. but you know how I feel about all that. > > Actually if I knew that in addition to boosting someone, I also unboost > myself enough to be preempted, it wouldn't matter. While boosting the > lock holder is good, the main point is not spinning and doing useful > work instead. We can detect spinners and avoid boosting them. > > That's the motivation for the "donate vruntime" approach I wanted earlier. I'll probably get shot for the suggestion, but doesn't this problem merit another scheduler class? We want FIFO order for a special class of tasks, "spinners". Wouldn't a clean solution be to promote a task's scheduler class to the spinner class when we PLE (or come from some special syscall for userspace spinlocks?)? That class would be higher priority than the fair class and would schedule in FIFO order, but it would only run its tasks for short periods before switching. Also, after each task is run its scheduler class would get reset down to its original class (fair). At least at first thought this looks to me to be cleaner than the next and skip hinting, plus it helps guarantee that the lock holder gets scheduled before the tasks waiting on that lock. Drew > > > > >> > We already try and favour the non running vcpu in this case, that's what > >> > yield_to_task_fair() is about. If its still not eligible to run, tough > >> > luck. > >> > >> Crazy idea: instead of yielding, just run that other vcpu in the thread > >> that would otherwise spin. I can see about a million objections to this > >> already though. > > > > Yah.. you want me to list a few? :-) It would require synchronization > > with the other cpu to pull its task -- one really wants to avoid it also > > running it. > > Yeah, it's quite a horrible idea. > > > > > Do this at a high enough frequency and you're dead too. > > > > Anyway, you can do this inside the KVM stuff, simply flip the vcpu state > > associated with a vcpu thread and use the preemption notifiers to sort > > things against the scheduler or somesuch. > > That's what I thought when I wrote this, but I can't, I might be > preempted in random kvm code. So my state includes the host stack and > registers. Maybe we can special-case when we interrupt guest mode. > > > > >> >> Do you think instead of using rq->nr_running, we could get a global > >> >> sense of load using avenrun (something like avenrun/num_onlinecpus) > >> > > >> > To what purpose? Also, global stuff is expensive, so you should try and > >> > stay away from it as hard as you possibly can. > >> > >> Spinning is also expensive. How about we do the global stuff every N > >> times, to amortize the cost (and reduce contention)? > > > > Nah, spinning isn't expensive, its a waste of time, similar end result > > for someone who wants to do useful work though, but not the same cause. > > > > Pick N and I'll come up with a scenario for which its wrong ;-) > > Sure. But if it's rare enough, then that's okay for us. > > > Anyway, its an ugly problem and one I really want to contain inside the > > insanity that created it (virt), lets not taint the rest of the kernel > > more than we need to. > > Agreed. Though given that postgres and others use userspace spinlocks, > maybe it's not just virt. > > -- > error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Mon, Sep 24, 2012 at 02:36:05PM +0200, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote: > > On 09/24/2012 05:04 PM, Peter Zijlstra wrote: > > > On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote: > > >> In some special scenarios like #vcpu<= #pcpu, PLE handler may > > >> prove very costly, because there is no need to iterate over vcpus > > >> and do unsuccessful yield_to burning CPU. > > > > > > What's the costly thing? The vm-exit, the yield (which should be a nop > > > if its the only task there) or something else entirely? > > > > > Both vmexit and yield_to() actually, > > > > because unsuccessful yield_to() overall is costly in PLE handler. > > > > This is because when we have large guests, say 32/16 vcpus, and one > > vcpu is holding lock, rest of the vcpus waiting for the lock, when they > > do PL-exit, each of the vcpu try to iterate over rest of vcpu list in > > the VM and try to do directed yield (unsuccessful). (O(n^2) tries). > > > > this results is fairly high amount of cpu burning and double run queue > > lock contention. > > > > (if they were spinning probably lock progress would have been faster). > > As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which > > seems little complex to achieve currently. > > OK, so the vmexit stays and we need to improve yield_to. Can't we do this check sooner as well, as it only requires per-cpu data? If we do it way back in kvm_vcpu_on_spin, then we avoid get_pid_task() and a bunch of read barriers from kvm_for_each_vcpu. Also, moving the test into kvm code would allow us to do other kvm things as a result of the check in order to avoid some vmexits. It looks like we should be able to avoid some without much complexity by just making a per-vm ple_window variable, and then, when we hit the nr_running == 1 condition, also doing vmcs_write32(PLE_WINDOW, (kvm->ple_window += PLE_WINDOW_BUMP)) Reset the window to the default value when we successfully yield (and maybe we should limit the number of bumps). Drew > > How about something like the below, that would allow breaking out of the > for-each-vcpu loop and simply going back into the vm, right? > > --- > kernel/sched/core.c | 25 +++-- > 1 file changed, 19 insertions(+), 6 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index b38f00e..5d5b355 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield); > * It's the caller's job to ensure that the target task struct > * can't go away on us before we can do any checks. > * > - * Returns true if we indeed boosted the target task. > + * Returns: > + * true (>0) if we indeed boosted the target task. > + * false (0) if we failed to boost the target. > + * -ESRCH if there's no task to yield to. > */ > bool __sched yield_to(struct task_struct *p, bool preempt) > { > @@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool > preempt) > local_irq_save(flags); > rq = this_rq(); > > + /* > + * If we're the only runnable task on the rq, there's absolutely no > + * point in yielding. > + */ > + if (rq->nr_running == 1) { > + yielded = -ESRCH; > + goto out_irq; > + } > + > again: > p_rq = task_rq(p); > double_rq_lock(rq, p_rq); > @@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool > preempt) > } > > if (!curr->sched_class->yield_to_task) > - goto out; > + goto out_unlock; > > if (curr->sched_class != p->sched_class) > - goto out; > + goto out_unlock; > > if (task_running(p_rq, p) || p->state) > - goto out; > + goto out_unlock; > > yielded = curr->sched_class->yield_to_task(rq, p, preempt); > if (yielded) { > @@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool > preempt) > resched_task(p_rq->curr); > } > > -out: > +out_unlock: > double_rq_unlock(rq, p_rq); > +out_irq: > local_irq_restore(flags); > > - if (yielded) > + if (yielded > 0) > schedule(); > > return yielded; > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Tue, Sep 25, 2012 at 05:00:30PM +0200, Dor Laor wrote: > On 09/24/2012 02:02 PM, Raghavendra K T wrote: > >On 09/24/2012 02:12 PM, Dor Laor wrote: > >>In order to help PLE and pvticketlock converge I thought that a small > >>test code should be developed to test this in a predictable, > >>deterministic way. > >> > >>The idea is to have a guest kernel module that spawn a new thread each > >>time you write to a /sys/ entry. > >> > >>Each such a thread spins over a spin lock. The specific spin lock is > >>also chosen by the /sys/ interface. Let's say we have an array of spin > >>locks *10 times the amount of vcpus. > >> > >>All the threads are running a > >>while (1) { > >> > >>spin_lock(my_lock); > >>sum += execute_dummy_cpu_computation(time); > >>spin_unlock(my_lock); > >> > >>if (sys_tells_thread_to_die()) break; > >>} > >> > >>print_result(sum); > >> > >>Instead of calling the kernel's spin_lock functions, clone them and make > >>the ticket lock order deterministic and known (like a linear walk of all > >>the threads trying to catch that lock). > > > >By Cloning you mean hierarchy of the locks? > > No, I meant to clone the implementation of the current spin lock > code in order to set any order you may like for the ticket > selection. > (even for a non pvticket lock version) Wouldn't that defeat the purpose of trying the test the different implementations that try to fix the lock-holder preemption problem? You want something that you can shoe-in for all work-loads - also for this test system. > > For instance, let's say you have N threads trying to grab the lock, > you can always make the ticket go linearly from 1->2...->N. > Not sure it's a good idea, just a recommendation. So round-robin. Could you make NCPUS threads, pin them to CPUs, and set them to be SCHED_RR? Or NCPUS*2 to overcommit. > > >Also I believe time should be passed via sysfs / hardcoded for each > >type of lock we are mimicking > > Yap > > > > >> > >>This way you can easy calculate: > >>1. the score of a single vcpu running a single thread > >>2. the score of sum of all thread scores when #thread==#vcpu all > >>taking the same spin lock. The overall sum should be close as > >>possible to #1. > >>3. Like #2 but #threads > #vcpus and other versions of #total vcpus > >>(belonging to all VMs) > #pcpus. > >>4. Create #thread == #vcpus but let each thread have it's own spin > >>lock > >>5. Like 4 + 2 > >> > >>Hopefully this way will allows you to judge and evaluate the exact > >>overhead of scheduling VMs and threads since you have the ideal result > >>in hand and you know what the threads are doing. > >> > >>My 2 cents, Dor > >> > > > >Thank you, > >I think this is an excellent idea. ( Though I am trying to put all the > >pieces together you mentioned). So overall we should be able to measure > >the performance of pvspinlock/PLE improvements with a deterministic > >load in guest. > > > >Only thing I am missing is, > >How to generate different combinations of the lock. > > > >Okay, let me see if I can come with a solid model for this. > > > > Do you mean the various options for PLE/pvticket/other? I haven't > thought of it and assumed its static but it can also be controlled > through the temporary /sys interface. > > Thanks for following up! > Dor > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 02:02 PM, Raghavendra K T wrote: On 09/24/2012 02:12 PM, Dor Laor wrote: In order to help PLE and pvticketlock converge I thought that a small test code should be developed to test this in a predictable, deterministic way. The idea is to have a guest kernel module that spawn a new thread each time you write to a /sys/ entry. Each such a thread spins over a spin lock. The specific spin lock is also chosen by the /sys/ interface. Let's say we have an array of spin locks *10 times the amount of vcpus. All the threads are running a while (1) { spin_lock(my_lock); sum += execute_dummy_cpu_computation(time); spin_unlock(my_lock); if (sys_tells_thread_to_die()) break; } print_result(sum); Instead of calling the kernel's spin_lock functions, clone them and make the ticket lock order deterministic and known (like a linear walk of all the threads trying to catch that lock). By Cloning you mean hierarchy of the locks? No, I meant to clone the implementation of the current spin lock code in order to set any order you may like for the ticket selection. (even for a non pvticket lock version) For instance, let's say you have N threads trying to grab the lock, you can always make the ticket go linearly from 1->2...->N. Not sure it's a good idea, just a recommendation. Also I believe time should be passed via sysfs / hardcoded for each type of lock we are mimicking Yap This way you can easy calculate: 1. the score of a single vcpu running a single thread 2. the score of sum of all thread scores when #thread==#vcpu all taking the same spin lock. The overall sum should be close as possible to #1. 3. Like #2 but #threads > #vcpus and other versions of #total vcpus (belonging to all VMs) > #pcpus. 4. Create #thread == #vcpus but let each thread have it's own spin lock 5. Like 4 + 2 Hopefully this way will allows you to judge and evaluate the exact overhead of scheduling VMs and threads since you have the ideal result in hand and you know what the threads are doing. My 2 cents, Dor Thank you, I think this is an excellent idea. ( Though I am trying to put all the pieces together you mentioned). So overall we should be able to measure the performance of pvspinlock/PLE improvements with a deterministic load in guest. Only thing I am missing is, How to generate different combinations of the lock. Okay, let me see if I can come with a solid model for this. Do you mean the various options for PLE/pvticket/other? I haven't thought of it and assumed its static but it can also be controlled through the temporary /sys interface. Thanks for following up! Dor -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 07:46 PM, Raghavendra K T wrote: On 09/24/2012 07:24 PM, Peter Zijlstra wrote: On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote: However Rik had a genuine concern in the cases where runqueue is not equally distributed and lockholder might actually be on a different run queue but not running. Load should eventually get distributed equally -- that's what the load-balancer is for -- so this is a temporary situation. We already try and favour the non running vcpu in this case, that's what yield_to_task_fair() is about. If its still not eligible to run, tough luck. Yes, I agree. Do you think instead of using rq->nr_running, we could get a global sense of load using avenrun (something like avenrun/num_onlinecpus) To what purpose? Also, global stuff is expensive, so you should try and stay away from it as hard as you possibly can. Yes, that concern only had made me to fall back to rq->nr_running. Will come back with the result soon. Got the result with the patches: So here is the result, Tried this on a 32 core ple box with HT disabled. 32 guest vcpus with 1x and 2x overcommits Base = 3.6.0-rc5 + ple handler optimization patches A = Base + checking rq_running in vcpu_on_spin() patch B = Base + checking rq->nr_running in sched/core C = Base - PLE ---+---+---+---+---+ |Ebizzy result (rec/sec higher is better) | ---+---+---+---+---+ |Base | A | B| C | ---+---+---+---+---+ 1x | 2374.1250 | 7273.7500 | 5690.8750 | 7364.3750| 2x | 2536.2500 | 2458.5000 | 2426.3750 |48.5000| ---+---+---+---+---+ % improvements w.r.t BASE ---++++ | A |B | C | ---++++ 1x | 206.37603 | 139.70410 | 210.19323 | 2x | -3.06555 | -4.33218 | -98.08773 | ---++++ we are getting the benefit of almost PLE disabled case with this approach. With patch B, we have dropped a bit in gain. (because we still would iterate vcpus until we decide to do a directed yield). -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 06:03 PM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote: >> On 09/24/2012 03:54 PM, Peter Zijlstra wrote: >> > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote: >> >> However Rik had a genuine concern in the cases where runqueue is not >> >> equally distributed and lockholder might actually be on a different run >> >> queue but not running. >> > >> > Load should eventually get distributed equally -- that's what the >> > load-balancer is for -- so this is a temporary situation. >> >> What's the expected latency? This is the whole problem. Eventually the >> scheduler would pick the lock holder as well, the problem is that it's >> in the millisecond scale while lock hold times are in the microsecond >> scale, leading to a 1000x slowdown. > > Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes > something like accurate or fast, never both. > >> If we want to yield, we really want to boost someone. > > Now if only you knew which someone ;-) This non-modified guest nonsense > is such a snake pit.. but you know how I feel about all that. Actually if I knew that in addition to boosting someone, I also unboost myself enough to be preempted, it wouldn't matter. While boosting the lock holder is good, the main point is not spinning and doing useful work instead. We can detect spinners and avoid boosting them. That's the motivation for the "donate vruntime" approach I wanted earlier. > >> > We already try and favour the non running vcpu in this case, that's what >> > yield_to_task_fair() is about. If its still not eligible to run, tough >> > luck. >> >> Crazy idea: instead of yielding, just run that other vcpu in the thread >> that would otherwise spin. I can see about a million objections to this >> already though. > > Yah.. you want me to list a few? :-) It would require synchronization > with the other cpu to pull its task -- one really wants to avoid it also > running it. Yeah, it's quite a horrible idea. > > Do this at a high enough frequency and you're dead too. > > Anyway, you can do this inside the KVM stuff, simply flip the vcpu state > associated with a vcpu thread and use the preemption notifiers to sort > things against the scheduler or somesuch. That's what I thought when I wrote this, but I can't, I might be preempted in random kvm code. So my state includes the host stack and registers. Maybe we can special-case when we interrupt guest mode. > >> >> Do you think instead of using rq->nr_running, we could get a global >> >> sense of load using avenrun (something like avenrun/num_onlinecpus) >> > >> > To what purpose? Also, global stuff is expensive, so you should try and >> > stay away from it as hard as you possibly can. >> >> Spinning is also expensive. How about we do the global stuff every N >> times, to amortize the cost (and reduce contention)? > > Nah, spinning isn't expensive, its a waste of time, similar end result > for someone who wants to do useful work though, but not the same cause. > > Pick N and I'll come up with a scenario for which its wrong ;-) Sure. But if it's rare enough, then that's okay for us. > Anyway, its an ugly problem and one I really want to contain inside the > insanity that created it (virt), lets not taint the rest of the kernel > more than we need to. Agreed. Though given that postgres and others use userspace spinlocks, maybe it's not just virt. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Mon, 2012-09-24 at 17:51 +0200, Avi Kivity wrote: > On 09/24/2012 03:54 PM, Peter Zijlstra wrote: > > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote: > >> However Rik had a genuine concern in the cases where runqueue is not > >> equally distributed and lockholder might actually be on a different run > >> queue but not running. > > > > Load should eventually get distributed equally -- that's what the > > load-balancer is for -- so this is a temporary situation. > > What's the expected latency? This is the whole problem. Eventually the > scheduler would pick the lock holder as well, the problem is that it's > in the millisecond scale while lock hold times are in the microsecond > scale, leading to a 1000x slowdown. Yeah I know.. Heisenberg's uncertainty applied to SMP computing becomes something like accurate or fast, never both. > If we want to yield, we really want to boost someone. Now if only you knew which someone ;-) This non-modified guest nonsense is such a snake pit.. but you know how I feel about all that. > > We already try and favour the non running vcpu in this case, that's what > > yield_to_task_fair() is about. If its still not eligible to run, tough > > luck. > > Crazy idea: instead of yielding, just run that other vcpu in the thread > that would otherwise spin. I can see about a million objections to this > already though. Yah.. you want me to list a few? :-) It would require synchronization with the other cpu to pull its task -- one really wants to avoid it also running it. Do this at a high enough frequency and you're dead too. Anyway, you can do this inside the KVM stuff, simply flip the vcpu state associated with a vcpu thread and use the preemption notifiers to sort things against the scheduler or somesuch. > >> Do you think instead of using rq->nr_running, we could get a global > >> sense of load using avenrun (something like avenrun/num_onlinecpus) > > > > To what purpose? Also, global stuff is expensive, so you should try and > > stay away from it as hard as you possibly can. > > Spinning is also expensive. How about we do the global stuff every N > times, to amortize the cost (and reduce contention)? Nah, spinning isn't expensive, its a waste of time, similar end result for someone who wants to do useful work though, but not the same cause. Pick N and I'll come up with a scenario for which its wrong ;-) Anyway, its an ugly problem and one I really want to contain inside the insanity that created it (virt), lets not taint the rest of the kernel more than we need to. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 03:54 PM, Peter Zijlstra wrote: > On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote: >> However Rik had a genuine concern in the cases where runqueue is not >> equally distributed and lockholder might actually be on a different run >> queue but not running. > > Load should eventually get distributed equally -- that's what the > load-balancer is for -- so this is a temporary situation. What's the expected latency? This is the whole problem. Eventually the scheduler would pick the lock holder as well, the problem is that it's in the millisecond scale while lock hold times are in the microsecond scale, leading to a 1000x slowdown. If we want to yield, we really want to boost someone. > We already try and favour the non running vcpu in this case, that's what > yield_to_task_fair() is about. If its still not eligible to run, tough > luck. Crazy idea: instead of yielding, just run that other vcpu in the thread that would otherwise spin. I can see about a million objections to this already though. >> Do you think instead of using rq->nr_running, we could get a global >> sense of load using avenrun (something like avenrun/num_onlinecpus) > > To what purpose? Also, global stuff is expensive, so you should try and > stay away from it as hard as you possibly can. Spinning is also expensive. How about we do the global stuff every N times, to amortize the cost (and reduce contention)? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 07:24 PM, Peter Zijlstra wrote: On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote: However Rik had a genuine concern in the cases where runqueue is not equally distributed and lockholder might actually be on a different run queue but not running. Load should eventually get distributed equally -- that's what the load-balancer is for -- so this is a temporary situation. We already try and favour the non running vcpu in this case, that's what yield_to_task_fair() is about. If its still not eligible to run, tough luck. Yes, I agree. Do you think instead of using rq->nr_running, we could get a global sense of load using avenrun (something like avenrun/num_onlinecpus) To what purpose? Also, global stuff is expensive, so you should try and stay away from it as hard as you possibly can. Yes, that concern only had made me to fall back to rq->nr_running. Will come back with the result soon. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Mon, 2012-09-24 at 18:59 +0530, Raghavendra K T wrote: > However Rik had a genuine concern in the cases where runqueue is not > equally distributed and lockholder might actually be on a different run > queue but not running. Load should eventually get distributed equally -- that's what the load-balancer is for -- so this is a temporary situation. We already try and favour the non running vcpu in this case, that's what yield_to_task_fair() is about. If its still not eligible to run, tough luck. > Do you think instead of using rq->nr_running, we could get a global > sense of load using avenrun (something like avenrun/num_onlinecpus) To what purpose? Also, global stuff is expensive, so you should try and stay away from it as hard as you possibly can. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 06:06 PM, Peter Zijlstra wrote: On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote: On 09/24/2012 05:04 PM, Peter Zijlstra wrote: On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote: In some special scenarios like #vcpu<= #pcpu, PLE handler may prove very costly, because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. What's the costly thing? The vm-exit, the yield (which should be a nop if its the only task there) or something else entirely? Both vmexit and yield_to() actually, because unsuccessful yield_to() overall is costly in PLE handler. This is because when we have large guests, say 32/16 vcpus, and one vcpu is holding lock, rest of the vcpus waiting for the lock, when they do PL-exit, each of the vcpu try to iterate over rest of vcpu list in the VM and try to do directed yield (unsuccessful). (O(n^2) tries). this results is fairly high amount of cpu burning and double run queue lock contention. (if they were spinning probably lock progress would have been faster). As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which seems little complex to achieve currently. OK, so the vmexit stays and we need to improve yield_to. How about something like the below, that would allow breaking out of the for-each-vcpu loop and simply going back into the vm, right? --- kernel/sched/core.c | 25 +++-- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b38f00e..5d5b355 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield); * It's the caller's job to ensure that the target task struct * can't go away on us before we can do any checks. * - * Returns true if we indeed boosted the target task. + * Returns: + * true (>0) if we indeed boosted the target task. + * false (0) if we failed to boost the target. + * -ESRCH if there's no task to yield to. */ bool __sched yield_to(struct task_struct *p, bool preempt) { @@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt) local_irq_save(flags); rq = this_rq(); + /* +* If we're the only runnable task on the rq, there's absolutely no +* point in yielding. +*/ + if (rq->nr_running == 1) { + yielded = -ESRCH; + goto out_irq; + } + again: p_rq = task_rq(p); double_rq_lock(rq, p_rq); @@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool preempt) } if (!curr->sched_class->yield_to_task) - goto out; + goto out_unlock; if (curr->sched_class != p->sched_class) - goto out; + goto out_unlock; if (task_running(p_rq, p) || p->state) - goto out; + goto out_unlock; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool preempt) resched_task(p_rq->curr); } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); - if (yielded) + if (yielded> 0) schedule(); return yielded; Yes, I think this is a nice idea. Any future users of yield_to also would benefit from this. we will have to iterate only till first attempt to yield_to. I 'll run the test with this patch. However Rik had a genuine concern in the cases where runqueue is not equally distributed and lockholder might actually be on a different run queue but not running. Do you think instead of using rq->nr_running, we could get a global sense of load using avenrun (something like avenrun/num_onlinecpus) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Mon, 2012-09-24 at 17:22 +0530, Raghavendra K T wrote: > On 09/24/2012 05:04 PM, Peter Zijlstra wrote: > > On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote: > >> In some special scenarios like #vcpu<= #pcpu, PLE handler may > >> prove very costly, because there is no need to iterate over vcpus > >> and do unsuccessful yield_to burning CPU. > > > > What's the costly thing? The vm-exit, the yield (which should be a nop > > if its the only task there) or something else entirely? > > > Both vmexit and yield_to() actually, > > because unsuccessful yield_to() overall is costly in PLE handler. > > This is because when we have large guests, say 32/16 vcpus, and one > vcpu is holding lock, rest of the vcpus waiting for the lock, when they > do PL-exit, each of the vcpu try to iterate over rest of vcpu list in > the VM and try to do directed yield (unsuccessful). (O(n^2) tries). > > this results is fairly high amount of cpu burning and double run queue > lock contention. > > (if they were spinning probably lock progress would have been faster). > As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which > seems little complex to achieve currently. OK, so the vmexit stays and we need to improve yield_to. How about something like the below, that would allow breaking out of the for-each-vcpu loop and simply going back into the vm, right? --- kernel/sched/core.c | 25 +++-- 1 file changed, 19 insertions(+), 6 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b38f00e..5d5b355 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4272,7 +4272,10 @@ EXPORT_SYMBOL(yield); * It's the caller's job to ensure that the target task struct * can't go away on us before we can do any checks. * - * Returns true if we indeed boosted the target task. + * Returns: + * true (>0) if we indeed boosted the target task. + * false (0) if we failed to boost the target. + * -ESRCH if there's no task to yield to. */ bool __sched yield_to(struct task_struct *p, bool preempt) { @@ -4284,6 +4287,15 @@ bool __sched yield_to(struct task_struct *p, bool preempt) local_irq_save(flags); rq = this_rq(); + /* +* If we're the only runnable task on the rq, there's absolutely no +* point in yielding. +*/ + if (rq->nr_running == 1) { + yielded = -ESRCH; + goto out_irq; + } + again: p_rq = task_rq(p); double_rq_lock(rq, p_rq); @@ -4293,13 +4305,13 @@ bool __sched yield_to(struct task_struct *p, bool preempt) } if (!curr->sched_class->yield_to_task) - goto out; + goto out_unlock; if (curr->sched_class != p->sched_class) - goto out; + goto out_unlock; if (task_running(p_rq, p) || p->state) - goto out; + goto out_unlock; yielded = curr->sched_class->yield_to_task(rq, p, preempt); if (yielded) { @@ -4312,11 +4324,12 @@ bool __sched yield_to(struct task_struct *p, bool preempt) resched_task(p_rq->curr); } -out: +out_unlock: double_rq_unlock(rq, p_rq); +out_irq: local_irq_restore(flags); - if (yielded) + if (yielded > 0) schedule(); return yielded; -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 02:12 PM, Dor Laor wrote: In order to help PLE and pvticketlock converge I thought that a small test code should be developed to test this in a predictable, deterministic way. The idea is to have a guest kernel module that spawn a new thread each time you write to a /sys/ entry. Each such a thread spins over a spin lock. The specific spin lock is also chosen by the /sys/ interface. Let's say we have an array of spin locks *10 times the amount of vcpus. All the threads are running a while (1) { spin_lock(my_lock); sum += execute_dummy_cpu_computation(time); spin_unlock(my_lock); if (sys_tells_thread_to_die()) break; } print_result(sum); Instead of calling the kernel's spin_lock functions, clone them and make the ticket lock order deterministic and known (like a linear walk of all the threads trying to catch that lock). By Cloning you mean hierarchy of the locks? Also I believe time should be passed via sysfs / hardcoded for each type of lock we are mimicking This way you can easy calculate: 1. the score of a single vcpu running a single thread 2. the score of sum of all thread scores when #thread==#vcpu all taking the same spin lock. The overall sum should be close as possible to #1. 3. Like #2 but #threads > #vcpus and other versions of #total vcpus (belonging to all VMs) > #pcpus. 4. Create #thread == #vcpus but let each thread have it's own spin lock 5. Like 4 + 2 Hopefully this way will allows you to judge and evaluate the exact overhead of scheduling VMs and threads since you have the ideal result in hand and you know what the threads are doing. My 2 cents, Dor Thank you, I think this is an excellent idea. ( Though I am trying to put all the pieces together you mentioned). So overall we should be able to measure the performance of pvspinlock/PLE improvements with a deterministic load in guest. Only thing I am missing is, How to generate different combinations of the lock. Okay, let me see if I can come with a solid model for this. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/24/2012 05:04 PM, Peter Zijlstra wrote: On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote: In some special scenarios like #vcpu<= #pcpu, PLE handler may prove very costly, because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. What's the costly thing? The vm-exit, the yield (which should be a nop if its the only task there) or something else entirely? Both vmexit and yield_to() actually, because unsuccessful yield_to() overall is costly in PLE handler. This is because when we have large guests, say 32/16 vcpus, and one vcpu is holding lock, rest of the vcpus waiting for the lock, when they do PL-exit, each of the vcpu try to iterate over rest of vcpu list in the VM and try to do directed yield (unsuccessful). (O(n^2) tries). this results is fairly high amount of cpu burning and double run queue lock contention. (if they were spinning probably lock progress would have been faster). As Avi/Chegu Vinod had felt it is better to avoid vmexit itself, which seems little complex to achieve currently. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On Fri, 2012-09-21 at 17:29 +0530, Raghavendra K T wrote: > In some special scenarios like #vcpu <= #pcpu, PLE handler may > prove very costly, because there is no need to iterate over vcpus > and do unsuccessful yield_to burning CPU. What's the costly thing? The vm-exit, the yield (which should be a nop if its the only task there) or something else entirely? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
In order to help PLE and pvticketlock converge I thought that a small test code should be developed to test this in a predictable, deterministic way. The idea is to have a guest kernel module that spawn a new thread each time you write to a /sys/ entry. Each such a thread spins over a spin lock. The specific spin lock is also chosen by the /sys/ interface. Let's say we have an array of spin locks *10 times the amount of vcpus. All the threads are running a while (1) { spin_lock(my_lock); sum += execute_dummy_cpu_computation(time); spin_unlock(my_lock); if (sys_tells_thread_to_die()) break; } print_result(sum); Instead of calling the kernel's spin_lock functions, clone them and make the ticket lock order deterministic and known (like a linear walk of all the threads trying to catch that lock). This way you can easy calculate: 1. the score of a single vcpu running a single thread 2. the score of sum of all thread scores when #thread==#vcpu all taking the same spin lock. The overall sum should be close as possible to #1. 3. Like #2 but #threads > #vcpus and other versions of #total vcpus (belonging to all VMs) > #pcpus. 4. Create #thread == #vcpus but let each thread have it's own spin lock 5. Like 4 + 2 Hopefully this way will allows you to judge and evaluate the exact overhead of scheduling VMs and threads since you have the ideal result in hand and you know what the threads are doing. My 2 cents, Dor On 09/21/2012 08:36 PM, Raghavendra K T wrote: On 09/21/2012 06:48 PM, Chegu Vinod wrote: On 9/21/2012 4:59 AM, Raghavendra K T wrote: In some special scenarios like #vcpu <= #pcpu, PLE handler may prove very costly, Yes. because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. An idea to solve this is: 1) As Avi had proposed we can modify hardware ple_window dynamically to avoid frequent PL-exit. Yes. We had to do this to get around some scaling issues for large (>20way) guests (with no overcommitment) Do you mean you already have some solution tested for this? As part of some experimentation we even tried "switching off" PLE too :( Honestly, Your this experiment and Andrew Theurer's observations were the motivation for this patch. (IMHO, it is difficult to decide when we have mixed type of VMs). Agree. Not sure if the following alternatives have also been looked at : - Could the behavior associated with the "ple_window" be modified to be a function of some [new] per-guest attribute (which can be conveyed to the host as part of the guest launch sequence). The user can choose to set this [new] attribute for a given guest. This would help avoid the frequent exits due to PLE (as Avi had mentioned earlier) ? Ccing Drew also. We had a good discussion on this idea last time. (sorry that I forgot to include in patch series) May be a good idea when we know the load in advance.. - Can the PLE feature ( in VT) be "enhanced" to be made a per guest attribute ? IMHO, the approach of not taking a frequent exit is better than taking an exit and returning back from the handler etc. I entirely agree on this point. (though have not tried above approaches). Hope to see more expert opinions pouring in. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 09/21/2012 06:48 PM, Chegu Vinod wrote: On 9/21/2012 4:59 AM, Raghavendra K T wrote: In some special scenarios like #vcpu <= #pcpu, PLE handler may prove very costly, Yes. because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. An idea to solve this is: 1) As Avi had proposed we can modify hardware ple_window dynamically to avoid frequent PL-exit. Yes. We had to do this to get around some scaling issues for large (>20way) guests (with no overcommitment) Do you mean you already have some solution tested for this? As part of some experimentation we even tried "switching off" PLE too :( Honestly, Your this experiment and Andrew Theurer's observations were the motivation for this patch. (IMHO, it is difficult to decide when we have mixed type of VMs). Agree. Not sure if the following alternatives have also been looked at : - Could the behavior associated with the "ple_window" be modified to be a function of some [new] per-guest attribute (which can be conveyed to the host as part of the guest launch sequence). The user can choose to set this [new] attribute for a given guest. This would help avoid the frequent exits due to PLE (as Avi had mentioned earlier) ? Ccing Drew also. We had a good discussion on this idea last time. (sorry that I forgot to include in patch series) May be a good idea when we know the load in advance.. - Can the PLE feature ( in VT) be "enhanced" to be made a per guest attribute ? IMHO, the approach of not taking a frequent exit is better than taking an exit and returning back from the handler etc. I entirely agree on this point. (though have not tried above approaches). Hope to see more expert opinions pouring in. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 0/2] kvm: Improving undercommit,overcommit scenarios in PLE handler
On 9/21/2012 4:59 AM, Raghavendra K T wrote: In some special scenarios like #vcpu <= #pcpu, PLE handler may prove very costly, Yes. because there is no need to iterate over vcpus and do unsuccessful yield_to burning CPU. An idea to solve this is: 1) As Avi had proposed we can modify hardware ple_window dynamically to avoid frequent PL-exit. Yes. We had to do this to get around some scaling issues for large (>20way) guests (with no overcommitment) As part of some experimentation we even tried "switching off" PLE too :( (IMHO, it is difficult to decide when we have mixed type of VMs). Agree. Not sure if the following alternatives have also been looked at : - Could the behavior associated with the "ple_window" be modified to be a function of some [new] per-guest attribute (which can be conveyed to the host as part of the guest launch sequence). The user can choose to set this [new] attribute for a given guest. This would help avoid the frequent exits due to PLE (as Avi had mentioned earlier) ? - Can the PLE feature ( in VT) be "enhanced" to be made a per guest attribute ? IMHO, the approach of not taking a frequent exit is better than taking an exit and returning back from the handler etc. Thanks Vinod Another idea, proposed in the first patch, is to identify non-overcommit case and just return from the PLE handler. There are are many ways to identify non-overcommit scenario. 1) Using loadavg etc (get_avenrun/calc_global_load /this_cpu_load) 2) Explicitly check nr_running()/num_online_cpus() 3) Check source vcpu runqueue length. Not sure how can we make use of (1) effectively/how to use it. (2) has significant overhead since it iterates all cpus. so this patch uses third method. (I feel it is uglier to export runqueue length, but expecting suggestion on this). In second patch, when we have large number of small guests, it is possible that a spinning vcpu fails to yield_to any vcpu of same VM and go back and spin. This is also not effective when we are over-committed. Instead, we do a schedule() so that we give chance to other VMs to run. Raghavendra K T(2): Handle undercommitted guest case in PLE handler Be courteous to other VMs in overcommitted scenario in PLE handler Results: base = 3.6.0-rc5 + ple handler optimization patches from kvm tree. patched = base + patch1 + patch2 machine: x240 with 16 core with HT enabled (32 cpu thread). 32 vcpu guest with 8GB RAM. +---+---+---++---+ ebizzy (record/sec higher is better) +---+---+---++---+ basestddev patchedstdev%improve +---+---+---++---+ 11293.3750 624.4378 18209.6250 371.706161.24166 3641.8750 468.9400 3725.5000 253.7823 2.29621 +---+---+---++---+ +---+---+---++---+ kernbench (time in sec lower is better) +---+---+---++---+ basestddev patchedstdev%improve +---+---+---++---+ 30.6020 1.3018 30.8287 1.1517-0.74080 64.0825 2.3764 63.4721 5.0191 0.95252 95.8638 8.7030 94.5988 8.3832 1.31958 +---+---+---++---+ Note: on mx3850x5 machine with 32 cores HT disabled I got around ebizzy 209% kernbench 6% improvement for 1x scenario. Thanks Srikar for his active partipation in discussing ideas and reviewing the patch. Please let me know your suggestions and comments. --- include/linux/sched.h |1 + kernel/sched/core.c |6 ++ virt/kvm/kvm_main.c |7 +++ 3 files changed, 14 insertions(+), 0 deletions(-) . -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html