Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE

Anup Patel Wed, 09 Oct 2013 07:53:23 -0700

On Wed, Oct 9, 2013 at 8:20 PM, Anup Patel <a...@brainfault.org> wrote:
> On Wed, Oct 9, 2013 at 7:48 PM, Marc Zyngier <marc.zyng...@arm.com> wrote:
>> On 09/10/13 14:26, Gleb Natapov wrote:
>>> On Wed, Oct 09, 2013 at 03:09:54PM +0200, Alexander Graf wrote:
>>>>
>>>> On 07.10.2013, at 18:53, Gleb Natapov <g...@redhat.com> wrote:
>>>>
>>>>> On Mon, Oct 07, 2013 at 06:30:04PM +0200, Alexander Graf wrote:
>>>>>>
>>>>>> On 07.10.2013, at 18:16, Marc Zyngier <marc.zyng...@arm.com> wrote:
>>>>>>
>>>>>>> On 07/10/13 17:04, Alexander Graf wrote:
>>>>>>>>
>>>>>>>> On 07.10.2013, at 17:40, Marc Zyngier <marc.zyng...@arm.com> wrote:
>>>>>>>>
>>>>>>>>> On an (even slightly) oversubscribed system, spinlocks are quickly
>>>>>>>>> becoming a bottleneck, as some vcpus are spinning, waiting for a
>>>>>>>>> lock to be released, while the vcpu holding the lock may not be
>>>>>>>>> running at all.
>>>>>>>>>
>>>>>>>>> This creates contention, and the observed slowdown is 40x for
>>>>>>>>> hackbench. No, this isn't a typo.
>>>>>>>>>
>>>>>>>>> The solution is to trap blocking WFEs and tell KVM that we're now
>>>>>>>>> spinning. This ensures that other vpus will get a scheduling boost,
>>>>>>>>> allowing the lock to be released more quickly.
>>>>>>>>>
>>>>>>>>>> From a performance point of view: hackbench 1 process 1000
>>>>>>>>>
>>>>>>>>> 2xA15 host (baseline):  1.843s
>>>>>>>>>
>>>>>>>>> 2xA15 guest w/o patch:  2.083s 4xA15 guest w/o patch:   80.212s
>>>>>>>>>
>>>>>>>>> 2xA15 guest w/ patch:   2.072s 4xA15 guest w/ patch:    3.202s
>>>>>>>>
>>>>>>>> I'm confused. You got from 2.083s when not exiting on spin locks to
>>>>>>>> 2.072 when exiting on _every_ spin lock that didn't immediately
>>>>>>>> succeed. I would've expected to second number to be worse rather than
>>>>>>>> better. I assume it's within jitter, I'm still puzzled why you don't
>>>>>>>> see any significant drop in performance.
>>>>>>>
>>>>>>> The key is in the ARM ARM:
>>>>>>>
>>>>>>> B1.14.9: "When HCR.TWE is set to 1, and the processor is in a Non-secure
>>>>>>> mode other than Hyp mode, execution of a WFE instruction generates a Hyp
>>>>>>> Trap exception if, ignoring the value of the HCR.TWE bit, conditions
>>>>>>> permit the processor to suspend execution."
>>>>>>>
>>>>>>> So, on a non-overcommitted system, you rarely hit a blocking spinlock,
>>>>>>> hence not trapping. Otherwise, performance would go down the drain very
>>>>>>> quickly.
>>>>>>
>>>>>> Well, it's the same as pause/loop exiting on x86, but there we have 
>>>>>> special hardware features to only ever exit after n number of 
>>>>>> turnarounds. I wonder why we have those when we could just as easily 
>>>>>> exit on every blocking path.
>>>>>>
>>>>> It will hurt performance if vcpu that holds the lock is running.
>>>>
>>>> Apparently not so on ARM. At least that's what Marc's numbers are showing. 
>>>> I'm not sure what exactly that means. Basically his logic is "if we spin, 
>>>> the holder must have been preempted". And it seems to work out 
>>>> surprisingly well.
>>
>> Yes. I basically assume that contention should be rare, and that ending
>> up in a *blocking* WFE is a sign that we're in thrashing mode already
>> (no event is pending).
>>
>>>>
>>> For not contended locks it make sense. We need to recheck if x86
>>> assumption is still true there, but x86 lock is ticketing which
>>> has not only lock holder preemption, but also lock waiter
>>> preemption problem which make overcommit problem even worse.
>>
>> Locks are ticketing on ARM as well. But there is one key difference here
>> with x86 (or at least what I understand of it, which is very close to
>> none): We only trap if we would have blocked anyway. In our case, it is
>> almost always better to give up the CPU to someone else rather than
>> waiting for some event to take the CPU out of sleep.
>
> Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
> 1. How spin lock is implemented in Guest OS?
> we cannot assume
>     that underlying Guest OS is always Linux.
> 2. How bad/good is spin
>
> It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE


(Please ignore previous incomplete reply ....)

Benefits of "Yield CPU when vcpu executes a WFE" seems to depend on:
1. How spin lock is implemented in Guest OS?
(Note: we cannot assume that underlying Guest OS is always Linux)
2. How bad/good is spin lock contention in Guest ?
(Note: here too we cannot assume the loads running on Guest)

It will be good if we can enable/disable "Yield CPU when vcpu executes a WFE"
via Kconfig.

--Anup

>
>
>>
>>         M.
>> --
>> Jazz is not dead. It just smells funny...
>>
>>
>> _______________________________________________
>> kvmarm mailing list
>> kvm...@lists.cs.columbia.edu
>> https://lists.cs.columbia.edu/cucslists/listinfo/kvmarm
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/2] ARM: KVM: Yield CPU when vcpu executes a WFE

Reply via email to