[PATCH] use unfair spinlock when running on hypervisor.
Ticket lock spinlock ensures fairness by introducing FIFO of cpus waiting for spinlock to be released. This works great on real HW, but when running on hypervisor it introduce very heavy performance hit if physical cpus are overcommitted (up to 35% in my test). The reason for performance hit is that vcpu that at the head of the FIFO can be unscheduled and no other vcpu waiting on the lock can proceed. This result in a big time gaps where no vcpu is in critical section. Even worse they are constantly spinning and not allowing vcpu at the head of the FIFO to be scheduled to run. The patch below allows to patch ticket spinlock code to behave similar to old unfair spinlock when hypervisor is detected. After patching unlocked condition remains the same (next == owner), but if vcpu detects that lock is already taken it spins till (next == owner == 0) and when the condition is met it tries to reacquire the lock. Unlock simply zeroes head and tail. Trylock state the same since unlocked condition stays the same. I ran two guest with 16 vcpus each one. One guest with this patch another without. Inside those guests I ran kernel compilation "time make -j 16 all". Here are results of two runs: patched:unpatched: real 13m34.674s real 19m35.827s user 96m2.638s user 102m38.665s sys 56m14.991s sys 158m22.470s real 13m3.768s real 19m4.375s user 95m34.509s user 111m9.903s sys 53m40.550s sys 141m59.370s Note that for 6 minutes unpatched guest ran without cpu oversubscription since patched guest was already idle. Running kernbench on the host with and without the patch: Unpatched:Patched: Average Half load -j 8 Run (std deviation): Elapsed Time 312.538 (0.779404) Elapsed Time 311.974 (0.258031) User Time 2154.89 (6.62261) User Time 2151.94 (1.96702) System Time 233.78 (1.96642) System Time 236.472 (0.695572) Percent CPU 763.8 (1.64317) Percent CPU 765 (0.707107) Context Switches 39348.2 (1542.87)Context Switches 41522.4 (1193.13) Sleeps 871688 (5596.52) Sleeps 877317 (1135.83) Average Optimal load -j 16 Run (std deviation): Elapsed Time 259.878 (0.444826) Elapsed Time 258.842 (0.413122) User Time 2549.22 (415.685) User Time 2538.42 (407.383) System Time 261.084 (28.8145) System Time 263.847 (28.8638) Percent CPU 1003.4 (252.566) Percent CPU 1003.5 (251.407) Context Switches 228418 (199308) Context Switches 228894 (197533) Sleeps 898721 (28757.2) Sleeps 902020 (26074.5) Signed-off-by: Gleb Natapov diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 3089f70..b919b54 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -60,19 +60,27 @@ static __always_inline void __ticket_spin_lock(arch_spinlock_t *lock) { - short inc = 0x0100; + short inc; asm volatile ( + "1:\t\n" + "mov $0x100, %0\n\t" LOCK_PREFIX "xaddw %w0, %1\n" - "1:\t" + "2:\t" "cmpb %h0, %b0\n\t" - "je 2f\n\t" + "je 4f\n\t" + "3:\t\n" "rep ; nop\n\t" - "movb %1, %b0\n\t" /* don't need lfence here, because loads are in-order */ - "jmp 1b\n" - "2:" - : "+Q" (inc), "+m" (lock->slock) + ALTERNATIVE( + "movb %1, %b0\n\t" + "jmp 2b\n", + "nop", X86_FEATURE_HYPERVISOR)"\n\t" + "cmpw $0, %1\n\t" + "jne 3b\n\t" + "jmp 1b\n\t" + "4:" + : "=Q" (inc), "+m" (lock->slock) : : "memory", "cc"); } @@ -98,10 +106,13 @@ static __always_inline int __ticket_spin_trylock(arch_spinlock_t *lock) static __always_inline void __ticket_spin_unlock(arch_spinlock_t *lock) { - asm volatile(UNLOCK_LOCK_PREFIX "incb %0" -: "+m" (lock->slock) -: -: "memory", "cc"); + asm volatile( + ALTERNATIVE(UNLOCK_LOCK_PREFIX "incb (%0);"ASM_NOP3, + UNLOCK_LOCK_PREFIX "movw $0, (%0)", + X86_FEATURE_HYPERVISOR) + : + : "Q" (&lock->slock) + : "memory", "cc"); } #else #define TICKET_SHIFT 16 -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-in
Re: [PATCH] use unfair spinlock when running on hypervisor.
Gleb Natapov writes: > > The patch below allows to patch ticket spinlock code to behave similar to > old unfair spinlock when hypervisor is detected. After patching unlocked The question is what happens when you have a system with unfair memory and you run the hypervisor on that. There it could be much worse. Your new code would starve again, right? There's a reason the ticket spinlocks were added in the first place. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Tue, Jun 01, 2010 at 05:53:09PM +0200, Andi Kleen wrote: > Gleb Natapov writes: > > > > The patch below allows to patch ticket spinlock code to behave similar to > > old unfair spinlock when hypervisor is detected. After patching unlocked > > The question is what happens when you have a system with unfair > memory and you run the hypervisor on that. There it could be much worse. > How much worse performance hit could be? > Your new code would starve again, right? > Yes, of course it may starve with unfair spinlock. Since vcpus are not always running there is much smaller chance then vcpu on remote memory node will starve forever. Old kernels with unfair spinlocks are running fine in VMs on NUMA machines with various loads. > There's a reason the ticket spinlocks were added in the first place. > I understand that reason and do not propose to get back to old spinlock on physical HW! But with virtualization performance hit is unbearable. -- Gleb. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Tue, Jun 01, 2010 at 07:24:14PM +0300, Gleb Natapov wrote: > On Tue, Jun 01, 2010 at 05:53:09PM +0200, Andi Kleen wrote: > > Gleb Natapov writes: > > > > > > The patch below allows to patch ticket spinlock code to behave similar to > > > old unfair spinlock when hypervisor is detected. After patching unlocked > > > > The question is what happens when you have a system with unfair > > memory and you run the hypervisor on that. There it could be much worse. > > > How much worse performance hit could be? It depends on the workload. Overall it means that a contended lock can have much higher latencies. If you want to study some examples see the locking problems the RT people have with their heavy weight mutex-spinlocks. But the main problem is that in the worst case you can see extremly long stalls (upto a second has been observed), which then turns in a correctness issue. > > > Your new code would starve again, right? > > > Yes, of course it may starve with unfair spinlock. Since vcpus are not > always running there is much smaller chance then vcpu on remote memory > node will starve forever. Old kernels with unfair spinlocks are running > fine in VMs on NUMA machines with various loads. Try it on a NUMA system with unfair memory. > > There's a reason the ticket spinlocks were added in the first place. > > > I understand that reason and do not propose to get back to old spinlock > on physical HW! But with virtualization performance hit is unbearable. Extreme unfairness can be unbearable too. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On 06/01/2010 07:38 PM, Andi Kleen wrote: Your new code would starve again, right? Yes, of course it may starve with unfair spinlock. Since vcpus are not always running there is much smaller chance then vcpu on remote memory node will starve forever. Old kernels with unfair spinlocks are running fine in VMs on NUMA machines with various loads. Try it on a NUMA system with unfair memory. We are running everything on NUMA (since all modern machines are now NUMA). At what scale do the issues become observable? I understand that reason and do not propose to get back to old spinlock on physical HW! But with virtualization performance hit is unbearable. Extreme unfairness can be unbearable too. Well, the question is what happens first. In our experience, vcpu overcommit is a lot more painful. People will never see the NUMA unfairness issue if they can't use kvm due to the vcpu overcommit problem. What I'd like to see eventually is a short-term-unfair, long-term-fair spinlock. Might make sense for bare metal as well. But it won't be easy to write. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Tue, Jun 01, 2010 at 07:52:28PM +0300, Avi Kivity wrote: > We are running everything on NUMA (since all modern machines are now NUMA). > At what scale do the issues become observable? On Intel platforms it's visible starting with 4 sockets. > >>> I understand that reason and do not propose to get back to old spinlock >>> on physical HW! But with virtualization performance hit is unbearable. >>> >> Extreme unfairness can be unbearable too. >> > > Well, the question is what happens first. In our experience, vcpu > overcommit is a lot more painful. People will never see the NUMA > unfairness issue if they can't use kvm due to the vcpu overcommit problem. You really have to address both, if you don't fix them both users will eventually into one of them and be unhappy. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Tue, 01 Jun 2010 19:52:28 +0300, Avi Kivity said: > On 06/01/2010 07:38 PM, Andi Kleen wrote: > >>> Your new code would starve again, right? > > Try it on a NUMA system with unfair memory. > We are running everything on NUMA (since all modern machines are now > NUMA). At what scale do the issues become observable? My 6-month-old laptop is NUMA? Comes as a surprise to me, and to the perfectly-running NUMA=n kernel I'm running. Or did you mean a less broad phrase than "all modern machines"? pgpThbBemb8KF.pgp Description: PGP signature
Re: [PATCH] use unfair spinlock when running on hypervisor.
Avi Kivity wrote: > On 06/01/2010 07:38 PM, Andi Kleen wrote: Your new code would starve again, right? >>> Yes, of course it may starve with unfair spinlock. Since vcpus are not >>> always running there is much smaller chance then vcpu on remote memory >>> node will starve forever. Old kernels with unfair spinlocks are running >>> fine in VMs on NUMA machines with various loads. >>> >> Try it on a NUMA system with unfair memory. >> > > We are running everything on NUMA (since all modern machines are now > NUMA). At what scale do the issues become observable? > >>> I understand that reason and do not propose to get back to old spinlock >>> on physical HW! But with virtualization performance hit is unbearable. >>> >> Extreme unfairness can be unbearable too. >> > > Well, the question is what happens first. In our experience, vcpu > overcommit is a lot more painful. People will never see the NUMA > unfairness issue if they can't use kvm due to the vcpu overcommit problem. Gleb's observed performance hit seems to be a rather mild throughput depression compared with creating a worst case by enforcing vcpu overcommit. Running a single guest with 2:1 overcommit on a 4 core machine I saw over an order of magnitude slowdown vs. 1:1 commit with the same kernel build test. Others have reported similar results. How close you'll get to that scenario depends on host scheduling dynamics, and statistically the number of opened and stalled lock held paths waiting to be contended. So I'd expect to see quite variable numbers for guest-guest aggravation of this problem. > What I'd like to see eventually is a short-term-unfair, long-term-fair > spinlock. Might make sense for bare metal as well. But it won't be > easy to write. Collecting the contention/usage statistics on a per spinlock basis seems complex. I believe a practical approximation to this are adaptive mutexes where upon hitting a spin time threshold, punt and let the scheduler reconcile fairness. -john -- john.coo...@third-harmonic.com -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
> Collecting the contention/usage statistics on a per spinlock > basis seems complex. I believe a practical approximation > to this are adaptive mutexes where upon hitting a spin > time threshold, punt and let the scheduler reconcile fairness. That would probably work, except: how do you get the adaptive spinlock into a paravirt op without slowing down a standard kernel? -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
Le mardi 01 juin 2010 à 19:52 +0300, Avi Kivity a écrit : > What I'd like to see eventually is a short-term-unfair, long-term-fair > spinlock. Might make sense for bare metal as well. But it won't be > easy to write. > This thread rings a bell here :) Yes, ticket spinlocks are sometime slower, especially in workloads where a spinlock needs to be taken several times to handle one unit of work, and many cpus competing. We currently have kind of a similar problem in network stack, and we have a patch to speedup xmit path by an order of magnitude, letting one cpu (the consumer cpu) to get unfair access to the (ticket) spinlock. (It can compete with no more than one other cpu) Boost from ~50.000 to ~600.000 pps on a dual quad core machine (E5450 @3.00GHz) on a particular workload (many cpus want to xmit their packets) ( patch : http://patchwork.ozlabs.org/patch/53163/ ) It could be possible to write such a generic beast, with a cascade or regular ticket spinlocks ? One ticket spinlock at first stage (only if some conditions are met, aka slow path), then an 'primary' spinlock at second stage. // generic implementation // (x86 could use 16bit fields for users_in & user_out) struct cascade_lock { atomic_tusers_in; int users_out; spinlock_t primlock; spinlock_t slowpathlock; // could be outside of this structure, shared by many 'cascade_locks' }; /* * In kvm case, you might call hypervisor when slowpathlock is about to be taken ? * When a cascade lock is unlocked, and relocked right after, this cpu has unfair * priority and could get the lock before cpus blocked in slowpathlock (especially if * an hypervisor call was done) * * In network xmit path, the dequeue thread would use highprio_user=true mode * In network xmit path, the 'contended' enqueueing thread would set a negative threshold, * to force a 'lowprio_user' mode. */ void cascade_lock(struct cascade_lock *l, bool highprio_user, int threshold) { bool slowpath = false; atomic_inc(&l->users_in); // no real need for atomic_inc_return() if (atomic_read(&l->users_in) - l->users_out > threshold && !highprio_user)) { spin_lock(&l->slowpathlock); slowpath = true; } spin_lock(&l->primlock); if (slowpath) spin_unlock(&l->slowpathlock); } void cascade_unlock(struct cascade_lock *l) { l->users_out++; spin_unlock(&l->primlock); } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On 06/01/2010 08:39 PM, valdis.kletni...@vt.edu wrote: We are running everything on NUMA (since all modern machines are now NUMA). At what scale do the issues become observable? My 6-month-old laptop is NUMA? Comes as a surprise to me, and to the perfectly-running NUMA=n kernel I'm running. Or did you mean a less broad phrase than "all modern machines"? All modern two socket and above boards, sorry. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On 06/01/2010 08:27 PM, Andi Kleen wrote: On Tue, Jun 01, 2010 at 07:52:28PM +0300, Avi Kivity wrote: We are running everything on NUMA (since all modern machines are now NUMA). At what scale do the issues become observable? On Intel platforms it's visible starting with 4 sockets. Can you recommend a benchmark that shows bad behaviour? I'll run it with ticket spinlocks and Gleb's patch. I have a 4-way Nehalem-EX, presumably the huge number of threads will magnify the problem even more there. I understand that reason and do not propose to get back to old spinlock on physical HW! But with virtualization performance hit is unbearable. Extreme unfairness can be unbearable too. Well, the question is what happens first. In our experience, vcpu overcommit is a lot more painful. People will never see the NUMA unfairness issue if they can't use kvm due to the vcpu overcommit problem. You really have to address both, if you don't fix them both users will eventually into one of them and be unhappy. That's definitely the long term plan. I consider Gleb's patch the first step. Do you have any idea how we can tackle both problems? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Wed, Jun 02, 2010 at 05:51:14AM +0300, Avi Kivity wrote: > That's definitely the long term plan. I consider Gleb's patch the > first step. > > Do you have any idea how we can tackle both problems? I recall Xen posting some solution for a similar problem: http://lkml.org/lkml/2010/1/29/45 Wouldn't a similar approach help KVM as well? - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On 06/01/2010 10:39 AM, valdis.kletni...@vt.edu wrote: On Tue, 01 Jun 2010 19:52:28 +0300, Avi Kivity said: On 06/01/2010 07:38 PM, Andi Kleen wrote: Your new code would starve again, right? Try it on a NUMA system with unfair memory. We are running everything on NUMA (since all modern machines are now NUMA). At what scale do the issues become observable? My 6-month-old laptop is NUMA? Comes as a surprise to me, and to the perfectly-running NUMA=n kernel I'm running. Or did you mean a less broad phrase than "all modern machines"? All modern multisocket machines, unless configured in interleaved memory mode. -hpa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Wed, Jun 02, 2010 at 05:51:14AM +0300, Avi Kivity wrote: > On 06/01/2010 08:27 PM, Andi Kleen wrote: >> On Tue, Jun 01, 2010 at 07:52:28PM +0300, Avi Kivity wrote: >> >>> We are running everything on NUMA (since all modern machines are now NUMA). >>> At what scale do the issues become observable? >>> >> On Intel platforms it's visible starting with 4 sockets. >> > > Can you recommend a benchmark that shows bad behaviour? I'll run it with Pretty much anything with high lock contention. > ticket spinlocks and Gleb's patch. I have a 4-way Nehalem-EX, presumably > the huge number of threads will magnify the problem even more there. Yes more threads cause more lock contention too. > Do you have any idea how we can tackle both problems? Apparently Xen has something, perhaps that can be leveraged (but I haven't looked at their solution in detail) Otherwise I would probably try to start with a adaptive spinlock that at some point calls into the HV (or updates shared memory?), like john cooper suggested. The tricky part here would be to find the thresholds and fit that state into paravirt ops and the standard spinlock_t. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On 06/02/2010 11:50 AM, Andi Kleen wrote: On Wed, Jun 02, 2010 at 05:51:14AM +0300, Avi Kivity wrote: On 06/01/2010 08:27 PM, Andi Kleen wrote: On Tue, Jun 01, 2010 at 07:52:28PM +0300, Avi Kivity wrote: We are running everything on NUMA (since all modern machines are now NUMA). At what scale do the issues become observable? On Intel platforms it's visible starting with 4 sockets. Can you recommend a benchmark that shows bad behaviour? I'll run it with Pretty much anything with high lock contention. Okay, we'll try to measure it here as soon as we can switch it into numa mode. Do you have any idea how we can tackle both problems? Apparently Xen has something, perhaps that can be leveraged (but I haven't looked at their solution in detail) Otherwise I would probably try to start with a adaptive spinlock that at some point calls into the HV (or updates shared memory?), like john cooper suggested. The tricky part here would be to find the thresholds and fit that state into paravirt ops and the standard spinlock_t. There are two separate problems: the more general problem is that the hypervisor can put a vcpu to sleep while holding a lock, causing other vcpus to spin until the end of their time slice. This can only be addressed with hypervisor help. The second problem is that the extreme fairness of ticket locks causes lots of context switches if the hypervisor helps, and aggravates the first problem horribly if it doesn't (since now a vcpu will spin waiting for its ticket even if the lock is unlocked). So yes, we'll need hypervisor assistance, but even with that we'll need to reduce ticket lock fairness (retaining global fairness but sacrificing some local fairness). I imagine that will be helpful for non-virt as well as local unfairness reduces bounciness. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote: > > There are two separate problems: the more general problem is that > the hypervisor can put a vcpu to sleep while holding a lock, causing > other vcpus to spin until the end of their time slice. This can > only be addressed with hypervisor help. Fyi - I have a early patch ready to address this issue. Basically I am using host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to hint host whenever guest is in spin-lock'ed section, which is read by host scheduler to defer preemption. Guest side: static inline void spin_lock(spinlock_t *lock) { raw_spin_lock(&lock->rlock); + __get_cpu_var(gh_vcpu_ptr)->defer_preempt++; } static inline void spin_unlock(spinlock_t *lock) { + __get_cpu_var(gh_vcpu_ptr)->defer_preempt--; raw_spin_unlock(&lock->rlock); } [similar changes to other spinlock variants] Host side: @@ -860,6 +866,17 @@ check_preempt_tick(struct cfs_rq *cfs_rq ideal_runtime = sched_slice(cfs_rq, curr); delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime; if (delta_exec > ideal_runtime) { + if ((sched_feat(DEFER_PREEMPT)) && (rq_of(cfs_rq)->curr->ghptr)) { + int defer_preempt = rq_of(cfs_rq)->curr->ghptr->defer_preempt; + if (((defer_preempt & 0x) == 0xfeed) && ((defer_preempt & 0x) != 0)) { + if ((rq_of(cfs_rq)->curr->grace_defer++ < sysctl_sched_preempt_defer_count)) { + rq_of(cfs_rq)->defer_preempt++; + return; + } else + rq_of(cfs_rq)->force_preempt++; + } + } resched_task(rq_of(cfs_rq)->curr); /* * The current task ran long enough, ensure it doesn't get [similar changes introduced at other preemption points in sched_fair.c] Note that guest can only request preemption to be deferred (and not disabled via this mechanism). I have seen good improvement (~15%) in kern compile benchmark with sysctl_sched_preempt_defer_count set to a low value of just 2 (i.e we can defer preemption by maximum two ticks). I intend to cleanup and post the patches pretty soon for comments. One pathological case where this may actually hurt is routines in guest like flush_tlb_others_ipi() which take a spinlock and then enter a while() loop waiting for other cpus to ack something. In this case, deferring preemption just because guest is in critical section actually hurts! Hopefully the upper bound for deferring preemtion and the fact that such routines may not be frequently hit should help alleviate such situations. - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
Le jeudi 03 juin 2010 à 09:50 +0530, Srivatsa Vaddagiri a écrit : > On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote: > > > > There are two separate problems: the more general problem is that > > the hypervisor can put a vcpu to sleep while holding a lock, causing > > other vcpus to spin until the end of their time slice. This can > > only be addressed with hypervisor help. > > Fyi - I have a early patch ready to address this issue. Basically I am using > host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to > hint > host whenever guest is in spin-lock'ed section, which is read by host > scheduler > to defer preemption. > > Guest side: > > static inline void spin_lock(spinlock_t *lock) > { > raw_spin_lock(&lock->rlock); > + __get_cpu_var(gh_vcpu_ptr)->defer_preempt++; 1) __this_cpu_inc() should be faster 2) Isnt a bit late to do this increment _after_ raw_spin_lock(&lock->rlock); > } > > static inline void spin_unlock(spinlock_t *lock) > { > + __get_cpu_var(gh_vcpu_ptr)->defer_preempt--; > raw_spin_unlock(&lock->rlock); > } -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 06:51:51AM +0200, Eric Dumazet wrote: > > Guest side: > > > > static inline void spin_lock(spinlock_t *lock) > > { > > raw_spin_lock(&lock->rlock); > > + __get_cpu_var(gh_vcpu_ptr)->defer_preempt++; > > 1) __this_cpu_inc() should be faster Ok ..thx for that tip. > 2) Isnt a bit late to do this increment _after_ > raw_spin_lock(&lock->rlock); I think so, my worry about doing it earlier is we may set the defer_preempt hint for the wrong vcpu (if lets say the guest application thread is preempted by guest kernel and later migrated to another vcpu after it sets the hint and before it acquires the lock). - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 09:50:51AM +0530, Srivatsa Vaddagiri wrote: > On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote: > > > > There are two separate problems: the more general problem is that > > the hypervisor can put a vcpu to sleep while holding a lock, causing > > other vcpus to spin until the end of their time slice. This can > > only be addressed with hypervisor help. > > Fyi - I have a early patch ready to address this issue. Basically I am using > host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to > hint > host whenever guest is in spin-lock'ed section, which is read by host > scheduler > to defer preemption. Looks like a ni.ce simple way to handle this for the kernel. However I suspect user space will hit the same issue sooner or later. I assume your way is not easily extensable to futexes? > One pathological case where this may actually hurt is routines in guest like > flush_tlb_others_ipi() which take a spinlock and then enter a while() loop > waiting for other cpus to ack something. In this case, deferring preemption > just > because guest is in critical section actually hurts! Hopefully the upper bound > for deferring preemtion and the fact that such routines may not be frequently > hit should help alleviate such situations. So do you defer during the whole spinlock region or just during the spin? I assume the the first? -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 10:52:51AM +0200, Andi Kleen wrote: > > Fyi - I have a early patch ready to address this issue. Basically I am using > > host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to > > hint > > host whenever guest is in spin-lock'ed section, which is read by host > > scheduler > > to defer preemption. > > Looks like a ni.ce simple way to handle this for the kernel. The idea is not new. It has been discussed for example at [1]. > However I suspect user space will hit the same issue sooner > or later. I assume your way is not easily extensable to futexes? I had thought that most userspace lock implementation avoid spinning for long times? i.e they would spin for a short while and sleep beyond a threshold? If that is the case, we shouldn't be burning lot of cycles unnecessarily spinning in userspace .. > So do you defer during the whole spinlock region or just during the spin? > > I assume the the first? My current implementation just blindly defers by a tick and checks if it is safe to preempt in the next tick - otherwise gives more grace ticks until the threshold is crossed (after which we forcibly preempt it). In future, I was thinking that host scheduler can hint back to guest that it was given some "grace" time which can be used in guest to yield when it comes out of the locked section. - vatsa 1. http://l4ka.org/publications/2004/Towards-Scalable-Multiprocessor-Virtual-Machines-VM04.pdf -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 10:52:51AM +0200, Andi Kleen wrote: > On Thu, Jun 03, 2010 at 09:50:51AM +0530, Srivatsa Vaddagiri wrote: > > On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote: > > > > > > There are two separate problems: the more general problem is that > > > the hypervisor can put a vcpu to sleep while holding a lock, causing > > > other vcpus to spin until the end of their time slice. This can > > > only be addressed with hypervisor help. > > > > Fyi - I have a early patch ready to address this issue. Basically I am using > > host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to > > hint > > host whenever guest is in spin-lock'ed section, which is read by host > > scheduler > > to defer preemption. > > Looks like a ni.ce simple way to handle this for the kernel. > > However I suspect user space will hit the same issue sooner > or later. I assume your way is not easily extensable to futexes? Well userspace has always had the problem, hypervisor or not. So sleeping locks obviously help a lot with that. But we do hit the problem at times. The MySQL sysbench scalability problem was a fine example http://ozlabs.org/~anton/linux/sysbench/ Performance would tank when threads oversubscribe CPUs because lock holders would start getting preempted. This was due to nasty locking in MySQL as well, mind you. There are some ways to improve it. glibc I believe has an option to increase thread priority when taking a mutex, which is similar to what we have here. But it's a fairly broad problem for userspace. The resource may not be just a lock but it could be IO as well. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 09:50:51AM +0530, Srivatsa Vaddagiri wrote: > On Wed, Jun 02, 2010 at 12:00:27PM +0300, Avi Kivity wrote: > > > > There are two separate problems: the more general problem is that > > the hypervisor can put a vcpu to sleep while holding a lock, causing > > other vcpus to spin until the end of their time slice. This can > > only be addressed with hypervisor help. > > Fyi - I have a early patch ready to address this issue. Basically I am using > host-kernel memory (mmap'ed into guest as io-memory via ivshmem driver) to > hint > host whenever guest is in spin-lock'ed section, which is read by host > scheduler > to defer preemption. > > Guest side: > > static inline void spin_lock(spinlock_t *lock) > { > raw_spin_lock(&lock->rlock); > + __get_cpu_var(gh_vcpu_ptr)->defer_preempt++; > } > > static inline void spin_unlock(spinlock_t *lock) > { > + __get_cpu_var(gh_vcpu_ptr)->defer_preempt--; > raw_spin_unlock(&lock->rlock); > } > > [similar changes to other spinlock variants] Great, this is a nice way to improve it. You might want to consider playing with first taking a ticket, and then if we fail to acquire the lock immediately, then increment defer_preempt before we start spinning. The downside of this would be if we waste all our slice on spinning and then preempted in the critical section. But with ticket locks you can easily see how many entries in the queue in front of you. So you could experiment with starting to defer preempt when we notice we are getting toward the head of the queue. Have you also looked at how s390 checks if the owning vcpu is running and if so it spins, if not yields to the hypervisor. Something like turning it into an adaptive lock. This could be applicable as well. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Tue, 2010-06-01 at 21:36 +0200, Andi Kleen wrote: > > Collecting the contention/usage statistics on a per spinlock > > basis seems complex. I believe a practical approximation > > to this are adaptive mutexes where upon hitting a spin > > time threshold, punt and let the scheduler reconcile fairness. > > That would probably work, except: how do you get the > adaptive spinlock into a paravirt op without slowing > down a standard kernel? It only ever comes into play in the case where the spinlock is contended anyway -- surely it shouldn't be _that_ much of a performance issue? See the way that ppc64 handles it -- on a machine with overcommitted virtual cpus, it will call __spin_yield (arch/powerpc/lib/locks.c) on contention, which may cause the virtual CPU to donate its hypervisor timeslice to the vCPU which is actually holding the lock in question. -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 08:38:55PM +1000, Nick Piggin wrote: > > Guest side: > > > > static inline void spin_lock(spinlock_t *lock) > > { > > raw_spin_lock(&lock->rlock); > > + __get_cpu_var(gh_vcpu_ptr)->defer_preempt++; > > } > > > > static inline void spin_unlock(spinlock_t *lock) > > { > > + __get_cpu_var(gh_vcpu_ptr)->defer_preempt--; > > raw_spin_unlock(&lock->rlock); > > } > > > > [similar changes to other spinlock variants] > > Great, this is a nice way to improve it. > > You might want to consider playing with first taking a ticket, and > then if we fail to acquire the lock immediately, then increment > defer_preempt before we start spinning. > > The downside of this would be if we waste all our slice on spinning > and then preempted in the critical section. But with ticket locks > you can easily see how many entries in the queue in front of you. > So you could experiment with starting to defer preempt when we > notice we are getting toward the head of the queue. Mm - my goal is to avoid long spin times in the first place (because the owning vcpu was descheduled at an unfortunate time i.e while it was holding a lock). From that sense, I am targetting preemption-defer of lock *holder* rather than of lock acquirer. So ideally whenever somebody tries to grab a lock, it should be free most of the time, it can be held only if the owner is currently running - which means we won't have to spin too long for the lock. > Have you also looked at how s390 checks if the owning vcpu is running > and if so it spins, if not yields to the hypervisor. Something like > turning it into an adaptive lock. This could be applicable as well. I don't think even s390 does adaptive spinlocks. Also afaik s390 zVM does gang scheduling of vcpus, which reduces the severity of this problem very much - essentially lock acquirer/holder are run simultaneously on different cpus all the time. Gang scheduling is on my list of things to look at much later (although I have been warned that its a scalablility nightmare!). - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 05:34:50PM +0530, Srivatsa Vaddagiri wrote: > On Thu, Jun 03, 2010 at 08:38:55PM +1000, Nick Piggin wrote: > > > Guest side: > > > > > > static inline void spin_lock(spinlock_t *lock) > > > { > > > raw_spin_lock(&lock->rlock); > > > + __get_cpu_var(gh_vcpu_ptr)->defer_preempt++; > > > } > > > > > > static inline void spin_unlock(spinlock_t *lock) > > > { > > > + __get_cpu_var(gh_vcpu_ptr)->defer_preempt--; > > > raw_spin_unlock(&lock->rlock); > > > } > > > > > > [similar changes to other spinlock variants] > > > > Great, this is a nice way to improve it. > > > > You might want to consider playing with first taking a ticket, and > > then if we fail to acquire the lock immediately, then increment > > defer_preempt before we start spinning. > > > > The downside of this would be if we waste all our slice on spinning > > and then preempted in the critical section. But with ticket locks > > you can easily see how many entries in the queue in front of you. > > So you could experiment with starting to defer preempt when we > > notice we are getting toward the head of the queue. > > Mm - my goal is to avoid long spin times in the first place (because the > owning vcpu was descheduled at an unfortunate time i.e while it was holding a > lock). From that sense, I am targetting preemption-defer of lock *holder* > rather than of lock acquirer. So ideally whenever somebody tries to grab a > lock, > it should be free most of the time, it can be held only if the owner is > currently running - which means we won't have to spin too long for the lock. Holding a ticket in the queue is effectively the same as holding the lock, from the pov of processes waiting behind. The difference of course is that CPU cycles do not directly reduce latency of ticket holders (only the owner). Spinlock critical sections should tend to be several orders of magnitude shorter than context switch times. So if you preempt the guy waiting at the head of the queue, then it's almost as bad as preempting the lock holder. > > Have you also looked at how s390 checks if the owning vcpu is running > > and if so it spins, if not yields to the hypervisor. Something like > > turning it into an adaptive lock. This could be applicable as well. > > I don't think even s390 does adaptive spinlocks. Also afaik s390 zVM does gang > scheduling of vcpus, which reduces the severity of this problem very much - > essentially lock acquirer/holder are run simultaneously on different cpus all > the time. Gang scheduling is on my list of things to look at much later > (although I have been warned that its a scalablility nightmare!). It effectively is pretty well an adaptive lock. The spinlock itself doesn't sleep of course, but it yields to the hypervisor if the owner has been preempted. This is pretty close to analogous with Linux adaptive mutexes. s390 also has the diag9c instruction which I suppose somehow boosts priority of a preempted contended lock holder. In spite of any other possible optimizations in their hypervisor like gang scheduling, diag9c apparently provides quite a large improvement in some cases. And they aren't even using ticket spinlocks!! So I think these things are fairly important to look at. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 10:38:32PM +1000, Nick Piggin wrote: > Holding a ticket in the queue is effectively the same as holding the > lock, from the pov of processes waiting behind. > > The difference of course is that CPU cycles do not directly reduce > latency of ticket holders (only the owner). Spinlock critical sections > should tend to be several orders of magnitude shorter than context > switch times. So if you preempt the guy waiting at the head of the > queue, then it's almost as bad as preempting the lock holder. Ok got it - although that approach is not advisable in some cases for ex: when the lock holder vcpu and lock acquired vcpu are scheduled on the same pcpu by the hypervisor (which was experimented with in [1] where they foud a huge hit in perf). I agree that in general we should look at deferring preemption of lock acquirer esp when its at "head" as you suggest - I will consider that approach as the next step (want to incrementally progress basically!). > > > Have you also looked at how s390 checks if the owning vcpu is running > > > and if so it spins, if not yields to the hypervisor. Something like > > > turning it into an adaptive lock. This could be applicable as well. > > > > I don't think even s390 does adaptive spinlocks. Also afaik s390 zVM does > > gang > > scheduling of vcpus, which reduces the severity of this problem very much - > > essentially lock acquirer/holder are run simultaneously on different cpus > > all > > the time. Gang scheduling is on my list of things to look at much later > > (although I have been warned that its a scalablility nightmare!). > > It effectively is pretty well an adaptive lock. The spinlock itself > doesn't sleep of course, but it yields to the hypervisor if the owner > has been preempted. This is pretty close to analogous with Linux adaptive > mutexes. Oops you are right - sorry should have checked more closely earlier. Given that we may not be able to always guarantee that locked critical sections will not be preempted (ex: when a real-time task takes over), we will need a combination of both approaches (i.e request preemption defer on lock hold path + yield on lock acquire path if owner !scheduled). The advantage of former approach is that it could reduce job turnaround times in most cases (as lock is available when we want or we don't have to wait too long for it). > s390 also has the diag9c instruction which I suppose somehow boosts > priority of a preempted contended lock holder. In spite of any other > possible optimizations in their hypervisor like gang scheduling, > diag9c apparently provides quite a large improvement in some cases. Ok - thx for that pointer - will have a look at diag9c. > So I think these things are fairly important to look at. I agree .. - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 06:28:21PM +0530, Srivatsa Vaddagiri wrote: > Ok got it - although that approach is not advisable in some cases for ex: when > the lock holder vcpu and lock acquired vcpu are scheduled on the same pcpu by > the hypervisor (which was experimented with in [1] where they foud a huge hit > in > perf). 1. http://lkml.org/lkml/2010/4/13/464 - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 06:28:21PM +0530, Srivatsa Vaddagiri wrote: > On Thu, Jun 03, 2010 at 10:38:32PM +1000, Nick Piggin wrote: > > Holding a ticket in the queue is effectively the same as holding the > > lock, from the pov of processes waiting behind. > > > > The difference of course is that CPU cycles do not directly reduce > > latency of ticket holders (only the owner). Spinlock critical sections > > should tend to be several orders of magnitude shorter than context > > switch times. So if you preempt the guy waiting at the head of the > > queue, then it's almost as bad as preempting the lock holder. > > Ok got it - although that approach is not advisable in some cases for ex: when > the lock holder vcpu and lock acquired vcpu are scheduled on the same pcpu by > the hypervisor (which was experimented with in [1] where they foud a huge hit > in > perf). Sure but if you had adaptive yielding, that solves that problem. > I agree that in general we should look at deferring preemption of lock > acquirer esp when its at "head" as you suggest - I will consider that approach > as the next step (want to incrementally progress basically!). > > > > > Have you also looked at how s390 checks if the owning vcpu is running > > > > and if so it spins, if not yields to the hypervisor. Something like > > > > turning it into an adaptive lock. This could be applicable as well. > > > > > > I don't think even s390 does adaptive spinlocks. Also afaik s390 zVM does > > > gang > > > scheduling of vcpus, which reduces the severity of this problem very much > > > - > > > essentially lock acquirer/holder are run simultaneously on different cpus > > > all > > > the time. Gang scheduling is on my list of things to look at much later > > > (although I have been warned that its a scalablility nightmare!). > > > > It effectively is pretty well an adaptive lock. The spinlock itself > > doesn't sleep of course, but it yields to the hypervisor if the owner > > has been preempted. This is pretty close to analogous with Linux adaptive > > mutexes. > > Oops you are right - sorry should have checked more closely earlier. Given > that > we may not be able to always guarantee that locked critical sections will not > be > preempted (ex: when a real-time task takes over), we will need a combination > of > both approaches (i.e request preemption defer on lock hold path + yield on > lock > acquire path if owner !scheduled). The advantage of former approach is that it > could reduce job turnaround times in most cases (as lock is available when we > want or we don't have to wait too long for it). Both I think would be good. It might be interesting to talk with the s390 guys and see if they can look at ticket locks and preempt defer techniques too (considering they already do the other half of the equation well). > > s390 also has the diag9c instruction which I suppose somehow boosts > > priority of a preempted contended lock holder. In spite of any other > > possible optimizations in their hypervisor like gang scheduling, > > diag9c apparently provides quite a large improvement in some cases. > > Ok - thx for that pointer - will have a look at diag9c. > > > So I think these things are fairly important to look at. > > I agree .. > > - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 11:45:00PM +1000, Nick Piggin wrote: > > Ok got it - although that approach is not advisable in some cases for ex: > > when > > the lock holder vcpu and lock acquired vcpu are scheduled on the same pcpu > > by > > the hypervisor (which was experimented with in [1] where they foud a huge > > hit in > > perf). > > Sure but if you had adaptive yielding, that solves that problem. I guess so. > > Oops you are right - sorry should have checked more closely earlier. Given > > that > > we may not be able to always guarantee that locked critical sections will > > not be > > preempted (ex: when a real-time task takes over), we will need a > > combination of > > both approaches (i.e request preemption defer on lock hold path + yield on > > lock > > acquire path if owner !scheduled). The advantage of former approach is that > > it > > could reduce job turnaround times in most cases (as lock is available when > > we > > want or we don't have to wait too long for it). > > Both I think would be good. It might be interesting to talk with the > s390 guys and see if they can look at ticket locks and preempt defer > techniques too (considering they already do the other half of the > equation well). Martin/Heiko, Do you want to comment on this? - vatsa -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 12:06:39PM +0100, David Woodhouse wrote: > On Tue, 2010-06-01 at 21:36 +0200, Andi Kleen wrote: > > > Collecting the contention/usage statistics on a per spinlock > > > basis seems complex. I believe a practical approximation > > > to this are adaptive mutexes where upon hitting a spin > > > time threshold, punt and let the scheduler reconcile fairness. > > > > That would probably work, except: how do you get the > > adaptive spinlock into a paravirt op without slowing > > down a standard kernel? > > It only ever comes into play in the case where the spinlock is contended > anyway -- surely it shouldn't be _that_ much of a performance issue? The problem is fitting the state into the u32 Also "lightly contended" is not that uncommon. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 10:38:32PM +1000, Nick Piggin wrote: > And they aren't even using ticket spinlocks!! I suppose they simply don't have unfair memory. Makes things easier. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
On Thu, Jun 03, 2010 at 05:17:30PM +0200, Andi Kleen wrote: > On Thu, Jun 03, 2010 at 10:38:32PM +1000, Nick Piggin wrote: > > And they aren't even using ticket spinlocks!! > > I suppose they simply don't have unfair memory. Makes things easier. That would certainly be a part of it, I'm sure they have stronger fairness and guarantees at the expense of some performance. We saw the spinlock starvation first on 8-16 core Opterons I think, wheras Altix had been over 1024 cores and POWER7 1024 threads now apparently without reported problems. However I think more is needed than simply "fair" memory at the cache coherency level, considering that for example s390 implements it simply by retrying cas until it succeeds. So you could perfectly round-robin all cache requests for the lock word, but one core could quite easily always find it is granted the cacheline when the lock is already taken. So I think actively enforcing fairness at the lock level would be required. Something like if it is detected that a core is not making progress on a tight cas loop, then it will need to enter a queue of cores where the head of the queue is always granted the cacheline first after it has been dirtied. Interrupts will need to be ignored from this logic. This still doesn't solve the problem of an owner unfairly releasing and grabbing the lock again, they could have more detection to handle that. I don't know how far hardware goes. Maybe it is enough to statistically avoid starvation if memory is pretty fair. But it does seem a lot easier to enforce fairness in software. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] use unfair spinlock when running on hypervisor.
> That would certainly be a part of it, I'm sure they have stronger > fairness and guarantees at the expense of some performance. We saw the > spinlock starvation first on 8-16 core Opterons I think, wheras Altix > had been over 1024 cores and POWER7 1024 threads now apparently without > reported problems. I suppose P7 handles that in the HV through the pvcall. Altix AFAIK has special hardware for this in the interconnect, but as individual nodes get larger and have more cores you'll start seeing it there too. In general we now have the problem that with increasing core counts per socket each NUMA node can be a fairly large SMP by itself and several of the old SMP scalability problems that were fixed by having per node datastructures are back now. For example this is a serious problem with the zone locks in some workloads now on 8core+HT systems. > So I think actively enforcing fairness at the lock level would be > required. Something like if it is detected that a core is not making I suppose how that exactly works is IBM's secret sauce. Anyways as long as there are no reports I wouldn't worry about it. -Andi -- a...@linux.intel.com -- Speaking for myself only. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html