Back in 2008 Gregory Haskins noticed that there was unfair locking when dealing with double_lock_balance(). The old code (which still exists when CONFIG_PREEMPT is disabled), on contention, will check the ordering of the run-queue locks (by address) to determine if it should release the current lock before taking the second lock.
if (!spin_trylock(busiest->lock)) { if (busiest < this_rq) { spin_unlock(&this_rq->lock); spin_lock(&busiest->lock); spin_lock(&this_rq->lock); } else spin_lock(&busiest->lock); } This gives unfair access for higher CPUs who are trying to take a contended lock. For lower numbered CPUs (with lower address of their rq structure), it wont release its lock when trying to take the next lock. This can cause an escalation of latency in locking if there's a waiter on the current 'this_rq->lock', which now must also wait for the 'busiest->lock' to be taken. The solution was to simply release the current (this_rq) lock and then take both locks. spin_unlock(&this_rq->lock); double_rq_lock(this_rq, busiest); Where double_rq_lock() takes the locks in order of the rq's address. This way, if there was a waiter on the rq lock, due to the ticket spinlocks, it would grab the lock immediately, and the current owner will have to wait for it now that it released the lock. What I found while testing an 80 CPU core was a large "ping-pong"ing around of the locks. Let's say that an RT task on a CPU is woken up from another CPU, and the CPU that the task was woken on is also running an RT task and tries to push: CPU 0 CPU 1 ----- ----- [ wake up ] spin_lock(cpu1_rq->lock); spin_lock(cpu1_rq->lock) double_lock_balance() [ release cpu1_rq->lock ] spin_lock(cpu1_rq->lock) [due to ticket, now acquires cpu1_rq->lock ] [goes to push task] double_lock_balance() [ release cpu1_rq->lock ] [ acquires lock ] spin_lock(cpu2_rq->lock) [ blocks as cpu2 is using it ] And then CPU2 would call double_lock_balance() and the ping pong continues. What I could not understand about Gregory's patch is that regardless of contention, the currently held lock is always released, opening up a window for this ping ponging to occur. When I changed the code to only release on contention of the second lock, things improved tremendously. The problem that Gregory was facing was that there was an unfair access when there was contention. But what his solution did was to always release the lock even when there was no contention on the second lock, which happened to cause more contention later on. I talked with Peter Zijlstra about this, and he pointed me to the thread where this patch was discussed back in 2008. As I read the thread, Nick Piggin brought up this very issue. After the little discussion, it was determined to let the waiting task have the lock even when there was no contention to be even more "fair". The problem though, Nick's suggestion was never tested. If it had been back then, I'm sure that Gregory would have decided to only release the lock if there was contention on the second lock. Link: http://lkml.kernel.org/r/20080825201534.23217.14936.st...@dev.haskins.net Suggested-by: Nick Piggin <nickpig...@yahoo.com.au> Signed-off-by: Steven Rostedt <rost...@goodmis.org> --- diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index ec2e8d23527e..3ed9ef770085 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1548,10 +1548,15 @@ static inline int _double_lock_balance(struct rq *this_rq, struct rq *busiest) __acquires(busiest->lock) __acquires(this_rq->lock) { - raw_spin_unlock(&this_rq->lock); - double_rq_lock(this_rq, busiest); + int ret = 0; + + if (unlikely(!raw_spin_trylock(&busiest->lock))) { + raw_spin_unlock(&this_rq->lock); + double_rq_lock(this_rq, busiest); + ret = 1; + } - return 1; + return ret; } #else