Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On 1/25/2013 11:05 AM, Rik van Riel wrote: Many spinlocks are embedded in data structures; having many CPUs pounce on the cache line the lock is in will slow down the lock holder, and can cause system performance to fall off a cliff. The paper "Non-scalable locks are dangerous" is a good reference: http://pdos.csail.mit.edu/papers/linux:lock.pdf In the Linux kernel, spinlocks are optimized for the case of there not being contention. After all, if there is contention, the data structure can be improved to reduce or eliminate lock contention. Likewise, the spinlock API should remain simple, and the common case of the lock not being contended should remain as fast as ever. However, since spinlock contention should be fairly uncommon, we can add functionality into the spinlock slow path that keeps system performance from falling off a cliff when there is lock contention. Proportional delay in ticket locks is delaying the time between checking the ticket based on a delay factor, and the number of CPUs ahead of us in the queue for this lock. Checking the lock less often allows the lock holder to continue running, resulting in better throughput and preventing performance from dropping off a cliff. The test case has a number of threads locking and unlocking a semaphore. With just one thread, everything sits in the CPU cache and throughput is around 2.6 million operations per second, with a 5-10% variation. Once a second thread gets involved, data structures bounce from CPU to CPU, and performance deteriorates to about 1.25 million operations per second, with a 5-10% variation. However, as more and more threads get added to the mix, performance with the vanilla kernel continues to deteriorate. Once I hit 24 threads, on a 24 CPU, 4 node test system, performance is down to about 290k operations/second. With a proportional backoff delay added to the spinlock code, performance with 24 threads goes up to about 400k operations/second with a 50x delay, and about 900k operations/second with a 250x delay. However, with a 250x delay, performance with 2-5 threads is worse than with a 50x delay. Making the code auto-tune the delay factor results in a system that performs well with both light and heavy lock contention, and should also protect against the (likely) case of the fixed delay factor being wrong for other hardware. The attached graph shows the performance of the multi threaded semaphore lock/unlock test case, with 1-24 threads, on the vanilla kernel, with 10x, 50x, and 250x proportional delay, as well as the v1 patch series with autotuning for 2x and 2.7x spinning before the lock is obtained, and with the v2 series. The v2 series integrates several ideas from Michel Lespinasse and Eric Dumazet, which should result in better throughput and nicer behaviour in situations with contention on multiple locks. For the v3 series, I tried out all the ideas suggested by Michel. They made perfect sense, but in the end it turned out they did not work as well as the simple, aggressive "try to make the delay longer" policy I have now. Several small bug fixes and cleanups have been integrated. For the v4 series, I added code to keep the maximum spinlock delay to a small value when running in a virtual machine. That should solve the performance regression seen in virtual machines. The performance issue observed with AIM7 is still a mystery. Performance is within the margin of error of v2, so the graph has not been update. Please let me know if you manage to break this code in any way, so I can fix it... I got back on the machine re-ran the AIM7-highsystime microbenchmark with a 2000 users and 100 jobs per user on a 20, 40, 80 vcpu guest using 3.7.5 kernel with and without Rik's latest patches. Host Platform : 8 socket (80 Core) Westmere with 1TB RAM. Config 1 : 3.7.5 base running on host and in the guest Config 2 : 3.7.5 + Rik's patches running on host and in the guest (Note: I didn't change the PLE settings on the host... The severe drop from at 40 way and 80 way is due to the un-optimized PLE handler. Raghu's PLE's fixes should address those. ). Config 1 Config 2 20vcpu - 170K 168K 40vcpu - 37K 37K 80vcpu - 10.5K11.5K Not much difference between the two configs.. (need to test it along with Raghu's fixes). BTW, I noticed that there were results posted using AIM7-compute workload earlier. The AIM7-highsystime is a lot more kernel intensive. FYI Vinod -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On 1/25/2013 11:05 AM, Rik van Riel wrote: Many spinlocks are embedded in data structures; having many CPUs pounce on the cache line the lock is in will slow down the lock holder, and can cause system performance to fall off a cliff. The paper Non-scalable locks are dangerous is a good reference: http://pdos.csail.mit.edu/papers/linux:lock.pdf In the Linux kernel, spinlocks are optimized for the case of there not being contention. After all, if there is contention, the data structure can be improved to reduce or eliminate lock contention. Likewise, the spinlock API should remain simple, and the common case of the lock not being contended should remain as fast as ever. However, since spinlock contention should be fairly uncommon, we can add functionality into the spinlock slow path that keeps system performance from falling off a cliff when there is lock contention. Proportional delay in ticket locks is delaying the time between checking the ticket based on a delay factor, and the number of CPUs ahead of us in the queue for this lock. Checking the lock less often allows the lock holder to continue running, resulting in better throughput and preventing performance from dropping off a cliff. The test case has a number of threads locking and unlocking a semaphore. With just one thread, everything sits in the CPU cache and throughput is around 2.6 million operations per second, with a 5-10% variation. Once a second thread gets involved, data structures bounce from CPU to CPU, and performance deteriorates to about 1.25 million operations per second, with a 5-10% variation. However, as more and more threads get added to the mix, performance with the vanilla kernel continues to deteriorate. Once I hit 24 threads, on a 24 CPU, 4 node test system, performance is down to about 290k operations/second. With a proportional backoff delay added to the spinlock code, performance with 24 threads goes up to about 400k operations/second with a 50x delay, and about 900k operations/second with a 250x delay. However, with a 250x delay, performance with 2-5 threads is worse than with a 50x delay. Making the code auto-tune the delay factor results in a system that performs well with both light and heavy lock contention, and should also protect against the (likely) case of the fixed delay factor being wrong for other hardware. The attached graph shows the performance of the multi threaded semaphore lock/unlock test case, with 1-24 threads, on the vanilla kernel, with 10x, 50x, and 250x proportional delay, as well as the v1 patch series with autotuning for 2x and 2.7x spinning before the lock is obtained, and with the v2 series. The v2 series integrates several ideas from Michel Lespinasse and Eric Dumazet, which should result in better throughput and nicer behaviour in situations with contention on multiple locks. For the v3 series, I tried out all the ideas suggested by Michel. They made perfect sense, but in the end it turned out they did not work as well as the simple, aggressive try to make the delay longer policy I have now. Several small bug fixes and cleanups have been integrated. For the v4 series, I added code to keep the maximum spinlock delay to a small value when running in a virtual machine. That should solve the performance regression seen in virtual machines. The performance issue observed with AIM7 is still a mystery. Performance is within the margin of error of v2, so the graph has not been update. Please let me know if you manage to break this code in any way, so I can fix it... I got back on the machine re-ran the AIM7-highsystime microbenchmark with a 2000 users and 100 jobs per user on a 20, 40, 80 vcpu guest using 3.7.5 kernel with and without Rik's latest patches. Host Platform : 8 socket (80 Core) Westmere with 1TB RAM. Config 1 : 3.7.5 base running on host and in the guest Config 2 : 3.7.5 + Rik's patches running on host and in the guest (Note: I didn't change the PLE settings on the host... The severe drop from at 40 way and 80 way is due to the un-optimized PLE handler. Raghu's PLE's fixes should address those. ). Config 1 Config 2 20vcpu - 170K 168K 40vcpu - 37K 37K 80vcpu - 10.5K11.5K Not much difference between the two configs.. (need to test it along with Raghu's fixes). BTW, I noticed that there were results posted using AIM7-compute workload earlier. The AIM7-highsystime is a lot more kernel intensive. FYI Vinod -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On 01/26/2013 12:35 AM, Rik van Riel wrote: Many spinlocks are embedded in data structures; having many CPUs pounce on the cache line the lock is in will slow down the lock holder, and can cause system performance to fall off a cliff. The paper "Non-scalable locks are dangerous" is a good reference: http://pdos.csail.mit.edu/papers/linux:lock.pdf In the Linux kernel, spinlocks are optimized for the case of there not being contention. After all, if there is contention, the data structure can be improved to reduce or eliminate lock contention. Likewise, the spinlock API should remain simple, and the common case of the lock not being contended should remain as fast as ever. However, since spinlock contention should be fairly uncommon, we can add functionality into the spinlock slow path that keeps system performance from falling off a cliff when there is lock contention. Proportional delay in ticket locks is delaying the time between checking the ticket based on a delay factor, and the number of CPUs ahead of us in the queue for this lock. Checking the lock less often allows the lock holder to continue running, resulting in better throughput and preventing performance from dropping off a cliff. The test case has a number of threads locking and unlocking a semaphore. With just one thread, everything sits in the CPU cache and throughput is around 2.6 million operations per second, with a 5-10% variation. Once a second thread gets involved, data structures bounce from CPU to CPU, and performance deteriorates to about 1.25 million operations per second, with a 5-10% variation. However, as more and more threads get added to the mix, performance with the vanilla kernel continues to deteriorate. Once I hit 24 threads, on a 24 CPU, 4 node test system, performance is down to about 290k operations/second. With a proportional backoff delay added to the spinlock code, performance with 24 threads goes up to about 400k operations/second with a 50x delay, and about 900k operations/second with a 250x delay. However, with a 250x delay, performance with 2-5 threads is worse than with a 50x delay. Making the code auto-tune the delay factor results in a system that performs well with both light and heavy lock contention, and should also protect against the (likely) case of the fixed delay factor being wrong for other hardware. The attached graph shows the performance of the multi threaded semaphore lock/unlock test case, with 1-24 threads, on the vanilla kernel, with 10x, 50x, and 250x proportional delay, as well as the v1 patch series with autotuning for 2x and 2.7x spinning before the lock is obtained, and with the v2 series. The v2 series integrates several ideas from Michel Lespinasse and Eric Dumazet, which should result in better throughput and nicer behaviour in situations with contention on multiple locks. For the v3 series, I tried out all the ideas suggested by Michel. They made perfect sense, but in the end it turned out they did not work as well as the simple, aggressive "try to make the delay longer" policy I have now. Several small bug fixes and cleanups have been integrated. For the v4 series, I added code to keep the maximum spinlock delay to a small value when running in a virtual machine. That should solve the performance regression seen in virtual machines. The performance issue observed with AIM7 is still a mystery. Performance is within the margin of error of v2, so the graph has not been update. Please let me know if you manage to break this code in any way, so I can fix it... After the introduction of patch 5 which limits the delay loops to 16, I am no more seeing the degradation in virtual guests as reported earlier, but improvements. For the whole series: Reviewed-by: Raghavendra K T -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On 01/26/2013 12:35 AM, Rik van Riel wrote: Many spinlocks are embedded in data structures; having many CPUs pounce on the cache line the lock is in will slow down the lock holder, and can cause system performance to fall off a cliff. The paper Non-scalable locks are dangerous is a good reference: http://pdos.csail.mit.edu/papers/linux:lock.pdf In the Linux kernel, spinlocks are optimized for the case of there not being contention. After all, if there is contention, the data structure can be improved to reduce or eliminate lock contention. Likewise, the spinlock API should remain simple, and the common case of the lock not being contended should remain as fast as ever. However, since spinlock contention should be fairly uncommon, we can add functionality into the spinlock slow path that keeps system performance from falling off a cliff when there is lock contention. Proportional delay in ticket locks is delaying the time between checking the ticket based on a delay factor, and the number of CPUs ahead of us in the queue for this lock. Checking the lock less often allows the lock holder to continue running, resulting in better throughput and preventing performance from dropping off a cliff. The test case has a number of threads locking and unlocking a semaphore. With just one thread, everything sits in the CPU cache and throughput is around 2.6 million operations per second, with a 5-10% variation. Once a second thread gets involved, data structures bounce from CPU to CPU, and performance deteriorates to about 1.25 million operations per second, with a 5-10% variation. However, as more and more threads get added to the mix, performance with the vanilla kernel continues to deteriorate. Once I hit 24 threads, on a 24 CPU, 4 node test system, performance is down to about 290k operations/second. With a proportional backoff delay added to the spinlock code, performance with 24 threads goes up to about 400k operations/second with a 50x delay, and about 900k operations/second with a 250x delay. However, with a 250x delay, performance with 2-5 threads is worse than with a 50x delay. Making the code auto-tune the delay factor results in a system that performs well with both light and heavy lock contention, and should also protect against the (likely) case of the fixed delay factor being wrong for other hardware. The attached graph shows the performance of the multi threaded semaphore lock/unlock test case, with 1-24 threads, on the vanilla kernel, with 10x, 50x, and 250x proportional delay, as well as the v1 patch series with autotuning for 2x and 2.7x spinning before the lock is obtained, and with the v2 series. The v2 series integrates several ideas from Michel Lespinasse and Eric Dumazet, which should result in better throughput and nicer behaviour in situations with contention on multiple locks. For the v3 series, I tried out all the ideas suggested by Michel. They made perfect sense, but in the end it turned out they did not work as well as the simple, aggressive try to make the delay longer policy I have now. Several small bug fixes and cleanups have been integrated. For the v4 series, I added code to keep the maximum spinlock delay to a small value when running in a virtual machine. That should solve the performance regression seen in virtual machines. The performance issue observed with AIM7 is still a mystery. Performance is within the margin of error of v2, so the graph has not been update. Please let me know if you manage to break this code in any way, so I can fix it... After the introduction of patch 5 which limits the delay loops to 16, I am no more seeing the degradation in virtual guests as reported earlier, but improvements. For the whole series: Reviewed-by: Raghavendra K T raghavendra...@linux.vnet.ibm.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On Sat, 2013-01-26 at 13:05 +0100, Ingo Molnar wrote: > * Mike Galbraith wrote: > > > On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote: > > > > > The performance issue observed with AIM7 is still a mystery. > > > > Hm. AIM7 mystery _may_ be the same crud I see on a 4 node 40 > > core box. Stock scheduler knobs are too preempt happy, produce > > unstable results. I twiddle them as below to stabilize > > results. > > > > I'm testing a load balancing series from Alex Shi with AIM7 > > and whatnot, added your series on top of it and retested. > > What I see is improvement. > > Since we want both series that's good news, agreed? Yes, both look good. BTW, the balance and powersaving low end improvements in the Alex set are not due to any select_idle_sibling() misbehavior as one might suspect when looking at the set, it's innocent (this time), it's the balancing policies themselves. I don't see where it comes from, gotta be some turbo thingy methinks, but who cares, better is better. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
* Mike Galbraith wrote: > On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote: > > > The performance issue observed with AIM7 is still a mystery. > > Hm. AIM7 mystery _may_ be the same crud I see on a 4 node 40 > core box. Stock scheduler knobs are too preempt happy, produce > unstable results. I twiddle them as below to stabilize > results. > > I'm testing a load balancing series from Alex Shi with AIM7 > and whatnot, added your series on top of it and retested. > What I see is improvement. Since we want both series that's good news, agreed? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
* Mike Galbraith bitbuc...@online.de wrote: On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote: The performance issue observed with AIM7 is still a mystery. Hm. AIM7 mystery _may_ be the same crud I see on a 4 node 40 core box. Stock scheduler knobs are too preempt happy, produce unstable results. I twiddle them as below to stabilize results. I'm testing a load balancing series from Alex Shi with AIM7 and whatnot, added your series on top of it and retested. What I see is improvement. Since we want both series that's good news, agreed? Thanks, Ingo -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On Sat, 2013-01-26 at 13:05 +0100, Ingo Molnar wrote: * Mike Galbraith bitbuc...@online.de wrote: On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote: The performance issue observed with AIM7 is still a mystery. Hm. AIM7 mystery _may_ be the same crud I see on a 4 node 40 core box. Stock scheduler knobs are too preempt happy, produce unstable results. I twiddle them as below to stabilize results. I'm testing a load balancing series from Alex Shi with AIM7 and whatnot, added your series on top of it and retested. What I see is improvement. Since we want both series that's good news, agreed? Yes, both look good. BTW, the balance and powersaving low end improvements in the Alex set are not due to any select_idle_sibling() misbehavior as one might suspect when looking at the set, it's innocent (this time), it's the balancing policies themselves. I don't see where it comes from, gotta be some turbo thingy methinks, but who cares, better is better. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote: > The performance issue observed with AIM7 is still a mystery. Hm. AIM7 mystery _may_ be the same crud I see on a 4 node 40 core box. Stock scheduler knobs are too preempt happy, produce unstable results. I twiddle them as below to stabilize results. I'm testing a load balancing series from Alex Shi with AIM7 and whatnot, added your series on top of it and retested. What I see is improvement. Oodles of numbers follow. Sorry that your numbers are mixed in with my numbers, but this is just an excerpt from my test log, and I'm too lazy to reformat and filter. You can save wear and tear on your eyeballs by just poking 'D'. There does appear to be evidence that your patch set improved this load though, so in case you want to see numbers, here come a bunch, a quick scroll-by may be worth it. The very heavy load end did not improve, which seems odd, but whatever. Numbers... sched_latency_ns = 24ms sched_min_granularity_ns = 8ms sched_wakeup_granularity_ns = 10ms aim7 compute 3.8.0-performance 3.8.0-balance 3.8.0-powersaving Tasksjobs/min jti jobs/min/task real cpujobs/min jti jobs/min/task real cpujobs/min jti jobs/min/task real cpu 1 432.86 100 432.8571 14.00 3.99 433.48 100 433.4764 13.98 3.97 433.17 100 433.1665 13.99 3.98 1 437.23 100 437.2294 13.86 3.85 436.60 100 436.5994 13.88 3.86 435.66 100 435.6578 13.91 3.90 1 434.10 100 434.0974 13.96 3.95 436.29 100 436.2851 13.89 3.89 436.29 100 436.2851 13.89 3.87 5 2400.95 99 480.1902 12.62 12.49 2554.81 98 510.9612 11.86 7.55 2487.68 98 497.5369 12.18 8.22 5 2341.58 99 468.3153 12.94 13.95 2578.72 99 515.7447 11.75 7.25 2527.11 99 505.4212 11.99 7.90 5 2350.66 99 470.1319 12.89 13.66 2600.86 99 520.1717 11.65 7.09 2508.28 98 501.6556 12.08 8.24 10 4291.78 99 429.1785 14.12 40.14 5334.51 99 533.4507 11.36 11.13 5183.92 98 518.3918 11.69 12.15 10 4334.76 99 433.4764 13.98 38.70 5311.13 99 531.1131 11.41 11.23 5215.15 99 521.5146 11.62 12.53 10 4273.62 99 427.3625 14.18 40.29 5287.96 99 528.7958 11.46 11.46 5144.31 98 514.4312 11.78 12.32 20 8487.39 94 424.3697 14.28 63.1410594.41 99 529.7203 11.44 23.7210575.92 99 528.7958 11.46 22.08 20 8387.54 97 419.3772 14.45 77.0110575.92 98 528.7958 11.46 23.4110520.83 99 526.0417 11.52 21.88 20 8713.16 95 435.6578 13.91 55.1010659.63 99 532.9815 11.37 24.1710539.13 99 526.9565 11.50 22.13 4016786.70 99 419.6676 14.44170.0819469.88 98 486.7470 12.45 60.7819967.05 98 499.1763 12.14 51.40 4016728.78 99 418.2195 14.49172.9619627.53 98 490.6883 12.35 65.2620386.88 98 509.6720 11.89 46.91 4016763.49 99 419.0871 14.46171.4220033.06 98 500.8264 12.10 51.4420682.59 98 517.0648 11.72 42.45 8033024.52 98 412.8065 14.68355.1033205.48 98 415.0685 14.60336.9033690.06 97 421.1258 14.39 248.91 8033002.04 99 412.5255 14.69356.2733949.58 96 424.3697 14.28283.8733160.05 97 414.5007 14.62 264.85 8033047.03 99 413.0879 14.67355.2233137.39 98 414.2174 14.63338.9233526.97 97 419.0871 14.46 257.31 16064254.47 98 401.5905 15.09391.3064000.00 98 400. 15.15396.8765073.83 97 406.7114 14.90 371.09 16064468.09 98 402.9255 15.04390.2864553.93 98 403.4621 15.02389.4964640.00 98 404. 15.00 379.82 16064297.08 98 401.8568 15.08389.4564856.19 98 405.3512 14.95383.6464683.12 98 404.2695 14.99 379.43 320 121579.94 98
[PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
Many spinlocks are embedded in data structures; having many CPUs pounce on the cache line the lock is in will slow down the lock holder, and can cause system performance to fall off a cliff. The paper "Non-scalable locks are dangerous" is a good reference: http://pdos.csail.mit.edu/papers/linux:lock.pdf In the Linux kernel, spinlocks are optimized for the case of there not being contention. After all, if there is contention, the data structure can be improved to reduce or eliminate lock contention. Likewise, the spinlock API should remain simple, and the common case of the lock not being contended should remain as fast as ever. However, since spinlock contention should be fairly uncommon, we can add functionality into the spinlock slow path that keeps system performance from falling off a cliff when there is lock contention. Proportional delay in ticket locks is delaying the time between checking the ticket based on a delay factor, and the number of CPUs ahead of us in the queue for this lock. Checking the lock less often allows the lock holder to continue running, resulting in better throughput and preventing performance from dropping off a cliff. The test case has a number of threads locking and unlocking a semaphore. With just one thread, everything sits in the CPU cache and throughput is around 2.6 million operations per second, with a 5-10% variation. Once a second thread gets involved, data structures bounce from CPU to CPU, and performance deteriorates to about 1.25 million operations per second, with a 5-10% variation. However, as more and more threads get added to the mix, performance with the vanilla kernel continues to deteriorate. Once I hit 24 threads, on a 24 CPU, 4 node test system, performance is down to about 290k operations/second. With a proportional backoff delay added to the spinlock code, performance with 24 threads goes up to about 400k operations/second with a 50x delay, and about 900k operations/second with a 250x delay. However, with a 250x delay, performance with 2-5 threads is worse than with a 50x delay. Making the code auto-tune the delay factor results in a system that performs well with both light and heavy lock contention, and should also protect against the (likely) case of the fixed delay factor being wrong for other hardware. The attached graph shows the performance of the multi threaded semaphore lock/unlock test case, with 1-24 threads, on the vanilla kernel, with 10x, 50x, and 250x proportional delay, as well as the v1 patch series with autotuning for 2x and 2.7x spinning before the lock is obtained, and with the v2 series. The v2 series integrates several ideas from Michel Lespinasse and Eric Dumazet, which should result in better throughput and nicer behaviour in situations with contention on multiple locks. For the v3 series, I tried out all the ideas suggested by Michel. They made perfect sense, but in the end it turned out they did not work as well as the simple, aggressive "try to make the delay longer" policy I have now. Several small bug fixes and cleanups have been integrated. For the v4 series, I added code to keep the maximum spinlock delay to a small value when running in a virtual machine. That should solve the performance regression seen in virtual machines. The performance issue observed with AIM7 is still a mystery. Performance is within the margin of error of v2, so the graph has not been update. Please let me know if you manage to break this code in any way, so I can fix it... -- All rights reversed. <>
[PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
Many spinlocks are embedded in data structures; having many CPUs pounce on the cache line the lock is in will slow down the lock holder, and can cause system performance to fall off a cliff. The paper Non-scalable locks are dangerous is a good reference: http://pdos.csail.mit.edu/papers/linux:lock.pdf In the Linux kernel, spinlocks are optimized for the case of there not being contention. After all, if there is contention, the data structure can be improved to reduce or eliminate lock contention. Likewise, the spinlock API should remain simple, and the common case of the lock not being contended should remain as fast as ever. However, since spinlock contention should be fairly uncommon, we can add functionality into the spinlock slow path that keeps system performance from falling off a cliff when there is lock contention. Proportional delay in ticket locks is delaying the time between checking the ticket based on a delay factor, and the number of CPUs ahead of us in the queue for this lock. Checking the lock less often allows the lock holder to continue running, resulting in better throughput and preventing performance from dropping off a cliff. The test case has a number of threads locking and unlocking a semaphore. With just one thread, everything sits in the CPU cache and throughput is around 2.6 million operations per second, with a 5-10% variation. Once a second thread gets involved, data structures bounce from CPU to CPU, and performance deteriorates to about 1.25 million operations per second, with a 5-10% variation. However, as more and more threads get added to the mix, performance with the vanilla kernel continues to deteriorate. Once I hit 24 threads, on a 24 CPU, 4 node test system, performance is down to about 290k operations/second. With a proportional backoff delay added to the spinlock code, performance with 24 threads goes up to about 400k operations/second with a 50x delay, and about 900k operations/second with a 250x delay. However, with a 250x delay, performance with 2-5 threads is worse than with a 50x delay. Making the code auto-tune the delay factor results in a system that performs well with both light and heavy lock contention, and should also protect against the (likely) case of the fixed delay factor being wrong for other hardware. The attached graph shows the performance of the multi threaded semaphore lock/unlock test case, with 1-24 threads, on the vanilla kernel, with 10x, 50x, and 250x proportional delay, as well as the v1 patch series with autotuning for 2x and 2.7x spinning before the lock is obtained, and with the v2 series. The v2 series integrates several ideas from Michel Lespinasse and Eric Dumazet, which should result in better throughput and nicer behaviour in situations with contention on multiple locks. For the v3 series, I tried out all the ideas suggested by Michel. They made perfect sense, but in the end it turned out they did not work as well as the simple, aggressive try to make the delay longer policy I have now. Several small bug fixes and cleanups have been integrated. For the v4 series, I added code to keep the maximum spinlock delay to a small value when running in a virtual machine. That should solve the performance regression seen in virtual machines. The performance issue observed with AIM7 is still a mystery. Performance is within the margin of error of v2, so the graph has not been update. Please let me know if you manage to break this code in any way, so I can fix it... -- All rights reversed. attachment: spinlock-backoff-v2.png
Re: [PATCH -v4 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On Fri, 2013-01-25 at 14:05 -0500, Rik van Riel wrote: The performance issue observed with AIM7 is still a mystery. Hm. AIM7 mystery _may_ be the same crud I see on a 4 node 40 core box. Stock scheduler knobs are too preempt happy, produce unstable results. I twiddle them as below to stabilize results. I'm testing a load balancing series from Alex Shi with AIM7 and whatnot, added your series on top of it and retested. What I see is improvement. Oodles of numbers follow. Sorry that your numbers are mixed in with my numbers, but this is just an excerpt from my test log, and I'm too lazy to reformat and filter. You can save wear and tear on your eyeballs by just poking 'D'. There does appear to be evidence that your patch set improved this load though, so in case you want to see numbers, here come a bunch, a quick scroll-by may be worth it. The very heavy load end did not improve, which seems odd, but whatever. Numbers... sched_latency_ns = 24ms sched_min_granularity_ns = 8ms sched_wakeup_granularity_ns = 10ms aim7 compute 3.8.0-performance 3.8.0-balance 3.8.0-powersaving Tasksjobs/min jti jobs/min/task real cpujobs/min jti jobs/min/task real cpujobs/min jti jobs/min/task real cpu 1 432.86 100 432.8571 14.00 3.99 433.48 100 433.4764 13.98 3.97 433.17 100 433.1665 13.99 3.98 1 437.23 100 437.2294 13.86 3.85 436.60 100 436.5994 13.88 3.86 435.66 100 435.6578 13.91 3.90 1 434.10 100 434.0974 13.96 3.95 436.29 100 436.2851 13.89 3.89 436.29 100 436.2851 13.89 3.87 5 2400.95 99 480.1902 12.62 12.49 2554.81 98 510.9612 11.86 7.55 2487.68 98 497.5369 12.18 8.22 5 2341.58 99 468.3153 12.94 13.95 2578.72 99 515.7447 11.75 7.25 2527.11 99 505.4212 11.99 7.90 5 2350.66 99 470.1319 12.89 13.66 2600.86 99 520.1717 11.65 7.09 2508.28 98 501.6556 12.08 8.24 10 4291.78 99 429.1785 14.12 40.14 5334.51 99 533.4507 11.36 11.13 5183.92 98 518.3918 11.69 12.15 10 4334.76 99 433.4764 13.98 38.70 5311.13 99 531.1131 11.41 11.23 5215.15 99 521.5146 11.62 12.53 10 4273.62 99 427.3625 14.18 40.29 5287.96 99 528.7958 11.46 11.46 5144.31 98 514.4312 11.78 12.32 20 8487.39 94 424.3697 14.28 63.1410594.41 99 529.7203 11.44 23.7210575.92 99 528.7958 11.46 22.08 20 8387.54 97 419.3772 14.45 77.0110575.92 98 528.7958 11.46 23.4110520.83 99 526.0417 11.52 21.88 20 8713.16 95 435.6578 13.91 55.1010659.63 99 532.9815 11.37 24.1710539.13 99 526.9565 11.50 22.13 4016786.70 99 419.6676 14.44170.0819469.88 98 486.7470 12.45 60.7819967.05 98 499.1763 12.14 51.40 4016728.78 99 418.2195 14.49172.9619627.53 98 490.6883 12.35 65.2620386.88 98 509.6720 11.89 46.91 4016763.49 99 419.0871 14.46171.4220033.06 98 500.8264 12.10 51.4420682.59 98 517.0648 11.72 42.45 8033024.52 98 412.8065 14.68355.1033205.48 98 415.0685 14.60336.9033690.06 97 421.1258 14.39 248.91 8033002.04 99 412.5255 14.69356.2733949.58 96 424.3697 14.28283.8733160.05 97 414.5007 14.62 264.85 8033047.03 99 413.0879 14.67355.2233137.39 98 414.2174 14.63338.9233526.97 97 419.0871 14.46 257.31 16064254.47 98 401.5905 15.09391.3064000.00 98 400. 15.15396.8765073.83 97 406.7114 14.90 371.09 16064468.09 98 402.9255 15.04390.2864553.93 98 403.4621 15.02389.4964640.00 98 404. 15.00 379.82 16064297.08 98 401.8568 15.08389.4564856.19 98 405.3512 14.95383.6464683.12 98 404.2695 14.99 379.43 320 121579.94 98