Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
* Mike Galbraith wrote: > On Thu, 2013-01-10 at 10:31 -0500, Rik van Riel wrote: > > On 01/10/2013 10:19 AM, Mike Galbraith wrote: > > > On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote: > > > > > >> Please let me know if you manage to break this code in any way, > > >> so I can fix it... > > > > > > I didn't break it, but did let it play with rq->lock contention. Using > > > cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to > > > pull around appears to have been a ~dead heat. > > > > Good to hear that the code seems to be robust. It seems to > > help prevent performance degradation in some workloads, and > > nobody seems to have found regressions yet. > > I had hoped for a bit of positive, but a wash isn't surprising > given the profile. I tried tbench too, didn't expect to see > anything at all there, and got that.. so both results are > positive in that respect. Ok, that's good. Rik, mind re-sending the latest series with all the acks and Reviewed-by's added? Thanks, Ingo -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On 01/12/2013 01:41 AM, Rik van Riel wrote: On 01/10/2013 12:36 PM, Raghavendra K T wrote: * Rafael Aquini [2013-01-10 00:27:23]: On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote: I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base. x base_3.8rc2 + rik_backoff N Min MaxMedian AvgStddev x 8 222.977231.16 227.735 227.388 3.1512986 + 8218.75 232.347 229.1035 228.25425 4.2730225 No difference proven at 95.0% confidence I got similar results on smaller systems (1 socket, dual-cores and quad-cores) when running Rik's latest series, no big difference for good nor for worse, but I also think Rik's work is meant to address bigger systems with more cores contending for any given spinlock. I was able to do the test on same 32 core machine with 4 guests (8GB RAM, 32 vcpu). Here are the results base = 3.8-rc2 patched = base + Rik V3 backoff series [patch 1-4] I believe I understand why this is happening. Modern Intel and AMD CPUs have a feature called Pause Loop Exiting (PLE) and Pause Filter (PF), respectively. This feature is used to trap to the host when the guest is spinning on a spinlock. This allows the host to run something else, and having the spinner temporarily yield the CPU. Effectively, this causes the KVM code to already do some limited amount of spinlock backoff code, in the host. Adding more backoff code in the guest can lead to wild delays in acquiring locks, and generally bad performance. Yes agree with you. I suspect that when running in a virtual machine, we should limit the delay factor to something much smaller, since the host will take care of most of the backoff for us. Even for non-PLE case I believe it would be difficult to tune delay, because of VCPU scheduling and LHP. Maybe a maximum delay value of ~10 would do the trick for KVM guests. We should be able to get this right by placing the value for the maximum delay in a __read_mostly section and setting it to something small from an init function when we detect we are running in a virtual machine. Let me cook up, and test, a patch that does that... Sure.. Awaiting and happy to test the patches. I also tried few things on my own and also how it behaves without patch 4. Nothing helped. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On 01/10/2013 12:36 PM, Raghavendra K T wrote: * Rafael Aquini [2013-01-10 00:27:23]: On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote: I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base. x base_3.8rc2 + rik_backoff N Min MaxMedian AvgStddev x 8 222.977231.16 227.735 227.388 3.1512986 + 8218.75 232.347 229.1035 228.25425 4.2730225 No difference proven at 95.0% confidence I got similar results on smaller systems (1 socket, dual-cores and quad-cores) when running Rik's latest series, no big difference for good nor for worse, but I also think Rik's work is meant to address bigger systems with more cores contending for any given spinlock. I was able to do the test on same 32 core machine with 4 guests (8GB RAM, 32 vcpu). Here are the results base = 3.8-rc2 patched = base + Rik V3 backoff series [patch 1-4] I believe I understand why this is happening. Modern Intel and AMD CPUs have a feature called Pause Loop Exiting (PLE) and Pause Filter (PF), respectively. This feature is used to trap to the host when the guest is spinning on a spinlock. This allows the host to run something else, and having the spinner temporarily yield the CPU. Effectively, this causes the KVM code to already do some limited amount of spinlock backoff code, in the host. Adding more backoff code in the guest can lead to wild delays in acquiring locks, and generally bad performance. I suspect that when running in a virtual machine, we should limit the delay factor to something much smaller, since the host will take care of most of the backoff for us. Maybe a maximum delay value of ~10 would do the trick for KVM guests. We should be able to get this right by placing the value for the maximum delay in a __read_mostly section and setting it to something small from an init function when we detect we are running in a virtual machine. Let me cook up, and test, a patch that does that... -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On 1/8/2013 2:26 PM, Rik van Riel wrote: <...> Performance is within the margin of error of v2, so the graph has not been update. Please let me know if you manage to break this code in any way, so I can fix it... Attached below is some preliminary data with one of the AIM7 micro-benchmark workloads (i.e. high_systime). This is a kernel intensive workload which does tons of forks/execs etc.and stresses quite a few of the same set of spinlocks and semaphores. Observed a drop in performance as we go to 40way and 80 way. Wondering if the back off keeps increasing to such an extent that it actually starts to hurt given the nature of this workload ? Also in the case of 80way observed quite a bit of variation from run to run... Also ran it inside a single KVM guest. There were some perf. dips but interestingly didn't observe the same level of drop (compared to the drop in the native case) as the guest size was scaled up to 40vcpu or 80vcpu. FYI Vinod --- Platform : 8 socket (80 Core) Westmere with 1TB RAM. Workload: AIM7-highsystime microbenchmark - 2000 users & 100 jobs per user. Values reported are Jobs Per Minute (Higher is better). The values are average of 3 runs. 1) Native run: -- Config 1: 3.7 kernel Config 2: 3.7 + Rik's 1-4 patches 20way 40way 80way Config 1 ~179K ~159K ~146K Config 2 ~180K ~134K ~21K-43K <- high variation! (Note: Used numactl to restrict workload to 2 sockets (20way) and 4 sockets(40way)) -- 2) KVM run : Single guest of different sizes (No over commit, NUMA enabled in the guest). Note: This kernel intensive micro benchmark is exposes the PLE handler issue esp. for large guests. Since Raghu's PLE changes are not yet in upstream 'have just run with current PLE handler & then by disabling PLE (ple_gap=0). Config 1 : Host & Guest at 3.7 Config 2 : Host & Guest are at 3.7 + Rik's 1-4 patches -- 20vcpu/128G 40vcpu/256G 80vcpu/512G (on 2 sockets) (on 4 sockets) (on 8 sockets) -- Config 1 ~144K ~39K ~10K -- Config 2 ~143K ~37.5K ~11K -- Config 3 : Host & Guest at 3.7 AND ple_gap=0 Config 4 : Host & Guest are at 3.7 + Rik's 1-4 patches AND ple_gap=0 -- Config 3 ~154K~131K~116K -- Config 4 ~151K~130K~115K -- (Note: Used numactl to restrict qemu to 2 sockets (20way) and 4 sockets(40way))
Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On Thu, 2013-01-10 at 10:31 -0500, Rik van Riel wrote: > On 01/10/2013 10:19 AM, Mike Galbraith wrote: > > On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote: > > > >> Please let me know if you manage to break this code in any way, > >> so I can fix it... > > > > I didn't break it, but did let it play with rq->lock contention. Using > > cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to > > pull around appears to have been a ~dead heat. > > Good to hear that the code seems to be robust. It seems to > help prevent performance degradation in some workloads, and > nobody seems to have found regressions yet. I had hoped for a bit of positive, but a wash isn't surprising given the profile. I tried tbench too, didn't expect to see anything at all there, and got that.. so both results are positive in that respect. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
* Rafael Aquini [2013-01-10 00:27:23]: > On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote: > > I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base. > > x base_3.8rc2 > > + rik_backoff > > N Min MaxMedian AvgStddev > > x 8 222.977231.16 227.735 227.388 3.1512986 > > + 8218.75 232.347 229.1035 228.25425 4.2730225 > > No difference proven at 95.0% confidence > > I got similar results on smaller systems (1 socket, dual-cores and quad-cores) > when running Rik's latest series, no big difference for good nor for worse, > but I also think Rik's work is meant to address bigger systems with more cores > contending for any given spinlock. I was able to do the test on same 32 core machine with 4 guests (8GB RAM, 32 vcpu). Here are the results base = 3.8-rc2 patched = base + Rik V3 backoff series [patch 1-4] +---+---+---++---+ kernbench (sec lower is better) +---+---+---++---+ basestdevpatched stdev %improve +---+---+---++---+ 44.3000 2.0404 46.7928 1.7518-5.62709 94.8262 5.1444 102.4737 7.8406-8.06475 156.054014.5797 167.6888 9.7110-7.45562 202.322515.8906 213.143517.1778-5.34839 +---+---+---++---+ +---+---+---++---+ sysbench (sec lower is better) +---+---+---++---+ basestdevpatched stdev %improve +---+---+---++---+ 16.8512 0.4164 17.7301 0.3761-5.21565 13.0411 0.4115 12.9380 0.1554 0.79058 18.4573 0.2123 18.4662 0.2005-0.04822 24.2021 0.1713 24.3690 0.3270-0.68961 +---+---+---++---+ +---+---+---++---+ ebizzy (record/sec higher is better) +---+---+---++---+ basestdevpatched stdev %improve +---+---+---++---+ 2494.400027.54472400.600083.4255-3.76042 2636.6000 302.96582757.5000 147.5137 4.58545 2236.8333 239.65312131.6667 156.1534-4.70158 1768.8750 142.54371901.3750 295.2147 7.49064 +---+---+---++---+ +---+---+---++---+ dbench (throughput in MB/sec higher is better) +---+---+---++---+ basestdevpatched stdev %improve +---+---+---++---+ 10076.9180 2410.96555870.7460 4297.4532xxx 2152.522088.28531517.827061.9742 -29.48611 1334.960834.32471078.427538.2288 -19.21654 946.635532.0426 753.075725.5302 -20.44713 +---+---+---++---+ Please note that I have put dbench_1x result as since I observed very high variance in the result. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On 01/10/2013 10:19 AM, Mike Galbraith wrote: On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote: Please let me know if you manage to break this code in any way, so I can fix it... I didn't break it, but did let it play with rq->lock contention. Using cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to pull around appears to have been a ~dead heat. Good to hear that the code seems to be robust. It seems to help prevent performance degradation in some workloads, and nobody seems to have found regressions yet. Thank you for testing. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On Tue, 2013-01-08 at 17:26 -0500, Rik van Riel wrote: > Please let me know if you manage to break this code in any way, > so I can fix it... I didn't break it, but did let it play with rq->lock contention. Using cyclictest -Smp99 -i 100 -d 0, with 3 rt tasks for pull_rt_task() to pull around appears to have been a ~dead heat. 3.6.113.6.11-spinlock PerfTop: 78852 irqs/sec kernel:96.4% exact: 0.0% [1000Hz cycles], (all, 80 CPUs) - samples pcnt function samples pcnt function ___ _ ___ ___ _ ___ 468341.00 52.0% cpupri_set 471786.00 52.0% cpupri_set 110259.00 12.2% _raw_spin_lock 88963.00 9.8% ticket_spin_lock_wait 78863.00 8.8% native_write_msr_safe77109.00 8.5% native_write_msr_safe 42882.00 4.8% __schedule 48858.00 5.4% native_write_cr0 40930.00 4.5% native_write_cr0 47038.00 5.2% __schedule 13718.00 1.5% finish_task_switch 24775.00 2.7% _raw_spin_lock 13188.00 1.5% plist_del13117.00 1.4% plist_del 13078.00 1.5% _raw_spin_lock_irqsave 12372.00 1.4% ttwu_do_wakeup 12083.00 1.3% ttwu_do_wakeup 11553.00 1.3% _raw_spin_lock_irqsave 8359.00 0.9% pull_rt_task 8186.00 0.9% pull_rt_task 6979.00 0.8% apic_timer_interrupt 7989.00 0.9% finish_task_switch 4623.00 0.5% __enqueue_rt_entity 6430.00 0.7% apic_timer_interrupt 3961.00 0.4% resched_task 4721.00 0.5% resched_task 3942.00 0.4% __switch_to 4109.00 0.5% __switch_to 3128.00 0.3% _raw_spin_trylock 2917.00 0.3% rcu_idle_exit_common 3081.00 0.3% __tick_nohz_idle_enter2897.00 0.3% __local_bh_enable 2561.00 0.3% update_curr_rt2873.00 0.3% _raw_spin_trylock 2385.00 0.3% _raw_spin_lock_irq2674.00 0.3% __enqueue_rt_entity 2190.00 0.2% __local_bh_enable 2434.00 0.3% update_curr_rt 1904.00 0.2% rcu_idle_exit_common 2161.00 0.2% hrtimer_interrupt 1870.00 0.2% clockevents_program_event 2106.00 0.2% ktime_get_update_offsets 1828.00 0.2% hrtimer_interrupt 1766.00 0.2% tick_nohz_idle_exit 1741.00 0.2% do_nanosleep 1608.00 0.2% __tick_nohz_idle_enter 1681.00 0.2% sys_clock_nanosleep 1437.00 0.2% do_nanosleep 1639.00 0.2% pick_next_task_rt 1428.00 0.2% hrtimer_init 1630.00 0.2% pick_next_task_stop 1320.00 0.1% sched_clock_idle_sleep_event 1535.00 0.2% _raw_spin_unlock_irqrestore 1290.00 0.1% sys_clock_nanosleep -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On Wed, Jan 09, 2013 at 06:20:35PM +0530, Raghavendra K T wrote: > I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base. > x base_3.8rc2 > + rik_backoff > N Min MaxMedian AvgStddev > x 8 222.977231.16 227.735 227.388 3.1512986 > + 8218.75 232.347 229.1035 228.25425 4.2730225 > No difference proven at 95.0% confidence I got similar results on smaller systems (1 socket, dual-cores and quad-cores) when running Rik's latest series, no big difference for good nor for worse, but I also think Rik's work is meant to address bigger systems with more cores contending for any given spinlock. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 0/5] x86,smp: make ticket spinlock proportional backoff w/ auto tuning
On 01/09/2013 03:56 AM, Rik van Riel wrote: Many spinlocks are embedded in data structures; having many CPUs pounce on the cache line the lock is in will slow down the lock holder, and can cause system performance to fall off a cliff. The paper "Non-scalable locks are dangerous" is a good reference: http://pdos.csail.mit.edu/papers/linux:lock.pdf In the Linux kernel, spinlocks are optimized for the case of there not being contention. After all, if there is contention, the data structure can be improved to reduce or eliminate lock contention. Likewise, the spinlock API should remain simple, and the common case of the lock not being contended should remain as fast as ever. However, since spinlock contention should be fairly uncommon, we can add functionality into the spinlock slow path that keeps system performance from falling off a cliff when there is lock contention. Proportional delay in ticket locks is delaying the time between checking the ticket based on a delay factor, and the number of CPUs ahead of us in the queue for this lock. Checking the lock less often allows the lock holder to continue running, resulting in better throughput and preventing performance from dropping off a cliff. The test case has a number of threads locking and unlocking a semaphore. With just one thread, everything sits in the CPU cache and throughput is around 2.6 million operations per second, with a 5-10% variation. Once a second thread gets involved, data structures bounce from CPU to CPU, and performance deteriorates to about 1.25 million operations per second, with a 5-10% variation. However, as more and more threads get added to the mix, performance with the vanilla kernel continues to deteriorate. Once I hit 24 threads, on a 24 CPU, 4 node test system, performance is down to about 290k operations/second. With a proportional backoff delay added to the spinlock code, performance with 24 threads goes up to about 400k operations/second with a 50x delay, and about 900k operations/second with a 250x delay. However, with a 250x delay, performance with 2-5 threads is worse than with a 50x delay. Making the code auto-tune the delay factor results in a system that performs well with both light and heavy lock contention, and should also protect against the (likely) case of the fixed delay factor being wrong for other hardware. The attached graph shows the performance of the multi threaded semaphore lock/unlock test case, with 1-24 threads, on the vanilla kernel, with 10x, 50x, and 250x proportional delay, as well as the v1 patch series with autotuning for 2x and 2.7x spinning before the lock is obtained, and with the v2 series. The v2 series integrates several ideas from Michel Lespinasse and Eric Dumazet, which should result in better throughput and nicer behaviour in situations with contention on multiple locks. For the v3 series, I tried out all the ideas suggested by Michel. They made perfect sense, but in the end it turned out they did not work as well as the simple, aggressive "try to make the delay longer" policy I have now. Several small bug fixes and cleanups have been integrated. Performance is within the margin of error of v2, so the graph has not been update. Please let me know if you manage to break this code in any way, so I can fix it... Patch series does not show anymore weird behaviour because of the underflow (pointed by Michael) and looks fine. I ran kernbench on 32 core (mx3850) machine with 3.8-rc2 base. x base_3.8rc2 + rik_backoff N Min MaxMedian AvgStddev x 8 222.977231.16 227.735 227.388 3.1512986 + 8218.75 232.347 229.1035 228.25425 4.2730225 No difference proven at 95.0% confidence The run did not show much difference. But I believe a spinlock stress test would have shown the benefit. I 'll start running benchmarks now on kvm guests and comeback with report. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/