Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
On Tue, 16 Jun 2015, Juergen Gross wrote: > AFAIK there are no outstanding questions for more than one month now. > I'd appreciate some feedback or accepting these patches. They are against dead code, which will be gone soon. We switched over to queued locks. Thanks, tglx ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
AFAIK there are no outstanding questions for more than one month now. I'd appreciate some feedback or accepting these patches. Juergen On 04/30/2015 12:53 PM, Juergen Gross wrote: Paravirtualized spinlocks produce some overhead even if the kernel is running on bare metal. The main reason are the more complex locking and unlocking functions. Especially unlocking is no longer just one instruction but so complex that it is no longer inlined. This patch series addresses this issue by adding two more pvops functions to reduce the size of the inlined spinlock functions. When running on bare metal unlocking is again basically one instruction. Compile tested with CONFIG_PARAVIRT_SPINLOCKS on and off, 32 and 64 bits. Functional testing on bare metal and as Xen dom0. Correct patching verified by disassembly of active kernel. Juergen Gross (6): x86: use macro instead of "0" for setting TICKET_SLOWPATH_FLAG x86: move decision about clearing slowpath flag into arch_spin_lock() x86: introduce new pvops function clear_slowpath x86: introduce new pvops function spin_unlock x86: switch config from UNINLINE_SPIN_UNLOCK to INLINE_SPIN_UNLOCK x86: remove no longer needed paravirt_ticketlocks_enabled arch/x86/Kconfig | 1 - arch/x86/include/asm/paravirt.h | 13 + arch/x86/include/asm/paravirt_types.h | 12 arch/x86/include/asm/spinlock.h | 53 --- arch/x86/include/asm/spinlock_types.h | 3 +- arch/x86/kernel/kvm.c | 14 + arch/x86/kernel/paravirt-spinlocks.c | 42 +-- arch/x86/kernel/paravirt.c| 12 arch/x86/kernel/paravirt_patch_32.c | 25 + arch/x86/kernel/paravirt_patch_64.c | 24 arch/x86/xen/spinlock.c | 23 +-- include/linux/spinlock_api_smp.h | 2 +- kernel/Kconfig.locks | 7 +++-- kernel/Kconfig.preempt| 3 +- kernel/locking/spinlock.c | 2 +- lib/Kconfig.debug | 1 - 16 files changed, 154 insertions(+), 83 deletions(-) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
Ping? Anything missing from my side? On 04/30/2015 12:53 PM, Juergen Gross wrote: Paravirtualized spinlocks produce some overhead even if the kernel is running on bare metal. The main reason are the more complex locking and unlocking functions. Especially unlocking is no longer just one instruction but so complex that it is no longer inlined. This patch series addresses this issue by adding two more pvops functions to reduce the size of the inlined spinlock functions. When running on bare metal unlocking is again basically one instruction. Compile tested with CONFIG_PARAVIRT_SPINLOCKS on and off, 32 and 64 bits. Functional testing on bare metal and as Xen dom0. Correct patching verified by disassembly of active kernel. Juergen Gross (6): x86: use macro instead of "0" for setting TICKET_SLOWPATH_FLAG x86: move decision about clearing slowpath flag into arch_spin_lock() x86: introduce new pvops function clear_slowpath x86: introduce new pvops function spin_unlock x86: switch config from UNINLINE_SPIN_UNLOCK to INLINE_SPIN_UNLOCK x86: remove no longer needed paravirt_ticketlocks_enabled arch/x86/Kconfig | 1 - arch/x86/include/asm/paravirt.h | 13 + arch/x86/include/asm/paravirt_types.h | 12 arch/x86/include/asm/spinlock.h | 53 --- arch/x86/include/asm/spinlock_types.h | 3 +- arch/x86/kernel/kvm.c | 14 + arch/x86/kernel/paravirt-spinlocks.c | 42 +-- arch/x86/kernel/paravirt.c| 12 arch/x86/kernel/paravirt_patch_32.c | 25 + arch/x86/kernel/paravirt_patch_64.c | 24 arch/x86/xen/spinlock.c | 23 +-- include/linux/spinlock_api_smp.h | 2 +- kernel/Kconfig.locks | 7 +++-- kernel/Kconfig.preempt| 3 +- kernel/locking/spinlock.c | 2 +- lib/Kconfig.debug | 1 - 16 files changed, 154 insertions(+), 83 deletions(-) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
On 05/17/2015 07:30 AM, Ingo Molnar wrote: * Juergen Gross wrote: On 05/05/2015 07:21 PM, Jeremy Fitzhardinge wrote: On 05/03/2015 10:55 PM, Juergen Gross wrote: I did a small measurement of the pure locking functions on bare metal without and with my patches. spin_lock() for the first time (lock and code not in cache) dropped from about 600 to 500 cycles. spin_unlock() for first time dropped from 145 to 87 cycles. spin_lock() in a loop dropped from 48 to 45 cycles. spin_unlock() in the same loop dropped from 24 to 22 cycles. Did you isolate icache hot/cold from dcache hot/cold? It seems to me the main difference will be whether the branch predictor is warmed up rather than if the lock itself is in dcache, but its much more likely that the lock code is icache if the code is lock intensive, making the cold case moot. But that's pure speculation. Could you see any differences in workloads beyond microbenchmarks? Not that its my call at all, but I think we'd need to see some concrete improvements in real workloads before adding the complexity of more pvops. I did another test on a larger machine: 25 kernel builds (time make -j 32) on a 32 core machine. Before each build "make clean" was called, the first result after boot was omitted to avoid disk cache warmup effects. System time without my patches: 861.5664 +/- 3.3665 s with my patches: 852.2269 +/- 3.6629 s So how does the profile look like in the guest, before/after the PV spinlock patches? I'm a bit surprised to see so much spinlock overhead. I did another test in Xen dom0: System time without my patches: 2903 +/- 2 s with my patches: 2904 +/- 2 s BTW, this was what I expected: There should be no significant change in system time, as the only real difference between both variants in a guest is an additional 2-byte nop in the inlined unlock function call, another one in the lock call and one jmp instruction less in the lock call. What I didn't expect was the huge performance difference between native and guest. The used configuration (32 cores with hyperthreads enabled) surely is one reason for the difference, but still this seems to be too much. I double checked the results on bare metal, they are still more or less the same (did only one kernel build resulting in 862 seconds system time). There seems to be a lot of room for improvement, but this is another story. Regarding spinlock overhead: I think the reason I saw about 1% less system time with my patches was mainly due to less cache misses. Inlining of the unlock function avoided an additional instruction cache miss for the unlock function. KT Raghavendra did some benchmarks with only small user programs and high kernel load which showed nearly no effect at all. Additionally I've compared the two kernels using bloat-o-meter: add/remove: 11/13 grow/shrink: 654/603 up/down: 6046/-31754 (-25708) with some hot path functions going down in size quite nice, e.g.: __raw_spin_unlock_irq336 90-246 Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
* Juergen Gross wrote: > On 05/05/2015 07:21 PM, Jeremy Fitzhardinge wrote: > >On 05/03/2015 10:55 PM, Juergen Gross wrote: > >>I did a small measurement of the pure locking functions on bare metal > >>without and with my patches. > >> > >>spin_lock() for the first time (lock and code not in cache) dropped from > >>about 600 to 500 cycles. > >> > >>spin_unlock() for first time dropped from 145 to 87 cycles. > >> > >>spin_lock() in a loop dropped from 48 to 45 cycles. > >> > >>spin_unlock() in the same loop dropped from 24 to 22 cycles. > > > >Did you isolate icache hot/cold from dcache hot/cold? It seems to me the > >main difference will be whether the branch predictor is warmed up rather > >than if the lock itself is in dcache, but its much more likely that the > >lock code is icache if the code is lock intensive, making the cold case > >moot. But that's pure speculation. > > > >Could you see any differences in workloads beyond microbenchmarks? > > > >Not that its my call at all, but I think we'd need to see some concrete > >improvements in real workloads before adding the complexity of more pvops. > > I did another test on a larger machine: > > 25 kernel builds (time make -j 32) on a 32 core machine. Before each > build "make clean" was called, the first result after boot was omitted > to avoid disk cache warmup effects. > > System time without my patches: 861.5664 +/- 3.3665 s >with my patches: 852.2269 +/- 3.6629 s So how does the profile look like in the guest, before/after the PV spinlock patches? I'm a bit surprised to see so much spinlock overhead. Thanks, Ingo ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
Ping? On 04/30/2015 12:53 PM, Juergen Gross wrote: Paravirtualized spinlocks produce some overhead even if the kernel is running on bare metal. The main reason are the more complex locking and unlocking functions. Especially unlocking is no longer just one instruction but so complex that it is no longer inlined. This patch series addresses this issue by adding two more pvops functions to reduce the size of the inlined spinlock functions. When running on bare metal unlocking is again basically one instruction. Compile tested with CONFIG_PARAVIRT_SPINLOCKS on and off, 32 and 64 bits. Functional testing on bare metal and as Xen dom0. Correct patching verified by disassembly of active kernel. Juergen Gross (6): x86: use macro instead of "0" for setting TICKET_SLOWPATH_FLAG x86: move decision about clearing slowpath flag into arch_spin_lock() x86: introduce new pvops function clear_slowpath x86: introduce new pvops function spin_unlock x86: switch config from UNINLINE_SPIN_UNLOCK to INLINE_SPIN_UNLOCK x86: remove no longer needed paravirt_ticketlocks_enabled arch/x86/Kconfig | 1 - arch/x86/include/asm/paravirt.h | 13 + arch/x86/include/asm/paravirt_types.h | 12 arch/x86/include/asm/spinlock.h | 53 --- arch/x86/include/asm/spinlock_types.h | 3 +- arch/x86/kernel/kvm.c | 14 + arch/x86/kernel/paravirt-spinlocks.c | 42 +-- arch/x86/kernel/paravirt.c| 12 arch/x86/kernel/paravirt_patch_32.c | 25 + arch/x86/kernel/paravirt_patch_64.c | 24 arch/x86/xen/spinlock.c | 23 +-- include/linux/spinlock_api_smp.h | 2 +- kernel/Kconfig.locks | 7 +++-- kernel/Kconfig.preempt| 3 +- kernel/locking/spinlock.c | 2 +- lib/Kconfig.debug | 1 - 16 files changed, 154 insertions(+), 83 deletions(-) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
On 05/05/2015 07:21 PM, Jeremy Fitzhardinge wrote: On 05/03/2015 10:55 PM, Juergen Gross wrote: I did a small measurement of the pure locking functions on bare metal without and with my patches. spin_lock() for the first time (lock and code not in cache) dropped from about 600 to 500 cycles. spin_unlock() for first time dropped from 145 to 87 cycles. spin_lock() in a loop dropped from 48 to 45 cycles. spin_unlock() in the same loop dropped from 24 to 22 cycles. Did you isolate icache hot/cold from dcache hot/cold? It seems to me the main difference will be whether the branch predictor is warmed up rather than if the lock itself is in dcache, but its much more likely that the lock code is icache if the code is lock intensive, making the cold case moot. But that's pure speculation. Could you see any differences in workloads beyond microbenchmarks? Not that its my call at all, but I think we'd need to see some concrete improvements in real workloads before adding the complexity of more pvops. I did another test on a larger machine: 25 kernel builds (time make -j 32) on a 32 core machine. Before each build "make clean" was called, the first result after boot was omitted to avoid disk cache warmup effects. System time without my patches: 861.5664 +/- 3.3665 s with my patches: 852.2269 +/- 3.6629 s Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
On 05/03/2015 10:55 PM, Juergen Gross wrote: > I did a small measurement of the pure locking functions on bare metal > without and with my patches. > > spin_lock() for the first time (lock and code not in cache) dropped from > about 600 to 500 cycles. > > spin_unlock() for first time dropped from 145 to 87 cycles. > > spin_lock() in a loop dropped from 48 to 45 cycles. > > spin_unlock() in the same loop dropped from 24 to 22 cycles. Did you isolate icache hot/cold from dcache hot/cold? It seems to me the main difference will be whether the branch predictor is warmed up rather than if the lock itself is in dcache, but its much more likely that the lock code is icache if the code is lock intensive, making the cold case moot. But that's pure speculation. Could you see any differences in workloads beyond microbenchmarks? Not that its my call at all, but I think we'd need to see some concrete improvements in real workloads before adding the complexity of more pvops. J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
On 04/30/2015 06:39 PM, Jeremy Fitzhardinge wrote: On 04/30/2015 03:53 AM, Juergen Gross wrote: Paravirtualized spinlocks produce some overhead even if the kernel is running on bare metal. The main reason are the more complex locking and unlocking functions. Especially unlocking is no longer just one instruction but so complex that it is no longer inlined. This patch series addresses this issue by adding two more pvops functions to reduce the size of the inlined spinlock functions. When running on bare metal unlocking is again basically one instruction. Out of curiosity, is there a measurable difference? I did a small measurement of the pure locking functions on bare metal without and with my patches. spin_lock() for the first time (lock and code not in cache) dropped from about 600 to 500 cycles. spin_unlock() for first time dropped from 145 to 87 cycles. spin_lock() in a loop dropped from 48 to 45 cycles. spin_unlock() in the same loop dropped from 24 to 22 cycles. Juergen ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 0/6] x86: reduce paravirtualized spinlock overhead
Paravirtualized spinlocks produce some overhead even if the kernel is running on bare metal. The main reason are the more complex locking and unlocking functions. Especially unlocking is no longer just one instruction but so complex that it is no longer inlined. This patch series addresses this issue by adding two more pvops functions to reduce the size of the inlined spinlock functions. When running on bare metal unlocking is again basically one instruction. Compile tested with CONFIG_PARAVIRT_SPINLOCKS on and off, 32 and 64 bits. Functional testing on bare metal and as Xen dom0. Correct patching verified by disassembly of active kernel. Juergen Gross (6): x86: use macro instead of "0" for setting TICKET_SLOWPATH_FLAG x86: move decision about clearing slowpath flag into arch_spin_lock() x86: introduce new pvops function clear_slowpath x86: introduce new pvops function spin_unlock x86: switch config from UNINLINE_SPIN_UNLOCK to INLINE_SPIN_UNLOCK x86: remove no longer needed paravirt_ticketlocks_enabled arch/x86/Kconfig | 1 - arch/x86/include/asm/paravirt.h | 13 + arch/x86/include/asm/paravirt_types.h | 12 arch/x86/include/asm/spinlock.h | 53 --- arch/x86/include/asm/spinlock_types.h | 3 +- arch/x86/kernel/kvm.c | 14 + arch/x86/kernel/paravirt-spinlocks.c | 42 +-- arch/x86/kernel/paravirt.c| 12 arch/x86/kernel/paravirt_patch_32.c | 25 + arch/x86/kernel/paravirt_patch_64.c | 24 arch/x86/xen/spinlock.c | 23 +-- include/linux/spinlock_api_smp.h | 2 +- kernel/Kconfig.locks | 7 +++-- kernel/Kconfig.preempt| 3 +- kernel/locking/spinlock.c | 2 +- lib/Kconfig.debug | 1 - 16 files changed, 154 insertions(+), 83 deletions(-) -- 2.1.4 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 0/6] x86: reduce paravirtualized spinlock overhead
On 04/30/2015 03:53 AM, Juergen Gross wrote: > Paravirtualized spinlocks produce some overhead even if the kernel is > running on bare metal. The main reason are the more complex locking > and unlocking functions. Especially unlocking is no longer just one > instruction but so complex that it is no longer inlined. > > This patch series addresses this issue by adding two more pvops > functions to reduce the size of the inlined spinlock functions. When > running on bare metal unlocking is again basically one instruction. Out of curiosity, is there a measurable difference? J ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization