Re: [PATCH V11 07/17] riscv: qspinlock: Introduce qspinlock param for command line
On 9/14/23 03:32, Leonardo Bras wrote: On Tue, Sep 12, 2023 at 09:08:34AM +0800, Guo Ren wrote: On Mon, Sep 11, 2023 at 11:34 PM Waiman Long wrote: On 9/10/23 04:29, guo...@kernel.org wrote: From: Guo Ren Allow cmdline to force the kernel to use queued_spinlock when CONFIG_RISCV_COMBO_SPINLOCKS=y. Signed-off-by: Guo Ren Signed-off-by: Guo Ren --- Documentation/admin-guide/kernel-parameters.txt | 2 ++ arch/riscv/kernel/setup.c | 16 +++- 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 7dfb540c4f6c..61cacb8dfd0e 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4693,6 +4693,8 @@ [KNL] Number of legacy pty's. Overwrites compiled-in default number. + qspinlock [RISCV] Force to use qspinlock or auto-detect spinlock. + qspinlock.numa_spinlock_threshold_ns= [NUMA, PV_OPS] Set the time threshold in nanoseconds for the number of intra-node lock hand-offs before the diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c index a447cf360a18..0f084f037651 100644 --- a/arch/riscv/kernel/setup.c +++ b/arch/riscv/kernel/setup.c @@ -270,6 +270,15 @@ static void __init parse_dtb(void) } #ifdef CONFIG_RISCV_COMBO_SPINLOCKS +bool enable_qspinlock_key = false; You can use __ro_after_init qualifier for enable_qspinlock_key. BTW, this is not a static key, just a simple flag. So what is the point of the _key suffix? Okay, I would change it to: bool enable_qspinlock_flag __ro_after_init = false; IIUC, this bool / flag is used in a single file, so it makes sense for it to be static. Being static means it does not need to be initialized to false, as it's standard to zero-fill this areas. Also, since it's a bool, it does not need to be called _flag. I would go with: static bool enable_qspinlock __ro_after_init; I actually was thinking about the same suggestion to add static. Then I realized that the flag was also used in another file in a later patch. Of course, if it turns out that this flag is no longer needed outside of this file, it should be static. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V10 07/19] riscv: qspinlock: errata: Introduce ERRATA_THEAD_QSPINLOCK
On 9/13/23 14:54, Palmer Dabbelt wrote: On Sun, 06 Aug 2023 22:23:34 PDT (-0700), sor...@fastmail.com wrote: On Wed, Aug 2, 2023, at 12:46 PM, guo...@kernel.org wrote: From: Guo Ren According to qspinlock requirements, RISC-V gives out a weak LR/SC forward progress guarantee which does not satisfy qspinlock. But many vendors could produce stronger forward guarantee LR/SC to ensure the xchg_tail could be finished in time on any kind of hart. T-HEAD is the vendor which implements strong forward guarantee LR/SC instruction pairs, so enable qspinlock for T-HEAD with errata help. T-HEAD early version of processors has the merge buffer delay problem, so we need ERRATA_WRITEONCE to support qspinlock. Signed-off-by: Guo Ren Signed-off-by: Guo Ren --- arch/riscv/Kconfig.errata | 13 + arch/riscv/errata/thead/errata.c | 24 arch/riscv/include/asm/errata_list.h | 20 arch/riscv/include/asm/vendorid_list.h | 3 ++- arch/riscv/kernel/cpufeature.c | 3 ++- 5 files changed, 61 insertions(+), 2 deletions(-) diff --git a/arch/riscv/Kconfig.errata b/arch/riscv/Kconfig.errata index 4745a5c57e7c..eb43677b13cc 100644 --- a/arch/riscv/Kconfig.errata +++ b/arch/riscv/Kconfig.errata @@ -96,4 +96,17 @@ config ERRATA_THEAD_WRITE_ONCE If you don't know what to do here, say "Y". +config ERRATA_THEAD_QSPINLOCK + bool "Apply T-Head queued spinlock errata" + depends on ERRATA_THEAD + default y + help + The T-HEAD C9xx processors implement strong fwd guarantee LR/SC to + match the xchg_tail requirement of qspinlock. + + This will apply the QSPINLOCK errata to handle the non-standard + behavior via using qspinlock instead of ticket_lock. + + If you don't know what to do here, say "Y". If this is to be applied, I would like to see a detailed explanation somewhere, preferably with citations, of: (a) The memory model requirements for qspinlock The part of qspinlock that causes problem with many RISC architectures is its use of a 16-bit xchg() function call which many RISC architectures cannot do it natively and have to be emulated with hopefully some forward progress guarantee. Except that one call, the other atomic operations are all 32 bit in size. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V11 04/17] locking/qspinlock: Improve xchg_tail for number of cpus >= 16k
On 9/13/23 08:52, Guo Ren wrote: On Wed, Sep 13, 2023 at 4:55 PM Leonardo Bras wrote: On Tue, Sep 12, 2023 at 09:10:08AM +0800, Guo Ren wrote: On Mon, Sep 11, 2023 at 9:03 PM Waiman Long wrote: On 9/10/23 23:09, Guo Ren wrote: On Mon, Sep 11, 2023 at 10:35 AM Waiman Long wrote: On 9/10/23 04:28, guo...@kernel.org wrote: From: Guo Ren The target of xchg_tail is to write the tail to the lock value, so adding prefetchw could help the next cmpxchg step, which may decrease the cmpxchg retry loops of xchg_tail. Some processors may utilize this feature to give a forward guarantee, e.g., RISC-V XuanTie processors would block the snoop channel & irq for several cycles when prefetch.w instruction (from Zicbop extension) retired, which guarantees the next cmpxchg succeeds. Signed-off-by: Guo Ren Signed-off-by: Guo Ren --- kernel/locking/qspinlock.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index d3f99060b60f..96b54e2ade86 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -223,7 +223,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock) */ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) { - u32 old, new, val = atomic_read(>val); + u32 old, new, val; + + prefetchw(>val); + val = atomic_read(>val); for (;;) { new = (val & _Q_LOCKED_PENDING_MASK) | tail; That looks a bit weird. You pre-fetch and then immediately read it. How much performance gain you get by this change alone? Maybe you can define an arch specific primitive that default back to atomic_read() if not defined. Thx for the reply. This is a generic optimization point I would like to talk about with you. First, prefetchw() makes cacheline an exclusive state and serves for the next cmpxchg loop semantic, which writes the idx_tail part of arch_spin_lock. The atomic_read only makes cacheline in the shared state, which couldn't give any guarantee for the next cmpxchg loop semantic. Micro-architecture could utilize prefetchw() to provide a strong forward progress guarantee for the xchg_tail, e.g., the T-HEAD XuanTie processor would hold the exclusive cacheline state until the next cmpxchg write success. In the end, Let's go back to the principle: the xchg_tail is an atomic swap operation that contains write eventually, so giving a prefetchw() at the beginning is acceptable for all architectures.. I did realize afterward that prefetchw gets the cacheline in exclusive state. I will suggest you mention that in your commit log as well as adding a comment about its purpose in the code. Okay, I would do that in v12, thx. I would suggest adding a snippet from the ISA Extenstion doc: "A prefetch.w instruction indicates to hardware that the cache block whose effective address is the sum of the base address specified in rs1 and the sign-extended offset encoded in imm[11:0], where imm[4:0] equals 0b0, is likely to be accessed by a data write (i.e. store) in the near future." Good point, thx. qspinlock is generic code. I suppose this is for the RISCV architecture. You can mention that in the commit log as an example, but I prefer more generic comment especially in the code. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V11 07/17] riscv: qspinlock: Introduce qspinlock param for command line
On 9/10/23 04:29, guo...@kernel.org wrote: From: Guo Ren Allow cmdline to force the kernel to use queued_spinlock when CONFIG_RISCV_COMBO_SPINLOCKS=y. Signed-off-by: Guo Ren Signed-off-by: Guo Ren --- Documentation/admin-guide/kernel-parameters.txt | 2 ++ arch/riscv/kernel/setup.c | 16 +++- 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 7dfb540c4f6c..61cacb8dfd0e 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4693,6 +4693,8 @@ [KNL] Number of legacy pty's. Overwrites compiled-in default number. + qspinlock [RISCV] Force to use qspinlock or auto-detect spinlock. + qspinlock.numa_spinlock_threshold_ns= [NUMA, PV_OPS] Set the time threshold in nanoseconds for the number of intra-node lock hand-offs before the diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c index a447cf360a18..0f084f037651 100644 --- a/arch/riscv/kernel/setup.c +++ b/arch/riscv/kernel/setup.c @@ -270,6 +270,15 @@ static void __init parse_dtb(void) } #ifdef CONFIG_RISCV_COMBO_SPINLOCKS +bool enable_qspinlock_key = false; You can use __ro_after_init qualifier for enable_qspinlock_key. BTW, this is not a static key, just a simple flag. So what is the point of the _key suffix? Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V11 07/17] riscv: qspinlock: Introduce qspinlock param for command line
On 9/10/23 04:29, guo...@kernel.org wrote: From: Guo Ren Allow cmdline to force the kernel to use queued_spinlock when CONFIG_RISCV_COMBO_SPINLOCKS=y. Signed-off-by: Guo Ren Signed-off-by: Guo Ren --- Documentation/admin-guide/kernel-parameters.txt | 2 ++ arch/riscv/kernel/setup.c | 16 +++- 2 files changed, 17 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 7dfb540c4f6c..61cacb8dfd0e 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4693,6 +4693,8 @@ [KNL] Number of legacy pty's. Overwrites compiled-in default number. + qspinlock [RISCV] Force to use qspinlock or auto-detect spinlock. + qspinlock.numa_spinlock_threshold_ns= [NUMA, PV_OPS] Set the time threshold in nanoseconds for the number of intra-node lock hand-offs before the Your patch series is still based on top of numa-aware qspinlock patchset which isn't upstream yet. Please rebase it without that as that will cause merge conflict during upstream merge. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V11 04/17] locking/qspinlock: Improve xchg_tail for number of cpus >= 16k
On 9/10/23 23:09, Guo Ren wrote: On Mon, Sep 11, 2023 at 10:35 AM Waiman Long wrote: On 9/10/23 04:28, guo...@kernel.org wrote: From: Guo Ren The target of xchg_tail is to write the tail to the lock value, so adding prefetchw could help the next cmpxchg step, which may decrease the cmpxchg retry loops of xchg_tail. Some processors may utilize this feature to give a forward guarantee, e.g., RISC-V XuanTie processors would block the snoop channel & irq for several cycles when prefetch.w instruction (from Zicbop extension) retired, which guarantees the next cmpxchg succeeds. Signed-off-by: Guo Ren Signed-off-by: Guo Ren --- kernel/locking/qspinlock.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index d3f99060b60f..96b54e2ade86 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -223,7 +223,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock) */ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) { - u32 old, new, val = atomic_read(>val); + u32 old, new, val; + + prefetchw(>val); + val = atomic_read(>val); for (;;) { new = (val & _Q_LOCKED_PENDING_MASK) | tail; That looks a bit weird. You pre-fetch and then immediately read it. How much performance gain you get by this change alone? Maybe you can define an arch specific primitive that default back to atomic_read() if not defined. Thx for the reply. This is a generic optimization point I would like to talk about with you. First, prefetchw() makes cacheline an exclusive state and serves for the next cmpxchg loop semantic, which writes the idx_tail part of arch_spin_lock. The atomic_read only makes cacheline in the shared state, which couldn't give any guarantee for the next cmpxchg loop semantic. Micro-architecture could utilize prefetchw() to provide a strong forward progress guarantee for the xchg_tail, e.g., the T-HEAD XuanTie processor would hold the exclusive cacheline state until the next cmpxchg write success. In the end, Let's go back to the principle: the xchg_tail is an atomic swap operation that contains write eventually, so giving a prefetchw() at the beginning is acceptable for all architectures.. I did realize afterward that prefetchw gets the cacheline in exclusive state. I will suggest you mention that in your commit log as well as adding a comment about its purpose in the code. Thanks, Longman Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V11 04/17] locking/qspinlock: Improve xchg_tail for number of cpus >= 16k
On 9/10/23 04:28, guo...@kernel.org wrote: From: Guo Ren The target of xchg_tail is to write the tail to the lock value, so adding prefetchw could help the next cmpxchg step, which may decrease the cmpxchg retry loops of xchg_tail. Some processors may utilize this feature to give a forward guarantee, e.g., RISC-V XuanTie processors would block the snoop channel & irq for several cycles when prefetch.w instruction (from Zicbop extension) retired, which guarantees the next cmpxchg succeeds. Signed-off-by: Guo Ren Signed-off-by: Guo Ren --- kernel/locking/qspinlock.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index d3f99060b60f..96b54e2ade86 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -223,7 +223,10 @@ static __always_inline void clear_pending_set_locked(struct qspinlock *lock) */ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail) { - u32 old, new, val = atomic_read(>val); + u32 old, new, val; + + prefetchw(>val); + val = atomic_read(>val); for (;;) { new = (val & _Q_LOCKED_PENDING_MASK) | tail; That looks a bit weird. You pre-fetch and then immediately read it. How much performance gain you get by this change alone? Maybe you can define an arch specific primitive that default back to atomic_read() if not defined. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V10 18/19] locking/qspinlock: Move pv_ops into x86 directory
On 8/11/23 20:24, Guo Ren wrote: On Sat, Aug 12, 2023 at 4:42 AM Waiman Long wrote: On 8/2/23 12:47, guo...@kernel.org wrote: From: Guo Ren The pv_ops belongs to x86 custom infrastructure and cleans up the cna_configure_spin_lock_slowpath() with standard code. This is preparation for riscv support CNA qspoinlock. CNA qspinlock has not been merged into mainline yet. I will suggest you drop the last 2 patches for now. Of course, you can provide benchmark data to boost the case for the inclusion of the CNA qspinlock patch into the mainline. Yes, my lazy, I would separate paravirt and CNA from this series. paravirt is OK, it is just that CNA hasn't been merged yet. Cheers, Longman Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V10 18/19] locking/qspinlock: Move pv_ops into x86 directory
On 8/2/23 12:47, guo...@kernel.org wrote: From: Guo Ren The pv_ops belongs to x86 custom infrastructure and cleans up the cna_configure_spin_lock_slowpath() with standard code. This is preparation for riscv support CNA qspoinlock. CNA qspinlock has not been merged into mainline yet. I will suggest you drop the last 2 patches for now. Of course, you can provide benchmark data to boost the case for the inclusion of the CNA qspinlock patch into the mainline. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V10 05/19] riscv: qspinlock: Introduce combo spinlock
On 8/2/23 12:46, guo...@kernel.org wrote: From: Guo Ren Combo spinlock could support queued and ticket in one Linux Image and select them during boot time via errata mechanism. Here is the func size (Bytes) comparison table below: TYPE: COMBO | TICKET | QUEUED arch_spin_lock : 106 | 60 | 50 arch_spin_unlock: 54| 36 | 26 arch_spin_trylock : 110 | 72 | 54 arch_spin_is_locked : 48| 34 | 20 arch_spin_is_contended : 56| 40 | 24 rch_spin_value_unlocked : 48| 34 | 24 One example of disassemble combo arch_spin_unlock: 0x8000409c <+14>:nop# detour slot 0x800040a0 <+18>:fence rw,w # queued spinlock start 0x800040a4 <+22>:sb zero,0(a4) # queued spinlock end 0x800040a8 <+26>:ld s0,8(sp) 0x800040aa <+28>:addisp,sp,16 0x800040ac <+30>:ret 0x800040ae <+32>:lw a5,0(a4) # ticket spinlock start 0x800040b0 <+34>:sext.w a5,a5 0x800040b2 <+36>:fence rw,w 0x800040b6 <+40>:addiw a5,a5,1 0x800040b8 <+42>:sllia5,a5,0x30 0x800040ba <+44>:srlia5,a5,0x30 0x800040bc <+46>:sh a5,0(a4) # ticket spinlock end 0x800040c0 <+50>:ld s0,8(sp) 0x800040c2 <+52>:addisp,sp,16 0x800040c4 <+54>:ret The qspinlock is smaller and faster than ticket-lock when all are in fast-path, and combo spinlock could provide a compatible Linux Image for different micro-arch design (weak/strict fwd guarantee) processors. Signed-off-by: Guo Ren Signed-off-by: Guo Ren --- arch/riscv/Kconfig| 9 +++- arch/riscv/include/asm/hwcap.h| 1 + arch/riscv/include/asm/spinlock.h | 87 ++- arch/riscv/kernel/cpufeature.c| 10 4 files changed, 104 insertions(+), 3 deletions(-) diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index e89a3bea3dc1..119e774a3dcf 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -440,7 +440,7 @@ config NODES_SHIFT choice prompt "RISC-V spinlock type" - default RISCV_TICKET_SPINLOCKS + default RISCV_COMBO_SPINLOCKS config RISCV_TICKET_SPINLOCKS bool "Using ticket spinlock" @@ -452,6 +452,13 @@ config RISCV_QUEUED_SPINLOCKS help Make sure your micro arch LL/SC has a strong forward progress guarantee. Otherwise, stay at ticket-lock. + +config RISCV_COMBO_SPINLOCKS + bool "Using combo spinlock" + depends on SMP && MMU + select ARCH_USE_QUEUED_SPINLOCKS + help + Select queued spinlock or ticket-lock via errata. endchoice config RISCV_ALTERNATIVE diff --git a/arch/riscv/include/asm/hwcap.h b/arch/riscv/include/asm/hwcap.h index f041bfa7f6a0..08ae75a694c2 100644 --- a/arch/riscv/include/asm/hwcap.h +++ b/arch/riscv/include/asm/hwcap.h @@ -54,6 +54,7 @@ #define RISCV_ISA_EXT_ZIFENCEI41 #define RISCV_ISA_EXT_ZIHPM 42 +#define RISCV_ISA_EXT_XTICKETLOCK 63 #define RISCV_ISA_EXT_MAX 64 #define RISCV_ISA_EXT_NAME_LEN_MAX32 diff --git a/arch/riscv/include/asm/spinlock.h b/arch/riscv/include/asm/spinlock.h index c644a92d4548..9eb3ad31e564 100644 --- a/arch/riscv/include/asm/spinlock.h +++ b/arch/riscv/include/asm/spinlock.h @@ -7,11 +7,94 @@ #define _Q_PENDING_LOOPS (1 << 9) #endif I see why you separated the _Q_PENDING_LOOPS out. +#ifdef CONFIG_RISCV_COMBO_SPINLOCKS +#include + +#undef arch_spin_is_locked +#undef arch_spin_is_contended +#undef arch_spin_value_unlocked +#undef arch_spin_lock +#undef arch_spin_trylock +#undef arch_spin_unlock + +#include +#include + +#undef arch_spin_is_locked +#undef arch_spin_is_contended +#undef arch_spin_value_unlocked +#undef arch_spin_lock +#undef arch_spin_trylock +#undef arch_spin_unlock Perhaps you can add a macro like __no_arch_spinlock_redefine to disable the various arch_spin_* definition in qspinlock.h and ticket_spinlock.h. + +#define COMBO_DETOUR \ + asm_volatile_goto(ALTERNATIVE( \ + "nop",\ + "j %l[ticket_spin_lock]", \ + 0, \ + RISCV_ISA_EXT_XTICKETLOCK, \ + CONFIG_RISCV_COMBO_SPINLOCKS) \ + : : : : ticket_spin_lock); + +static __always_inline void arch_spin_lock(arch_spinlock_t *lock) +{ + COMBO_DETOUR + queued_spin_lock(lock); + return; +ticket_spin_lock: + ticket_spin_lock(lock); +} + +static __always_inline bool arch_spin_trylock(arch_spinlock_t *lock) +{ + COMBO_DETOUR + return queued_spin_trylock(lock); +ticket_spin_lock: + return ticket_spin_trylock(lock); +} + +static __always_inline
Re: [PATCH V10 04/19] riscv: qspinlock: Add basic queued_spinlock support
On 8/2/23 12:46, guo...@kernel.org wrote: \ diff --git a/arch/riscv/include/asm/spinlock.h b/arch/riscv/include/asm/spinlock.h new file mode 100644 index ..c644a92d4548 --- /dev/null +++ b/arch/riscv/include/asm/spinlock.h @@ -0,0 +1,17 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef __ASM_RISCV_SPINLOCK_H +#define __ASM_RISCV_SPINLOCK_H + +#ifdef CONFIG_QUEUED_SPINLOCKS +#define _Q_PENDING_LOOPS (1 << 9) +#endif + +#ifdef CONFIG_QUEUED_SPINLOCKS You can merge the two "#ifdef CONFIG_QUEUED_SPINLOCKS" into single one to avoid the duplication. Cheers, Longman +#include +#include +#else +#include +#endif + +#endif /* __ASM_RISCV_SPINLOCK_H */ ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v7] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation
0.0 301279.3760.8 File Copy 256 bufsize 500 maxblocks1655.0 79291.3479.1 File Copy 4096 bufsize 8000 maxblocks 5800.01039755.2 1792.7 Pipe Throughput 12440.0 118701468.1 95419.2 Pipe-based Context Switching 4000.08073453.3 20183.6 Process Creation126.0 33440.9 2654.0 Shell Scripts (1 concurrent) 42.4 52722.6 12434.6 Shell Scripts (8 concurrent) 6.0 7050.4 11750.6 System Call Overhead 15000.06834371.5 4556.2 System Benchmarks Index Score8157.8 Signed-off-by: Guo Hui --- arch/x86/kernel/paravirt-spinlocks.c | 4 kernel/locking/osq_lock.c| 19 ++- 2 files changed, 22 insertions(+), 1 deletion(-) diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 9e1ea99ad..a2eb375e2 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -33,6 +33,8 @@ bool pv_is_native_vcpu_is_preempted(void) __raw_callee_save___native_vcpu_is_preempted; } +DECLARE_STATIC_KEY_TRUE(vcpu_has_preemption); + void __init paravirt_set_cap(void) { if (!pv_is_native_spin_unlock()) @@ -40,4 +42,6 @@ void __init paravirt_set_cap(void) if (!pv_is_native_vcpu_is_preempted()) setup_force_cpu_cap(X86_FEATURE_VCPUPREEMPT); + else + static_branch_disable(_has_preemption); } diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c index d5610ad52..f521b0f6d 100644 --- a/kernel/locking/osq_lock.c +++ b/kernel/locking/osq_lock.c @@ -27,6 +27,23 @@ static inline int node_cpu(struct optimistic_spin_node *node) return node->cpu - 1; } +#ifdef vcpu_is_preempted +DEFINE_STATIC_KEY_TRUE(vcpu_has_preemption); + +static inline bool vcpu_is_preempted_node(struct optimistic_spin_node *node) +{ + if (static_branch_likely(_has_preemption)) + return vcpu_is_preempted(node_cpu(node->prev)); + + return false; +} +#else +static inline bool vcpu_is_preempted_node(struct optimistic_spin_node *node) +{ + return false; +} +#endif + static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val) { int cpu_nr = encoded_cpu_val - 1; @@ -141,7 +158,7 @@ bool osq_lock(struct optimistic_spin_queue *lock) * polling, be careful. */ if (smp_cond_load_relaxed(>locked, VAL || need_resched() || - vcpu_is_preempted(node_cpu(node->prev + vcpu_is_preempted_node(node))) return true; /* unqueue */ Reviewed-by: Waiman Long ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v6] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation
On 6/28/22 08:54, Guo Hui wrote: The instructions assigned to the vcpu_is_preempted function parameter in the X86 architecture physical machine are redundant instructions, causing the multi-core performance of Unixbench to drop by about 4% to 5%. The C function is as follows: static bool vcpu_is_preempted(long vcpu); The parameter 'vcpu' in the function osq_lock that calls the function vcpu_is_preempted is assigned as follows: The C code is in the function node_cpu: cpu = node->cpu - 1; The instructions corresponding to the C code are: mov 0x14(%rax),%edi sub $0x1,%edi The above instructions are unnecessary in the X86 Native operating environment, causing high cache-misses and degrading performance. This patch uses static_key to not execute this instruction in the Native runtime environment. The patch effect is as follows two machines, Unixbench runs with full core score: 1. Machine configuration: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz CPU core: 40 Memory: 256G OS Kernel: 5.19-rc3 Before using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 948326591.2 81261.9 Double-Precision Whetstone 55.0 211986.3 38543.0 Execl Throughput 43.0 43453.2 10105.4 File Copy 1024 bufsize 2000 maxblocks 3960.0 438936.2 1108.4 File Copy 256 bufsize 500 maxblocks1655.0 118197.4714.2 File Copy 4096 bufsize 8000 maxblocks 5800.01534674.7 2646.0 Pipe Throughput 12440.0 46482107.6 37365.0 Pipe-based Context Switching 4000.01915094.2 4787.7 Process Creation126.0 85442.2 6781.1 Shell Scripts (1 concurrent) 42.4 69400.7 16368.1 Shell Scripts (8 concurrent) 6.0 8877.2 14795.3 System Call Overhead 15000.04714906.1 3143.3 System Benchmarks Index Score7923.3 After using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 947032915.5 81151.1 Double-Precision Whetstone 55.0 211971.2 38540.2 Execl Throughput 43.0 45054.8 10477.9 File Copy 1024 bufsize 2000 maxblocks 3960.0 515024.9 1300.6 File Copy 256 bufsize 500 maxblocks1655.0 146354.6884.3 File Copy 4096 bufsize 8000 maxblocks 5800.01679995.9 2896.5 Pipe Throughput 12440.0 46466394.2 37352.4 Pipe-based Context Switching 4000.01898221.4 4745.6 Process Creation126.0 85653.1 6797.9 Shell Scripts (1 concurrent) 42.4 69437.3 16376.7 Shell Scripts (8 concurrent) 6.0 8898.9 14831.4 System Call Overhead 15000.04658746.7 3105.8 System Benchmarks Index Score8248.8 2. Machine configuration: Hygon C86 7185 32-core Processor CPU core: 128 Memory: 256G OS Kernel: 5.19-rc3 Before using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 2256644068.3 193371.4 Double-Precision Whetstone 55.0 438969.9 79812.7 Execl Throughput 43.0 10108.6 2350.8 File Copy 1024 bufsize 2000 maxblocks 3960.0 275892.8696.7 File Copy 256 bufsize 500 maxblocks1655.0 72082.7435.5 File Copy 4096 bufsize 8000 maxblocks 5800.0 925043.4 1594.9 Pipe Throughput 12440.0 118905512.5 95583.2 Pipe-based Context Switching 4000.07820945.7 19552.4 Process Creation126.0 31233.3 2478.8 Shell Scripts (1 concurrent) 42.4 49042.8 11566.7 Shell Scripts (8 concurrent) 6.0 6656.0 11093.3 System Call Overhead 15000.06816047.5 4544.0 System Benchmarks Index Score7756.6 After using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 2252272929.4 192996.8 Double-Precision Whetstone 55.0 451847.2 82154.0 Execl Throughput 43.0 10595.1 2464.0 File Copy 1024 bufsize 2000 maxblocks 3960.0
Re: [PATCH v4] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation
On 6/27/22 10:27, Guo Hui wrote: The instructions assigned to the vcpu_is_preempted function parameter in the X86 architecture physical machine are redundant instructions, causing the multi-core performance of Unixbench to drop by about 4% to 5%. The C function is as follows: static bool vcpu_is_preempted(long vcpu); The parameter 'vcpu' in the function osq_lock that calls the function vcpu_is_preempted is assigned as follows: The C code is in the function node_cpu: cpu = node->cpu - 1; The instructions corresponding to the C code are: mov 0x14(%rax),%edi sub $0x1,%edi The above instructions are unnecessary in the X86 Native operating environment, causing high cache-misses and degrading performance. This patch uses static_key to not execute this instruction in the Native runtime environment. The patch effect is as follows two machines, Unixbench runs with full core score: 1. Machine configuration: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz CPU core: 40 Memory: 256G OS Kernel: 5.19-rc3 Before using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 948326591.2 81261.9 Double-Precision Whetstone 55.0 211986.3 38543.0 Execl Throughput 43.0 43453.2 10105.4 File Copy 1024 bufsize 2000 maxblocks 3960.0 438936.2 1108.4 File Copy 256 bufsize 500 maxblocks1655.0 118197.4714.2 File Copy 4096 bufsize 8000 maxblocks 5800.01534674.7 2646.0 Pipe Throughput 12440.0 46482107.6 37365.0 Pipe-based Context Switching 4000.01915094.2 4787.7 Process Creation126.0 85442.2 6781.1 Shell Scripts (1 concurrent) 42.4 69400.7 16368.1 Shell Scripts (8 concurrent) 6.0 8877.2 14795.3 System Call Overhead 15000.04714906.1 3143.3 System Benchmarks Index Score7923.3 After using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 947032915.5 81151.1 Double-Precision Whetstone 55.0 211971.2 38540.2 Execl Throughput 43.0 45054.8 10477.9 File Copy 1024 bufsize 2000 maxblocks 3960.0 515024.9 1300.6 File Copy 256 bufsize 500 maxblocks1655.0 146354.6884.3 File Copy 4096 bufsize 8000 maxblocks 5800.01679995.9 2896.5 Pipe Throughput 12440.0 46466394.2 37352.4 Pipe-based Context Switching 4000.01898221.4 4745.6 Process Creation126.0 85653.1 6797.9 Shell Scripts (1 concurrent) 42.4 69437.3 16376.7 Shell Scripts (8 concurrent) 6.0 8898.9 14831.4 System Call Overhead 15000.04658746.7 3105.8 System Benchmarks Index Score8248.8 2. Machine configuration: Hygon C86 7185 32-core Processor CPU core: 128 Memory: 256G OS Kernel: 5.19-rc3 Before using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 2256644068.3 193371.4 Double-Precision Whetstone 55.0 438969.9 79812.7 Execl Throughput 43.0 10108.6 2350.8 File Copy 1024 bufsize 2000 maxblocks 3960.0 275892.8696.7 File Copy 256 bufsize 500 maxblocks1655.0 72082.7435.5 File Copy 4096 bufsize 8000 maxblocks 5800.0 925043.4 1594.9 Pipe Throughput 12440.0 118905512.5 95583.2 Pipe-based Context Switching 4000.07820945.7 19552.4 Process Creation126.0 31233.3 2478.8 Shell Scripts (1 concurrent) 42.4 49042.8 11566.7 Shell Scripts (8 concurrent) 6.0 6656.0 11093.3 System Call Overhead 15000.06816047.5 4544.0 System Benchmarks Index Score7756.6 After using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 2252272929.4 192996.8 Double-Precision Whetstone 55.0 451847.2 82154.0 Execl Throughput 43.0 10595.1 2464.0 File Copy 1024 bufsize 2000 maxblocks 3960.0
Re: [PATCH v2] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation
On 6/27/22 01:54, Guo Hui wrote: Thank you very much Longman, my patch is as you said, only disable node_cpu on X86, enable node_cpu on arm64, powerpc, s390 architectures; the code is in file arch/x86/kernel/paravirt-spinlocks.c: DECLARE_STATIC_KEY_FALSE(preemted_key); static_branch_enable(_key); the default value of preemted_key is false and the if conditional statement is reversed, the code is in file kernel/locking/osq_lock.c: DEFINE_STATIC_KEY_FALSE(preemted_key); static inline int node_cpu(struct optimistic_spin_node *node) { int cpu = 0; if (!static_branch_unlikely(_key)) cpu = node->cpu - 1; return cpu; } In this way, only one nop instruction is added to architectures arm64, powerpc and s390, including virtual machines, without any other changes. You are right. I am probably too tired last night to read the patch more carefully. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation
On 6/26/22 22:13, Guo Hui wrote: The instructions assigned to the vcpu_is_preempted function parameter in the X86 architecture physical machine are redundant instructions, causing the multi-core performance of Unixbench to drop by about 4% to 5%. The C function is as follows: static bool vcpu_is_preempted(long vcpu); The parameter 'vcpu' in the function osq_lock that calls the function vcpu_is_preempted is assigned as follows: The C code is in the function node_cpu: cpu = node->cpu - 1; The instructions corresponding to the C code are: mov 0x14(%rax),%edi sub $0x1,%edi The above instructions are unnecessary in the X86 Native operating environment, causing high cache-misses and degrading performance. This patch uses static_key to not execute this instruction in the Native runtime environment. The code is as follows: DEFINE_STATIC_KEY_FALSE(preemted_key); static inline int node_cpu(struct optimistic_spin_node *node) { int cpu = 0; if (!static_branch_unlikely(_key)) cpu = node->cpu - 1; return cpu; } The patch effect is as follows two machines, Unixbench runs with full core score: 1. Machine configuration: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz CPU core: 40 Memory: 256G OS Kernel: 5.19-rc3 Before using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 948326591.2 81261.9 Double-Precision Whetstone 55.0 211986.3 38543.0 Execl Throughput 43.0 43453.2 10105.4 File Copy 1024 bufsize 2000 maxblocks 3960.0 438936.2 1108.4 File Copy 256 bufsize 500 maxblocks1655.0 118197.4714.2 File Copy 4096 bufsize 8000 maxblocks 5800.01534674.7 2646.0 Pipe Throughput 12440.0 46482107.6 37365.0 Pipe-based Context Switching 4000.01915094.2 4787.7 Process Creation126.0 85442.2 6781.1 Shell Scripts (1 concurrent) 42.4 69400.7 16368.1 Shell Scripts (8 concurrent) 6.0 8877.2 14795.3 System Call Overhead 15000.04714906.1 3143.3 System Benchmarks Index Score7923.3 After using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 947032915.5 81151.1 Double-Precision Whetstone 55.0 211971.2 38540.2 Execl Throughput 43.0 45054.8 10477.9 File Copy 1024 bufsize 2000 maxblocks 3960.0 515024.9 1300.6 File Copy 256 bufsize 500 maxblocks1655.0 146354.6884.3 File Copy 4096 bufsize 8000 maxblocks 5800.01679995.9 2896.5 Pipe Throughput 12440.0 46466394.2 37352.4 Pipe-based Context Switching 4000.01898221.4 4745.6 Process Creation126.0 85653.1 6797.9 Shell Scripts (1 concurrent) 42.4 69437.3 16376.7 Shell Scripts (8 concurrent) 6.0 8898.9 14831.4 System Call Overhead 15000.04658746.7 3105.8 System Benchmarks Index Score8248.8 2. Machine configuration: Hygon C86 7185 32-core Processor CPU core: 128 Memory: 256G OS Kernel: 5.19-rc3 Before using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 2256644068.3 193371.4 Double-Precision Whetstone 55.0 438969.9 79812.7 Execl Throughput 43.0 10108.6 2350.8 File Copy 1024 bufsize 2000 maxblocks 3960.0 275892.8696.7 File Copy 256 bufsize 500 maxblocks1655.0 72082.7435.5 File Copy 4096 bufsize 8000 maxblocks 5800.0 925043.4 1594.9 Pipe Throughput 12440.0 118905512.5 95583.2 Pipe-based Context Switching 4000.07820945.7 19552.4 Process Creation126.0 31233.3 2478.8 Shell Scripts (1 concurrent) 42.4 49042.8 11566.7 Shell Scripts (8 concurrent) 6.0 6656.0 11093.3 System Call Overhead 15000.06816047.5 4544.0 System Benchmarks Index Score7756.6 After using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables
Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
On 7/25/20 1:26 PM, Peter Zijlstra wrote: On Fri, Jul 24, 2020 at 03:10:59PM -0400, Waiman Long wrote: On 7/24/20 4:16 AM, Will Deacon wrote: On Thu, Jul 23, 2020 at 08:47:59PM +0200, pet...@infradead.org wrote: On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote: BTW, do you have any comment on my v2 lock holder cpu info qspinlock patch? I will have to update the patch to fix the reported 0-day test problem, but I want to collect other feedback before sending out v3. I want to say I hate it all, it adds instructions to a path we spend an aweful lot of time optimizing without really getting anything back for it. Will, how do you feel about it? I can see it potentially being useful for debugging, but I hate the limitation to 256 CPUs. Even arm64 is hitting that now. After thinking more about that, I think we can use all the remaining bits in the 16-bit locked_pending. Reserving 1 bit for locked and 1 bit for pending, there are 14 bits left. So as long as NR_CPUS < 16k (requirement for 16-bit locked_pending), we can put all possible cpu numbers into the lock. We can also just use smp_processor_id() without additional percpu data. That sounds horrific, wouldn't that destroy the whole point of using a byte for pending? You are right. I realized that later on and had sent a follow-up mail to correct that. Also, you're talking ~1% gains here. I think our collective time would be better spent off reviewing the CNA series and trying to make it more deterministic. I thought you guys are not interested in CNA. I do want to get CNA merged, if possible. Let review the current version again and see if there are ways we can further improve it. It's not a lack of interrest. We were struggling with the fairness issues and the complexity of the thing. I forgot the current state of matters, but at one point UNLOCK was O(n) in waiters, which is, of course, 'unfortunate'. I'll have to look up whatever notes remain, but the basic idea of keeping remote nodes on a secondary list is obviously breaking all sorts of fairness. After that they pile on a bunch of hacks to fix the worst of them, but it feels exactly like that, a bunch of hacks. One of the things I suppose we ought to do is see if some of the ideas of phase-fair locks can be applied to this. That could be a possible solution to ensure better fairness. That coupled with a chronic lack of time for anything :-( That is always true and I feel this way too:-) Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
On 7/24/20 3:10 PM, Waiman Long wrote: On 7/24/20 4:16 AM, Will Deacon wrote: On Thu, Jul 23, 2020 at 08:47:59PM +0200, pet...@infradead.org wrote: On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote: BTW, do you have any comment on my v2 lock holder cpu info qspinlock patch? I will have to update the patch to fix the reported 0-day test problem, but I want to collect other feedback before sending out v3. I want to say I hate it all, it adds instructions to a path we spend an aweful lot of time optimizing without really getting anything back for it. Will, how do you feel about it? I can see it potentially being useful for debugging, but I hate the limitation to 256 CPUs. Even arm64 is hitting that now. After thinking more about that, I think we can use all the remaining bits in the 16-bit locked_pending. Reserving 1 bit for locked and 1 bit for pending, there are 14 bits left. So as long as NR_CPUS < 16k (requirement for 16-bit locked_pending), we can put all possible cpu numbers into the lock. We can also just use smp_processor_id() without additional percpu data. Sorry, that doesn't work. The extra bits in the pending byte won't get cleared on unlock. That will have noticeable performance impact. Clearing the pending byte on unlock will cause other performance problem. So I guess we will have to limit the cpu number in the locked byte. Regards, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 0/6] powerpc: queued spinlocks and rwlocks
On 7/24/20 9:14 AM, Nicholas Piggin wrote: Updated with everybody's feedback (thanks all), and more performance results. What I've found is I might have been measuring the worst load point for the paravirt case, and by looking at a range of loads it's clear that queued spinlocks are overall better even on PV, doubly so when you look at the generally much improved worst case latencies. I have defaulted it to N even though I'm less concerned about the PV numbers now, just because I think it needs more stress testing. But it's very nicely selectable so should be low risk to include. All in all this is a very cool technology and great results especially on the big systems but even on smaller ones there are nice gains. Thanks Waiman and everyone who developed it. Thanks, Nick Nicholas Piggin (6): powerpc/pseries: move some PAPR paravirt functions to their own file powerpc: move spinlock implementation to simple_spinlock powerpc/64s: implement queued spinlocks and rwlocks powerpc/pseries: implement paravirt qspinlocks for SPLPAR powerpc/qspinlock: optimised atomic_try_cmpxchg_lock that adds the lock hint powerpc: implement smp_cond_load_relaxed arch/powerpc/Kconfig | 15 + arch/powerpc/include/asm/Kbuild | 1 + arch/powerpc/include/asm/atomic.h | 28 ++ arch/powerpc/include/asm/barrier.h| 14 + arch/powerpc/include/asm/paravirt.h | 87 + arch/powerpc/include/asm/qspinlock.h | 91 ++ arch/powerpc/include/asm/qspinlock_paravirt.h | 7 + arch/powerpc/include/asm/simple_spinlock.h| 288 .../include/asm/simple_spinlock_types.h | 21 ++ arch/powerpc/include/asm/spinlock.h | 308 +- arch/powerpc/include/asm/spinlock_types.h | 17 +- arch/powerpc/lib/Makefile | 3 + arch/powerpc/lib/locks.c | 12 +- arch/powerpc/platforms/pseries/Kconfig| 9 +- arch/powerpc/platforms/pseries/setup.c| 4 +- include/asm-generic/qspinlock.h | 4 + 16 files changed, 588 insertions(+), 321 deletions(-) create mode 100644 arch/powerpc/include/asm/paravirt.h create mode 100644 arch/powerpc/include/asm/qspinlock.h create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h create mode 100644 arch/powerpc/include/asm/simple_spinlock.h create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h That patch series looks good to me. Thanks for working on this. For the series, Acked-by: Waiman Long ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 6/6] powerpc: implement smp_cond_load_relaxed
On 7/24/20 9:14 AM, Nicholas Piggin wrote: This implements smp_cond_load_relaed with the slowpath busy loop using the Nit: "smp_cond_load_relaxed" Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
On 7/24/20 4:16 AM, Will Deacon wrote: On Thu, Jul 23, 2020 at 08:47:59PM +0200, pet...@infradead.org wrote: On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote: BTW, do you have any comment on my v2 lock holder cpu info qspinlock patch? I will have to update the patch to fix the reported 0-day test problem, but I want to collect other feedback before sending out v3. I want to say I hate it all, it adds instructions to a path we spend an aweful lot of time optimizing without really getting anything back for it. Will, how do you feel about it? I can see it potentially being useful for debugging, but I hate the limitation to 256 CPUs. Even arm64 is hitting that now. After thinking more about that, I think we can use all the remaining bits in the 16-bit locked_pending. Reserving 1 bit for locked and 1 bit for pending, there are 14 bits left. So as long as NR_CPUS < 16k (requirement for 16-bit locked_pending), we can put all possible cpu numbers into the lock. We can also just use smp_processor_id() without additional percpu data. Also, you're talking ~1% gains here. I think our collective time would be better spent off reviewing the CNA series and trying to make it more deterministic. I thought you guys are not interested in CNA. I do want to get CNA merged, if possible. Let review the current version again and see if there are ways we can further improve it. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
On 7/23/20 3:58 PM, pet...@infradead.org wrote: On Thu, Jul 23, 2020 at 03:04:13PM -0400, Waiman Long wrote: On 7/23/20 2:47 PM, pet...@infradead.org wrote: On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote: BTW, do you have any comment on my v2 lock holder cpu info qspinlock patch? I will have to update the patch to fix the reported 0-day test problem, but I want to collect other feedback before sending out v3. I want to say I hate it all, it adds instructions to a path we spend an aweful lot of time optimizing without really getting anything back for it. It does add some extra instruction that may slow it down slightly, but I don't agree that it gives nothing back. The cpu lock holder information can be useful in analyzing crash dumps and in some debugging situation. I think it can be useful in RHEL for this readon. How about an x86 config option to allow distros to decide if they want to have it enabled? I will make sure that it will have no performance degradation if the option is not enabled. Config knobs suck too; they create a maintenance burden (we get to make sure all the permutations works/build/etc..) and effectively nobody uses them, since world+dog uses what distros pick. Anyway, instead of adding a second per-cpu variable, can you see how horrible something like this is: unsigned char adds(unsigned char var, unsigned char val) { unsigned short sat = 0xff, tmp = var; asm ("addb %[val], %b[var];" "cmovc%[sat], %[var];" : [var] "+r" (tmp) : [val] "ir" (val), [sat] "r" (sat) ); return tmp; } Another thing to try is, instead of threading that lockval throughout the thing, simply: #define _Q_LOCKED_VAL this_cpu_read_stable(cpu_sat) or combined with the above #define _Q_LOCKED_VAL adds(this_cpu_read_stable(cpu_number), 2) and see if the compiler really makes a mess of things. Thanks for the suggestion. I will try that out. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
On 7/23/20 2:47 PM, pet...@infradead.org wrote: On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote: BTW, do you have any comment on my v2 lock holder cpu info qspinlock patch? I will have to update the patch to fix the reported 0-day test problem, but I want to collect other feedback before sending out v3. I want to say I hate it all, it adds instructions to a path we spend an aweful lot of time optimizing without really getting anything back for it. It does add some extra instruction that may slow it down slightly, but I don't agree that it gives nothing back. The cpu lock holder information can be useful in analyzing crash dumps and in some debugging situation. I think it can be useful in RHEL for this readon. How about an x86 config option to allow distros to decide if they want to have it enabled? I will make sure that it will have no performance degradation if the option is not enabled. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
On 7/23/20 10:00 AM, Peter Zijlstra wrote: On Thu, Jul 09, 2020 at 12:06:13PM -0400, Waiman Long wrote: We don't really need to do a pv_spinlocks_init() if pv_kick() isn't supported. Waiman, if you cannot explain how not having kick is a sane thing, what are you saying here? The current PPC paravirt spinlock code doesn't do any cpu kick. It does an equivalence of pv_wait by yielding the cpu to the lock holder only. The pv_spinlocks_init() is for setting up the hash table for doing pv_kick. If we don't need to do pv_kick, we don't need the hash table. I am not saying that pv_kick is not needed for the PPC environment. I was just trying to adapt the pvqspinlock code to such an environment first. Further investigation on how to implement some kind of pv_kick will be something that we may want to do as a follow on. BTW, do you have any comment on my v2 lock holder cpu info qspinlock patch? I will have to update the patch to fix the reported 0-day test problem, but I want to collect other feedback before sending out v3. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
On 7/23/20 9:30 AM, Nicholas Piggin wrote: I would prefer to extract out the pending bit handling code out into a separate helper function which can be overridden by the arch code instead of breaking the slowpath into 2 pieces. You mean have the arch provide a queued_spin_lock_slowpath_pending function that the slow path calls? I would actually prefer the pending handling can be made inline in the queued_spin_lock function, especially with out-of-line locks it makes sense to put it there. We could ifdef out queued_spin_lock_slowpath_queue if it's not used, then __queued_spin_lock_slowpath_queue would be inlined into the caller so there would be no split? The pending code is an optimization for lightly contended locks. That is why I think it is appropriate to extract it into a helper function and mark it as such. You can certainly put the code in the arch's spin_lock code, you just has to override the generic pending code by a null function. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
On 7/21/20 7:08 AM, Nicholas Piggin wrote: diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h index b752d34517b3..26d8766a1106 100644 --- a/arch/powerpc/include/asm/qspinlock.h +++ b/arch/powerpc/include/asm/qspinlock.h @@ -31,16 +31,57 @@ static inline void queued_spin_unlock(struct qspinlock *lock) #else extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); +extern void queued_spin_lock_slowpath_queue(struct qspinlock *lock); #endif static __always_inline void queued_spin_lock(struct qspinlock *lock) { - u32 val = 0; - - if (likely(atomic_try_cmpxchg_lock(>val, , _Q_LOCKED_VAL))) + atomic_t *a = >val; + u32 val; + +again: + asm volatile( +"1:\t" PPC_LWARX(%0,0,%1,1) " # queued_spin_lock \n" + : "=" (val) + : "r" (>counter) + : "memory"); + + if (likely(val == 0)) { + asm_volatile_goto( + " stwcx. %0,0,%1 \n" + " bne-%l[again] \n" + "\t" PPC_ACQUIRE_BARRIER " \n" + : + : "r"(_Q_LOCKED_VAL), "r" (>counter) + : "cr0", "memory" + : again ); return; - - queued_spin_lock_slowpath(lock, val); + } + + if (likely(val == _Q_LOCKED_VAL)) { + asm_volatile_goto( + " stwcx. %0,0,%1 \n" + " bne-%l[again] \n" + : + : "r"(_Q_LOCKED_VAL | _Q_PENDING_VAL), "r" (>counter) + : "cr0", "memory" + : again ); + + atomic_cond_read_acquire(a, !(VAL & _Q_LOCKED_MASK)); +// clear_pending_set_locked(lock); + WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL); +// lockevent_inc(lock_pending); + return; + } + + if (val == _Q_PENDING_VAL) { + int cnt = _Q_PENDING_LOOPS; + val = atomic_cond_read_relaxed(a, + (VAL != _Q_PENDING_VAL) || !cnt--); + if (!(val & ~_Q_LOCKED_MASK)) + goto again; +} + queued_spin_lock_slowpath_queue(lock); } #define queued_spin_lock queued_spin_lock I am fine with the arch code override some part of the generic code. diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c index b9515fcc9b29..ebcc6f5d99d5 100644 --- a/kernel/locking/qspinlock.c +++ b/kernel/locking/qspinlock.c @@ -287,10 +287,14 @@ static __always_inline u32 __pv_wait_head_or_lock(struct qspinlock *lock, #ifdef CONFIG_PARAVIRT_SPINLOCKS #define queued_spin_lock_slowpath native_queued_spin_lock_slowpath +#define queued_spin_lock_slowpath_queue native_queued_spin_lock_slowpath_queue #endif #endif /* _GEN_PV_LOCK_SLOWPATH */ +void queued_spin_lock_slowpath_queue(struct qspinlock *lock); +static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock); + /** * queued_spin_lock_slowpath - acquire the queued spinlock * @lock: Pointer to queued spinlock structure @@ -314,12 +318,6 @@ static __always_inline u32 __pv_wait_head_or_lock(struct qspinlock *lock, */ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) { - struct mcs_spinlock *prev, *next, *node; - u32 old, tail; - int idx; - - BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)); - if (pv_enabled()) goto pv_queue; @@ -397,6 +395,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) queue: lockevent_inc(lock_slowpath); pv_queue: + __queued_spin_lock_slowpath_queue(lock); +} +EXPORT_SYMBOL(queued_spin_lock_slowpath); + +void queued_spin_lock_slowpath_queue(struct qspinlock *lock) +{ + lockevent_inc(lock_slowpath); + __queued_spin_lock_slowpath_queue(lock); +} +EXPORT_SYMBOL(queued_spin_lock_slowpath_queue); + +static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock) +{ + struct mcs_spinlock *prev, *next, *node; + u32 old, tail; + u32 val; + int idx; + + BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS)); + node = this_cpu_ptr([0].mcs); idx = node->count++; tail = encode_tail(smp_processor_id(), idx); @@ -559,7 +577,6 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) */ __this_cpu_dec(qnodes[0].mcs.count); } -EXPORT_SYMBOL(queued_spin_lock_slowpath); /* * Generate the paravirt code for queued_spin_unlock_slowpath(). I would prefer to extract out the pending bit handling code out into a separate helper function which can be overridden by the arch code instead of breaking the slowpath
Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
On 7/9/20 6:53 AM, Michael Ellerman wrote: Nicholas Piggin writes: Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/paravirt.h | 28 arch/powerpc/include/asm/qspinlock.h | 66 +++ arch/powerpc/include/asm/qspinlock_paravirt.h | 7 ++ arch/powerpc/platforms/pseries/Kconfig| 5 ++ arch/powerpc/platforms/pseries/setup.c| 6 +- include/asm-generic/qspinlock.h | 2 + Another ack? I am OK with adding the #ifdef around queued_spin_lock(). Acked-by: Waiman Long diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h index 7a8546660a63..f2d51f929cf5 100644 --- a/arch/powerpc/include/asm/paravirt.h +++ b/arch/powerpc/include/asm/paravirt.h @@ -45,6 +55,19 @@ static inline void yield_to_preempted(int cpu, u32 yield_count) { ___bad_yield_to_preempted(); /* This would be a bug */ } + +extern void ___bad_yield_to_any(void); +static inline void yield_to_any(void) +{ + ___bad_yield_to_any(); /* This would be a bug */ +} Why do we do that rather than just not defining yield_to_any() at all and letting the build fail on that? There's a condition somewhere that we know will false at compile time and drop the call before linking? diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h b/arch/powerpc/include/asm/qspinlock_paravirt.h new file mode 100644 index ..750d1b5e0202 --- /dev/null +++ b/arch/powerpc/include/asm/qspinlock_paravirt.h @@ -0,0 +1,7 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +#ifndef __ASM_QSPINLOCK_PARAVIRT_H +#define __ASM_QSPINLOCK_PARAVIRT_H _ASM_POWERPC_QSPINLOCK_PARAVIRT_H please. + +EXPORT_SYMBOL(__pv_queued_spin_unlock); Why's that in a header? Should that (eventually) go with the generic implementation? The PV qspinlock implementation is not that generic at the moment. Even though native qspinlock is used by a number of archs, PV qspinlock is only currently used in x86. This is certainly an area that needs improvement. diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index 24c18362e5ea..756e727b383f 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -25,9 +25,14 @@ config PPC_PSERIES select SWIOTLB default y +config PARAVIRT_SPINLOCKS + bool + default n default n is the default. diff --git a/arch/powerpc/platforms/pseries/setup.c b/arch/powerpc/platforms/pseries/setup.c index 2db8469e475f..747a203d9453 100644 --- a/arch/powerpc/platforms/pseries/setup.c +++ b/arch/powerpc/platforms/pseries/setup.c @@ -771,8 +771,12 @@ static void __init pSeries_setup_arch(void) if (firmware_has_feature(FW_FEATURE_LPAR)) { vpa_init(boot_cpuid); - if (lppaca_shared_proc(get_lppaca())) + if (lppaca_shared_proc(get_lppaca())) { static_branch_enable(_processor); +#ifdef CONFIG_PARAVIRT_SPINLOCKS + pv_spinlocks_init(); +#endif + } We could avoid the ifdef with this I think? diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h index 434615f1d761..6ec72282888d 100644 --- a/arch/powerpc/include/asm/spinlock.h +++ b/arch/powerpc/include/asm/spinlock.h @@ -10,5 +10,9 @@ #include #endif +#ifndef CONFIG_PARAVIRT_SPINLOCKS +static inline void pv_spinlocks_init(void) { } +#endif + #endif /* __KERNEL__ */ #endif /* __ASM_SPINLOCK_H */ cheers We don't really need to do a pv_spinlocks_init() if pv_kick() isn't supported. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
On 7/8/20 7:50 PM, Waiman Long wrote: On 7/8/20 1:10 AM, Nicholas Piggin wrote: Excerpts from Waiman Long's message of July 8, 2020 1:33 pm: On 7/7/20 1:57 AM, Nicholas Piggin wrote: Yes, powerpc could certainly get more performance out of the slow paths, and then there are a few parameters to tune. We don't have a good alternate patching for function calls yet, but that would be something to do for native vs pv. And then there seem to be one or two tunable parameters we could experiment with. The paravirt locks may need a bit more tuning. Some simple testing under KVM shows we might be a bit slower in some cases. Whether this is fairness or something else I'm not sure. The current simple pv spinlock code can do a directed yield to the lock holder CPU, whereas the pv qspl here just does a general yield. I think we might actually be able to change that to also support directed yield. Though I'm not sure if this is actually the cause of the slowdown yet. Regarding the paravirt lock, I have taken a further look into the current PPC spinlock code. There is an equivalent of pv_wait() but no pv_kick(). Maybe PPC doesn't really need that. So powerpc has two types of wait, either undirected "all processors" or directed to a specific processor which has been preempted by the hypervisor. The simple spinlock code does a directed wait, because it knows the CPU which is holding the lock. In this case, there is a sequence that is used to ensure we don't wait if the condition has become true, and the target CPU does not need to kick the waiter it will happen automatically (see splpar_spin_yield). This is preferable because we only wait as needed and don't require the kick operation. Thanks for the explanation. The pv spinlock code I did uses the undirected wait, because we don't know the CPU number which we are waiting on. This is undesirable because it's higher overhead and the wait is not so accurate. I think perhaps we could change things so we wait on the correct CPU when queued, which might be good enough (we could also put the lock owner CPU in the spinlock word, if we add another format). The LS byte of the lock word is used to indicate locking status. If we have less than 255 cpus, we can put the (cpu_nr + 1) into the lock byte. The special 0xff value can be used to indicate a cpu number >= 255 for indirect yield. The required change to the qspinlock code will be minimal, I think. BTW, we can also keep track of the previous cpu in the waiting queue. Due to lock stealing, that may not be the cpu that is holding the lock. Maybe we can use this, if available, in case the cpu number is >= 255. Regards, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
On 7/8/20 4:41 AM, Peter Zijlstra wrote: On Tue, Jul 07, 2020 at 03:57:06PM +1000, Nicholas Piggin wrote: Yes, powerpc could certainly get more performance out of the slow paths, and then there are a few parameters to tune. Can you clarify? The slow path is already in use on ARM64 which is weak, so I doubt there's superfluous serialization present. And Will spend a fair amount of time on making that thing guarantee forward progressm, so there just isn't too much room to play. We don't have a good alternate patching for function calls yet, but that would be something to do for native vs pv. Going by your jump_label implementation, support for static_call should be fairly straight forward too, no? https://lkml.kernel.org/r/20200624153024.794671...@infradead.org Speaking of static_call, I am also looking forward to it. Do you have an idea when that will be merged? Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
On 7/8/20 4:32 AM, Peter Zijlstra wrote: On Tue, Jul 07, 2020 at 11:33:45PM -0400, Waiman Long wrote: From 5d7941a498935fb225b2c7a3108cbf590114c3db Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Tue, 7 Jul 2020 22:29:16 -0400 Subject: [PATCH 2/9] locking/pvqspinlock: Introduce CONFIG_PARAVIRT_QSPINLOCKS_LITE Add a new PARAVIRT_QSPINLOCKS_LITE config option that allows architectures to use the PV qspinlock code without the need to use or implement a pv_kick() function, thus eliminating the atomic unlock overhead. The non-atomic queued_spin_unlock() can be used instead. The pv_wait() function will still be needed, but it can be a dummy function. With that option set, the hybrid PV queued/unfair locking code should still be able to make it performant enough in a paravirtualized How is this supposed to work? If there is no kick, you have no control over who wakes up and fairness goes out the window entirely. You don't even begin to explain... I don't have a full understanding of how the PPC hypervisor work myself. Apparently, a cpu kick may not be needed. This is just a test patch to see if it yields better result. It is subjected to further modifcation. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
On 7/8/20 1:10 AM, Nicholas Piggin wrote: Excerpts from Waiman Long's message of July 8, 2020 1:33 pm: On 7/7/20 1:57 AM, Nicholas Piggin wrote: Yes, powerpc could certainly get more performance out of the slow paths, and then there are a few parameters to tune. We don't have a good alternate patching for function calls yet, but that would be something to do for native vs pv. And then there seem to be one or two tunable parameters we could experiment with. The paravirt locks may need a bit more tuning. Some simple testing under KVM shows we might be a bit slower in some cases. Whether this is fairness or something else I'm not sure. The current simple pv spinlock code can do a directed yield to the lock holder CPU, whereas the pv qspl here just does a general yield. I think we might actually be able to change that to also support directed yield. Though I'm not sure if this is actually the cause of the slowdown yet. Regarding the paravirt lock, I have taken a further look into the current PPC spinlock code. There is an equivalent of pv_wait() but no pv_kick(). Maybe PPC doesn't really need that. So powerpc has two types of wait, either undirected "all processors" or directed to a specific processor which has been preempted by the hypervisor. The simple spinlock code does a directed wait, because it knows the CPU which is holding the lock. In this case, there is a sequence that is used to ensure we don't wait if the condition has become true, and the target CPU does not need to kick the waiter it will happen automatically (see splpar_spin_yield). This is preferable because we only wait as needed and don't require the kick operation. Thanks for the explanation. The pv spinlock code I did uses the undirected wait, because we don't know the CPU number which we are waiting on. This is undesirable because it's higher overhead and the wait is not so accurate. I think perhaps we could change things so we wait on the correct CPU when queued, which might be good enough (we could also put the lock owner CPU in the spinlock word, if we add another format). The LS byte of the lock word is used to indicate locking status. If we have less than 255 cpus, we can put the (cpu_nr + 1) into the lock byte. The special 0xff value can be used to indicate a cpu number >= 255 for indirect yield. The required change to the qspinlock code will be minimal, I think. Attached are two additional qspinlock patches that adds a CONFIG_PARAVIRT_QSPINLOCKS_LITE option to not require pv_kick(). There is also a fixup patch to be applied after your patchset. I don't have access to a PPC LPAR with shared processor at the moment, so I can't test the performance of the paravirt code. Would you mind adding my patches and do some performance test on your end to see if it gives better result? Great, I'll do some tests. Any suggestions for what to try? I will just like to see if it will produce some better performance result compared with your current version. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks
On 7/7/20 1:57 AM, Nicholas Piggin wrote: Yes, powerpc could certainly get more performance out of the slow paths, and then there are a few parameters to tune. We don't have a good alternate patching for function calls yet, but that would be something to do for native vs pv. And then there seem to be one or two tunable parameters we could experiment with. The paravirt locks may need a bit more tuning. Some simple testing under KVM shows we might be a bit slower in some cases. Whether this is fairness or something else I'm not sure. The current simple pv spinlock code can do a directed yield to the lock holder CPU, whereas the pv qspl here just does a general yield. I think we might actually be able to change that to also support directed yield. Though I'm not sure if this is actually the cause of the slowdown yet. Regarding the paravirt lock, I have taken a further look into the current PPC spinlock code. There is an equivalent of pv_wait() but no pv_kick(). Maybe PPC doesn't really need that. Attached are two additional qspinlock patches that adds a CONFIG_PARAVIRT_QSPINLOCKS_LITE option to not require pv_kick(). There is also a fixup patch to be applied after your patchset. I don't have access to a PPC LPAR with shared processor at the moment, so I can't test the performance of the paravirt code. Would you mind adding my patches and do some performance test on your end to see if it gives better result? Thanks, Longman >From 161e545523a7eb4c42c145c04e9a5a15903ba3d9 Mon Sep 17 00:00:00 2001 From: Waiman Long Date: Tue, 7 Jul 2020 20:46:51 -0400 Subject: [PATCH 1/9] locking/pvqspinlock: Code relocation and extraction Move pv_kick_node() and the unlock functions up and extract out the hash and lock code from pv_wait_head_or_lock() into pv_hash_lock(). There is no functional change. Signed-off-by: Waiman Long --- kernel/locking/qspinlock_paravirt.h | 302 ++-- 1 file changed, 156 insertions(+), 146 deletions(-) diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h index e84d21aa0722..8eec58320b85 100644 --- a/kernel/locking/qspinlock_paravirt.h +++ b/kernel/locking/qspinlock_paravirt.h @@ -55,6 +55,7 @@ struct pv_node { /* * Hybrid PV queued/unfair lock + * * * By replacing the regular queued_spin_trylock() with the function below, * it will be called once when a lock waiter enter the PV slowpath before @@ -259,6 +260,156 @@ static struct pv_node *pv_unhash(struct qspinlock *lock) BUG(); } +/* + * Insert lock into hash and set _Q_SLOW_VAL. + * Return true if lock acquired. + */ +static inline bool pv_hash_lock(struct qspinlock *lock, struct pv_node *node) +{ + struct qspinlock **lp = pv_hash(lock, node); + + /* + * We must hash before setting _Q_SLOW_VAL, such that + * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock() + * we'll be sure to be able to observe our hash entry. + * + * [S] [Rmw] l->locked == _Q_SLOW_VAL + * MB RMB + * [RmW] l->locked = _Q_SLOW_VAL [L] + * + * Matches the smp_rmb() in __pv_queued_spin_unlock(). + */ + if (xchg(>locked, _Q_SLOW_VAL) == 0) { + /* + * The lock was free and now we own the lock. + * Change the lock value back to _Q_LOCKED_VAL + * and unhash the table. + */ + WRITE_ONCE(lock->locked, _Q_LOCKED_VAL); + WRITE_ONCE(*lp, NULL); + return true; + } + return false; +} + +/* + * Called after setting next->locked = 1 when we're the lock owner. + * + * Instead of waking the waiters stuck in pv_wait_node() advance their state + * such that they're waiting in pv_wait_head_or_lock(), this avoids a + * wake/sleep cycle. + */ +static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node) +{ + struct pv_node *pn = (struct pv_node *)node; + + /* + * If the vCPU is indeed halted, advance its state to match that of + * pv_wait_node(). If OTOH this fails, the vCPU was running and will + * observe its next->locked value and advance itself. + * + * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() + * + * The write to next->locked in arch_mcs_spin_unlock_contended() + * must be ordered before the read of pn->state in the cmpxchg() + * below for the code to work correctly. To guarantee full ordering + * irrespective of the success or failure of the cmpxchg(), + * a relaxed version with explicit barrier is used. The control + * dependency will order the reading of pn->state before any + * subsequent writes. + */ + smp_mb__before_atomic(); + if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed) + != vcpu_halted) + return; + + /* + * Put the lock into the hash table and set the _Q_SLOW_VAL. + * + * As this is the same vCPU that will check the _Q_SLOW_VAL value and + * the hash table later on at unlock time, no atomic instruction is + * needed. + */ + WRITE_ONCE(lock->locked, _Q_SLOW_VAL); + (void)pv_hash
Re: [PATCH v2 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
On 7/3/20 3:35 AM, Nicholas Piggin wrote: Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/paravirt.h | 28 ++ arch/powerpc/include/asm/qspinlock.h | 55 +++ arch/powerpc/include/asm/qspinlock_paravirt.h | 5 ++ arch/powerpc/platforms/pseries/Kconfig| 5 ++ arch/powerpc/platforms/pseries/setup.c| 6 +- include/asm-generic/qspinlock.h | 2 + 6 files changed, 100 insertions(+), 1 deletion(-) create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h index 7a8546660a63..f2d51f929cf5 100644 --- a/arch/powerpc/include/asm/paravirt.h +++ b/arch/powerpc/include/asm/paravirt.h @@ -29,6 +29,16 @@ static inline void yield_to_preempted(int cpu, u32 yield_count) { plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(cpu), yield_count); } + +static inline void prod_cpu(int cpu) +{ + plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu)); +} + +static inline void yield_to_any(void) +{ + plpar_hcall_norets(H_CONFER, -1, 0); +} #else static inline bool is_shared_processor(void) { @@ -45,6 +55,19 @@ static inline void yield_to_preempted(int cpu, u32 yield_count) { ___bad_yield_to_preempted(); /* This would be a bug */ } + +extern void ___bad_yield_to_any(void); +static inline void yield_to_any(void) +{ + ___bad_yield_to_any(); /* This would be a bug */ +} + +extern void ___bad_prod_cpu(void); +static inline void prod_cpu(int cpu) +{ + ___bad_prod_cpu(); /* This would be a bug */ +} + #endif #define vcpu_is_preempted vcpu_is_preempted @@ -57,5 +80,10 @@ static inline bool vcpu_is_preempted(int cpu) return false; } +static inline bool pv_is_native_spin_unlock(void) +{ + return !is_shared_processor(); +} + #endif /* __KERNEL__ */ #endif /* __ASM_PARAVIRT_H */ diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h index c49e33e24edd..0960a0de2467 100644 --- a/arch/powerpc/include/asm/qspinlock.h +++ b/arch/powerpc/include/asm/qspinlock.h @@ -3,9 +3,36 @@ #define _ASM_POWERPC_QSPINLOCK_H #include +#include #define _Q_PENDING_LOOPS (1 << 9) /* not tuned */ +#ifdef CONFIG_PARAVIRT_SPINLOCKS +extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); +extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); + +static __always_inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) +{ + if (!is_shared_processor()) + native_queued_spin_lock_slowpath(lock, val); + else + __pv_queued_spin_lock_slowpath(lock, val); +} In a previous mail, I said that: You may need to match the use of __pv_queued_spin_lock_slowpath() with the corresponding __pv_queued_spin_unlock(), e.g. #define queued_spin_unlock queued_spin_unlock static inline queued_spin_unlock(struct qspinlock *lock) { if (!is_shared_processor()) smp_store_release(>locked, 0); else __pv_queued_spin_unlock(lock); } Otherwise, pv_kick() will never be called. Maybe PowerPC HMT is different that the shared cpus can still process instruction, though slower, that cpu kicking like what was done in kvm is not really necessary. If that is the case, I think we should document that. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 6/8] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
On 7/2/20 12:15 PM, kernel test robot wrote: Hi Nicholas, I love your patch! Yet something to improve: [auto build test ERROR on powerpc/next] [also build test ERROR on tip/locking/core v5.8-rc3 next-20200702] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use as documented in https://git-scm.com/docs/git-format-patch] url: https://github.com/0day-ci/linux/commits/Nicholas-Piggin/powerpc-queued-spinlocks-and-rwlocks/20200702-155158 base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next config: powerpc-allyesconfig (attached as .config) compiler: powerpc64-linux-gcc (GCC) 9.3.0 reproduce (this is a W=1 build): wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=powerpc If you fix the issue, kindly add following tag as appropriate Reported-by: kernel test robot All errors (new ones prefixed by >>): kernel/locking/lock_events.c:61:16: warning: no previous prototype for 'lockevent_read' [-Wmissing-prototypes] 61 | ssize_t __weak lockevent_read(struct file *file, char __user *user_buf, |^~ kernel/locking/lock_events.c: In function 'skip_lockevent': kernel/locking/lock_events.c:126:12: error: implicit declaration of function 'pv_is_native_spin_unlock' [-Werror=implicit-function-declaration] 126 | pv_on = !pv_is_native_spin_unlock(); |^~~~ cc1: some warnings being treated as errors vim +/pv_is_native_spin_unlock +126 kernel/locking/lock_events.c I think you will need to add the following into arch/powerpc/include/asm/qspinlock_paravirt.h: static inline pv_is_native_spin_unlock(void) { return !is_shared_processor(); } Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 6/8] powerpc/pseries: implement paravirt qspinlocks for SPLPAR
On 7/2/20 3:48 AM, Nicholas Piggin wrote: Signed-off-by: Nicholas Piggin --- arch/powerpc/include/asm/paravirt.h | 23 arch/powerpc/include/asm/qspinlock.h | 55 +++ arch/powerpc/include/asm/qspinlock_paravirt.h | 5 ++ arch/powerpc/platforms/pseries/Kconfig| 5 ++ arch/powerpc/platforms/pseries/setup.c| 6 +- include/asm-generic/qspinlock.h | 2 + 6 files changed, 95 insertions(+), 1 deletion(-) create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h diff --git a/arch/powerpc/include/asm/paravirt.h b/arch/powerpc/include/asm/paravirt.h index 7a8546660a63..5fae9dfa6fe9 100644 --- a/arch/powerpc/include/asm/paravirt.h +++ b/arch/powerpc/include/asm/paravirt.h @@ -29,6 +29,16 @@ static inline void yield_to_preempted(int cpu, u32 yield_count) { plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(cpu), yield_count); } + +static inline void prod_cpu(int cpu) +{ + plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu)); +} + +static inline void yield_to_any(void) +{ + plpar_hcall_norets(H_CONFER, -1, 0); +} #else static inline bool is_shared_processor(void) { @@ -45,6 +55,19 @@ static inline void yield_to_preempted(int cpu, u32 yield_count) { ___bad_yield_to_preempted(); /* This would be a bug */ } + +extern void ___bad_yield_to_any(void); +static inline void yield_to_any(void) +{ + ___bad_yield_to_any(); /* This would be a bug */ +} + +extern void ___bad_prod_cpu(void); +static inline void prod_cpu(int cpu) +{ + ___bad_prod_cpu(); /* This would be a bug */ +} + #endif #define vcpu_is_preempted vcpu_is_preempted diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h index f84da77b6bb7..997a9a32df77 100644 --- a/arch/powerpc/include/asm/qspinlock.h +++ b/arch/powerpc/include/asm/qspinlock.h @@ -3,9 +3,36 @@ #define _ASM_POWERPC_QSPINLOCK_H #include +#include #define _Q_PENDING_LOOPS (1 << 9) /* not tuned */ +#ifdef CONFIG_PARAVIRT_SPINLOCKS +extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); +extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); + +static __always_inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val) +{ + if (!is_shared_processor()) + native_queued_spin_lock_slowpath(lock, val); + else + __pv_queued_spin_lock_slowpath(lock, val); +} You may need to match the use of __pv_queued_spin_lock_slowpath() with the corresponding __pv_queued_spin_unlock(), e.g. #define queued_spin_unlock queued_spin_unlock static inline queued_spin_unlock(struct qspinlock *lock) { if (!is_shared_processor()) smp_store_release(>locked, 0); else __pv_queued_spin_unlock(lock); } Otherwise, pv_kick() will never be called. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 0/3] mm, treewide: Rename kzfree() to kfree_sensitive()
On 6/16/20 2:53 PM, Joe Perches wrote: On Mon, 2020-06-15 at 21:57 -0400, Waiman Long wrote: v4: - Break out the memzero_explicit() change as suggested by Dan Carpenter so that it can be backported to stable. - Drop the "crypto: Remove unnecessary memzero_explicit()" patch for now as there can be a bit more discussion on what is best. It will be introduced as a separate patch later on after this one is merged. To this larger audience and last week without reply: https://lore.kernel.org/lkml/573b3fbd5927c643920e1364230c296b23e7584d.ca...@perches.com/ Are there _any_ fastpath uses of kfree or vfree? Many patches have been posted recently to fix mispairings of specific types of alloc and free functions. To eliminate these mispairings at a runtime cost of four comparisons, should the kfree/vfree/kvfree/kfree_const functions be consolidated into a single kfree? Something like the below: void kfree(const void *addr) { if (is_kernel_rodata((unsigned long)addr)) return; if (is_vmalloc_addr(addr)) _vfree(addr); else _kfree(addr); } #define kvfree kfree #define vfree kfree #define kfree_const kfree How about adding CONFIG_DEBUG_VM code to check for invalid address ranges in kfree() and vfree()? By doing this, we can catch unmatched pairing in debug mode, but won't have the overhead when debug mode is off. Thought? Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 0/3] mm, treewide: Rename kzfree() to kfree_sensitive()
On 6/16/20 2:53 PM, Joe Perches wrote: On Mon, 2020-06-15 at 21:57 -0400, Waiman Long wrote: v4: - Break out the memzero_explicit() change as suggested by Dan Carpenter so that it can be backported to stable. - Drop the "crypto: Remove unnecessary memzero_explicit()" patch for now as there can be a bit more discussion on what is best. It will be introduced as a separate patch later on after this one is merged. To this larger audience and last week without reply: https://lore.kernel.org/lkml/573b3fbd5927c643920e1364230c296b23e7584d.ca...@perches.com/ Are there _any_ fastpath uses of kfree or vfree? I am not sure about that, but both of them can be slow. Many patches have been posted recently to fix mispairings of specific types of alloc and free functions. To eliminate these mispairings at a runtime cost of four comparisons, should the kfree/vfree/kvfree/kfree_const functions be consolidated into a single kfree? Something like the below: void kfree(const void *addr) { if (is_kernel_rodata((unsigned long)addr)) return; if (is_vmalloc_addr(addr)) _vfree(addr); else _kfree(addr); } is_kernel_rodata() is inlined, but is_vmalloc_addr() isn't. So the overhead can be a bit bigger. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v5 2/2] mm, treewide: Rename kzfree() to kfree_sensitive()
On 6/16/20 2:09 PM, Andrew Morton wrote: On Tue, 16 Jun 2020 11:43:11 -0400 Waiman Long wrote: As said by Linus: A symmetric naming is only helpful if it implies symmetries in use. Otherwise it's actively misleading. In "kzalloc()", the z is meaningful and an important part of what the caller wants. In "kzfree()", the z is actively detrimental, because maybe in the future we really _might_ want to use that "memfill(0xdeadbeef)" or something. The "zero" part of the interface isn't even _relevant_. The main reason that kzfree() exists is to clear sensitive information that should not be leaked to other future users of the same memory objects. Rename kzfree() to kfree_sensitive() to follow the example of the recently added kvfree_sensitive() and make the intention of the API more explicit. In addition, memzero_explicit() is used to clear the memory to make sure that it won't get optimized away by the compiler. The renaming is done by using the command sequence: git grep -w --name-only kzfree |\ xargs sed -i 's/\bkzfree\b/kfree_sensitive/' followed by some editing of the kfree_sensitive() kerneldoc and adding a kzfree backward compatibility macro in slab.h. ... --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -186,10 +186,12 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *, struct mem_cgroup *); */ void * __must_check krealloc(const void *, size_t, gfp_t); void kfree(const void *); -void kzfree(const void *); +void kfree_sensitive(const void *); size_t __ksize(const void *); size_t ksize(const void *); +#define kzfree(x) kfree_sensitive(x) /* For backward compatibility */ + What was the thinking here? Is this really necessary? I suppose we could keep this around for a while to ease migration. But not for too long, please. It should be there just for 1 release cycle. I have broken out the btrfs patch to the btrfs list and I didn't make the kzfree to kfree_sensitive conversion there as that patch was in front in my patch list. So depending on which one lands first, there can be a window where the compilation may fail without this workaround. I am going to send out another patch in the next release cycle to remove it. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v5 2/2] mm, treewide: Rename kzfree() to kfree_sensitive()
As said by Linus: A symmetric naming is only helpful if it implies symmetries in use. Otherwise it's actively misleading. In "kzalloc()", the z is meaningful and an important part of what the caller wants. In "kzfree()", the z is actively detrimental, because maybe in the future we really _might_ want to use that "memfill(0xdeadbeef)" or something. The "zero" part of the interface isn't even _relevant_. The main reason that kzfree() exists is to clear sensitive information that should not be leaked to other future users of the same memory objects. Rename kzfree() to kfree_sensitive() to follow the example of the recently added kvfree_sensitive() and make the intention of the API more explicit. In addition, memzero_explicit() is used to clear the memory to make sure that it won't get optimized away by the compiler. The renaming is done by using the command sequence: git grep -w --name-only kzfree |\ xargs sed -i 's/\bkzfree\b/kfree_sensitive/' followed by some editing of the kfree_sensitive() kerneldoc and adding a kzfree backward compatibility macro in slab.h. Suggested-by: Joe Perches Acked-by: David Howells Acked-by: Michal Hocko Acked-by: Johannes Weiner Signed-off-by: Waiman Long --- arch/s390/crypto/prng.c | 4 +-- arch/x86/power/hibernate.c| 2 +- crypto/adiantum.c | 2 +- crypto/ahash.c| 4 +-- crypto/api.c | 2 +- crypto/asymmetric_keys/verify_pefile.c| 4 +-- crypto/deflate.c | 2 +- crypto/drbg.c | 10 +++--- crypto/ecc.c | 8 ++--- crypto/ecdh.c | 2 +- crypto/gcm.c | 2 +- crypto/gf128mul.c | 4 +-- crypto/jitterentropy-kcapi.c | 2 +- crypto/rng.c | 2 +- crypto/rsa-pkcs1pad.c | 6 ++-- crypto/seqiv.c| 2 +- crypto/shash.c| 2 +- crypto/skcipher.c | 2 +- crypto/testmgr.c | 6 ++-- crypto/zstd.c | 2 +- .../allwinner/sun8i-ce/sun8i-ce-cipher.c | 2 +- .../allwinner/sun8i-ss/sun8i-ss-cipher.c | 2 +- drivers/crypto/amlogic/amlogic-gxl-cipher.c | 4 +-- drivers/crypto/atmel-ecc.c| 2 +- drivers/crypto/caam/caampkc.c | 28 +++ drivers/crypto/cavium/cpt/cptvf_main.c| 6 ++-- drivers/crypto/cavium/cpt/cptvf_reqmanager.c | 12 +++ drivers/crypto/cavium/nitrox/nitrox_lib.c | 4 +-- drivers/crypto/cavium/zip/zip_crypto.c| 6 ++-- drivers/crypto/ccp/ccp-crypto-rsa.c | 6 ++-- drivers/crypto/ccree/cc_aead.c| 4 +-- drivers/crypto/ccree/cc_buffer_mgr.c | 4 +-- drivers/crypto/ccree/cc_cipher.c | 6 ++-- drivers/crypto/ccree/cc_hash.c| 8 ++--- drivers/crypto/ccree/cc_request_mgr.c | 2 +- drivers/crypto/marvell/cesa/hash.c| 2 +- .../crypto/marvell/octeontx/otx_cptvf_main.c | 6 ++-- .../marvell/octeontx/otx_cptvf_reqmgr.h | 2 +- drivers/crypto/mediatek/mtk-aes.c | 2 +- drivers/crypto/nx/nx.c| 4 +-- drivers/crypto/virtio/virtio_crypto_algs.c| 12 +++ drivers/crypto/virtio/virtio_crypto_core.c| 2 +- drivers/md/dm-crypt.c | 32 - drivers/md/dm-integrity.c | 6 ++-- drivers/misc/ibmvmc.c | 6 ++-- .../hisilicon/hns3/hns3pf/hclge_mbx.c | 2 +- .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c| 6 ++-- drivers/net/ppp/ppp_mppe.c| 6 ++-- drivers/net/wireguard/noise.c | 4 +-- drivers/net/wireguard/peer.c | 2 +- drivers/net/wireless/intel/iwlwifi/pcie/rx.c | 2 +- .../net/wireless/intel/iwlwifi/pcie/tx-gen2.c | 6 ++-- drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 6 ++-- drivers/net/wireless/intersil/orinoco/wext.c | 4 +-- drivers/s390/crypto/ap_bus.h | 4 +-- drivers/staging/ks7010/ks_hostif.c| 2 +- drivers/staging/rtl8723bs/core/rtw_security.c | 2 +- drivers/staging/wlan-ng/p80211netdev.c| 2 +- drivers/target/iscsi/iscsi_target_auth.c | 2 +- fs/cifs/cifsencrypt.c | 2 +- fs/cifs/connect.c | 10 +++--- fs/cifs/dfs_cache.c | 2 +- fs/cifs/misc.c| 8 ++--- fs/crypto/keyring.c | 6 ++-- fs/crypto/keysetup_v1.c | 4 +-- fs/ecryptfs/keystore.c
[PATCH v5 1/2] mm/slab: Use memzero_explicit() in kzfree()
The kzfree() function is normally used to clear some sensitive information, like encryption keys, in the buffer before freeing it back to the pool. Memset() is currently used for buffer clearing. However unlikely, there is still a non-zero probability that the compiler may choose to optimize away the memory clearing especially if LTO is being used in the future. To make sure that this optimization will never happen, memzero_explicit(), which is introduced in v3.18, is now used in kzfree() to future-proof it. Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()") Cc: sta...@vger.kernel.org Acked-by: Michal Hocko Signed-off-by: Waiman Long --- mm/slab_common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/slab_common.c b/mm/slab_common.c index 9e72ba224175..37d48a56431d 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1726,7 +1726,7 @@ void kzfree(const void *p) if (unlikely(ZERO_OR_NULL_PTR(mem))) return; ks = ksize(mem); - memset(mem, 0, ks); + memzero_explicit(mem, ks); kfree(mem); } EXPORT_SYMBOL(kzfree); -- 2.18.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v5 0/2] mm, treewide: Rename kzfree() to kfree_sensitive()
v5: - Break the btrfs patch out as a separate patch to be processed independently. - Update the commit log of patch 1 to make it less scary. - Add a kzfree backward compatibility macro in patch 2. v4: - Break out the memzero_explicit() change as suggested by Dan Carpenter so that it can be backported to stable. - Drop the "crypto: Remove unnecessary memzero_explicit()" patch for now as there can be a bit more discussion on what is best. It will be introduced as a separate patch later on after this one is merged. This patchset makes a global rename of the kzfree() to kfree_sensitive() to highlight the fact buffer clearing is only needed if the data objects contain sensitive information like encrpytion key. The fact that kzfree() uses memset() to do the clearing isn't totally safe either as compiler may compile out the clearing in their optimizer especially if LTO is used. Instead, the new kfree_sensitive() uses memzero_explicit() which won't get compiled out. Waiman Long (2): mm/slab: Use memzero_explicit() in kzfree() mm, treewide: Rename kzfree() to kfree_sensitive() arch/s390/crypto/prng.c | 4 +-- arch/x86/power/hibernate.c| 2 +- crypto/adiantum.c | 2 +- crypto/ahash.c| 4 +-- crypto/api.c | 2 +- crypto/asymmetric_keys/verify_pefile.c| 4 +-- crypto/deflate.c | 2 +- crypto/drbg.c | 10 +++--- crypto/ecc.c | 8 ++--- crypto/ecdh.c | 2 +- crypto/gcm.c | 2 +- crypto/gf128mul.c | 4 +-- crypto/jitterentropy-kcapi.c | 2 +- crypto/rng.c | 2 +- crypto/rsa-pkcs1pad.c | 6 ++-- crypto/seqiv.c| 2 +- crypto/shash.c| 2 +- crypto/skcipher.c | 2 +- crypto/testmgr.c | 6 ++-- crypto/zstd.c | 2 +- .../allwinner/sun8i-ce/sun8i-ce-cipher.c | 2 +- .../allwinner/sun8i-ss/sun8i-ss-cipher.c | 2 +- drivers/crypto/amlogic/amlogic-gxl-cipher.c | 4 +-- drivers/crypto/atmel-ecc.c| 2 +- drivers/crypto/caam/caampkc.c | 28 +++ drivers/crypto/cavium/cpt/cptvf_main.c| 6 ++-- drivers/crypto/cavium/cpt/cptvf_reqmanager.c | 12 +++ drivers/crypto/cavium/nitrox/nitrox_lib.c | 4 +-- drivers/crypto/cavium/zip/zip_crypto.c| 6 ++-- drivers/crypto/ccp/ccp-crypto-rsa.c | 6 ++-- drivers/crypto/ccree/cc_aead.c| 4 +-- drivers/crypto/ccree/cc_buffer_mgr.c | 4 +-- drivers/crypto/ccree/cc_cipher.c | 6 ++-- drivers/crypto/ccree/cc_hash.c| 8 ++--- drivers/crypto/ccree/cc_request_mgr.c | 2 +- drivers/crypto/marvell/cesa/hash.c| 2 +- .../crypto/marvell/octeontx/otx_cptvf_main.c | 6 ++-- .../marvell/octeontx/otx_cptvf_reqmgr.h | 2 +- drivers/crypto/mediatek/mtk-aes.c | 2 +- drivers/crypto/nx/nx.c| 4 +-- drivers/crypto/virtio/virtio_crypto_algs.c| 12 +++ drivers/crypto/virtio/virtio_crypto_core.c| 2 +- drivers/md/dm-crypt.c | 32 - drivers/md/dm-integrity.c | 6 ++-- drivers/misc/ibmvmc.c | 6 ++-- .../hisilicon/hns3/hns3pf/hclge_mbx.c | 2 +- .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c| 6 ++-- drivers/net/ppp/ppp_mppe.c| 6 ++-- drivers/net/wireguard/noise.c | 4 +-- drivers/net/wireguard/peer.c | 2 +- drivers/net/wireless/intel/iwlwifi/pcie/rx.c | 2 +- .../net/wireless/intel/iwlwifi/pcie/tx-gen2.c | 6 ++-- drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 6 ++-- drivers/net/wireless/intersil/orinoco/wext.c | 4 +-- drivers/s390/crypto/ap_bus.h | 4 +-- drivers/staging/ks7010/ks_hostif.c| 2 +- drivers/staging/rtl8723bs/core/rtw_security.c | 2 +- drivers/staging/wlan-ng/p80211netdev.c| 2 +- drivers/target/iscsi/iscsi_target_auth.c | 2 +- fs/cifs/cifsencrypt.c | 2 +- fs/cifs/connect.c | 10 +++--- fs/cifs/dfs_cache.c | 2 +- fs/cifs/misc.c| 8 ++--- fs/crypto/keyring.c | 6 ++-- fs/crypto/keysetup_v1.c | 4 +-- fs/ecryptfs/keystore.c| 4 +-- fs/ecryptfs/messaging.c | 2 +- include/crypto/aead.h | 2 +- include/crypto/
Re: [PATCH v4 2/3] mm, treewide: Rename kzfree() to kfree_sensitive()
On 6/16/20 10:26 AM, Dan Carpenter wrote: Last time you sent this we couldn't decide which tree it should go through. Either the crypto tree or through Andrew seems like the right thing to me. Also the other issue is that it risks breaking things if people add new kzfree() instances while we are doing the transition. Could you just add a "#define kzfree kfree_sensitive" so that things continue to compile and we can remove it in the next kernel release? regards, dan carpenter Yes, that make sure sense. Will send out v5 later today. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 3/3] btrfs: Use kfree() in btrfs_ioctl_get_subvol_info()
On 6/16/20 10:48 AM, David Sterba wrote: On Mon, Jun 15, 2020 at 09:57:18PM -0400, Waiman Long wrote: In btrfs_ioctl_get_subvol_info(), there is a classic case where kzalloc() was incorrectly paired with kzfree(). According to David Sterba, there isn't any sensitive information in the subvol_info that needs to be cleared before freeing. So kfree_sensitive() isn't really needed, use kfree() instead. Reported-by: David Sterba Signed-off-by: Waiman Long --- fs/btrfs/ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index f1dd9e4271e9..e8f7c5f00894 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2692,7 +2692,7 @@ static int btrfs_ioctl_get_subvol_info(struct file *file, void __user *argp) btrfs_put_root(root); out_free: btrfs_free_path(path); - kfree_sensitive(subvol_info); + kfree(subvol_info); I would rather merge a patch doing to kzfree -> kfree instead of doing the middle step to switch it to kfree_sensitive. If it would help integration of your patchset I can push it to the next rc so there are no kzfree left in the btrfs code. Treewide change like that can take time so it would be one less problem to care about for you. Sure, I will move it forward in the patch series. Thanks, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 1/3] mm/slab: Use memzero_explicit() in kzfree()
On 6/15/20 11:30 PM, Eric Biggers wrote: On Mon, Jun 15, 2020 at 09:57:16PM -0400, Waiman Long wrote: The kzfree() function is normally used to clear some sensitive information, like encryption keys, in the buffer before freeing it back to the pool. Memset() is currently used for the buffer clearing. However, it is entirely possible that the compiler may choose to optimize away the memory clearing especially if LTO is being used. To make sure that this optimization will not happen, memzero_explicit(), which is introduced in v3.18, is now used in kzfree() to do the clearing. Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()") Cc: sta...@vger.kernel.org Signed-off-by: Waiman Long --- mm/slab_common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/slab_common.c b/mm/slab_common.c index 9e72ba224175..37d48a56431d 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1726,7 +1726,7 @@ void kzfree(const void *p) if (unlikely(ZERO_OR_NULL_PTR(mem))) return; ks = ksize(mem); - memset(mem, 0, ks); + memzero_explicit(mem, ks); kfree(mem); } EXPORT_SYMBOL(kzfree); This is a good change, but the commit message isn't really accurate. AFAIK, no one has found any case where this memset() gets optimized out. And even with LTO, it would be virtually impossible due to all the synchronization and global data structures that kfree() uses. (Remember that this isn't the C standard function "free()", so the compiler can't assign it any special meaning.) Not to mention that LTO support isn't actually upstream yet. I still agree with the change, but it might be helpful if the commit message were honest that this is really a hardening measure and about properly conveying the intent. As-is this sounds like a critical fix, which might confuse people. Yes, I agree that the commit log may look a bit scary. How about the following: The kzfree() function is normally used to clear some sensitive information, like encryption keys, in the buffer before freeing it back to the pool. Memset() is currently used for buffer clearing. However unlikely, there is still a non-zero probability that the compiler may choose to optimize away the memory clearing especially if LTO is being used in the future. To make sure that this optimization will never happen, memzero_explicit(), which is introduced in v3.18, is now used in kzfree() to future-proof it. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 3/3] btrfs: Use kfree() in btrfs_ioctl_get_subvol_info()
In btrfs_ioctl_get_subvol_info(), there is a classic case where kzalloc() was incorrectly paired with kzfree(). According to David Sterba, there isn't any sensitive information in the subvol_info that needs to be cleared before freeing. So kfree_sensitive() isn't really needed, use kfree() instead. Reported-by: David Sterba Signed-off-by: Waiman Long --- fs/btrfs/ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index f1dd9e4271e9..e8f7c5f00894 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2692,7 +2692,7 @@ static int btrfs_ioctl_get_subvol_info(struct file *file, void __user *argp) btrfs_put_root(root); out_free: btrfs_free_path(path); - kfree_sensitive(subvol_info); + kfree(subvol_info); return ret; } -- 2.18.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 2/3] mm, treewide: Rename kzfree() to kfree_sensitive()
As said by Linus: A symmetric naming is only helpful if it implies symmetries in use. Otherwise it's actively misleading. In "kzalloc()", the z is meaningful and an important part of what the caller wants. In "kzfree()", the z is actively detrimental, because maybe in the future we really _might_ want to use that "memfill(0xdeadbeef)" or something. The "zero" part of the interface isn't even _relevant_. The main reason that kzfree() exists is to clear sensitive information that should not be leaked to other future users of the same memory objects. Rename kzfree() to kfree_sensitive() to follow the example of the recently added kvfree_sensitive() and make the intention of the API more explicit. In addition, memzero_explicit() is used to clear the memory to make sure that it won't get optimized away by the compiler. The renaming is done by using the command sequence: git grep -w --name-only kzfree |\ xargs sed -i 's/\bkzfree\b/kfree_sensitive/' followed by some editing of the kfree_sensitive() kerneldoc and the use of memzero_explicit() instead of memset(). Suggested-by: Joe Perches Acked-by: David Howells Acked-by: Michal Hocko Acked-by: Johannes Weiner Signed-off-by: Waiman Long --- arch/s390/crypto/prng.c | 4 +-- arch/x86/power/hibernate.c| 2 +- crypto/adiantum.c | 2 +- crypto/ahash.c| 4 +-- crypto/api.c | 2 +- crypto/asymmetric_keys/verify_pefile.c| 4 +-- crypto/deflate.c | 2 +- crypto/drbg.c | 10 +++--- crypto/ecc.c | 8 ++--- crypto/ecdh.c | 2 +- crypto/gcm.c | 2 +- crypto/gf128mul.c | 4 +-- crypto/jitterentropy-kcapi.c | 2 +- crypto/rng.c | 2 +- crypto/rsa-pkcs1pad.c | 6 ++-- crypto/seqiv.c| 2 +- crypto/shash.c| 2 +- crypto/skcipher.c | 2 +- crypto/testmgr.c | 6 ++-- crypto/zstd.c | 2 +- .../allwinner/sun8i-ce/sun8i-ce-cipher.c | 2 +- .../allwinner/sun8i-ss/sun8i-ss-cipher.c | 2 +- drivers/crypto/amlogic/amlogic-gxl-cipher.c | 4 +-- drivers/crypto/atmel-ecc.c| 2 +- drivers/crypto/caam/caampkc.c | 28 +++ drivers/crypto/cavium/cpt/cptvf_main.c| 6 ++-- drivers/crypto/cavium/cpt/cptvf_reqmanager.c | 12 +++ drivers/crypto/cavium/nitrox/nitrox_lib.c | 4 +-- drivers/crypto/cavium/zip/zip_crypto.c| 6 ++-- drivers/crypto/ccp/ccp-crypto-rsa.c | 6 ++-- drivers/crypto/ccree/cc_aead.c| 4 +-- drivers/crypto/ccree/cc_buffer_mgr.c | 4 +-- drivers/crypto/ccree/cc_cipher.c | 6 ++-- drivers/crypto/ccree/cc_hash.c| 8 ++--- drivers/crypto/ccree/cc_request_mgr.c | 2 +- drivers/crypto/marvell/cesa/hash.c| 2 +- .../crypto/marvell/octeontx/otx_cptvf_main.c | 6 ++-- .../marvell/octeontx/otx_cptvf_reqmgr.h | 2 +- drivers/crypto/mediatek/mtk-aes.c | 2 +- drivers/crypto/nx/nx.c| 4 +-- drivers/crypto/virtio/virtio_crypto_algs.c| 12 +++ drivers/crypto/virtio/virtio_crypto_core.c| 2 +- drivers/md/dm-crypt.c | 32 - drivers/md/dm-integrity.c | 6 ++-- drivers/misc/ibmvmc.c | 6 ++-- .../hisilicon/hns3/hns3pf/hclge_mbx.c | 2 +- .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c| 6 ++-- drivers/net/ppp/ppp_mppe.c| 6 ++-- drivers/net/wireguard/noise.c | 4 +-- drivers/net/wireguard/peer.c | 2 +- drivers/net/wireless/intel/iwlwifi/pcie/rx.c | 2 +- .../net/wireless/intel/iwlwifi/pcie/tx-gen2.c | 6 ++-- drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 6 ++-- drivers/net/wireless/intersil/orinoco/wext.c | 4 +-- drivers/s390/crypto/ap_bus.h | 4 +-- drivers/staging/ks7010/ks_hostif.c| 2 +- drivers/staging/rtl8723bs/core/rtw_security.c | 2 +- drivers/staging/wlan-ng/p80211netdev.c| 2 +- drivers/target/iscsi/iscsi_target_auth.c | 2 +- fs/btrfs/ioctl.c | 2 +- fs/cifs/cifsencrypt.c | 2 +- fs/cifs/connect.c | 10 +++--- fs/cifs/dfs_cache.c | 2 +- fs/cifs/misc.c| 8 ++--- fs/crypto/keyring.c | 6 ++-- fs/crypto/keysetup_v1.c
[PATCH v4 0/3] mm, treewide: Rename kzfree() to kfree_sensitive()
v4: - Break out the memzero_explicit() change as suggested by Dan Carpenter so that it can be backported to stable. - Drop the "crypto: Remove unnecessary memzero_explicit()" patch for now as there can be a bit more discussion on what is best. It will be introduced as a separate patch later on after this one is merged. This patchset makes a global rename of the kzfree() to kfree_sensitive() to highlight the fact buffer clearing is only needed if the data objects contain sensitive information like encrpytion key. The fact that kzfree() uses memset() to do the clearing isn't totally safe either as compiler may compile out the clearing in their optimizer especially if LTO is used. Instead, the new kfree_sensitive() uses memzero_explicit() which won't get compiled out. Waiman Long (3): mm/slab: Use memzero_explicit() in kzfree() mm, treewide: Rename kzfree() to kfree_sensitive() btrfs: Use kfree() in btrfs_ioctl_get_subvol_info() arch/s390/crypto/prng.c | 4 +-- arch/x86/power/hibernate.c| 2 +- crypto/adiantum.c | 2 +- crypto/ahash.c| 4 +-- crypto/api.c | 2 +- crypto/asymmetric_keys/verify_pefile.c| 4 +-- crypto/deflate.c | 2 +- crypto/drbg.c | 10 +++--- crypto/ecc.c | 8 ++--- crypto/ecdh.c | 2 +- crypto/gcm.c | 2 +- crypto/gf128mul.c | 4 +-- crypto/jitterentropy-kcapi.c | 2 +- crypto/rng.c | 2 +- crypto/rsa-pkcs1pad.c | 6 ++-- crypto/seqiv.c| 2 +- crypto/shash.c| 2 +- crypto/skcipher.c | 2 +- crypto/testmgr.c | 6 ++-- crypto/zstd.c | 2 +- .../allwinner/sun8i-ce/sun8i-ce-cipher.c | 2 +- .../allwinner/sun8i-ss/sun8i-ss-cipher.c | 2 +- drivers/crypto/amlogic/amlogic-gxl-cipher.c | 4 +-- drivers/crypto/atmel-ecc.c| 2 +- drivers/crypto/caam/caampkc.c | 28 +++ drivers/crypto/cavium/cpt/cptvf_main.c| 6 ++-- drivers/crypto/cavium/cpt/cptvf_reqmanager.c | 12 +++ drivers/crypto/cavium/nitrox/nitrox_lib.c | 4 +-- drivers/crypto/cavium/zip/zip_crypto.c| 6 ++-- drivers/crypto/ccp/ccp-crypto-rsa.c | 6 ++-- drivers/crypto/ccree/cc_aead.c| 4 +-- drivers/crypto/ccree/cc_buffer_mgr.c | 4 +-- drivers/crypto/ccree/cc_cipher.c | 6 ++-- drivers/crypto/ccree/cc_hash.c| 8 ++--- drivers/crypto/ccree/cc_request_mgr.c | 2 +- drivers/crypto/marvell/cesa/hash.c| 2 +- .../crypto/marvell/octeontx/otx_cptvf_main.c | 6 ++-- .../marvell/octeontx/otx_cptvf_reqmgr.h | 2 +- drivers/crypto/mediatek/mtk-aes.c | 2 +- drivers/crypto/nx/nx.c| 4 +-- drivers/crypto/virtio/virtio_crypto_algs.c| 12 +++ drivers/crypto/virtio/virtio_crypto_core.c| 2 +- drivers/md/dm-crypt.c | 32 - drivers/md/dm-integrity.c | 6 ++-- drivers/misc/ibmvmc.c | 6 ++-- .../hisilicon/hns3/hns3pf/hclge_mbx.c | 2 +- .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c| 6 ++-- drivers/net/ppp/ppp_mppe.c| 6 ++-- drivers/net/wireguard/noise.c | 4 +-- drivers/net/wireguard/peer.c | 2 +- drivers/net/wireless/intel/iwlwifi/pcie/rx.c | 2 +- .../net/wireless/intel/iwlwifi/pcie/tx-gen2.c | 6 ++-- drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 6 ++-- drivers/net/wireless/intersil/orinoco/wext.c | 4 +-- drivers/s390/crypto/ap_bus.h | 4 +-- drivers/staging/ks7010/ks_hostif.c| 2 +- drivers/staging/rtl8723bs/core/rtw_security.c | 2 +- drivers/staging/wlan-ng/p80211netdev.c| 2 +- drivers/target/iscsi/iscsi_target_auth.c | 2 +- fs/btrfs/ioctl.c | 2 +- fs/cifs/cifsencrypt.c | 2 +- fs/cifs/connect.c | 10 +++--- fs/cifs/dfs_cache.c | 2 +- fs/cifs/misc.c| 8 ++--- fs/crypto/keyring.c | 6 ++-- fs/crypto/keysetup_v1.c | 4 +-- fs/ecryptfs/keystore.c| 4 +-- fs/ecryptfs/messaging.c | 2 +- include/crypto/aead.h | 2 +- include/crypto/akcipher.h | 2 +- include/crypto/gf128mul.h | 2 +- include/cry
[PATCH v4 1/3] mm/slab: Use memzero_explicit() in kzfree()
The kzfree() function is normally used to clear some sensitive information, like encryption keys, in the buffer before freeing it back to the pool. Memset() is currently used for the buffer clearing. However, it is entirely possible that the compiler may choose to optimize away the memory clearing especially if LTO is being used. To make sure that this optimization will not happen, memzero_explicit(), which is introduced in v3.18, is now used in kzfree() to do the clearing. Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()") Cc: sta...@vger.kernel.org Signed-off-by: Waiman Long --- mm/slab_common.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/slab_common.c b/mm/slab_common.c index 9e72ba224175..37d48a56431d 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1726,7 +1726,7 @@ void kzfree(const void *p) if (unlikely(ZERO_OR_NULL_PTR(mem))) return; ks = ksize(mem); - memset(mem, 0, ks); + memzero_explicit(mem, ks); kfree(mem); } EXPORT_SYMBOL(kzfree); -- 2.18.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2] mm, treewide: Rename kzfree() to kfree_sensitive()
On 6/15/20 2:07 PM, Dan Carpenter wrote: On Mon, Apr 13, 2020 at 05:15:49PM -0400, Waiman Long wrote: diff --git a/mm/slab_common.c b/mm/slab_common.c index 23c7500eea7d..c08bc7eb20bd 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -1707,17 +1707,17 @@ void *krealloc(const void *p, size_t new_size, gfp_t flags) EXPORT_SYMBOL(krealloc); /** - * kzfree - like kfree but zero memory + * kfree_sensitive - Clear sensitive information in memory before freeing * @p: object to free memory of * * The memory of the object @p points to is zeroed before freed. - * If @p is %NULL, kzfree() does nothing. + * If @p is %NULL, kfree_sensitive() does nothing. * * Note: this function zeroes the whole allocated buffer which can be a good * deal bigger than the requested buffer size passed to kmalloc(). So be * careful when using this function in performance sensitive code. */ -void kzfree(const void *p) +void kfree_sensitive(const void *p) { size_t ks; void *mem = (void *)p; @@ -1725,10 +1725,10 @@ void kzfree(const void *p) if (unlikely(ZERO_OR_NULL_PTR(mem))) return; ks = ksize(mem); - memset(mem, 0, ks); + memzero_explicit(mem, ks); ^ This is an unrelated bug fix. It really needs to be pulled into a separate patch by itself and back ported to stable kernels. kfree(mem); } -EXPORT_SYMBOL(kzfree); +EXPORT_SYMBOL(kfree_sensitive); /** * ksize - get the actual amount of memory allocated for a given object regards, dan carpenter Thanks for the suggestion. I will break it out and post a version soon. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/2] crypto: Remove unnecessary memzero_explicit()
On 4/14/20 3:16 PM, Michal Suchánek wrote: > On Tue, Apr 14, 2020 at 12:24:36PM -0400, Waiman Long wrote: >> On 4/14/20 2:08 AM, Christophe Leroy wrote: >>> >>> Le 14/04/2020 à 00:28, Waiman Long a écrit : >>>> Since kfree_sensitive() will do an implicit memzero_explicit(), there >>>> is no need to call memzero_explicit() before it. Eliminate those >>>> memzero_explicit() and simplify the call sites. For better correctness, >>>> the setting of keylen is also moved down after the key pointer check. >>>> >>>> Signed-off-by: Waiman Long >>>> --- >>>> .../allwinner/sun8i-ce/sun8i-ce-cipher.c | 19 +- >>>> .../allwinner/sun8i-ss/sun8i-ss-cipher.c | 20 +-- >>>> drivers/crypto/amlogic/amlogic-gxl-cipher.c | 12 +++ >>>> drivers/crypto/inside-secure/safexcel_hash.c | 3 +-- >>>> 4 files changed, 14 insertions(+), 40 deletions(-) >>>> >>>> diff --git a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c >>>> b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c >>>> index aa4e8fdc2b32..8358fac98719 100644 >>>> --- a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c >>>> +++ b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c >>>> @@ -366,10 +366,7 @@ void sun8i_ce_cipher_exit(struct crypto_tfm *tfm) >>>> { >>>> struct sun8i_cipher_tfm_ctx *op = crypto_tfm_ctx(tfm); >>>> - if (op->key) { >>>> - memzero_explicit(op->key, op->keylen); >>>> - kfree(op->key); >>>> - } >>>> + kfree_sensitive(op->key); >>>> crypto_free_sync_skcipher(op->fallback_tfm); >>>> pm_runtime_put_sync_suspend(op->ce->dev); >>>> } >>>> @@ -391,14 +388,11 @@ int sun8i_ce_aes_setkey(struct crypto_skcipher >>>> *tfm, const u8 *key, >>>> dev_dbg(ce->dev, "ERROR: Invalid keylen %u\n", keylen); >>>> return -EINVAL; >>>> } >>>> - if (op->key) { >>>> - memzero_explicit(op->key, op->keylen); >>>> - kfree(op->key); >>>> - } >>>> - op->keylen = keylen; >>>> + kfree_sensitive(op->key); >>>> op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA); >>>> if (!op->key) >>>> return -ENOMEM; >>>> + op->keylen = keylen; >>> Does it matter at all to ensure op->keylen is not set when of->key is >>> NULL ? I'm not sure. >>> >>> But if it does, then op->keylen should be set to 0 when freeing op->key. >> My thinking is that if memory allocation fails, we just don't touch >> anything and return an error code. I will not explicitly set keylen to 0 >> in this case unless it is specified in the API documentation. > You already freed the key by now so not touching anything is not > possible. The key is set to NULL on allocation failure so setting keylen > to 0 should be redundant. However, setting keylen to 0 is consisent with > not having a key, and it avoids the possibility of leaking the length > later should that ever cause any problem. OK, I can change it to clear the key length when the allocation failed which isn't likely. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/2] mm, treewide: Rename kzfree() to kfree_sensitive()
On 4/14/20 8:48 AM, David Sterba wrote: > On Mon, Apr 13, 2020 at 05:15:49PM -0400, Waiman Long wrote: >> fs/btrfs/ioctl.c | 2 +- > >> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c >> index 40b729dce91c..eab3f8510426 100644 >> --- a/fs/btrfs/ioctl.c >> +++ b/fs/btrfs/ioctl.c >> @@ -2691,7 +2691,7 @@ static int btrfs_ioctl_get_subvol_info(struct file >> *file, void __user *argp) >> btrfs_put_root(root); >> out_free: >> btrfs_free_path(path); >> -kzfree(subvol_info); >> +kfree_sensitive(subvol_info); > This is not in a sensitive context so please switch it to plain kfree. > With that you have my acked-by. Thanks. > Thanks for letting me know about. I think I will send it out as a separate patch. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/2] crypto: Remove unnecessary memzero_explicit()
On 4/14/20 2:08 AM, Christophe Leroy wrote: > > > Le 14/04/2020 à 00:28, Waiman Long a écrit : >> Since kfree_sensitive() will do an implicit memzero_explicit(), there >> is no need to call memzero_explicit() before it. Eliminate those >> memzero_explicit() and simplify the call sites. For better correctness, >> the setting of keylen is also moved down after the key pointer check. >> >> Signed-off-by: Waiman Long >> --- >> .../allwinner/sun8i-ce/sun8i-ce-cipher.c | 19 +- >> .../allwinner/sun8i-ss/sun8i-ss-cipher.c | 20 +-- >> drivers/crypto/amlogic/amlogic-gxl-cipher.c | 12 +++ >> drivers/crypto/inside-secure/safexcel_hash.c | 3 +-- >> 4 files changed, 14 insertions(+), 40 deletions(-) >> >> diff --git a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c >> b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c >> index aa4e8fdc2b32..8358fac98719 100644 >> --- a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c >> +++ b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c >> @@ -366,10 +366,7 @@ void sun8i_ce_cipher_exit(struct crypto_tfm *tfm) >> { >> struct sun8i_cipher_tfm_ctx *op = crypto_tfm_ctx(tfm); >> - if (op->key) { >> - memzero_explicit(op->key, op->keylen); >> - kfree(op->key); >> - } >> + kfree_sensitive(op->key); >> crypto_free_sync_skcipher(op->fallback_tfm); >> pm_runtime_put_sync_suspend(op->ce->dev); >> } >> @@ -391,14 +388,11 @@ int sun8i_ce_aes_setkey(struct crypto_skcipher >> *tfm, const u8 *key, >> dev_dbg(ce->dev, "ERROR: Invalid keylen %u\n", keylen); >> return -EINVAL; >> } >> - if (op->key) { >> - memzero_explicit(op->key, op->keylen); >> - kfree(op->key); >> - } >> - op->keylen = keylen; >> + kfree_sensitive(op->key); >> op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA); >> if (!op->key) >> return -ENOMEM; >> + op->keylen = keylen; > > Does it matter at all to ensure op->keylen is not set when of->key is > NULL ? I'm not sure. > > But if it does, then op->keylen should be set to 0 when freeing op->key. My thinking is that if memory allocation fails, we just don't touch anything and return an error code. I will not explicitly set keylen to 0 in this case unless it is specified in the API documentation. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2 2/2] crypto: Remove unnecessary memzero_explicit()
Since kfree_sensitive() will do an implicit memzero_explicit(), there is no need to call memzero_explicit() before it. Eliminate those memzero_explicit() and simplify the call sites. For better correctness, the setting of keylen is also moved down after the key pointer check. Signed-off-by: Waiman Long --- .../allwinner/sun8i-ce/sun8i-ce-cipher.c | 19 +- .../allwinner/sun8i-ss/sun8i-ss-cipher.c | 20 +-- drivers/crypto/amlogic/amlogic-gxl-cipher.c | 12 +++ drivers/crypto/inside-secure/safexcel_hash.c | 3 +-- 4 files changed, 14 insertions(+), 40 deletions(-) diff --git a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c index aa4e8fdc2b32..8358fac98719 100644 --- a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c +++ b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c @@ -366,10 +366,7 @@ void sun8i_ce_cipher_exit(struct crypto_tfm *tfm) { struct sun8i_cipher_tfm_ctx *op = crypto_tfm_ctx(tfm); - if (op->key) { - memzero_explicit(op->key, op->keylen); - kfree(op->key); - } + kfree_sensitive(op->key); crypto_free_sync_skcipher(op->fallback_tfm); pm_runtime_put_sync_suspend(op->ce->dev); } @@ -391,14 +388,11 @@ int sun8i_ce_aes_setkey(struct crypto_skcipher *tfm, const u8 *key, dev_dbg(ce->dev, "ERROR: Invalid keylen %u\n", keylen); return -EINVAL; } - if (op->key) { - memzero_explicit(op->key, op->keylen); - kfree(op->key); - } - op->keylen = keylen; + kfree_sensitive(op->key); op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA); if (!op->key) return -ENOMEM; + op->keylen = keylen; crypto_sync_skcipher_clear_flags(op->fallback_tfm, CRYPTO_TFM_REQ_MASK); crypto_sync_skcipher_set_flags(op->fallback_tfm, tfm->base.crt_flags & CRYPTO_TFM_REQ_MASK); @@ -416,14 +410,11 @@ int sun8i_ce_des3_setkey(struct crypto_skcipher *tfm, const u8 *key, if (err) return err; - if (op->key) { - memzero_explicit(op->key, op->keylen); - kfree(op->key); - } - op->keylen = keylen; + kfree_sensitive(op->key); op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA); if (!op->key) return -ENOMEM; + op->keylen = keylen; crypto_sync_skcipher_clear_flags(op->fallback_tfm, CRYPTO_TFM_REQ_MASK); crypto_sync_skcipher_set_flags(op->fallback_tfm, tfm->base.crt_flags & CRYPTO_TFM_REQ_MASK); diff --git a/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c b/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c index 5246ef4f5430..0495fbc27fcc 100644 --- a/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c +++ b/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c @@ -249,7 +249,6 @@ static int sun8i_ss_cipher(struct skcipher_request *areq) offset = areq->cryptlen - ivsize; if (rctx->op_dir & SS_DECRYPTION) { memcpy(areq->iv, backup_iv, ivsize); - memzero_explicit(backup_iv, ivsize); kfree_sensitive(backup_iv); } else { scatterwalk_map_and_copy(areq->iv, areq->dst, offset, @@ -367,10 +366,7 @@ void sun8i_ss_cipher_exit(struct crypto_tfm *tfm) { struct sun8i_cipher_tfm_ctx *op = crypto_tfm_ctx(tfm); - if (op->key) { - memzero_explicit(op->key, op->keylen); - kfree(op->key); - } + kfree_sensitive(op->key); crypto_free_sync_skcipher(op->fallback_tfm); pm_runtime_put_sync(op->ss->dev); } @@ -392,14 +388,11 @@ int sun8i_ss_aes_setkey(struct crypto_skcipher *tfm, const u8 *key, dev_dbg(ss->dev, "ERROR: Invalid keylen %u\n", keylen); return -EINVAL; } - if (op->key) { - memzero_explicit(op->key, op->keylen); - kfree(op->key); - } - op->keylen = keylen; + kfree_sensitive(op->key); op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA); if (!op->key) return -ENOMEM; + op->keylen = keylen; crypto_sync_skcipher_clear_flags(op->fallback_tfm, CRYPTO_TFM_REQ_MASK); crypto_sync_skcipher_set_flags(op->fallback_tfm, tfm->base.crt_flags & CRYPTO_TFM_REQ_MASK); @@ -418,14 +411,11 @@ int sun8i_ss_des3_setkey(struct crypto_skcipher *tfm, const u8 *key, return -EINVAL; } - if (op->key) { - memzero_explicit(op->key, op-&g
Re: [PATCH 2/2] crypto: Remove unnecessary memzero_explicit()
On 4/13/20 5:31 PM, Joe Perches wrote: > On Mon, 2020-04-13 at 17:15 -0400, Waiman Long wrote: >> Since kfree_sensitive() will do an implicit memzero_explicit(), there >> is no need to call memzero_explicit() before it. Eliminate those >> memzero_explicit() and simplify the call sites. > 2 bits of trivia: > >> diff --git a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c >> b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c > [] >> @@ -391,10 +388,7 @@ int sun8i_ce_aes_setkey(struct crypto_skcipher *tfm, >> const u8 *key, >> dev_dbg(ce->dev, "ERROR: Invalid keylen %u\n", keylen); >> return -EINVAL; >> } >> -if (op->key) { >> -memzero_explicit(op->key, op->keylen); >> -kfree(op->key); >> -} >> +kfree_sensitive(op->key); >> op->keylen = keylen; >> op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA); >> if (!op->key) > It might be a defect to set op->keylen before the kmemdup succeeds. It could be. I can move it down after the op->key check. >> @@ -416,10 +410,7 @@ int sun8i_ce_des3_setkey(struct crypto_skcipher *tfm, >> const u8 *key, >> if (err) >> return err; >> >> -if (op->key) { >> -memzero_explicit(op->key, op->keylen); >> -kfree(op->key); >> -} >> +free_sensitive(op->key, op->keylen); > Why not kfree_sensitive(op->key) ? Oh, it is a bug. I will send out v2 to fix that. Thanks for spotting it. Cheers, Longman > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 1/2] mm, treewide: Rename kzfree() to kfree_sensitive()
As said by Linus: A symmetric naming is only helpful if it implies symmetries in use. Otherwise it's actively misleading. In "kzalloc()", the z is meaningful and an important part of what the caller wants. In "kzfree()", the z is actively detrimental, because maybe in the future we really _might_ want to use that "memfill(0xdeadbeef)" or something. The "zero" part of the interface isn't even _relevant_. The main reason that kzfree() exists is to clear sensitive information that should not be leaked to other future users of the same memory objects. Rename kzfree() to kfree_sensitive() to follow the example of the recently added kvfree_sensitive() and make the intention of the API more explicit. In addition, memzero_explicit() is used to clear the memory to make sure that it won't get optimized away by the compiler. The renaming is done by using the command sequence: git grep -w --name-only kzfree |\ xargs sed -i 's/\bkzfree\b/kfree_sensitive/' followed by some editing of the kfree_sensitive() kerneldoc and the use of memzero_explicit() instead of memset(). Suggested-by: Joe Perches Signed-off-by: Waiman Long --- arch/s390/crypto/prng.c | 4 +-- arch/x86/power/hibernate.c| 2 +- crypto/adiantum.c | 2 +- crypto/ahash.c| 4 +-- crypto/api.c | 2 +- crypto/asymmetric_keys/verify_pefile.c| 4 +-- crypto/deflate.c | 2 +- crypto/drbg.c | 10 +++--- crypto/ecc.c | 8 ++--- crypto/ecdh.c | 2 +- crypto/gcm.c | 2 +- crypto/gf128mul.c | 4 +-- crypto/jitterentropy-kcapi.c | 2 +- crypto/rng.c | 2 +- crypto/rsa-pkcs1pad.c | 6 ++-- crypto/seqiv.c| 2 +- crypto/shash.c| 2 +- crypto/skcipher.c | 2 +- crypto/testmgr.c | 6 ++-- crypto/zstd.c | 2 +- .../allwinner/sun8i-ce/sun8i-ce-cipher.c | 2 +- .../allwinner/sun8i-ss/sun8i-ss-cipher.c | 2 +- drivers/crypto/amlogic/amlogic-gxl-cipher.c | 4 +-- drivers/crypto/atmel-ecc.c| 2 +- drivers/crypto/caam/caampkc.c | 28 +++ drivers/crypto/cavium/cpt/cptvf_main.c| 6 ++-- drivers/crypto/cavium/cpt/cptvf_reqmanager.c | 12 +++ drivers/crypto/cavium/nitrox/nitrox_lib.c | 4 +-- drivers/crypto/cavium/zip/zip_crypto.c| 6 ++-- drivers/crypto/ccp/ccp-crypto-rsa.c | 6 ++-- drivers/crypto/ccree/cc_aead.c| 4 +-- drivers/crypto/ccree/cc_buffer_mgr.c | 4 +-- drivers/crypto/ccree/cc_cipher.c | 6 ++-- drivers/crypto/ccree/cc_hash.c| 8 ++--- drivers/crypto/ccree/cc_request_mgr.c | 2 +- drivers/crypto/marvell/cesa/hash.c| 2 +- .../crypto/marvell/octeontx/otx_cptvf_main.c | 6 ++-- .../marvell/octeontx/otx_cptvf_reqmgr.h | 2 +- drivers/crypto/mediatek/mtk-aes.c | 2 +- drivers/crypto/nx/nx.c| 4 +-- drivers/crypto/virtio/virtio_crypto_algs.c| 12 +++ drivers/crypto/virtio/virtio_crypto_core.c| 2 +- drivers/md/dm-crypt.c | 34 +-- drivers/md/dm-integrity.c | 6 ++-- drivers/misc/ibmvmc.c | 6 ++-- .../hisilicon/hns3/hns3pf/hclge_mbx.c | 2 +- .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c| 6 ++-- drivers/net/ppp/ppp_mppe.c| 6 ++-- drivers/net/wireguard/noise.c | 4 +-- drivers/net/wireguard/peer.c | 2 +- drivers/net/wireless/intel/iwlwifi/pcie/rx.c | 2 +- .../net/wireless/intel/iwlwifi/pcie/tx-gen2.c | 6 ++-- drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 6 ++-- drivers/net/wireless/intersil/orinoco/wext.c | 4 +-- drivers/s390/crypto/ap_bus.h | 4 +-- drivers/staging/ks7010/ks_hostif.c| 2 +- drivers/staging/rtl8723bs/core/rtw_security.c | 2 +- drivers/staging/wlan-ng/p80211netdev.c| 2 +- drivers/target/iscsi/iscsi_target_auth.c | 2 +- fs/btrfs/ioctl.c | 2 +- fs/cifs/cifsencrypt.c | 2 +- fs/cifs/connect.c | 10 +++--- fs/cifs/dfs_cache.c | 2 +- fs/cifs/misc.c| 8 ++--- fs/crypto/keyring.c | 6 ++-- fs/crypto/keysetup_v1.c | 4 +-- fs/ecryptfs/keystore.c| 4 +-- fs/ecryptf
[PATCH 2/2] crypto: Remove unnecessary memzero_explicit()
Since kfree_sensitive() will do an implicit memzero_explicit(), there is no need to call memzero_explicit() before it. Eliminate those memzero_explicit() and simplify the call sites. Signed-off-by: Waiman Long --- .../crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c | 15 +++ .../crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c | 16 +++- drivers/crypto/amlogic/amlogic-gxl-cipher.c | 10 ++ drivers/crypto/inside-secure/safexcel_hash.c | 3 +-- 4 files changed, 9 insertions(+), 35 deletions(-) diff --git a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c index aa4e8fdc2b32..46c10c7ca6d0 100644 --- a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c +++ b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c @@ -366,10 +366,7 @@ void sun8i_ce_cipher_exit(struct crypto_tfm *tfm) { struct sun8i_cipher_tfm_ctx *op = crypto_tfm_ctx(tfm); - if (op->key) { - memzero_explicit(op->key, op->keylen); - kfree(op->key); - } + kfree_sensitive(op->key); crypto_free_sync_skcipher(op->fallback_tfm); pm_runtime_put_sync_suspend(op->ce->dev); } @@ -391,10 +388,7 @@ int sun8i_ce_aes_setkey(struct crypto_skcipher *tfm, const u8 *key, dev_dbg(ce->dev, "ERROR: Invalid keylen %u\n", keylen); return -EINVAL; } - if (op->key) { - memzero_explicit(op->key, op->keylen); - kfree(op->key); - } + kfree_sensitive(op->key); op->keylen = keylen; op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA); if (!op->key) @@ -416,10 +410,7 @@ int sun8i_ce_des3_setkey(struct crypto_skcipher *tfm, const u8 *key, if (err) return err; - if (op->key) { - memzero_explicit(op->key, op->keylen); - kfree(op->key); - } + free_sensitive(op->key, op->keylen); op->keylen = keylen; op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA); if (!op->key) diff --git a/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c b/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c index 5246ef4f5430..7e09a923cbaf 100644 --- a/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c +++ b/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c @@ -249,7 +249,6 @@ static int sun8i_ss_cipher(struct skcipher_request *areq) offset = areq->cryptlen - ivsize; if (rctx->op_dir & SS_DECRYPTION) { memcpy(areq->iv, backup_iv, ivsize); - memzero_explicit(backup_iv, ivsize); kfree_sensitive(backup_iv); } else { scatterwalk_map_and_copy(areq->iv, areq->dst, offset, @@ -367,10 +366,7 @@ void sun8i_ss_cipher_exit(struct crypto_tfm *tfm) { struct sun8i_cipher_tfm_ctx *op = crypto_tfm_ctx(tfm); - if (op->key) { - memzero_explicit(op->key, op->keylen); - kfree(op->key); - } + kfree_sensitive(op->key); crypto_free_sync_skcipher(op->fallback_tfm); pm_runtime_put_sync(op->ss->dev); } @@ -392,10 +388,7 @@ int sun8i_ss_aes_setkey(struct crypto_skcipher *tfm, const u8 *key, dev_dbg(ss->dev, "ERROR: Invalid keylen %u\n", keylen); return -EINVAL; } - if (op->key) { - memzero_explicit(op->key, op->keylen); - kfree(op->key); - } + kfree_sensitive(op->key); op->keylen = keylen; op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA); if (!op->key) @@ -418,10 +411,7 @@ int sun8i_ss_des3_setkey(struct crypto_skcipher *tfm, const u8 *key, return -EINVAL; } - if (op->key) { - memzero_explicit(op->key, op->keylen); - kfree(op->key); - } + kfree_sensitive(op->key); op->keylen = keylen; op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA); if (!op->key) diff --git a/drivers/crypto/amlogic/amlogic-gxl-cipher.c b/drivers/crypto/amlogic/amlogic-gxl-cipher.c index fd1269900d67..f424397fbba4 100644 --- a/drivers/crypto/amlogic/amlogic-gxl-cipher.c +++ b/drivers/crypto/amlogic/amlogic-gxl-cipher.c @@ -341,10 +341,7 @@ void meson_cipher_exit(struct crypto_tfm *tfm) { struct meson_cipher_tfm_ctx *op = crypto_tfm_ctx(tfm); - if (op->key) { - memzero_explicit(op->key, op->keylen); - kfree(op->key); - } + kfree_sensitive(op->key) crypto_free_sync_skcipher(op->fallback_tfm); } @@ -368,10
[PATCH 0/2] mm, treewide: Rename kzfree() to kfree_sensitive()
This patchset makes a global rename of the kzfree() to kfree_sensitive() to highlight the fact buffer clearing is only needed if the data objects contain sensitive information like encrpytion key. The fact that kzfree() uses memset() to do the clearing isn't totally safe either as compiler may compile out the clearing in their optimizer. Instead, the new kfree_sensitive() uses memzero_explicit() which won't get compiled out. Waiman Long (2): mm, treewide: Rename kzfree() to kfree_sensitive() crypto: Remove unnecessary memzero_explicit() arch/s390/crypto/prng.c | 4 +-- arch/x86/power/hibernate.c| 2 +- crypto/adiantum.c | 2 +- crypto/ahash.c| 4 +-- crypto/api.c | 2 +- crypto/asymmetric_keys/verify_pefile.c| 4 +-- crypto/deflate.c | 2 +- crypto/drbg.c | 10 +++--- crypto/ecc.c | 8 ++--- crypto/ecdh.c | 2 +- crypto/gcm.c | 2 +- crypto/gf128mul.c | 4 +-- crypto/jitterentropy-kcapi.c | 2 +- crypto/rng.c | 2 +- crypto/rsa-pkcs1pad.c | 6 ++-- crypto/seqiv.c| 2 +- crypto/shash.c| 2 +- crypto/skcipher.c | 2 +- crypto/testmgr.c | 6 ++-- crypto/zstd.c | 2 +- .../allwinner/sun8i-ce/sun8i-ce-cipher.c | 17 +++--- .../allwinner/sun8i-ss/sun8i-ss-cipher.c | 18 +++--- drivers/crypto/amlogic/amlogic-gxl-cipher.c | 14 +++- drivers/crypto/atmel-ecc.c| 2 +- drivers/crypto/caam/caampkc.c | 28 +++ drivers/crypto/cavium/cpt/cptvf_main.c| 6 ++-- drivers/crypto/cavium/cpt/cptvf_reqmanager.c | 12 +++ drivers/crypto/cavium/nitrox/nitrox_lib.c | 4 +-- drivers/crypto/cavium/zip/zip_crypto.c| 6 ++-- drivers/crypto/ccp/ccp-crypto-rsa.c | 6 ++-- drivers/crypto/ccree/cc_aead.c| 4 +-- drivers/crypto/ccree/cc_buffer_mgr.c | 4 +-- drivers/crypto/ccree/cc_cipher.c | 6 ++-- drivers/crypto/ccree/cc_hash.c| 8 ++--- drivers/crypto/ccree/cc_request_mgr.c | 2 +- drivers/crypto/inside-secure/safexcel_hash.c | 3 +- drivers/crypto/marvell/cesa/hash.c| 2 +- .../crypto/marvell/octeontx/otx_cptvf_main.c | 6 ++-- .../marvell/octeontx/otx_cptvf_reqmgr.h | 2 +- drivers/crypto/mediatek/mtk-aes.c | 2 +- drivers/crypto/nx/nx.c| 4 +-- drivers/crypto/virtio/virtio_crypto_algs.c| 12 +++ drivers/crypto/virtio/virtio_crypto_core.c| 2 +- drivers/md/dm-crypt.c | 34 +-- drivers/md/dm-integrity.c | 6 ++-- drivers/misc/ibmvmc.c | 6 ++-- .../hisilicon/hns3/hns3pf/hclge_mbx.c | 2 +- .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c| 6 ++-- drivers/net/ppp/ppp_mppe.c| 6 ++-- drivers/net/wireguard/noise.c | 4 +-- drivers/net/wireguard/peer.c | 2 +- drivers/net/wireless/intel/iwlwifi/pcie/rx.c | 2 +- .../net/wireless/intel/iwlwifi/pcie/tx-gen2.c | 6 ++-- drivers/net/wireless/intel/iwlwifi/pcie/tx.c | 6 ++-- drivers/net/wireless/intersil/orinoco/wext.c | 4 +-- drivers/s390/crypto/ap_bus.h | 4 +-- drivers/staging/ks7010/ks_hostif.c| 2 +- drivers/staging/rtl8723bs/core/rtw_security.c | 2 +- drivers/staging/wlan-ng/p80211netdev.c| 2 +- drivers/target/iscsi/iscsi_target_auth.c | 2 +- fs/btrfs/ioctl.c | 2 +- fs/cifs/cifsencrypt.c | 2 +- fs/cifs/connect.c | 10 +++--- fs/cifs/dfs_cache.c | 2 +- fs/cifs/misc.c| 8 ++--- fs/crypto/keyring.c | 6 ++-- fs/crypto/keysetup_v1.c | 4 +-- fs/ecryptfs/keystore.c| 4 +-- fs/ecryptfs/messaging.c | 2 +- include/crypto/aead.h | 2 +- include/crypto/akcipher.h | 2 +- include/crypto/gf128mul.h | 2 +- include/crypto/hash.h | 2 +- include/crypto/internal/acompress.h | 2 +- include/crypto/kpp.h | 2 +- include/crypto/skcipher.h | 2 +- include/linux/slab.h | 2 +- lib/mpi/mpiutil.c | 6 ++-- lib/test_kasan.c
Re: [PATCH] x86/paravirt: Guard against invalid cpu # in pv_vcpu_is_preempted()
On 04/01/2019 02:38 AM, Juergen Gross wrote: > On 25/03/2019 19:03, Waiman Long wrote: >> On 03/25/2019 12:40 PM, Juergen Gross wrote: >>> On 25/03/2019 16:57, Waiman Long wrote: >>>> It was found that passing an invalid cpu number to pv_vcpu_is_preempted() >>>> might panic the kernel in a VM guest. For example, >>>> >>>> [2.531077] Oops: [#1] SMP PTI >>>> : >>>> [2.532545] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 >>>> [2.533321] RIP: 0010:__raw_callee_save___kvm_vcpu_is_preempted+0x0/0x20 >>>> >>>> To guard against this kind of kernel panic, check is added to >>>> pv_vcpu_is_preempted() to make sure that no invalid cpu number will >>>> be used. >>>> >>>> Signed-off-by: Waiman Long >>>> --- >>>> arch/x86/include/asm/paravirt.h | 6 ++ >>>> 1 file changed, 6 insertions(+) >>>> >>>> diff --git a/arch/x86/include/asm/paravirt.h >>>> b/arch/x86/include/asm/paravirt.h >>>> index c25c38a05c1c..4cfb465dcde4 100644 >>>> --- a/arch/x86/include/asm/paravirt.h >>>> +++ b/arch/x86/include/asm/paravirt.h >>>> @@ -671,6 +671,12 @@ static __always_inline void pv_kick(int cpu) >>>> >>>> static __always_inline bool pv_vcpu_is_preempted(long cpu) >>>> { >>>> + /* >>>> + * Guard against invalid cpu number or the kernel might panic. >>>> + */ >>>> + if (WARN_ON_ONCE((unsigned long)cpu >= nr_cpu_ids)) >>>> + return false; >>>> + >>>>return PVOP_CALLEE1(bool, lock.vcpu_is_preempted, cpu); >>>> } >>> Can this really happen without being a programming error? >> This shouldn't happen without a programming error, I think. In my case, >> it was caused by a race condition leading to use-after-free of the cpu >> number. However, my point is that error like that shouldn't cause the >> kernel to panic. >> >>> Basically you'd need to guard all percpu area accesses to foreign cpus >>> this way. Why is this one special? >> It depends. If out-of-bound access can only happen with obvious >> programming error, I don't think we need to guard against them. In this >> case, I am not totally sure if the race condition that I found may >> happen with existing code or not. To be prudent, I decide to send this >> patch out. >> >> The race condition that I am looking at is as follows: >> >> CPU 0 CPU 1 >> - - >> up_write: >> owner = NULL; >> >> count = 0; >> >> >> >> rwsem_can_spin_on_owner: >> rcu_read_lock(); >> read owner; >> : >> vcpu_is_preempted(owner->cpu); >> : >> rcu_read_unlock() >> >> When I tried to merge the owner into the count (clear the owner after >> the barrier), I can reproduce the crash 100% when booting up the kernel >> in a VM guest. However, I am not sure if the configuration above is safe >> and is just very hard to reproduce. >> >> Alternatively, I can also do the cpu check before calling >> vcpu_is_preempted(). > I think I'd prefer that. > > > Juergen > It turns out that it may be caused by a software bug after all. You can ignore this patch for now. Thanks, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/paravirt: Guard against invalid cpu # in pv_vcpu_is_preempted()
On 03/25/2019 12:40 PM, Juergen Gross wrote: > On 25/03/2019 16:57, Waiman Long wrote: >> It was found that passing an invalid cpu number to pv_vcpu_is_preempted() >> might panic the kernel in a VM guest. For example, >> >> [2.531077] Oops: [#1] SMP PTI >> : >> [2.532545] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 >> [2.533321] RIP: 0010:__raw_callee_save___kvm_vcpu_is_preempted+0x0/0x20 >> >> To guard against this kind of kernel panic, check is added to >> pv_vcpu_is_preempted() to make sure that no invalid cpu number will >> be used. >> >> Signed-off-by: Waiman Long >> --- >> arch/x86/include/asm/paravirt.h | 6 ++ >> 1 file changed, 6 insertions(+) >> >> diff --git a/arch/x86/include/asm/paravirt.h >> b/arch/x86/include/asm/paravirt.h >> index c25c38a05c1c..4cfb465dcde4 100644 >> --- a/arch/x86/include/asm/paravirt.h >> +++ b/arch/x86/include/asm/paravirt.h >> @@ -671,6 +671,12 @@ static __always_inline void pv_kick(int cpu) >> >> static __always_inline bool pv_vcpu_is_preempted(long cpu) >> { >> +/* >> + * Guard against invalid cpu number or the kernel might panic. >> + */ >> +if (WARN_ON_ONCE((unsigned long)cpu >= nr_cpu_ids)) >> +return false; >> + >> return PVOP_CALLEE1(bool, lock.vcpu_is_preempted, cpu); >> } > Can this really happen without being a programming error? This shouldn't happen without a programming error, I think. In my case, it was caused by a race condition leading to use-after-free of the cpu number. However, my point is that error like that shouldn't cause the kernel to panic. > Basically you'd need to guard all percpu area accesses to foreign cpus > this way. Why is this one special? It depends. If out-of-bound access can only happen with obvious programming error, I don't think we need to guard against them. In this case, I am not totally sure if the race condition that I found may happen with existing code or not. To be prudent, I decide to send this patch out. The race condition that I am looking at is as follows: CPU 0 CPU 1 - - up_write: owner = NULL; count = 0; rwsem_can_spin_on_owner: rcu_read_lock(); read owner; : vcpu_is_preempted(owner->cpu); : rcu_read_unlock() When I tried to merge the owner into the count (clear the owner after the barrier), I can reproduce the crash 100% when booting up the kernel in a VM guest. However, I am not sure if the configuration above is safe and is just very hard to reproduce. Alternatively, I can also do the cpu check before calling vcpu_is_preempted(). Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] x86/paravirt: Guard against invalid cpu # in pv_vcpu_is_preempted()
It was found that passing an invalid cpu number to pv_vcpu_is_preempted() might panic the kernel in a VM guest. For example, [2.531077] Oops: [#1] SMP PTI : [2.532545] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [2.533321] RIP: 0010:__raw_callee_save___kvm_vcpu_is_preempted+0x0/0x20 To guard against this kind of kernel panic, check is added to pv_vcpu_is_preempted() to make sure that no invalid cpu number will be used. Signed-off-by: Waiman Long --- arch/x86/include/asm/paravirt.h | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index c25c38a05c1c..4cfb465dcde4 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -671,6 +671,12 @@ static __always_inline void pv_kick(int cpu) static __always_inline bool pv_vcpu_is_preempted(long cpu) { + /* +* Guard against invalid cpu number or the kernel might panic. +*/ + if (WARN_ON_ONCE((unsigned long)cpu >= nr_cpu_ids)) + return false; + return PVOP_CALLEE1(bool, lock.vcpu_is_preempted, cpu); } -- 2.18.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH-tip v2 2/2] x86/xen: Deprecate xen_nopvspin
On 11/01/2017 06:01 PM, Boris Ostrovsky wrote: > On 11/01/2017 04:58 PM, Waiman Long wrote: >> +/* TODO: To be removed in a future kernel version */ >> static __init int xen_parse_nopvspin(char *arg) >> { >> -xen_pvspin = false; >> +pr_warn("xen_nopvspin is deprecated, replace it with >> \"pvlock_type=queued\"!\n"); >> +if (!pv_spinlock_type) >> +pv_spinlock_type = locktype_queued; > Since we currently end up using unfair locks and because you are > deprecating xen_nopvspin I wonder whether it would be better to set this > to locktype_unfair so that current behavior doesn't change. (Sorry, I > haven't responded to your earlier message before you posted this). Juergen? I think the latest patch from Juergen in tip is to use native qspinlock when xen_nopvspin is specified. Right? That is why I made the current choice. I can certainly change to unfair if it is what you guys want. > I am also not sure I agree with making pv_spinlock an enum *and* a > bitmask at the same time. I understand that it makes checks easier but I > think not assuming a value or a pattern would be better, especially > since none of the uses is on a critical path. > > (For example, !pv_spinlock_type is the same as locktype_auto, which is > defined but never used) OK, I will take out the enum and make explicit use of locktype_auto. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH-tip v2 2/2] x86/xen: Deprecate xen_nopvspin
With the new pvlock_type kernel parameter, xen_nopvspin is no longer needed. This patch deprecates the xen_nopvspin parameter by removing its documentation and treating it as an alias of "pvlock_type=queued". Signed-off-by: Waiman Long <long...@redhat.com> --- Documentation/admin-guide/kernel-parameters.txt | 4 arch/x86/xen/spinlock.c | 19 +++ 2 files changed, 7 insertions(+), 16 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index c98d9c7..683a817 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -4596,10 +4596,6 @@ the unplug protocol never -- do not unplug even if version check succeeds - xen_nopvspin[X86,XEN] - Disables the ticketlock slowpath using Xen PV - optimizations. - xen_nopv[X86] Disables the PV optimizations forcing the HVM guest to run as generic HVM guest with no PV drivers. diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index d5f79ac..19e2e75 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -20,7 +20,6 @@ static DEFINE_PER_CPU(int, lock_kicker_irq) = -1; static DEFINE_PER_CPU(char *, irq_name); -static bool xen_pvspin = true; #include @@ -81,12 +80,8 @@ void xen_init_lock_cpu(int cpu) int irq; char *name; - if (!xen_pvspin || - (pv_spinlock_type & (locktype_queued|locktype_unfair))) { - if ((cpu == 0) && !pv_spinlock_type) - static_branch_disable(_spin_lock_key); + if (pv_spinlock_type & (locktype_queued|locktype_unfair)) return; - } WARN(per_cpu(lock_kicker_irq, cpu) >= 0, "spinlock on CPU%d exists on IRQ%d!\n", cpu, per_cpu(lock_kicker_irq, cpu)); @@ -110,8 +105,7 @@ void xen_init_lock_cpu(int cpu) void xen_uninit_lock_cpu(int cpu) { - if (!xen_pvspin || - (pv_spinlock_type & (locktype_queued|locktype_unfair))) + if (pv_spinlock_type & (locktype_queued|locktype_unfair)) return; unbind_from_irqhandler(per_cpu(lock_kicker_irq, cpu), NULL); @@ -132,8 +126,7 @@ void xen_uninit_lock_cpu(int cpu) */ void __init xen_init_spinlocks(void) { - if (!xen_pvspin || - (pv_spinlock_type & (locktype_queued|locktype_unfair))) { + if (pv_spinlock_type & (locktype_queued|locktype_unfair)) { printk(KERN_DEBUG "xen: PV spinlocks disabled\n"); return; } @@ -147,10 +140,12 @@ void __init xen_init_spinlocks(void) pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(xen_vcpu_stolen); } +/* TODO: To be removed in a future kernel version */ static __init int xen_parse_nopvspin(char *arg) { - xen_pvspin = false; + pr_warn("xen_nopvspin is deprecated, replace it with \"pvlock_type=queued\"!\n"); + if (!pv_spinlock_type) + pv_spinlock_type = locktype_queued; return 0; } early_param("xen_nopvspin", xen_parse_nopvspin); - -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH-tip v2 0/2] x86/paravirt: Enable users to choose PV lock type
v1->v2: - Make pv_spinlock_type a bit mask for easier checking. - Add patch 2 to deprecate xen_nopvspin v1 - https://lkml.org/lkml/2017/11/1/381 Patch 1 adds a new pvlock_type parameter for the administrators to specify the type of lock to be used in a para-virtualized kernel. Patch 2 deprecates Xen's xen_nopvspin parameter as it is no longer needed. Waiman Long (2): x86/paravirt: Add kernel parameter to choose paravirt lock type x86/xen: Deprecate xen_nopvspin Documentation/admin-guide/kernel-parameters.txt | 11 --- arch/x86/include/asm/paravirt.h | 9 ++ arch/x86/kernel/kvm.c | 3 ++ arch/x86/kernel/paravirt.c | 40 - arch/x86/xen/spinlock.c | 17 +-- 5 files changed, 65 insertions(+), 15 deletions(-) -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH-tip v2 1/2] x86/paravirt: Add kernel parameter to choose paravirt lock type
Currently, there are 3 different lock types that can be chosen for the x86 architecture: - qspinlock - pvqspinlock - unfair lock One of the above lock types will be chosen at boot time depending on a number of different factors. Ideally, the hypervisors should be able to pick the best performing lock type for the current VM configuration. That is not currently the case as the performance of each lock type are affected by many different factors like the number of vCPUs in the VM, the amount vCPU overcommitment, the CPU type and so on. Generally speaking, unfair lock performs well for VMs with a small number of vCPUs. Native qspinlock may perform better than pvqspinlock if there is vCPU pinning and there is no vCPU over-commitment. This patch adds a new kernel parameter to allow administrator to choose the paravirt spinlock type to be used. VM administrators can experiment with the different lock types and choose one that can best suit their need, if they want to. Hypervisor developers can also use that to experiment with different lock types so that they can come up with a better algorithm to pick the best lock type. The hypervisor paravirt spinlock code will override this new parameter in determining if pvqspinlock should be used. The parameter, however, will override Xen's xen_nopvspin in term of disabling unfair lock. Signed-off-by: Waiman Long <long...@redhat.com> --- Documentation/admin-guide/kernel-parameters.txt | 7 + arch/x86/include/asm/paravirt.h | 9 ++ arch/x86/kernel/kvm.c | 3 ++ arch/x86/kernel/paravirt.c | 40 - arch/x86/xen/spinlock.c | 12 5 files changed, 65 insertions(+), 6 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index f7df49d..c98d9c7 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3275,6 +3275,13 @@ [KNL] Number of legacy pty's. Overwrites compiled-in default number. + pvlock_type=[X86,PV_OPS] + Specify the paravirt spinlock type to be used. + Options are: + queued - native queued spinlock + pv - paravirt queued spinlock + unfair - simple TATAS unfair lock + quiet [KNL] Disable most log messages r128= [HW,DRM] diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 12deec7..c8f9ad9 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -690,6 +690,15 @@ static __always_inline bool pv_vcpu_is_preempted(long cpu) #endif /* SMP && PARAVIRT_SPINLOCKS */ +enum pv_spinlock_type { + locktype_auto = 0, + locktype_queued = 1, + locktype_paravirt = 2, + locktype_unfair = 4, +}; + +extern enum pv_spinlock_type pv_spinlock_type; + #ifdef CONFIG_X86_32 #define PV_SAVE_REGS "pushl %ecx; pushl %edx;" #define PV_RESTORE_REGS "popl %edx; popl %ecx;" diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 8bb9594..3faee63 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -646,6 +646,9 @@ void __init kvm_spinlock_init(void) if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) return; + if (pv_spinlock_type & (locktype_queued|locktype_unfair)) + return; + __pv_init_lock_hash(); pv_lock_ops.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath; pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 041096b..aee6756 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -115,11 +115,48 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void *target, return 5; } +/* + * The kernel argument "pvlock_type=" can be used to explicitly specify + * which type of spinlocks to be used. Currently, there are 3 options: + * 1) queued - the native queued spinlock + * 2) pv - the paravirt queued spinlock (if CONFIG_PARAVIRT_SPINLOCKS) + * 3) unfair - the simple TATAS unfair lock + * + * If this argument is not specified, the kernel will automatically choose + * an appropriate one depending on X86_FEATURE_HYPERVISOR and hypervisor + * specific settings. + */ +enum pv_spinlock_type __read_mostly pv_spinlock_type = locktype_auto; + +static int __init pvlock_setup(char *s) +{ + if (!s) + return -EINVAL; + + if (!strcmp(s, "queued")) + pv_spinlock_type = locktype_queued; + else if (!strcmp(s, "pv")) + pv_spinlock_type = locktype_paravirt; + else if (!strcmp(s, "u
Re: [PATCH] x86/paravirt: Add kernel parameter to choose paravirt lock type
On 11/01/2017 03:01 PM, Boris Ostrovsky wrote: > On 11/01/2017 12:28 PM, Waiman Long wrote: >> On 11/01/2017 11:51 AM, Juergen Gross wrote: >>> On 01/11/17 16:32, Waiman Long wrote: >>>> Currently, there are 3 different lock types that can be chosen for >>>> the x86 architecture: >>>> >>>> - qspinlock >>>> - pvqspinlock >>>> - unfair lock >>>> >>>> One of the above lock types will be chosen at boot time depending on >>>> a number of different factors. >>>> >>>> Ideally, the hypervisors should be able to pick the best performing >>>> lock type for the current VM configuration. That is not currently >>>> the case as the performance of each lock type are affected by many >>>> different factors like the number of vCPUs in the VM, the amount vCPU >>>> overcommitment, the CPU type and so on. >>>> >>>> Generally speaking, unfair lock performs well for VMs with a small >>>> number of vCPUs. Native qspinlock may perform better than pvqspinlock >>>> if there is vCPU pinning and there is no vCPU over-commitment. >>>> >>>> This patch adds a new kernel parameter to allow administrator to >>>> choose the paravirt spinlock type to be used. VM administrators can >>>> experiment with the different lock types and choose one that can best >>>> suit their need, if they want to. Hypervisor developers can also use >>>> that to experiment with different lock types so that they can come >>>> up with a better algorithm to pick the best lock type. >>>> >>>> The hypervisor paravirt spinlock code will override this new parameter >>>> in determining if pvqspinlock should be used. The parameter, however, >>>> will override Xen's xen_nopvspin in term of disabling unfair lock. >>> Hmm, I'm not sure we need pvlock_type _and_ xen_nopvspin. What do others >>> think? >> I don't think we need xen_nopvspin, but I don't want to remove that >> without agreement from the community. > I also don't think xen_nopvspin will be needed after pvlock_type is > introduced. > > -boris Another reason that I didn't try to remove xen_nopvspin is backward compatibility concern. One way to handle it is to depreciate it and treat it as an alias to pvlock_type=queued. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] x86/paravirt: Add kernel parameter to choose paravirt lock type
On 11/01/2017 11:51 AM, Juergen Gross wrote: > On 01/11/17 16:32, Waiman Long wrote: >> Currently, there are 3 different lock types that can be chosen for >> the x86 architecture: >> >> - qspinlock >> - pvqspinlock >> - unfair lock >> >> One of the above lock types will be chosen at boot time depending on >> a number of different factors. >> >> Ideally, the hypervisors should be able to pick the best performing >> lock type for the current VM configuration. That is not currently >> the case as the performance of each lock type are affected by many >> different factors like the number of vCPUs in the VM, the amount vCPU >> overcommitment, the CPU type and so on. >> >> Generally speaking, unfair lock performs well for VMs with a small >> number of vCPUs. Native qspinlock may perform better than pvqspinlock >> if there is vCPU pinning and there is no vCPU over-commitment. >> >> This patch adds a new kernel parameter to allow administrator to >> choose the paravirt spinlock type to be used. VM administrators can >> experiment with the different lock types and choose one that can best >> suit their need, if they want to. Hypervisor developers can also use >> that to experiment with different lock types so that they can come >> up with a better algorithm to pick the best lock type. >> >> The hypervisor paravirt spinlock code will override this new parameter >> in determining if pvqspinlock should be used. The parameter, however, >> will override Xen's xen_nopvspin in term of disabling unfair lock. > Hmm, I'm not sure we need pvlock_type _and_ xen_nopvspin. What do others > think? I don't think we need xen_nopvspin, but I don't want to remove that without agreement from the community. >> DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key); >> >> void __init native_pv_lock_init(void) >> { >> -if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) >> +if (pv_spinlock_type == locktype_unfair) >> +return; >> + >> +if (!static_cpu_has(X86_FEATURE_HYPERVISOR) || >> + (pv_spinlock_type != locktype_auto)) >> static_branch_disable(_spin_lock_key); > Really? I don't think locktype_paravirt should disable the static key. With paravirt spinlock, it doesn't matter if the static key is disabled or not. Without CONFIG_PARAVIRT_SPINLOCKS, however, it does degenerate into the native qspinlock. So you are right, I should check for paravirt type as well. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH] x86/paravirt: Add kernel parameter to choose paravirt lock type
Currently, there are 3 different lock types that can be chosen for the x86 architecture: - qspinlock - pvqspinlock - unfair lock One of the above lock types will be chosen at boot time depending on a number of different factors. Ideally, the hypervisors should be able to pick the best performing lock type for the current VM configuration. That is not currently the case as the performance of each lock type are affected by many different factors like the number of vCPUs in the VM, the amount vCPU overcommitment, the CPU type and so on. Generally speaking, unfair lock performs well for VMs with a small number of vCPUs. Native qspinlock may perform better than pvqspinlock if there is vCPU pinning and there is no vCPU over-commitment. This patch adds a new kernel parameter to allow administrator to choose the paravirt spinlock type to be used. VM administrators can experiment with the different lock types and choose one that can best suit their need, if they want to. Hypervisor developers can also use that to experiment with different lock types so that they can come up with a better algorithm to pick the best lock type. The hypervisor paravirt spinlock code will override this new parameter in determining if pvqspinlock should be used. The parameter, however, will override Xen's xen_nopvspin in term of disabling unfair lock. Signed-off-by: Waiman Long <long...@redhat.com> --- Documentation/admin-guide/kernel-parameters.txt | 7 + arch/x86/include/asm/paravirt.h | 9 ++ arch/x86/kernel/kvm.c | 4 +++ arch/x86/kernel/paravirt.c | 40 - arch/x86/xen/spinlock.c | 6 ++-- 5 files changed, 62 insertions(+), 4 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index f7df49d..c98d9c7 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3275,6 +3275,13 @@ [KNL] Number of legacy pty's. Overwrites compiled-in default number. + pvlock_type=[X86,PV_OPS] + Specify the paravirt spinlock type to be used. + Options are: + queued - native queued spinlock + pv - paravirt queued spinlock + unfair - simple TATAS unfair lock + quiet [KNL] Disable most log messages r128= [HW,DRM] diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 12deec7..941a046 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -690,6 +690,15 @@ static __always_inline bool pv_vcpu_is_preempted(long cpu) #endif /* SMP && PARAVIRT_SPINLOCKS */ +enum pv_spinlock_type { + locktype_auto, + locktype_queued, + locktype_paravirt, + locktype_unfair, +}; + +extern enum pv_spinlock_type pv_spinlock_type; + #ifdef CONFIG_X86_32 #define PV_SAVE_REGS "pushl %ecx; pushl %edx;" #define PV_RESTORE_REGS "popl %edx; popl %ecx;" diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 8bb9594..3a5d3ec4 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -646,6 +646,10 @@ void __init kvm_spinlock_init(void) if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT)) return; + if ((pv_spinlock_type == locktype_queued) || + (pv_spinlock_type == locktype_unfair)) + return; + __pv_init_lock_hash(); pv_lock_ops.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath; pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c index 041096b..ca35cd3 100644 --- a/arch/x86/kernel/paravirt.c +++ b/arch/x86/kernel/paravirt.c @@ -115,11 +115,48 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void *target, return 5; } +/* + * The kernel argument "pvlock_type=" can be used to explicitly specify + * which type of spinlocks to be used. Currently, there are 3 options: + * 1) queued - the native queued spinlock + * 2) pv - the paravirt queued spinlock (if CONFIG_PARAVIRT_SPINLOCKS) + * 3) unfair - the simple TATAS unfair lock + * + * If this argument is not specified, the kernel will automatically choose + * an appropriate one depending on X86_FEATURE_HYPERVISOR and hypervisor + * specific settings. + */ +enum pv_spinlock_type __read_mostly pv_spinlock_type = locktype_auto; + +static int __init pvlock_setup(char *s) +{ + if (!s) + return -EINVAL; + + if (!strcmp(s, "queued")) + pv_spinlock_type = locktype_queued; + else if (!strcmp(s, "pv")) + pv_spinlock_type = locktype_paravirt; + else if (!strcmp(
Re: [PATCH v3 0/2] guard virt_spin_lock() with a static key
On 09/25/2017 09:59 AM, Juergen Gross wrote: > Ping? > > On 06/09/17 19:36, Juergen Gross wrote: >> With virt_spin_lock() being guarded by a static key the bare metal case >> can be optimized by patching the call away completely. In case a kernel >> running as a guest it can decide whether to use paravitualized >> spinlocks, the current fallback to the unfair test-and-set scheme, or >> to mimic the bare metal behavior. >> >> V3: >> - remove test for hypervisor environment from virt_spin_lock(9 as >> suggested by Waiman Long >> >> V2: >> - use static key instead of making virt_spin_lock() a pvops function >> >> Juergen Gross (2): >> paravirt/locks: use new static key for controlling call of >> virt_spin_lock() >> paravirt,xen: correct xen_nopvspin case >> >> arch/x86/include/asm/qspinlock.h | 11 ++- >> arch/x86/kernel/paravirt-spinlocks.c | 6 ++ >> arch/x86/kernel/smpboot.c| 2 ++ >> arch/x86/xen/spinlock.c | 2 ++ >> kernel/locking/qspinlock.c | 4 >> 5 files changed, 24 insertions(+), 1 deletion(-) >> Acked-by: Waiman Long <long...@redhat.com> ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 1/2] paravirt/locks: use new static key for controlling call of virt_spin_lock()
On 09/06/2017 12:04 PM, Peter Zijlstra wrote: > On Wed, Sep 06, 2017 at 11:49:49AM -0400, Waiman Long wrote: >>> #define virt_spin_lock virt_spin_lock >>> static inline bool virt_spin_lock(struct qspinlock *lock) >>> { >>> + if (!static_branch_likely(_spin_lock_key)) >>> + return false; >>> if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) >>> return false; >>> > Now native has two NOPs instead of one. Can't we merge these two static > branches? I guess we can remove the static_cpu_has() call. Just that any spin_lock calls before native_pv_lock_init() will use the virt_spin_lock(). That is still OK as the init call is done before SMP starts. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 2/2] paravirt,xen: correct xen_nopvspin case
On 09/06/2017 11:29 AM, Juergen Gross wrote: > With the boot parameter "xen_nopvspin" specified a Xen guest should not > make use of paravirt spinlocks, but behave as if running on bare > metal. This is not true, however, as the qspinlock code will fall back > to a test-and-set scheme when it is detecting a hypervisor. > > In order to avoid this disable the virt_spin_lock_key. > > Signed-off-by: Juergen Gross <jgr...@suse.com> > --- > arch/x86/xen/spinlock.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c > index 25a7c4302ce7..e8ab80ad7a6f 100644 > --- a/arch/x86/xen/spinlock.c > +++ b/arch/x86/xen/spinlock.c > @@ -10,6 +10,7 @@ > #include > > #include > +#include > > #include > #include > @@ -129,6 +130,7 @@ void __init xen_init_spinlocks(void) > > if (!xen_pvspin) { > printk(KERN_DEBUG "xen: PV spinlocks disabled\n"); > + static_branch_disable(_spin_lock_key); > return; > } > printk(KERN_DEBUG "xen: PV spinlocks enabled\n"); Acked-by: Waiman Long <long...@redhat.com> ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2 1/2] paravirt/locks: use new static key for controlling call of virt_spin_lock()
On 09/06/2017 11:29 AM, Juergen Gross wrote: > There are cases where a guest tries to switch spinlocks to bare metal > behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this > has the downside of falling back to unfair test and set scheme for > qspinlocks due to virt_spin_lock() detecting the virtualized > environment. > > Add a static key controlling whether virt_spin_lock() should be > called or not. When running on bare metal set the new key to false. > > Signed-off-by: Juergen Gross <jgr...@suse.com> > --- > arch/x86/include/asm/qspinlock.h | 11 +++ > arch/x86/kernel/paravirt-spinlocks.c | 6 ++ > arch/x86/kernel/smpboot.c| 2 ++ > kernel/locking/qspinlock.c | 4 > 4 files changed, 23 insertions(+) > > diff --git a/arch/x86/include/asm/qspinlock.h > b/arch/x86/include/asm/qspinlock.h > index 48a706f641f2..fc39389f196b 100644 > --- a/arch/x86/include/asm/qspinlock.h > +++ b/arch/x86/include/asm/qspinlock.h > @@ -1,6 +1,7 @@ > #ifndef _ASM_X86_QSPINLOCK_H > #define _ASM_X86_QSPINLOCK_H > > +#include > #include > #include > #include > @@ -46,9 +47,15 @@ static inline void queued_spin_unlock(struct qspinlock > *lock) > #endif > > #ifdef CONFIG_PARAVIRT > +DECLARE_STATIC_KEY_TRUE(virt_spin_lock_key); > + > +void native_pv_lock_init(void) __init; > + > #define virt_spin_lock virt_spin_lock > static inline bool virt_spin_lock(struct qspinlock *lock) > { > + if (!static_branch_likely(_spin_lock_key)) > + return false; > if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) > return false; > > @@ -65,6 +72,10 @@ static inline bool virt_spin_lock(struct qspinlock *lock) > > return true; > } > +#else > +static inline void native_pv_lock_init(void) > +{ > +} > #endif /* CONFIG_PARAVIRT */ > > #include > diff --git a/arch/x86/kernel/paravirt-spinlocks.c > b/arch/x86/kernel/paravirt-spinlocks.c > index 8f2d1c9d43a8..2fc65ddea40d 100644 > --- a/arch/x86/kernel/paravirt-spinlocks.c > +++ b/arch/x86/kernel/paravirt-spinlocks.c > @@ -42,3 +42,9 @@ struct pv_lock_ops pv_lock_ops = { > #endif /* SMP */ > }; > EXPORT_SYMBOL(pv_lock_ops); > + > +void __init native_pv_lock_init(void) > +{ > + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) > + static_branch_disable(_spin_lock_key); > +} > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 54b9e89d4d6b..21500d3ba359 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -77,6 +77,7 @@ > #include > #include > #include > +#include > > /* Number of siblings per CPU package */ > int smp_num_siblings = 1; > @@ -1381,6 +1382,7 @@ void __init native_smp_prepare_boot_cpu(void) > /* already set me in cpu_online_mask in boot_cpu_init() */ > cpumask_set_cpu(me, cpu_callout_mask); > cpu_set_state_online(me); > + native_pv_lock_init(); > } > > void __init native_smp_cpus_done(unsigned int max_cpus) > diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c > index 294294c71ba4..838d235b87ef 100644 > --- a/kernel/locking/qspinlock.c > +++ b/kernel/locking/qspinlock.c > @@ -76,6 +76,10 @@ > #define MAX_NODES4 > #endif > > +#ifdef CONFIG_PARAVIRT > +DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key); > +#endif > + > /* > * Per-CPU queue node structures; we can never have more than 4 nested > * contexts: task, softirq, hardirq, nmi. Acked-by: Waiman Long <long...@redhat.com> ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/4] paravirt: add virt_spin_lock pvops function
On 09/06/2017 03:08 AM, Peter Zijlstra wrote: > Guys, please trim email. > > On Tue, Sep 05, 2017 at 10:31:46AM -0400, Waiman Long wrote: >> For clarification, I was actually asking if you consider just adding one >> more jump label to skip it for Xen/KVM instead of making >> virt_spin_lock() a pv-op. > I don't understand. What performance are you worried about. Native will > now do: "xor rax,rax; jnz some_cold_label" that's fairly trival code. It is not native that I am talking about. I am worry about VM with non-Xen/KVM hypervisor where virt_spin_lock() will actually be called. Now that function will become a callee-saved function call instead of being inlined into the native slowpath function. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/4] paravirt: add virt_spin_lock pvops function
On 09/05/2017 10:24 AM, Waiman Long wrote: > On 09/05/2017 10:18 AM, Juergen Gross wrote: >> On 05/09/17 16:10, Waiman Long wrote: >>> On 09/05/2017 09:24 AM, Juergen Gross wrote: >>>> There are cases where a guest tries to switch spinlocks to bare metal >>>> behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this >>>> has the downside of falling back to unfair test and set scheme for >>>> qspinlocks due to virt_spin_lock() detecting the virtualized >>>> environment. >>>> >>>> Make virt_spin_lock() a paravirt operation in order to enable users >>>> to select an explicit behavior like bare metal. >>>> >>>> Signed-off-by: Juergen Gross <jgr...@suse.com> >>>> --- >>>> arch/x86/include/asm/paravirt.h | 5 >>>> arch/x86/include/asm/paravirt_types.h | 1 + >>>> arch/x86/include/asm/qspinlock.h | 48 >>>> --- >>>> arch/x86/kernel/paravirt-spinlocks.c | 14 ++ >>>> arch/x86/kernel/smpboot.c | 2 ++ >>>> 5 files changed, 55 insertions(+), 15 deletions(-) >>>> >>>> diff --git a/arch/x86/include/asm/paravirt.h >>>> b/arch/x86/include/asm/paravirt.h >>>> index c25dd22f7c70..d9e954fb37df 100644 >>>> --- a/arch/x86/include/asm/paravirt.h >>>> +++ b/arch/x86/include/asm/paravirt.h >>>> @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long >>>> cpu) >>>>return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); >>>> } >>>> >>>> +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock) >>>> +{ >>>> + return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock); >>>> +} >>>> + >>>> #endif /* SMP && PARAVIRT_SPINLOCKS */ >>>> >>>> #ifdef CONFIG_X86_32 >>>> diff --git a/arch/x86/include/asm/paravirt_types.h >>>> b/arch/x86/include/asm/paravirt_types.h >>>> index 19efefc0e27e..928f5e7953a7 100644 >>>> --- a/arch/x86/include/asm/paravirt_types.h >>>> +++ b/arch/x86/include/asm/paravirt_types.h >>>> @@ -319,6 +319,7 @@ struct pv_lock_ops { >>>>void (*kick)(int cpu); >>>> >>>>struct paravirt_callee_save vcpu_is_preempted; >>>> + struct paravirt_callee_save virt_spin_lock; >>>> } __no_randomize_layout; >>>> >>>> /* This contains all the paravirt structures: we get a convenient >>>> diff --git a/arch/x86/include/asm/qspinlock.h >>>> b/arch/x86/include/asm/qspinlock.h >>>> index 48a706f641f2..fbd98896385c 100644 >>>> --- a/arch/x86/include/asm/qspinlock.h >>>> +++ b/arch/x86/include/asm/qspinlock.h >>>> @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct >>>> qspinlock *lock) >>>>smp_store_release((u8 *)lock, 0); >>>> } >>>> >>>> +static inline bool native_virt_spin_lock(struct qspinlock *lock) >>>> +{ >>>> + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) >>>> + return false; >>>> + >>>> + /* >>>> + * On hypervisors without PARAVIRT_SPINLOCKS support we fall >>>> + * back to a Test-and-Set spinlock, because fair locks have >>>> + * horrible lock 'holder' preemption issues. >>>> + */ >>>> + >>>> + do { >>>> + while (atomic_read(>val) != 0) >>>> + cpu_relax(); >>>> + } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0); >>>> + >>>> + return true; >>>> +} >>>> + >>>> #ifdef CONFIG_PARAVIRT_SPINLOCKS >>>> extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 >>>> val); >>>> extern void __pv_init_lock_hash(void); >>>> @@ -38,33 +57,32 @@ static inline bool vcpu_is_preempted(long cpu) >>>> { >>>>return pv_vcpu_is_preempted(cpu); >>>> } >>>> + >>>> +void native_pv_lock_init(void) __init; >>>> #else >>>> static inline void queued_spin_unlock(struct qspinlock *lock) >>>> { >>>>native_queued_spin_unlock(lock); >>>> } >>>> + >>>> +static inline void native_pv_lock_init(void) >>>> +{ >>>> +} >>>> #endif >>>> >>>> #ifdef CONFIG_PARAVIRT >>>> #define virt_spin_lock virt_spin_lock >>>> +#ifdef CONFIG_PARAVIRT_SPINLOCKS >>>> static inline bool virt_spin_lock(struct qspinlock *lock) >>>> { >>>> - if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) >>>> - return false; >>> Have you consider just add one more jump label here to skip >>> virt_spin_lock when KVM or Xen want to do so? >> Why? Did you look at patch 4? This is the way to do it... > I asked this because of my performance concern as stated later in the email. For clarification, I was actually asking if you consider just adding one more jump label to skip it for Xen/KVM instead of making virt_spin_lock() a pv-op. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/4] paravirt: add virt_spin_lock pvops function
On 09/05/2017 10:18 AM, Juergen Gross wrote: > On 05/09/17 16:10, Waiman Long wrote: >> On 09/05/2017 09:24 AM, Juergen Gross wrote: >>> There are cases where a guest tries to switch spinlocks to bare metal >>> behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this >>> has the downside of falling back to unfair test and set scheme for >>> qspinlocks due to virt_spin_lock() detecting the virtualized >>> environment. >>> >>> Make virt_spin_lock() a paravirt operation in order to enable users >>> to select an explicit behavior like bare metal. >>> >>> Signed-off-by: Juergen Gross <jgr...@suse.com> >>> --- >>> arch/x86/include/asm/paravirt.h | 5 >>> arch/x86/include/asm/paravirt_types.h | 1 + >>> arch/x86/include/asm/qspinlock.h | 48 >>> --- >>> arch/x86/kernel/paravirt-spinlocks.c | 14 ++ >>> arch/x86/kernel/smpboot.c | 2 ++ >>> 5 files changed, 55 insertions(+), 15 deletions(-) >>> >>> diff --git a/arch/x86/include/asm/paravirt.h >>> b/arch/x86/include/asm/paravirt.h >>> index c25dd22f7c70..d9e954fb37df 100644 >>> --- a/arch/x86/include/asm/paravirt.h >>> +++ b/arch/x86/include/asm/paravirt.h >>> @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long >>> cpu) >>> return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); >>> } >>> >>> +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock) >>> +{ >>> + return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock); >>> +} >>> + >>> #endif /* SMP && PARAVIRT_SPINLOCKS */ >>> >>> #ifdef CONFIG_X86_32 >>> diff --git a/arch/x86/include/asm/paravirt_types.h >>> b/arch/x86/include/asm/paravirt_types.h >>> index 19efefc0e27e..928f5e7953a7 100644 >>> --- a/arch/x86/include/asm/paravirt_types.h >>> +++ b/arch/x86/include/asm/paravirt_types.h >>> @@ -319,6 +319,7 @@ struct pv_lock_ops { >>> void (*kick)(int cpu); >>> >>> struct paravirt_callee_save vcpu_is_preempted; >>> + struct paravirt_callee_save virt_spin_lock; >>> } __no_randomize_layout; >>> >>> /* This contains all the paravirt structures: we get a convenient >>> diff --git a/arch/x86/include/asm/qspinlock.h >>> b/arch/x86/include/asm/qspinlock.h >>> index 48a706f641f2..fbd98896385c 100644 >>> --- a/arch/x86/include/asm/qspinlock.h >>> +++ b/arch/x86/include/asm/qspinlock.h >>> @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct >>> qspinlock *lock) >>> smp_store_release((u8 *)lock, 0); >>> } >>> >>> +static inline bool native_virt_spin_lock(struct qspinlock *lock) >>> +{ >>> + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) >>> + return false; >>> + >>> + /* >>> +* On hypervisors without PARAVIRT_SPINLOCKS support we fall >>> +* back to a Test-and-Set spinlock, because fair locks have >>> +* horrible lock 'holder' preemption issues. >>> +*/ >>> + >>> + do { >>> + while (atomic_read(>val) != 0) >>> + cpu_relax(); >>> + } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0); >>> + >>> + return true; >>> +} >>> + >>> #ifdef CONFIG_PARAVIRT_SPINLOCKS >>> extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 >>> val); >>> extern void __pv_init_lock_hash(void); >>> @@ -38,33 +57,32 @@ static inline bool vcpu_is_preempted(long cpu) >>> { >>> return pv_vcpu_is_preempted(cpu); >>> } >>> + >>> +void native_pv_lock_init(void) __init; >>> #else >>> static inline void queued_spin_unlock(struct qspinlock *lock) >>> { >>> native_queued_spin_unlock(lock); >>> } >>> + >>> +static inline void native_pv_lock_init(void) >>> +{ >>> +} >>> #endif >>> >>> #ifdef CONFIG_PARAVIRT >>> #define virt_spin_lock virt_spin_lock >>> +#ifdef CONFIG_PARAVIRT_SPINLOCKS >>> static inline bool virt_spin_lock(struct qspinlock *lock) >>> { >>> - if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) >>> - return false; >> Have you consider just add one more j
Re: [PATCH 3/4] paravirt: add virt_spin_lock pvops function
On 09/05/2017 10:08 AM, Peter Zijlstra wrote: > On Tue, Sep 05, 2017 at 10:02:57AM -0400, Waiman Long wrote: >> On 09/05/2017 09:24 AM, Juergen Gross wrote: >>> +static inline bool native_virt_spin_lock(struct qspinlock *lock) >>> +{ >>> + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) >>> + return false; >>> + >> I think you can take the above if statement out as you has done test in >> native_pv_lock_init(). So the test will also be false here. > That does mean we'll run a test-and-set spinlock until paravirt patching > happens though. I prefer to not do that. > > One important point.. we must not be holding any locks when we switch > over between the two locks. Back then I spend some time making sure that > didn't happen with the X86 feature flag muck. AFAICT, native_pv_lock_init() is called before SMP init. So it shouldn't matter. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 3/4] paravirt: add virt_spin_lock pvops function
On 09/05/2017 09:24 AM, Juergen Gross wrote: > There are cases where a guest tries to switch spinlocks to bare metal > behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this > has the downside of falling back to unfair test and set scheme for > qspinlocks due to virt_spin_lock() detecting the virtualized > environment. > > Make virt_spin_lock() a paravirt operation in order to enable users > to select an explicit behavior like bare metal. > > Signed-off-by: Juergen Gross> --- > arch/x86/include/asm/paravirt.h | 5 > arch/x86/include/asm/paravirt_types.h | 1 + > arch/x86/include/asm/qspinlock.h | 48 > --- > arch/x86/kernel/paravirt-spinlocks.c | 14 ++ > arch/x86/kernel/smpboot.c | 2 ++ > 5 files changed, 55 insertions(+), 15 deletions(-) > > diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h > index c25dd22f7c70..d9e954fb37df 100644 > --- a/arch/x86/include/asm/paravirt.h > +++ b/arch/x86/include/asm/paravirt.h > @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long > cpu) > return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); > } > > +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock) > +{ > + return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock); > +} > + > #endif /* SMP && PARAVIRT_SPINLOCKS */ > > #ifdef CONFIG_X86_32 > diff --git a/arch/x86/include/asm/paravirt_types.h > b/arch/x86/include/asm/paravirt_types.h > index 19efefc0e27e..928f5e7953a7 100644 > --- a/arch/x86/include/asm/paravirt_types.h > +++ b/arch/x86/include/asm/paravirt_types.h > @@ -319,6 +319,7 @@ struct pv_lock_ops { > void (*kick)(int cpu); > > struct paravirt_callee_save vcpu_is_preempted; > + struct paravirt_callee_save virt_spin_lock; > } __no_randomize_layout; > > /* This contains all the paravirt structures: we get a convenient > diff --git a/arch/x86/include/asm/qspinlock.h > b/arch/x86/include/asm/qspinlock.h > index 48a706f641f2..fbd98896385c 100644 > --- a/arch/x86/include/asm/qspinlock.h > +++ b/arch/x86/include/asm/qspinlock.h > @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct > qspinlock *lock) > smp_store_release((u8 *)lock, 0); > } > > +static inline bool native_virt_spin_lock(struct qspinlock *lock) > +{ > + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) > + return false; > + > + /* > + * On hypervisors without PARAVIRT_SPINLOCKS support we fall > + * back to a Test-and-Set spinlock, because fair locks have > + * horrible lock 'holder' preemption issues. > + */ > + > + do { > + while (atomic_read(>val) != 0) > + cpu_relax(); > + } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0); > + > + return true; > +} > + > #ifdef CONFIG_PARAVIRT_SPINLOCKS > extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 > val); > extern void __pv_init_lock_hash(void); > @@ -38,33 +57,32 @@ static inline bool vcpu_is_preempted(long cpu) > { > return pv_vcpu_is_preempted(cpu); > } > + > +void native_pv_lock_init(void) __init; > #else > static inline void queued_spin_unlock(struct qspinlock *lock) > { > native_queued_spin_unlock(lock); > } > + > +static inline void native_pv_lock_init(void) > +{ > +} > #endif > > #ifdef CONFIG_PARAVIRT > #define virt_spin_lock virt_spin_lock > +#ifdef CONFIG_PARAVIRT_SPINLOCKS > static inline bool virt_spin_lock(struct qspinlock *lock) > { > - if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) > - return false; Have you consider just add one more jump label here to skip virt_spin_lock when KVM or Xen want to do so? > - > - /* > - * On hypervisors without PARAVIRT_SPINLOCKS support we fall > - * back to a Test-and-Set spinlock, because fair locks have > - * horrible lock 'holder' preemption issues. > - */ > - > - do { > - while (atomic_read(>val) != 0) > - cpu_relax(); > - } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0); > - > - return true; > + return pv_virt_spin_lock(lock); > +} > +#else > +static inline bool virt_spin_lock(struct qspinlock *lock) > +{ > + return native_virt_spin_lock(lock); > } > +#endif /* CONFIG_PARAVIRT_SPINLOCKS */ > #endif /* CONFIG_PARAVIRT */ > > #include > diff --git a/arch/x86/kernel/paravirt-spinlocks.c > b/arch/x86/kernel/paravirt-spinlocks.c > index 26e4bd92f309..1be187ef8a38 100644 > --- a/arch/x86/kernel/paravirt-spinlocks.c > +++ b/arch/x86/kernel/paravirt-spinlocks.c > @@ -20,6 +20,12 @@ bool pv_is_native_spin_unlock(void) > __raw_callee_save___native_queued_spin_unlock; > } > > +__visible bool __native_virt_spin_lock(struct qspinlock *lock) > +{ > + return native_virt_spin_lock(lock); > +} > +PV_CALLEE_SAVE_REGS_THUNK(__native_virt_spin_lock); I have some
Re: [PATCH 3/4] paravirt: add virt_spin_lock pvops function
On 09/05/2017 09:24 AM, Juergen Gross wrote: > There are cases where a guest tries to switch spinlocks to bare metal > behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this > has the downside of falling back to unfair test and set scheme for > qspinlocks due to virt_spin_lock() detecting the virtualized > environment. > > Make virt_spin_lock() a paravirt operation in order to enable users > to select an explicit behavior like bare metal. > > Signed-off-by: Juergen Gross> --- > arch/x86/include/asm/paravirt.h | 5 > arch/x86/include/asm/paravirt_types.h | 1 + > arch/x86/include/asm/qspinlock.h | 48 > --- > arch/x86/kernel/paravirt-spinlocks.c | 14 ++ > arch/x86/kernel/smpboot.c | 2 ++ > 5 files changed, 55 insertions(+), 15 deletions(-) > > diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h > index c25dd22f7c70..d9e954fb37df 100644 > --- a/arch/x86/include/asm/paravirt.h > +++ b/arch/x86/include/asm/paravirt.h > @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long > cpu) > return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); > } > > +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock) > +{ > + return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock); > +} > + > #endif /* SMP && PARAVIRT_SPINLOCKS */ > > #ifdef CONFIG_X86_32 > diff --git a/arch/x86/include/asm/paravirt_types.h > b/arch/x86/include/asm/paravirt_types.h > index 19efefc0e27e..928f5e7953a7 100644 > --- a/arch/x86/include/asm/paravirt_types.h > +++ b/arch/x86/include/asm/paravirt_types.h > @@ -319,6 +319,7 @@ struct pv_lock_ops { > void (*kick)(int cpu); > > struct paravirt_callee_save vcpu_is_preempted; > + struct paravirt_callee_save virt_spin_lock; > } __no_randomize_layout; > > /* This contains all the paravirt structures: we get a convenient > diff --git a/arch/x86/include/asm/qspinlock.h > b/arch/x86/include/asm/qspinlock.h > index 48a706f641f2..fbd98896385c 100644 > --- a/arch/x86/include/asm/qspinlock.h > +++ b/arch/x86/include/asm/qspinlock.h > @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct > qspinlock *lock) > smp_store_release((u8 *)lock, 0); > } > > +static inline bool native_virt_spin_lock(struct qspinlock *lock) > +{ > + if (!static_cpu_has(X86_FEATURE_HYPERVISOR)) > + return false; > + I think you can take the above if statement out as you has done test in native_pv_lock_init(). So the test will also be false here. As this patch series is x86 specific, you should probably add "x86/" in front of paravirt in the patch titles. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v5 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
It was found when running fio sequential write test with a XFS ramdisk on a KVM guest running on a 2-socket x86-64 system, the %CPU times as reported by perf were as follows: 69.75% 0.59% fio [k] down_write 69.15% 0.01% fio [k] call_rwsem_down_write_failed 67.12% 1.12% fio [k] rwsem_down_write_failed 63.48% 52.77% fio [k] osq_lock 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted Making vcpu_is_preempted() a callee-save function has a relatively high cost on x86-64 primarily due to at least one more cacheline of data access from the saving and restoring of registers (8 of them) to and from stack as well as one more level of function call. To reduce this performance overhead, an optimized assembly version of the the __raw_callee_save___kvm_vcpu_is_preempt() function is provided for x86-64. With this patch applied on a KVM guest on a 2-socekt 16-core 32-thread system with 16 parallel jobs (8 on each socket), the aggregrate bandwidth of the fio test on an XFS ramdisk were as follows: I/O Type w/o patchwith patch --- random read 8141.2 MB/s 8497.1 MB/s seq read 8229.4 MB/s 8304.2 MB/s random write 1675.5 MB/s 1701.5 MB/s seq write 1681.3 MB/s 1699.9 MB/s There are some increases in the aggregated bandwidth because of the patch. The perf data now became: 70.78% 0.58% fio [k] down_write 70.20% 0.01% fio [k] call_rwsem_down_write_failed 69.70% 1.17% fio [k] rwsem_down_write_failed 59.91% 55.42% fio [k] osq_lock 10.14% 10.14% fio [k] __kvm_vcpu_is_preempted The assembly code was verified by using a test kernel module to compare the output of C __kvm_vcpu_is_preempted() and that of assembly __raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched. Suggested-by: Peter Zijlstra <pet...@infradead.org> Signed-off-by: Waiman Long <long...@redhat.com> --- arch/x86/kernel/asm-offsets_64.c | 9 + arch/x86/kernel/kvm.c| 24 2 files changed, 33 insertions(+) diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c index 210927e..99332f5 100644 --- a/arch/x86/kernel/asm-offsets_64.c +++ b/arch/x86/kernel/asm-offsets_64.c @@ -13,6 +13,10 @@ #include }; +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS) +#include +#endif + int main(void) { #ifdef CONFIG_PARAVIRT @@ -22,6 +26,11 @@ int main(void) BLANK(); #endif +#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS) + OFFSET(KVM_STEAL_TIME_preempted, kvm_steal_time, preempted); + BLANK(); +#endif + #define ENTRY(entry) OFFSET(pt_regs_ ## entry, pt_regs, entry) ENTRY(bx); ENTRY(cx); diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 85ed343..14f65a5 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) local_irq_restore(flags); } +#ifdef CONFIG_X86_32 __visible bool __kvm_vcpu_is_preempted(long cpu) { struct kvm_steal_time *src = _cpu(steal_time, cpu); @@ -597,6 +598,29 @@ __visible bool __kvm_vcpu_is_preempted(long cpu) } PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); +#else + +#include + +extern bool __raw_callee_save___kvm_vcpu_is_preempted(long); + +/* + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and + * restoring to/from the stack. + */ +asm( +".pushsection .text;" +".global __raw_callee_save___kvm_vcpu_is_preempted;" +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" +"__raw_callee_save___kvm_vcpu_is_preempted:" +"movq __per_cpu_offset(,%rdi,8), %rax;" +"cmpb $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);" +"setne %al;" +"ret;" +".popsection"); + +#endif + /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. */ -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v5 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead
v4->v5: - As suggested by PeterZ, use the asm-offsets header file generation mechanism to get the offset of the preempted field in kvm_steal_time instead of hardcoding it. v3->v4: - Fix x86-32 build error. v2->v3: - Provide an optimized __raw_callee_save___kvm_vcpu_is_preempted() in assembly as suggested by PeterZ. - Add a new patch to change vcpu_is_preempted() argument type to long to ease the writing of the assembly code. v1->v2: - Rerun the fio test on a different system on both bare-metal and a KVM guest. Both sockets were utilized in this test. - The commit log was updated with new performance numbers, but the patch wasn't changed. - Drop patch 2. As it was found that the overhead of callee-save vcpu_is_preempted() can have some impact on system performance on a VM guest, especially of x86-64 guest, this patch set intends to reduce this performance overhead by replacing the C __kvm_vcpu_is_preempted() function by an optimized version of __raw_callee_save___kvm_vcpu_is_preempted() written in assembly. Waiman Long (2): x86/paravirt: Change vcp_is_preempted() arg type to long x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64 arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/qspinlock.h | 2 +- arch/x86/kernel/asm-offsets_64.c | 9 + arch/x86/kernel/kvm.c| 26 +- arch/x86/kernel/paravirt-spinlocks.c | 2 +- 5 files changed, 37 insertions(+), 4 deletions(-) -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v5 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
The cpu argument in the function prototype of vcpu_is_preempted() is changed from int to long. That makes it easier to provide a better optimized assembly version of that function. For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the downcast from long to int is not a problem as vCPU number won't exceed 32 bits. Signed-off-by: Waiman Long <long...@redhat.com> --- arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/qspinlock.h | 2 +- arch/x86/kernel/kvm.c| 2 +- arch/x86/kernel/paravirt-spinlocks.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 1eea6ca..f75fbfe 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -673,7 +673,7 @@ static __always_inline void pv_kick(int cpu) PVOP_VCALL1(pv_lock_ops.kick, cpu); } -static __always_inline bool pv_vcpu_is_preempted(int cpu) +static __always_inline bool pv_vcpu_is_preempted(long cpu) { return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); } diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index c343ab5..48a706f 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -34,7 +34,7 @@ static inline void queued_spin_unlock(struct qspinlock *lock) } #define vcpu_is_preempted vcpu_is_preempted -static inline bool vcpu_is_preempted(int cpu) +static inline bool vcpu_is_preempted(long cpu) { return pv_vcpu_is_preempted(cpu); } diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 099fcba..85ed343 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -589,7 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) local_irq_restore(flags); } -__visible bool __kvm_vcpu_is_preempted(int cpu) +__visible bool __kvm_vcpu_is_preempted(long cpu) { struct kvm_steal_time *src = _cpu(steal_time, cpu); diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 6259327..8f2d1c9 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -20,7 +20,7 @@ bool pv_is_native_spin_unlock(void) __raw_callee_save___native_queued_spin_unlock; } -__visible bool __native_vcpu_is_preempted(int cpu) +__visible bool __native_vcpu_is_preempted(long cpu) { return false; } -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
On 02/16/2017 11:09 AM, Peter Zijlstra wrote: > On Wed, Feb 15, 2017 at 04:37:49PM -0500, Waiman Long wrote: >> The cpu argument in the function prototype of vcpu_is_preempted() >> is changed from int to long. That makes it easier to provide a better >> optimized assembly version of that function. >> >> For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the >> downcast from long to int is not a problem as vCPU number won't exceed >> 32 bits. >> > Note that because of the cast in PVOP_CALL_ARG1() this patch is > pointless. > > Then again, it doesn't seem to affect code generation, so why not. Takes > away the reliance on that weird cast. I add this patch because I am a bit uneasy about clearing the upper 32 bits of rdi and assuming that the compiler won't have a previous use of those bits. It gives me peace of mind. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
On 02/16/2017 11:48 AM, Peter Zijlstra wrote: > On Wed, Feb 15, 2017 at 04:37:50PM -0500, Waiman Long wrote: >> +/* >> + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and >> + * restoring to/from the stack. It is assumed that the preempted value >> + * is at an offset of 16 from the beginning of the kvm_steal_time structure >> + * which is verified by the BUILD_BUG_ON() macro below. >> + */ >> +#define PREEMPTED_OFFSET16 > As per Andrew's suggestion, the 'right' way is something like so. Thanks for the tip. I was not aware of the asm-offsets stuff. I will update the patch to use it. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
It was found when running fio sequential write test with a XFS ramdisk on a KVM guest running on a 2-socket x86-64 system, the %CPU times as reported by perf were as follows: 69.75% 0.59% fio [k] down_write 69.15% 0.01% fio [k] call_rwsem_down_write_failed 67.12% 1.12% fio [k] rwsem_down_write_failed 63.48% 52.77% fio [k] osq_lock 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted Making vcpu_is_preempted() a callee-save function has a relatively high cost on x86-64 primarily due to at least one more cacheline of data access from the saving and restoring of registers (8 of them) to and from stack as well as one more level of function call. To reduce this performance overhead, an optimized assembly version of the the __raw_callee_save___kvm_vcpu_is_preempt() function is provided for x86-64. With this patch applied on a KVM guest on a 2-socekt 16-core 32-thread system with 16 parallel jobs (8 on each socket), the aggregrate bandwidth of the fio test on an XFS ramdisk were as follows: I/O Type w/o patchwith patch --- random read 8141.2 MB/s 8497.1 MB/s seq read 8229.4 MB/s 8304.2 MB/s random write 1675.5 MB/s 1701.5 MB/s seq write 1681.3 MB/s 1699.9 MB/s There are some increases in the aggregated bandwidth because of the patch. The perf data now became: 70.78% 0.58% fio [k] down_write 70.20% 0.01% fio [k] call_rwsem_down_write_failed 69.70% 1.17% fio [k] rwsem_down_write_failed 59.91% 55.42% fio [k] osq_lock 10.14% 10.14% fio [k] __kvm_vcpu_is_preempted The assembly code was verified by using a test kernel module to compare the output of C __kvm_vcpu_is_preempted() and that of assembly __raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched. Suggested-by: Peter Zijlstra <pet...@infradead.org> Signed-off-by: Waiman Long <long...@redhat.com> --- arch/x86/kernel/kvm.c | 30 ++ 1 file changed, 30 insertions(+) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 85ed343..e423435 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) local_irq_restore(flags); } +#ifdef CONFIG_X86_32 __visible bool __kvm_vcpu_is_preempted(long cpu) { struct kvm_steal_time *src = _cpu(steal_time, cpu); @@ -597,11 +598,40 @@ __visible bool __kvm_vcpu_is_preempted(long cpu) } PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); +#else + +extern bool __raw_callee_save___kvm_vcpu_is_preempted(long); + +/* + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and + * restoring to/from the stack. It is assumed that the preempted value + * is at an offset of 16 from the beginning of the kvm_steal_time structure + * which is verified by the BUILD_BUG_ON() macro below. + */ +#define PREEMPTED_OFFSET 16 +asm( +".pushsection .text;" +".global __raw_callee_save___kvm_vcpu_is_preempted;" +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" +"__raw_callee_save___kvm_vcpu_is_preempted:" +"movq __per_cpu_offset(,%rdi,8), %rax;" +"cmpb $0, " __stringify(PREEMPTED_OFFSET) "+steal_time(%rax);" +"setne %al;" +"ret;" +".popsection"); + +#endif + /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. */ void __init kvm_spinlock_init(void) { +#ifdef CONFIG_X86_64 + BUILD_BUG_ON((offsetof(struct kvm_steal_time, preempted) + != PREEMPTED_OFFSET) || (sizeof(steal_time.preempted) != 1)); +#endif + if (!kvm_para_available()) return; /* Does host kernel support KVM_FEATURE_PV_UNHALT? */ -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
The cpu argument in the function prototype of vcpu_is_preempted() is changed from int to long. That makes it easier to provide a better optimized assembly version of that function. For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the downcast from long to int is not a problem as vCPU number won't exceed 32 bits. Signed-off-by: Waiman Long <long...@redhat.com> --- arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/qspinlock.h | 2 +- arch/x86/kernel/kvm.c| 2 +- arch/x86/kernel/paravirt-spinlocks.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 1eea6ca..f75fbfe 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -673,7 +673,7 @@ static __always_inline void pv_kick(int cpu) PVOP_VCALL1(pv_lock_ops.kick, cpu); } -static __always_inline bool pv_vcpu_is_preempted(int cpu) +static __always_inline bool pv_vcpu_is_preempted(long cpu) { return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); } diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index c343ab5..48a706f 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -34,7 +34,7 @@ static inline void queued_spin_unlock(struct qspinlock *lock) } #define vcpu_is_preempted vcpu_is_preempted -static inline bool vcpu_is_preempted(int cpu) +static inline bool vcpu_is_preempted(long cpu) { return pv_vcpu_is_preempted(cpu); } diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 099fcba..85ed343 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -589,7 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) local_irq_restore(flags); } -__visible bool __kvm_vcpu_is_preempted(int cpu) +__visible bool __kvm_vcpu_is_preempted(long cpu) { struct kvm_steal_time *src = _cpu(steal_time, cpu); diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 6259327..8f2d1c9 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -20,7 +20,7 @@ bool pv_is_native_spin_unlock(void) __raw_callee_save___native_queued_spin_unlock; } -__visible bool __native_vcpu_is_preempted(int cpu) +__visible bool __native_vcpu_is_preempted(long cpu) { return false; } -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v4 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead
v3->v4: - Fix x86-32 build error. v2->v3: - Provide an optimized __raw_callee_save___kvm_vcpu_is_preempted() in assembly as suggested by PeterZ. - Add a new patch to change vcpu_is_preempted() argument type to long to ease the writing of the assembly code. v1->v2: - Rerun the fio test on a different system on both bare-metal and a KVM guest. Both sockets were utilized in this test. - The commit log was updated with new performance numbers, but the patch wasn't changed. - Drop patch 2. As it was found that the overhead of callee-save vcpu_is_preempted() can have some impact on system performance on a VM guest, especially of x86-64 guest, this patch set intends to reduce this performance overhead by replacing the C __kvm_vcpu_is_preempted() function by an optimized version of __raw_callee_save___kvm_vcpu_is_preempted() written in assembly. Waiman Long (2): x86/paravirt: Change vcp_is_preempted() arg type to long x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64 arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/qspinlock.h | 2 +- arch/x86/kernel/kvm.c| 32 +++- arch/x86/kernel/paravirt-spinlocks.c | 2 +- 4 files changed, 34 insertions(+), 4 deletions(-) -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long
The cpu argument in the function prototype of vcpu_is_preempted() is changed from int to long. That makes it easier to provide a better optimized assembly version of that function. For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the downcast from long to int is not a problem as vCPU number won't exceed 32 bits. Signed-off-by: Waiman Long <long...@redhat.com> --- arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/qspinlock.h | 2 +- arch/x86/kernel/kvm.c| 2 +- arch/x86/kernel/paravirt-spinlocks.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 864f57b..2a46f73 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -674,7 +674,7 @@ static __always_inline void pv_kick(int cpu) PVOP_VCALL1(pv_lock_ops.kick, cpu); } -static __always_inline bool pv_vcpu_is_preempted(int cpu) +static __always_inline bool pv_vcpu_is_preempted(long cpu) { return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); } diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h index c343ab5..48a706f 100644 --- a/arch/x86/include/asm/qspinlock.h +++ b/arch/x86/include/asm/qspinlock.h @@ -34,7 +34,7 @@ static inline void queued_spin_unlock(struct qspinlock *lock) } #define vcpu_is_preempted vcpu_is_preempted -static inline bool vcpu_is_preempted(int cpu) +static inline bool vcpu_is_preempted(long cpu) { return pv_vcpu_is_preempted(cpu); } diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 099fcba..85ed343 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -589,7 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) local_irq_restore(flags); } -__visible bool __kvm_vcpu_is_preempted(int cpu) +__visible bool __kvm_vcpu_is_preempted(long cpu) { struct kvm_steal_time *src = _cpu(steal_time, cpu); diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 6259327..8f2d1c9 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -20,7 +20,7 @@ bool pv_is_native_spin_unlock(void) __raw_callee_save___native_queued_spin_unlock; } -__visible bool __native_vcpu_is_preempted(int cpu) +__visible bool __native_vcpu_is_preempted(long cpu) { return false; } -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead
v2->v3: - Provide an optimized __raw_callee_save___kvm_vcpu_is_preempted() in assembly as suggested by PeterZ. - Add a new patch to change vcpu_is_preempted() argument type to long to ease the writing of the assembly code. v1->v2: - Rerun the fio test on a different system on both bare-metal and a KVM guest. Both sockets were utilized in this test. - The commit log was updated with new performance numbers, but the patch wasn't changed. - Drop patch 2. As it was found that the overhead of callee-save vcpu_is_preempted() can have some impact on system performance on a VM guest, especially of x86-64 guest, this patch set intends to reduce this performance overhead by replacing the C __kvm_vcpu_is_preempted() function by an optimized version of __raw_callee_save___kvm_vcpu_is_preempted() written in assembly. Waiman Long (2): x86/paravirt: Change vcp_is_preempted() arg type to long x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64 arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/qspinlock.h | 2 +- arch/x86/kernel/kvm.c| 30 +- arch/x86/kernel/paravirt-spinlocks.c | 2 +- 4 files changed, 32 insertions(+), 4 deletions(-) -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v3 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64
It was found when running fio sequential write test with a XFS ramdisk on a KVM guest running on a 2-socket x86-64 system, the %CPU times as reported by perf were as follows: 69.75% 0.59% fio [k] down_write 69.15% 0.01% fio [k] call_rwsem_down_write_failed 67.12% 1.12% fio [k] rwsem_down_write_failed 63.48% 52.77% fio [k] osq_lock 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted Making vcpu_is_preempted() a callee-save function has a relatively high cost on x86-64 primarily due to at least one more cacheline of data access from the saving and restoring of registers (8 of them) to and from stack as well as one more level of function call. To reduce this performance overhead, an optimized assembly version of the the __raw_callee_save___kvm_vcpu_is_preempt() function is provided for x86-64. With this patch applied on a KVM guest on a 2-socekt 16-core 32-thread system with 16 parallel jobs (8 on each socket), the aggregrate bandwidth of the fio test on an XFS ramdisk were as follows: I/O Type w/o patchwith patch --- random read 8141.2 MB/s 8497.1 MB/s seq read 8229.4 MB/s 8304.2 MB/s random write 1675.5 MB/s 1701.5 MB/s seq write 1681.3 MB/s 1699.9 MB/s There are some increases in the aggregated bandwidth because of the patch. The perf data now became: 70.78% 0.58% fio [k] down_write 70.20% 0.01% fio [k] call_rwsem_down_write_failed 69.70% 1.17% fio [k] rwsem_down_write_failed 59.91% 55.42% fio [k] osq_lock 10.14% 10.14% fio [k] __kvm_vcpu_is_preempted The assembly code was verified by using a test kernel module to compare the output of C __kvm_vcpu_is_preempted() and that of assembly __raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched. Suggested-by: Peter Zijlstra <pet...@infradead.org> Signed-off-by: Waiman Long <long...@redhat.com> --- arch/x86/kernel/kvm.c | 28 1 file changed, 28 insertions(+) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 85ed343..83e22c1 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) local_irq_restore(flags); } +#ifdef CONFIG_X86_32 __visible bool __kvm_vcpu_is_preempted(long cpu) { struct kvm_steal_time *src = _cpu(steal_time, cpu); @@ -597,11 +598,38 @@ __visible bool __kvm_vcpu_is_preempted(long cpu) } PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); +#else + +extern bool __raw_callee_save___kvm_vcpu_is_preempted(long); + +/* + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and + * restoring to/from the stack. It is assumed that the preempted value + * is at an offset of 16 from the beginning of the kvm_steal_time structure + * which is verified by the BUILD_BUG_ON() macro below. + */ +#define PREEMPTED_OFFSET 16 +asm( +".pushsection .text;" +".global __raw_callee_save___kvm_vcpu_is_preempted;" +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" +"__raw_callee_save___kvm_vcpu_is_preempted:" +"movq __per_cpu_offset(,%rdi,8), %rax;" +"cmpb $0, " __stringify(PREEMPTED_OFFSET) "+steal_time(%rax);" +"setne %al;" +"ret;" +".popsection"); + +#endif + /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. */ void __init kvm_spinlock_init(void) { + BUILD_BUG_ON((offsetof(struct kvm_steal_time, preempted) + != PREEMPTED_OFFSET) || (sizeof(steal_time.preempted) != 1)); + if (!kvm_para_available()) return; /* Does host kernel support KVM_FEATURE_PV_UNHALT? */ -- 1.8.3.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/14/2017 04:39 AM, Peter Zijlstra wrote: > On Mon, Feb 13, 2017 at 05:34:01PM -0500, Waiman Long wrote: >> It is the address of _time that will exceed the 32-bit limit. > That seems extremely unlikely. That would mean we have more than 4G > worth of per-cpu variables declared in the kernel. I have some doubt about if the compiler is able to properly use RIP-relative addressing for this. Anyway, it seems like constraints aren't allowed for asm() when not in the function context, at least for the the compiler that I am using (4.8.5). So it is a moot point. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/13/2017 04:52 PM, Peter Zijlstra wrote: > On Mon, Feb 13, 2017 at 03:12:45PM -0500, Waiman Long wrote: >> On 02/13/2017 02:42 PM, Waiman Long wrote: >>> On 02/13/2017 05:53 AM, Peter Zijlstra wrote: >>>> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >>>>> That way we'd end up with something like: >>>>> >>>>> asm(" >>>>> push %rdi; >>>>> movslq %edi, %rdi; >>>>> movq __per_cpu_offset(,%rdi,8), %rax; >>>>> cmpb $0, %[offset](%rax); >>>>> setne %al; >>>>> pop %rdi; >>>>> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct >>>>> steal_time, preempted))); >>>>> >>>>> And if we could get rid of the sign extend on edi we could avoid all the >>>>> push-pop nonsense, but I'm not sure I see how to do that (then again, >>>>> this asm foo isn't my strongest point). >>>> Maybe: >>>> >>>> movsql %edi, %rax; >>>> movq __per_cpu_offset(,%rax,8), %rax; >>>> cmpb $0, %[offset](%rax); >>>> setne %al; >>>> >>>> ? >>> Yes, that looks good to me. >>> >>> Cheers, >>> Longman >>> >> Sorry, I am going to take it back. The displacement or offset can only >> be up to 32-bit. So we will still need to use at least one more >> register, I think. > I don't think that would be a problem, I very much doubt we declare more > than 4G worth of per-cpu variables in the kernel. > > In any case, use "e" or "Z" as constraint (I never quite know when to > use which). That are s32 and u32 displacement immediates resp. and > should fail compile with a semi-sensible failure if the displacement is > too big. > It is the address of _time that will exceed the 32-bit limit. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/13/2017 03:06 PM, h...@zytor.com wrote: > On February 13, 2017 2:53:43 AM PST, Peter Zijlstra> wrote: >> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >>> That way we'd end up with something like: >>> >>> asm(" >>> push %rdi; >>> movslq %edi, %rdi; >>> movq __per_cpu_offset(,%rdi,8), %rax; >>> cmpb $0, %[offset](%rax); >>> setne %al; >>> pop %rdi; >>> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct >> steal_time, preempted))); >>> And if we could get rid of the sign extend on edi we could avoid all >> the >>> push-pop nonsense, but I'm not sure I see how to do that (then again, >>> this asm foo isn't my strongest point). >> Maybe: >> >> movsql %edi, %rax; >> movq __per_cpu_offset(,%rax,8), %rax; >> cmpb $0, %[offset](%rax); >> setne %al; >> >> ? > We could kill the zero or sign extend by changing the calling interface to > pass an unsigned long instead of an int. It is much more likely that a zero > extend is free for the caller than a sign extend. I have thought of that too. However, the goal is to eliminate memory read/write from/to stack. Eliminating a register sign-extend instruction won't help much in term of performance. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/13/2017 02:42 PM, Waiman Long wrote: > On 02/13/2017 05:53 AM, Peter Zijlstra wrote: >> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >>> That way we'd end up with something like: >>> >>> asm(" >>> push %rdi; >>> movslq %edi, %rdi; >>> movq __per_cpu_offset(,%rdi,8), %rax; >>> cmpb $0, %[offset](%rax); >>> setne %al; >>> pop %rdi; >>> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct >>> steal_time, preempted))); >>> >>> And if we could get rid of the sign extend on edi we could avoid all the >>> push-pop nonsense, but I'm not sure I see how to do that (then again, >>> this asm foo isn't my strongest point). >> Maybe: >> >> movsql %edi, %rax; >> movq __per_cpu_offset(,%rax,8), %rax; >> cmpb $0, %[offset](%rax); >> setne %al; >> >> ? > Yes, that looks good to me. > > Cheers, > Longman > Sorry, I am going to take it back. The displacement or offset can only be up to 32-bit. So we will still need to use at least one more register, I think. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/13/2017 05:53 AM, Peter Zijlstra wrote: > On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote: >> That way we'd end up with something like: >> >> asm(" >> push %rdi; >> movslq %edi, %rdi; >> movq __per_cpu_offset(,%rdi,8), %rax; >> cmpb $0, %[offset](%rax); >> setne %al; >> pop %rdi; >> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct >> steal_time, preempted))); >> >> And if we could get rid of the sign extend on edi we could avoid all the >> push-pop nonsense, but I'm not sure I see how to do that (then again, >> this asm foo isn't my strongest point). > Maybe: > > movsql %edi, %rax; > movq __per_cpu_offset(,%rax,8), %rax; > cmpb $0, %[offset](%rax); > setne %al; > > ? Yes, that looks good to me. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/13/2017 05:47 AM, Peter Zijlstra wrote: > On Fri, Feb 10, 2017 at 12:00:43PM -0500, Waiman Long wrote: > >>>> +asm( >>>> +".pushsection .text;" >>>> +".global __raw_callee_save___kvm_vcpu_is_preempted;" >>>> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" >>>> +"__raw_callee_save___kvm_vcpu_is_preempted:" >>>> +FRAME_BEGIN >>>> +"push %rdi;" >>>> +"push %rdx;" >>>> +"movslq %edi, %rdi;" >>>> +"movq$steal_time+16, %rax;" >>>> +"movq__per_cpu_offset(,%rdi,8), %rdx;" >>>> +"cmpb$0, (%rdx,%rax);" > Could we not put the $steal_time+16 displacement as an immediate in the > cmpb and save a whole register here? > > That way we'd end up with something like: > > asm(" > push %rdi; > movslq %edi, %rdi; > movq __per_cpu_offset(,%rdi,8), %rax; > cmpb $0, %[offset](%rax); > setne %al; > pop %rdi; > " : : [offset] "i" (((unsigned long)_time) + offsetof(struct > steal_time, preempted))); > > And if we could get rid of the sign extend on edi we could avoid all the > push-pop nonsense, but I'm not sure I see how to do that (then again, > this asm foo isn't my strongest point). Yes, I think that can work. I will try to ran this patch to see how thing goes. >>>> +"setne %al;" >>>> +"pop %rdx;" >>>> +"pop %rdi;" >>>> +FRAME_END >>>> +"ret;" >>>> +".popsection"); >>>> + >>>> +#endif >>>> + >>>> /* >>>> * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. >>>> */ >>> That should work for now. I have done something similar for >>> __pv_queued_spin_unlock. However, this has the problem of creating a >>> dependency on the exact layout of the steal_time structure. Maybe the >>> constant 16 can be passed in as a parameter offsetof(struct >>> kvm_steal_time, preempted) to the asm call. > Yeah it should be well possible to pass that in. But ideally we'd have > GCC grow something like __attribute__((callee_saved)) or somesuch and it > would do all this for us. That will be really nice too. I am not too fond of working in assembly. >> One more thing, that will improve KVM performance, but it won't help Xen. > People still use Xen? ;-) In any case, their implementation looks very > similar and could easily crib this. In Red Hat, my focus will be on KVM performance. I do believe that there are still Xen users out there. So we still need to keep their interest into consideration. Given that, I am OK to make it work better in KVM first and then think about Xen later. >> I looked into the assembly code for rwsem_spin_on_owner, It need to save >> and restore 2 additional registers with my patch. Doing it your way, >> will transfer the save and restore overhead to the assembly code. >> However, __kvm_vcpu_is_preempted() is called multiple times per >> invocation of rwsem_spin_on_owner. That function is simple enough that >> making __kvm_vcpu_is_preempted() callee-save won't produce much compiler >> optimization opportunity. > This is because of that noinline, right? Otherwise it would've been > folded and register pressure would be much higher. Yes, I guess so. The noinline is there so that we know what the CPU time is for spinning rather than other activities within the slowpath. > >> The outer function rwsem_down_write_failed() >> does appear to be a bit bigger (from 866 bytes to 884 bytes) though. > I suspect GCC is being clever and since all this is static it plays > games with the calling convention and pushes these clobbers out. > > Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/10/2017 11:35 AM, Waiman Long wrote: > On 02/10/2017 11:19 AM, Peter Zijlstra wrote: >> On Fri, Feb 10, 2017 at 10:43:09AM -0500, Waiman Long wrote: >>> It was found when running fio sequential write test with a XFS ramdisk >>> on a VM running on a 2-socket x86-64 system, the %CPU times as reported >>> by perf were as follows: >>> >>> 69.75% 0.59% fio [k] down_write >>> 69.15% 0.01% fio [k] call_rwsem_down_write_failed >>> 67.12% 1.12% fio [k] rwsem_down_write_failed >>> 63.48% 52.77% fio [k] osq_lock >>> 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt >>> 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted >>> >> Thinking about this again, wouldn't something like the below also work? >> >> >> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c >> index 099fcba4981d..6aa33702c15c 100644 >> --- a/arch/x86/kernel/kvm.c >> +++ b/arch/x86/kernel/kvm.c >> @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) >> local_irq_restore(flags); >> } >> >> +#ifdef CONFIG_X86_32 >> __visible bool __kvm_vcpu_is_preempted(int cpu) >> { >> struct kvm_steal_time *src = _cpu(steal_time, cpu); >> @@ -597,6 +598,31 @@ __visible bool __kvm_vcpu_is_preempted(int cpu) >> } >> PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); >> >> +#else >> + >> +extern bool __raw_callee_save___kvm_vcpu_is_preempted(int); >> + >> +asm( >> +".pushsection .text;" >> +".global __raw_callee_save___kvm_vcpu_is_preempted;" >> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" >> +"__raw_callee_save___kvm_vcpu_is_preempted:" >> +FRAME_BEGIN >> +"push %rdi;" >> +"push %rdx;" >> +"movslq %edi, %rdi;" >> +"movq$steal_time+16, %rax;" >> +"movq__per_cpu_offset(,%rdi,8), %rdx;" >> +"cmpb$0, (%rdx,%rax);" >> +"setne %al;" >> +"pop %rdx;" >> +"pop %rdi;" >> +FRAME_END >> +"ret;" >> +".popsection"); >> + >> +#endif >> + >> /* >> * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. >> */ > That should work for now. I have done something similar for > __pv_queued_spin_unlock. However, this has the problem of creating a > dependency on the exact layout of the steal_time structure. Maybe the > constant 16 can be passed in as a parameter offsetof(struct > kvm_steal_time, preempted) to the asm call. > > Cheers, > Longman One more thing, that will improve KVM performance, but it won't help Xen. I looked into the assembly code for rwsem_spin_on_owner, It need to save and restore 2 additional registers with my patch. Doing it your way, will transfer the save and restore overhead to the assembly code. However, __kvm_vcpu_is_preempted() is called multiple times per invocation of rwsem_spin_on_owner. That function is simple enough that making __kvm_vcpu_is_preempted() callee-save won't produce much compiler optimization opportunity. The outer function rwsem_down_write_failed() does appear to be a bit bigger (from 866 bytes to 884 bytes) though. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/10/2017 11:19 AM, Peter Zijlstra wrote: > On Fri, Feb 10, 2017 at 10:43:09AM -0500, Waiman Long wrote: >> It was found when running fio sequential write test with a XFS ramdisk >> on a VM running on a 2-socket x86-64 system, the %CPU times as reported >> by perf were as follows: >> >> 69.75% 0.59% fio [k] down_write >> 69.15% 0.01% fio [k] call_rwsem_down_write_failed >> 67.12% 1.12% fio [k] rwsem_down_write_failed >> 63.48% 52.77% fio [k] osq_lock >> 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt >> 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted >> > Thinking about this again, wouldn't something like the below also work? > > > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c > index 099fcba4981d..6aa33702c15c 100644 > --- a/arch/x86/kernel/kvm.c > +++ b/arch/x86/kernel/kvm.c > @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val) > local_irq_restore(flags); > } > > +#ifdef CONFIG_X86_32 > __visible bool __kvm_vcpu_is_preempted(int cpu) > { > struct kvm_steal_time *src = _cpu(steal_time, cpu); > @@ -597,6 +598,31 @@ __visible bool __kvm_vcpu_is_preempted(int cpu) > } > PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); > > +#else > + > +extern bool __raw_callee_save___kvm_vcpu_is_preempted(int); > + > +asm( > +".pushsection .text;" > +".global __raw_callee_save___kvm_vcpu_is_preempted;" > +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;" > +"__raw_callee_save___kvm_vcpu_is_preempted:" > +FRAME_BEGIN > +"push %rdi;" > +"push %rdx;" > +"movslq %edi, %rdi;" > +"movq$steal_time+16, %rax;" > +"movq__per_cpu_offset(,%rdi,8), %rdx;" > +"cmpb$0, (%rdx,%rax);" > +"setne %al;" > +"pop %rdx;" > +"pop %rdi;" > +FRAME_END > +"ret;" > +".popsection"); > + > +#endif > + > /* > * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. > */ That should work for now. I have done something similar for __pv_queued_spin_unlock. However, this has the problem of creating a dependency on the exact layout of the steal_time structure. Maybe the constant 16 can be passed in as a parameter offsetof(struct kvm_steal_time, preempted) to the asm call. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
It was found when running fio sequential write test with a XFS ramdisk on a VM running on a 2-socket x86-64 system, the %CPU times as reported by perf were as follows: 69.75% 0.59% fio [k] down_write 69.15% 0.01% fio [k] call_rwsem_down_write_failed 67.12% 1.12% fio [k] rwsem_down_write_failed 63.48% 52.77% fio [k] osq_lock 9.46% 7.88% fio [k] __raw_callee_save___kvm_vcpu_is_preempt 3.93% 3.93% fio [k] __kvm_vcpu_is_preempted Making vcpu_is_preempted() a callee-save function has a relatively high cost on x86-64 primarily due to at least one more cacheline of data access from the saving and restoring of registers (8 of them) to and from stack as well as one more level of function call. As vcpu_is_preempted() is called within the spinlock, mutex and rwsem slowpaths, there isn't much to gain by making it callee-save. So it is now changed to a normal function call instead. With this patch applied on both bare-metal & KVM guest on a 2-socekt 16-core 32-thread system with 16 parallel jobs (8 on each socket), the aggregrate bandwidth of the fio test on an XFS ramdisk were as follows: Bare MetalKVM Guest I/O Type w/o patchwith patch w/o patchwith patch --- --- random read 8650.5 MB/s 8560.9 MB/s 7602.9 MB/s 8196.1 MB/s seq read 9104.8 MB/s 9397.2 MB/s 8293.7 MB/s 8566.9 MB/s random write 1623.8 MB/s 1626.7 MB/s 1590.6 MB/s 1700.7 MB/s seq write 1626.4 MB/s 1624.9 MB/s 1604.8 MB/s 1726.3 MB/s The perf data (on KVM guest) now became: 70.78% 0.58% fio [k] down_write 70.20% 0.01% fio [k] call_rwsem_down_write_failed 69.70% 1.17% fio [k] rwsem_down_write_failed 59.91% 55.42% fio [k] osq_lock 10.14% 10.14% fio [k] __kvm_vcpu_is_preempted On bare metal, the patch doesn't introduce any performance regression. On KVM guest, it produces noticeable performance improvement (up to 7%). Signed-off-by: Waiman Long <long...@redhat.com> --- v1->v2: - Rerun the fio test on a different system on both bare-metal and a KVM guest. Both sockets were utilized in this test. - The commit log was updated with new performance numbers, but the patch wasn't changed. - Drop patch 2. arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/paravirt_types.h | 2 +- arch/x86/kernel/kvm.c | 7 ++- arch/x86/kernel/paravirt-spinlocks.c | 6 ++ arch/x86/xen/spinlock.c | 4 +--- 5 files changed, 7 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 864f57b..2515885 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -676,7 +676,7 @@ static __always_inline void pv_kick(int cpu) static __always_inline bool pv_vcpu_is_preempted(int cpu) { - return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); + return PVOP_CALL1(bool, pv_lock_ops.vcpu_is_preempted, cpu); } #endif /* SMP && PARAVIRT_SPINLOCKS */ diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index bb2de45..88dc852 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -309,7 +309,7 @@ struct pv_lock_ops { void (*wait)(u8 *ptr, u8 val); void (*kick)(int cpu); - struct paravirt_callee_save vcpu_is_preempted; + bool (*vcpu_is_preempted)(int cpu); }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 099fcba..eb3753d 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -595,7 +595,6 @@ __visible bool __kvm_vcpu_is_preempted(int cpu) return !!src->preempted; } -PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. @@ -614,10 +613,8 @@ void __init kvm_spinlock_init(void) pv_lock_ops.wait = kvm_wait; pv_lock_ops.kick = kvm_kick_cpu; - if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { - pv_lock_ops.vcpu_is_preempted = - PV_CALLEE_SAVE(__kvm_vcpu_is_preempted); - } + if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) + pv_lock_ops.vcpu_is_preempted = __kvm_vcpu_is_preempted; } #endif /* CONFIG_PARAVIRT_SPINLOCKS */ diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 6259327..da050bc 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -24,12 +24,10 @@ __visible bool __native_vcpu_is_preempted(int cpu) { return false; } -PV_CALLEE_SAVE_REGS_THUNK(__native_vcpu_is_preempted); bool pv_is_native_vcpu_is_preempted(void) { - return pv_lock_ops.vcpu_is_preempted.func == - __raw_callee_save___
Re: [PATCH 1/2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
On 02/08/2017 02:05 PM, Peter Zijlstra wrote: > On Wed, Feb 08, 2017 at 01:00:24PM -0500, Waiman Long wrote: >> It was found when running fio sequential write test with a XFS ramdisk >> on a 2-socket x86-64 system, the %CPU times as reported by perf were >> as follows: >> >> 71.27% 0.28% fio [k] down_write >> 70.99% 0.01% fio [k] call_rwsem_down_write_failed >> 69.43% 1.18% fio [k] rwsem_down_write_failed >> 65.51% 54.57% fio [k] osq_lock >> 9.72% 7.99% fio [k] __raw_callee_save___kvm_vcpu_is_preempted >> 4.16% 4.16% fio [k] __kvm_vcpu_is_preempted >> >> So making vcpu_is_preempted() a callee-save function has a pretty high >> cost associated with it. As vcpu_is_preempted() is called within the >> spinlock, mutex and rwsem slowpaths, there isn't much to gain by making >> it callee-save. So it is now changed to a normal function call instead. >> > Numbers for bare metal too please. I will run the test on bare metal, but I doubt there will be noticeable difference. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/2] locking/mutex,rwsem: Reduce vcpu_is_preempted() calling frequency
On 02/08/2017 02:05 PM, Peter Zijlstra wrote: > On Wed, Feb 08, 2017 at 01:00:25PM -0500, Waiman Long wrote: >> As the vcpu_is_preempted() call is pretty costly compared with other >> checks within mutex_spin_on_owner() and rwsem_spin_on_owner(), they >> are done at a reduce frequency of once every 256 iterations. > That's just disgusting. I do have some doubt myself on the effectiveness of this patch. Anyway, it is the first patch that I think is more beneficial. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH 1/2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function
It was found when running fio sequential write test with a XFS ramdisk on a 2-socket x86-64 system, the %CPU times as reported by perf were as follows: 71.27% 0.28% fio [k] down_write 70.99% 0.01% fio [k] call_rwsem_down_write_failed 69.43% 1.18% fio [k] rwsem_down_write_failed 65.51% 54.57% fio [k] osq_lock 9.72% 7.99% fio [k] __raw_callee_save___kvm_vcpu_is_preempted 4.16% 4.16% fio [k] __kvm_vcpu_is_preempted So making vcpu_is_preempted() a callee-save function has a pretty high cost associated with it. As vcpu_is_preempted() is called within the spinlock, mutex and rwsem slowpaths, there isn't much to gain by making it callee-save. So it is now changed to a normal function call instead. With this patch applied, the aggregrate bandwidth of the fio sequential write test increased slightly from 2563.3MB/s to 2588.1MB/s (about 1%). Signed-off-by: Waiman Long <long...@redhat.com> --- arch/x86/include/asm/paravirt.h | 2 +- arch/x86/include/asm/paravirt_types.h | 2 +- arch/x86/kernel/kvm.c | 7 ++- arch/x86/kernel/paravirt-spinlocks.c | 6 ++ arch/x86/xen/spinlock.c | 4 +--- 5 files changed, 7 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h index 864f57b..2515885 100644 --- a/arch/x86/include/asm/paravirt.h +++ b/arch/x86/include/asm/paravirt.h @@ -676,7 +676,7 @@ static __always_inline void pv_kick(int cpu) static __always_inline bool pv_vcpu_is_preempted(int cpu) { - return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu); + return PVOP_CALL1(bool, pv_lock_ops.vcpu_is_preempted, cpu); } #endif /* SMP && PARAVIRT_SPINLOCKS */ diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h index bb2de45..88dc852 100644 --- a/arch/x86/include/asm/paravirt_types.h +++ b/arch/x86/include/asm/paravirt_types.h @@ -309,7 +309,7 @@ struct pv_lock_ops { void (*wait)(u8 *ptr, u8 val); void (*kick)(int cpu); - struct paravirt_callee_save vcpu_is_preempted; + bool (*vcpu_is_preempted)(int cpu); }; /* This contains all the paravirt structures: we get a convenient diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index 099fcba..eb3753d 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -595,7 +595,6 @@ __visible bool __kvm_vcpu_is_preempted(int cpu) return !!src->preempted; } -PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted); /* * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present. @@ -614,10 +613,8 @@ void __init kvm_spinlock_init(void) pv_lock_ops.wait = kvm_wait; pv_lock_ops.kick = kvm_kick_cpu; - if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) { - pv_lock_ops.vcpu_is_preempted = - PV_CALLEE_SAVE(__kvm_vcpu_is_preempted); - } + if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) + pv_lock_ops.vcpu_is_preempted = __kvm_vcpu_is_preempted; } #endif /* CONFIG_PARAVIRT_SPINLOCKS */ diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 6259327..da050bc 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -24,12 +24,10 @@ __visible bool __native_vcpu_is_preempted(int cpu) { return false; } -PV_CALLEE_SAVE_REGS_THUNK(__native_vcpu_is_preempted); bool pv_is_native_vcpu_is_preempted(void) { - return pv_lock_ops.vcpu_is_preempted.func == - __raw_callee_save___native_vcpu_is_preempted; + return pv_lock_ops.vcpu_is_preempted == __native_vcpu_is_preempted; } struct pv_lock_ops pv_lock_ops = { @@ -38,7 +36,7 @@ struct pv_lock_ops pv_lock_ops = { .queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock), .wait = paravirt_nop, .kick = paravirt_nop, - .vcpu_is_preempted = PV_CALLEE_SAVE(__native_vcpu_is_preempted), + .vcpu_is_preempted = __native_vcpu_is_preempted, #endif /* SMP */ }; EXPORT_SYMBOL(pv_lock_ops); diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c index 25a7c43..c85bb8f 100644 --- a/arch/x86/xen/spinlock.c +++ b/arch/x86/xen/spinlock.c @@ -114,8 +114,6 @@ void xen_uninit_lock_cpu(int cpu) per_cpu(irq_name, cpu) = NULL; } -PV_CALLEE_SAVE_REGS_THUNK(xen_vcpu_stolen); - /* * Our init of PV spinlocks is split in two init functions due to us * using paravirt patching and jump labels patching and having to do @@ -138,7 +136,7 @@ void __init xen_init_spinlocks(void) pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock); pv_lock_ops.wait = xen_qlock_wait; pv_lock_ops.kick = xen_qlock_kick; - pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(xen_vcpu_stolen); + pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen; } static __init int xen_parse_nopvspin(char