Re: [PATCH V11 07/17] riscv: qspinlock: Introduce qspinlock param for command line

2023-09-14 Thread Waiman Long

On 9/14/23 03:32, Leonardo Bras wrote:

On Tue, Sep 12, 2023 at 09:08:34AM +0800, Guo Ren wrote:

On Mon, Sep 11, 2023 at 11:34 PM Waiman Long  wrote:

On 9/10/23 04:29, guo...@kernel.org wrote:

From: Guo Ren 

Allow cmdline to force the kernel to use queued_spinlock when
CONFIG_RISCV_COMBO_SPINLOCKS=y.

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
---
   Documentation/admin-guide/kernel-parameters.txt |  2 ++
   arch/riscv/kernel/setup.c   | 16 +++-
   2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 7dfb540c4f6c..61cacb8dfd0e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4693,6 +4693,8 @@
   [KNL] Number of legacy pty's. Overwrites compiled-in
   default number.

+ qspinlock   [RISCV] Force to use qspinlock or auto-detect spinlock.
+
   qspinlock.numa_spinlock_threshold_ns=   [NUMA, PV_OPS]
   Set the time threshold in nanoseconds for the
   number of intra-node lock hand-offs before the
diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c
index a447cf360a18..0f084f037651 100644
--- a/arch/riscv/kernel/setup.c
+++ b/arch/riscv/kernel/setup.c
@@ -270,6 +270,15 @@ static void __init parse_dtb(void)
   }

   #ifdef CONFIG_RISCV_COMBO_SPINLOCKS
+bool enable_qspinlock_key = false;

You can use __ro_after_init qualifier for enable_qspinlock_key. BTW,
this is not a static key, just a simple flag. So what is the point of
the _key suffix?

Okay, I would change it to:
bool enable_qspinlock_flag __ro_after_init = false;

IIUC, this bool / flag is used in a single file, so it makes sense for it
to be static. Being static means it does not need to be initialized to
false, as it's standard to zero-fill this areas.

Also, since it's a bool, it does not need to be called _flag.

I would go with:

static bool enable_qspinlock __ro_after_init;


I actually was thinking about the same suggestion to add static. Then I 
realized that the flag was also used in another file in a later patch. 
Of course, if it turns out that this flag is no longer needed outside of 
this file, it should be static.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V10 07/19] riscv: qspinlock: errata: Introduce ERRATA_THEAD_QSPINLOCK

2023-09-13 Thread Waiman Long

On 9/13/23 14:54, Palmer Dabbelt wrote:

On Sun, 06 Aug 2023 22:23:34 PDT (-0700), sor...@fastmail.com wrote:

On Wed, Aug 2, 2023, at 12:46 PM, guo...@kernel.org wrote:

From: Guo Ren 

According to qspinlock requirements, RISC-V gives out a weak LR/SC
forward progress guarantee which does not satisfy qspinlock. But
many vendors could produce stronger forward guarantee LR/SC to
ensure the xchg_tail could be finished in time on any kind of
hart. T-HEAD is the vendor which implements strong forward
guarantee LR/SC instruction pairs, so enable qspinlock for T-HEAD
with errata help.

T-HEAD early version of processors has the merge buffer delay
problem, so we need ERRATA_WRITEONCE to support qspinlock.

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
---
 arch/riscv/Kconfig.errata  | 13 +
 arch/riscv/errata/thead/errata.c   | 24 
 arch/riscv/include/asm/errata_list.h   | 20 
 arch/riscv/include/asm/vendorid_list.h |  3 ++-
 arch/riscv/kernel/cpufeature.c |  3 ++-
 5 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/Kconfig.errata b/arch/riscv/Kconfig.errata
index 4745a5c57e7c..eb43677b13cc 100644
--- a/arch/riscv/Kconfig.errata
+++ b/arch/riscv/Kconfig.errata
@@ -96,4 +96,17 @@ config ERRATA_THEAD_WRITE_ONCE

   If you don't know what to do here, say "Y".

+config ERRATA_THEAD_QSPINLOCK
+    bool "Apply T-Head queued spinlock errata"
+    depends on ERRATA_THEAD
+    default y
+    help
+  The T-HEAD C9xx processors implement strong fwd guarantee 
LR/SC to

+  match the xchg_tail requirement of qspinlock.
+
+  This will apply the QSPINLOCK errata to handle the non-standard
+  behavior via using qspinlock instead of ticket_lock.
+
+  If you don't know what to do here, say "Y".


If this is to be applied, I would like to see a detailed explanation 
somewhere,

preferably with citations, of:

(a) The memory model requirements for qspinlock


The part of qspinlock that causes problem with many RISC architectures 
is its use of a 16-bit xchg() function call which many RISC 
architectures cannot do it natively and have to be emulated with 
hopefully some forward progress guarantee. Except that one call, the 
other atomic operations are all 32 bit in size.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V11 04/17] locking/qspinlock: Improve xchg_tail for number of cpus >= 16k

2023-09-13 Thread Waiman Long

On 9/13/23 08:52, Guo Ren wrote:

On Wed, Sep 13, 2023 at 4:55 PM Leonardo Bras  wrote:

On Tue, Sep 12, 2023 at 09:10:08AM +0800, Guo Ren wrote:

On Mon, Sep 11, 2023 at 9:03 PM Waiman Long  wrote:

On 9/10/23 23:09, Guo Ren wrote:

On Mon, Sep 11, 2023 at 10:35 AM Waiman Long  wrote:

On 9/10/23 04:28, guo...@kernel.org wrote:

From: Guo Ren 

The target of xchg_tail is to write the tail to the lock value, so
adding prefetchw could help the next cmpxchg step, which may
decrease the cmpxchg retry loops of xchg_tail. Some processors may
utilize this feature to give a forward guarantee, e.g., RISC-V
XuanTie processors would block the snoop channel & irq for several
cycles when prefetch.w instruction (from Zicbop extension) retired,
which guarantees the next cmpxchg succeeds.

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
---
kernel/locking/qspinlock.c | 5 -
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index d3f99060b60f..96b54e2ade86 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -223,7 +223,10 @@ static __always_inline void 
clear_pending_set_locked(struct qspinlock *lock)
 */
static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
{
- u32 old, new, val = atomic_read(>val);
+ u32 old, new, val;
+
+ prefetchw(>val);
+ val = atomic_read(>val);

for (;;) {
new = (val & _Q_LOCKED_PENDING_MASK) | tail;

That looks a bit weird. You pre-fetch and then immediately read it. How
much performance gain you get by this change alone?

Maybe you can define an arch specific primitive that default back to
atomic_read() if not defined.

Thx for the reply. This is a generic optimization point I would like
to talk about with you.

First, prefetchw() makes cacheline an exclusive state and serves for
the next cmpxchg loop semantic, which writes the idx_tail part of
arch_spin_lock. The atomic_read only makes cacheline in the shared
state, which couldn't give any guarantee for the next cmpxchg loop
semantic. Micro-architecture could utilize prefetchw() to provide a
strong forward progress guarantee for the xchg_tail, e.g., the T-HEAD
XuanTie processor would hold the exclusive cacheline state until the
next cmpxchg write success.

In the end, Let's go back to the principle: the xchg_tail is an atomic
swap operation that contains write eventually, so giving a prefetchw()
at the beginning is acceptable for all architectures..


I did realize afterward that prefetchw gets the cacheline in exclusive
state. I will suggest you mention that in your commit log as well as
adding a comment about its purpose in the code.

Okay, I would do that in v12, thx.

I would suggest adding a snippet from the ISA Extenstion doc:

"A prefetch.w instruction indicates to hardware that the cache block whose
effective address is the sum of the base address specified in rs1 and the
sign-extended offset encoded in imm[11:0], where imm[4:0] equals 0b0,
is likely to be accessed by a data write (i.e. store) in the near future."

Good point, thx.


qspinlock is generic code. I suppose this is for the RISCV architecture. 
You can mention that in the commit log as an example, but I prefer more 
generic comment especially in the code.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V11 07/17] riscv: qspinlock: Introduce qspinlock param for command line

2023-09-11 Thread Waiman Long

On 9/10/23 04:29, guo...@kernel.org wrote:

From: Guo Ren 

Allow cmdline to force the kernel to use queued_spinlock when
CONFIG_RISCV_COMBO_SPINLOCKS=y.

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
---
  Documentation/admin-guide/kernel-parameters.txt |  2 ++
  arch/riscv/kernel/setup.c   | 16 +++-
  2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 7dfb540c4f6c..61cacb8dfd0e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4693,6 +4693,8 @@
[KNL] Number of legacy pty's. Overwrites compiled-in
default number.
  
+	qspinlock	[RISCV] Force to use qspinlock or auto-detect spinlock.

+
qspinlock.numa_spinlock_threshold_ns=   [NUMA, PV_OPS]
Set the time threshold in nanoseconds for the
number of intra-node lock hand-offs before the
diff --git a/arch/riscv/kernel/setup.c b/arch/riscv/kernel/setup.c
index a447cf360a18..0f084f037651 100644
--- a/arch/riscv/kernel/setup.c
+++ b/arch/riscv/kernel/setup.c
@@ -270,6 +270,15 @@ static void __init parse_dtb(void)
  }
  
  #ifdef CONFIG_RISCV_COMBO_SPINLOCKS

+bool enable_qspinlock_key = false;


You can use __ro_after_init qualifier for enable_qspinlock_key. BTW, 
this is not a static key, just a simple flag. So what is the point of 
the _key suffix?


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V11 07/17] riscv: qspinlock: Introduce qspinlock param for command line

2023-09-11 Thread Waiman Long

On 9/10/23 04:29, guo...@kernel.org wrote:

From: Guo Ren 

Allow cmdline to force the kernel to use queued_spinlock when
CONFIG_RISCV_COMBO_SPINLOCKS=y.

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
---
  Documentation/admin-guide/kernel-parameters.txt |  2 ++
  arch/riscv/kernel/setup.c   | 16 +++-
  2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 7dfb540c4f6c..61cacb8dfd0e 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4693,6 +4693,8 @@
[KNL] Number of legacy pty's. Overwrites compiled-in
default number.
  
+	qspinlock	[RISCV] Force to use qspinlock or auto-detect spinlock.

+
qspinlock.numa_spinlock_threshold_ns=   [NUMA, PV_OPS]
Set the time threshold in nanoseconds for the
number of intra-node lock hand-offs before the


Your patch series is still based on top of numa-aware qspinlock patchset 
which isn't upstream yet. Please rebase it without that as that will 
cause merge conflict during upstream merge.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V11 04/17] locking/qspinlock: Improve xchg_tail for number of cpus >= 16k

2023-09-11 Thread Waiman Long

On 9/10/23 23:09, Guo Ren wrote:

On Mon, Sep 11, 2023 at 10:35 AM Waiman Long  wrote:


On 9/10/23 04:28, guo...@kernel.org wrote:

From: Guo Ren 

The target of xchg_tail is to write the tail to the lock value, so
adding prefetchw could help the next cmpxchg step, which may
decrease the cmpxchg retry loops of xchg_tail. Some processors may
utilize this feature to give a forward guarantee, e.g., RISC-V
XuanTie processors would block the snoop channel & irq for several
cycles when prefetch.w instruction (from Zicbop extension) retired,
which guarantees the next cmpxchg succeeds.

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
---
   kernel/locking/qspinlock.c | 5 -
   1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index d3f99060b60f..96b54e2ade86 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -223,7 +223,10 @@ static __always_inline void 
clear_pending_set_locked(struct qspinlock *lock)
*/
   static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
   {
- u32 old, new, val = atomic_read(>val);
+ u32 old, new, val;
+
+ prefetchw(>val);
+ val = atomic_read(>val);

   for (;;) {
   new = (val & _Q_LOCKED_PENDING_MASK) | tail;

That looks a bit weird. You pre-fetch and then immediately read it. How
much performance gain you get by this change alone?

Maybe you can define an arch specific primitive that default back to
atomic_read() if not defined.

Thx for the reply. This is a generic optimization point I would like
to talk about with you.

First, prefetchw() makes cacheline an exclusive state and serves for
the next cmpxchg loop semantic, which writes the idx_tail part of
arch_spin_lock. The atomic_read only makes cacheline in the shared
state, which couldn't give any guarantee for the next cmpxchg loop
semantic. Micro-architecture could utilize prefetchw() to provide a
strong forward progress guarantee for the xchg_tail, e.g., the T-HEAD
XuanTie processor would hold the exclusive cacheline state until the
next cmpxchg write success.

In the end, Let's go back to the principle: the xchg_tail is an atomic
swap operation that contains write eventually, so giving a prefetchw()
at the beginning is acceptable for all architectures..



I did realize afterward that prefetchw gets the cacheline in exclusive 
state. I will suggest you mention that in your commit log as well as 
adding a comment about its purpose in the code.


Thanks,
Longman


Cheers,
Longman





___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V11 04/17] locking/qspinlock: Improve xchg_tail for number of cpus >= 16k

2023-09-10 Thread Waiman Long



On 9/10/23 04:28, guo...@kernel.org wrote:

From: Guo Ren 

The target of xchg_tail is to write the tail to the lock value, so
adding prefetchw could help the next cmpxchg step, which may
decrease the cmpxchg retry loops of xchg_tail. Some processors may
utilize this feature to give a forward guarantee, e.g., RISC-V
XuanTie processors would block the snoop channel & irq for several
cycles when prefetch.w instruction (from Zicbop extension) retired,
which guarantees the next cmpxchg succeeds.

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
---
  kernel/locking/qspinlock.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index d3f99060b60f..96b54e2ade86 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -223,7 +223,10 @@ static __always_inline void 
clear_pending_set_locked(struct qspinlock *lock)
   */
  static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
  {
-   u32 old, new, val = atomic_read(>val);
+   u32 old, new, val;
+
+   prefetchw(>val);
+   val = atomic_read(>val);
  
  	for (;;) {

new = (val & _Q_LOCKED_PENDING_MASK) | tail;


That looks a bit weird. You pre-fetch and then immediately read it. How 
much performance gain you get by this change alone?


Maybe you can define an arch specific primitive that default back to 
atomic_read() if not defined.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V10 18/19] locking/qspinlock: Move pv_ops into x86 directory

2023-08-11 Thread Waiman Long


On 8/11/23 20:24, Guo Ren wrote:

On Sat, Aug 12, 2023 at 4:42 AM Waiman Long  wrote:

On 8/2/23 12:47, guo...@kernel.org wrote:

From: Guo Ren 

The pv_ops belongs to x86 custom infrastructure and cleans up the
cna_configure_spin_lock_slowpath() with standard code. This is
preparation for riscv support CNA qspoinlock.

CNA qspinlock has not been merged into mainline yet. I will suggest you
drop the last 2 patches for now. Of course, you can provide benchmark
data to boost the case for the inclusion of the CNA qspinlock patch into
the mainline.

Yes, my lazy, I would separate paravirt and CNA from this series.


paravirt is OK, it is just that CNA hasn't been merged yet.

Cheers,
Longman




Cheers,
Longman





___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH V10 18/19] locking/qspinlock: Move pv_ops into x86 directory

2023-08-11 Thread Waiman Long

On 8/2/23 12:47, guo...@kernel.org wrote:

From: Guo Ren 

The pv_ops belongs to x86 custom infrastructure and cleans up the
cna_configure_spin_lock_slowpath() with standard code. This is
preparation for riscv support CNA qspoinlock.


CNA qspinlock has not been merged into mainline yet. I will suggest you 
drop the last 2 patches for now. Of course, you can provide benchmark 
data to boost the case for the inclusion of the CNA qspinlock patch into 
the mainline.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V10 05/19] riscv: qspinlock: Introduce combo spinlock

2023-08-11 Thread Waiman Long

On 8/2/23 12:46, guo...@kernel.org wrote:

From: Guo Ren 

Combo spinlock could support queued and ticket in one Linux Image and
select them during boot time via errata mechanism. Here is the func
size (Bytes) comparison table below:

TYPE: COMBO | TICKET | QUEUED
arch_spin_lock  : 106   | 60 | 50
arch_spin_unlock: 54| 36 | 26
arch_spin_trylock   : 110   | 72 | 54
arch_spin_is_locked : 48| 34 | 20
arch_spin_is_contended  : 56| 40 | 24
rch_spin_value_unlocked : 48| 34 | 24

One example of disassemble combo arch_spin_unlock:
0x8000409c <+14>:nop# detour slot
0x800040a0 <+18>:fence   rw,w   # queued spinlock start
0x800040a4 <+22>:sb  zero,0(a4) # queued spinlock end
0x800040a8 <+26>:ld  s0,8(sp)
0x800040aa <+28>:addisp,sp,16
0x800040ac <+30>:ret
0x800040ae <+32>:lw  a5,0(a4)   # ticket spinlock start
0x800040b0 <+34>:sext.w  a5,a5
0x800040b2 <+36>:fence   rw,w
0x800040b6 <+40>:addiw   a5,a5,1
0x800040b8 <+42>:sllia5,a5,0x30
0x800040ba <+44>:srlia5,a5,0x30
0x800040bc <+46>:sh  a5,0(a4)   # ticket spinlock end
0x800040c0 <+50>:ld  s0,8(sp)
0x800040c2 <+52>:addisp,sp,16
0x800040c4 <+54>:ret

The qspinlock is smaller and faster than ticket-lock when all are in
fast-path, and combo spinlock could provide a compatible Linux Image
for different micro-arch design (weak/strict fwd guarantee) processors.

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
---
  arch/riscv/Kconfig|  9 +++-
  arch/riscv/include/asm/hwcap.h|  1 +
  arch/riscv/include/asm/spinlock.h | 87 ++-
  arch/riscv/kernel/cpufeature.c| 10 
  4 files changed, 104 insertions(+), 3 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e89a3bea3dc1..119e774a3dcf 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -440,7 +440,7 @@ config NODES_SHIFT
  
  choice

prompt "RISC-V spinlock type"
-   default RISCV_TICKET_SPINLOCKS
+   default RISCV_COMBO_SPINLOCKS
  
  config RISCV_TICKET_SPINLOCKS

bool "Using ticket spinlock"
@@ -452,6 +452,13 @@ config RISCV_QUEUED_SPINLOCKS
help
  Make sure your micro arch LL/SC has a strong forward progress 
guarantee.
  Otherwise, stay at ticket-lock.
+
+config RISCV_COMBO_SPINLOCKS
+   bool "Using combo spinlock"
+   depends on SMP && MMU
+   select ARCH_USE_QUEUED_SPINLOCKS
+   help
+ Select queued spinlock or ticket-lock via errata.
  endchoice
  
  config RISCV_ALTERNATIVE

diff --git a/arch/riscv/include/asm/hwcap.h b/arch/riscv/include/asm/hwcap.h
index f041bfa7f6a0..08ae75a694c2 100644
--- a/arch/riscv/include/asm/hwcap.h
+++ b/arch/riscv/include/asm/hwcap.h
@@ -54,6 +54,7 @@
  #define RISCV_ISA_EXT_ZIFENCEI41
  #define RISCV_ISA_EXT_ZIHPM   42
  
+#define RISCV_ISA_EXT_XTICKETLOCK	63

  #define RISCV_ISA_EXT_MAX 64
  #define RISCV_ISA_EXT_NAME_LEN_MAX32
  
diff --git a/arch/riscv/include/asm/spinlock.h b/arch/riscv/include/asm/spinlock.h

index c644a92d4548..9eb3ad31e564 100644
--- a/arch/riscv/include/asm/spinlock.h
+++ b/arch/riscv/include/asm/spinlock.h
@@ -7,11 +7,94 @@
  #define _Q_PENDING_LOOPS  (1 << 9)
  #endif
  


I see why you separated the _Q_PENDING_LOOPS out.



+#ifdef CONFIG_RISCV_COMBO_SPINLOCKS
+#include 
+
+#undef arch_spin_is_locked
+#undef arch_spin_is_contended
+#undef arch_spin_value_unlocked
+#undef arch_spin_lock
+#undef arch_spin_trylock
+#undef arch_spin_unlock
+
+#include 
+#include 
+
+#undef arch_spin_is_locked
+#undef arch_spin_is_contended
+#undef arch_spin_value_unlocked
+#undef arch_spin_lock
+#undef arch_spin_trylock
+#undef arch_spin_unlock
Perhaps you can add a macro like __no_arch_spinlock_redefine to disable 
the various arch_spin_* definition in qspinlock.h and ticket_spinlock.h.

+
+#define COMBO_DETOUR   \
+   asm_volatile_goto(ALTERNATIVE(  \
+   "nop",\
+   "j %l[ticket_spin_lock]", \
+   0,  \
+   RISCV_ISA_EXT_XTICKETLOCK,  \
+   CONFIG_RISCV_COMBO_SPINLOCKS)   \
+   : : : : ticket_spin_lock);
+
+static __always_inline void arch_spin_lock(arch_spinlock_t *lock)
+{
+   COMBO_DETOUR
+   queued_spin_lock(lock);
+   return;
+ticket_spin_lock:
+   ticket_spin_lock(lock);
+}
+
+static __always_inline bool arch_spin_trylock(arch_spinlock_t *lock)
+{
+   COMBO_DETOUR
+   return queued_spin_trylock(lock);
+ticket_spin_lock:
+   return ticket_spin_trylock(lock);
+}
+
+static __always_inline 

Re: [PATCH V10 04/19] riscv: qspinlock: Add basic queued_spinlock support

2023-08-11 Thread Waiman Long



On 8/2/23 12:46, guo...@kernel.org wrote:

\
diff --git a/arch/riscv/include/asm/spinlock.h 
b/arch/riscv/include/asm/spinlock.h
new file mode 100644
index ..c644a92d4548
--- /dev/null
+++ b/arch/riscv/include/asm/spinlock.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ASM_RISCV_SPINLOCK_H
+#define __ASM_RISCV_SPINLOCK_H
+
+#ifdef CONFIG_QUEUED_SPINLOCKS
+#define _Q_PENDING_LOOPS   (1 << 9)
+#endif
+
+#ifdef CONFIG_QUEUED_SPINLOCKS


You can merge the two "#ifdef CONFIG_QUEUED_SPINLOCKS" into single one 
to avoid the duplication.


Cheers,
Longman


+#include 
+#include 
+#else
+#include 
+#endif
+
+#endif /* __ASM_RISCV_SPINLOCK_H */


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v7] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation

2022-06-28 Thread Waiman Long
0.0 301279.3760.8
File Copy 256 bufsize 500 maxblocks1655.0  79291.3479.1
File Copy 4096 bufsize 8000 maxblocks  5800.01039755.2   1792.7
Pipe Throughput   12440.0  118701468.1  95419.2
Pipe-based Context Switching   4000.08073453.3  20183.6
Process Creation126.0  33440.9   2654.0
Shell Scripts (1 concurrent) 42.4  52722.6  12434.6
Shell Scripts (8 concurrent)  6.0   7050.4  11750.6
System Call Overhead  15000.06834371.5   4556.2

System Benchmarks Index Score8157.8

Signed-off-by: Guo Hui 
---
  arch/x86/kernel/paravirt-spinlocks.c |  4 
  kernel/locking/osq_lock.c| 19 ++-
  2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 9e1ea99ad..a2eb375e2 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -33,6 +33,8 @@ bool pv_is_native_vcpu_is_preempted(void)
__raw_callee_save___native_vcpu_is_preempted;
  }
  
+DECLARE_STATIC_KEY_TRUE(vcpu_has_preemption);

+
  void __init paravirt_set_cap(void)
  {
if (!pv_is_native_spin_unlock())
@@ -40,4 +42,6 @@ void __init paravirt_set_cap(void)
  
  	if (!pv_is_native_vcpu_is_preempted())

setup_force_cpu_cap(X86_FEATURE_VCPUPREEMPT);
+   else
+   static_branch_disable(_has_preemption);
  }
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d5610ad52..f521b0f6d 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -27,6 +27,23 @@ static inline int node_cpu(struct optimistic_spin_node *node)
return node->cpu - 1;
  }
  
+#ifdef vcpu_is_preempted

+DEFINE_STATIC_KEY_TRUE(vcpu_has_preemption);
+
+static inline bool vcpu_is_preempted_node(struct optimistic_spin_node *node)
+{
+   if (static_branch_likely(_has_preemption))
+   return vcpu_is_preempted(node_cpu(node->prev));
+
+   return false;
+}
+#else
+static inline bool vcpu_is_preempted_node(struct optimistic_spin_node *node)
+{
+   return false;
+}
+#endif
+
  static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
  {
int cpu_nr = encoded_cpu_val - 1;
@@ -141,7 +158,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 * polling, be careful.
 */
if (smp_cond_load_relaxed(>locked, VAL || need_resched() ||
- vcpu_is_preempted(node_cpu(node->prev
+   vcpu_is_preempted_node(node)))
return true;
  
  	/* unqueue */

Reviewed-by: Waiman Long 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v6] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation

2022-06-28 Thread Waiman Long

On 6/28/22 08:54, Guo Hui wrote:

The instructions assigned to the vcpu_is_preempted function parameter
in the X86 architecture physical machine are redundant instructions,
causing the multi-core performance of Unixbench to drop by about 4% to 5%.
The C function is as follows:
static bool vcpu_is_preempted(long vcpu);

The parameter 'vcpu' in the function osq_lock
that calls the function vcpu_is_preempted is assigned as follows:

The C code is in the function node_cpu:
cpu = node->cpu - 1;

The instructions corresponding to the C code are:
mov 0x14(%rax),%edi
sub $0x1,%edi

The above instructions are unnecessary
in the X86 Native operating environment,
causing high cache-misses and degrading performance.

This patch uses static_key to not execute this instruction
in the Native runtime environment.

The patch effect is as follows two machines,
Unixbench runs with full core score:

1. Machine configuration:
Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
CPU core: 40
Memory: 256G
OS Kernel: 5.19-rc3

Before using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0  948326591.2  81261.9
Double-Precision Whetstone   55.0 211986.3  38543.0
Execl Throughput 43.0  43453.2  10105.4
File Copy 1024 bufsize 2000 maxblocks  3960.0 438936.2   1108.4
File Copy 256 bufsize 500 maxblocks1655.0 118197.4714.2
File Copy 4096 bufsize 8000 maxblocks  5800.01534674.7   2646.0
Pipe Throughput   12440.0   46482107.6  37365.0
Pipe-based Context Switching   4000.01915094.2   4787.7
Process Creation126.0  85442.2   6781.1
Shell Scripts (1 concurrent) 42.4  69400.7  16368.1
Shell Scripts (8 concurrent)  6.0   8877.2  14795.3
System Call Overhead  15000.04714906.1   3143.3

System Benchmarks Index Score7923.3

After using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0  947032915.5  81151.1
Double-Precision Whetstone   55.0 211971.2  38540.2
Execl Throughput 43.0  45054.8  10477.9
File Copy 1024 bufsize 2000 maxblocks  3960.0 515024.9   1300.6
File Copy 256 bufsize 500 maxblocks1655.0 146354.6884.3
File Copy 4096 bufsize 8000 maxblocks  5800.01679995.9   2896.5
Pipe Throughput   12440.0   46466394.2  37352.4
Pipe-based Context Switching   4000.01898221.4   4745.6
Process Creation126.0  85653.1   6797.9
Shell Scripts (1 concurrent) 42.4  69437.3  16376.7
Shell Scripts (8 concurrent)  6.0   8898.9  14831.4
System Call Overhead  15000.04658746.7   3105.8

System Benchmarks Index Score8248.8

2. Machine configuration:
Hygon C86 7185 32-core Processor
CPU core: 128
Memory: 256G
OS Kernel: 5.19-rc3

Before using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0 2256644068.3 193371.4
Double-Precision Whetstone   55.0 438969.9  79812.7
Execl Throughput 43.0  10108.6   2350.8
File Copy 1024 bufsize 2000 maxblocks  3960.0 275892.8696.7
File Copy 256 bufsize 500 maxblocks1655.0  72082.7435.5
File Copy 4096 bufsize 8000 maxblocks  5800.0 925043.4   1594.9
Pipe Throughput   12440.0  118905512.5  95583.2
Pipe-based Context Switching   4000.07820945.7  19552.4
Process Creation126.0  31233.3   2478.8
Shell Scripts (1 concurrent) 42.4  49042.8  11566.7
Shell Scripts (8 concurrent)  6.0   6656.0  11093.3
System Call Overhead  15000.06816047.5   4544.0

System Benchmarks Index Score7756.6

After using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0 2252272929.4 192996.8
Double-Precision Whetstone   55.0 451847.2  82154.0
Execl Throughput 43.0  10595.1   2464.0
File Copy 1024 bufsize 2000 maxblocks  3960.0 

Re: [PATCH v4] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation

2022-06-27 Thread Waiman Long

On 6/27/22 10:27, Guo Hui wrote:

The instructions assigned to the vcpu_is_preempted function parameter
in the X86 architecture physical machine are redundant instructions,
causing the multi-core performance of Unixbench to drop by about 4% to 5%.
The C function is as follows:
static bool vcpu_is_preempted(long vcpu);

The parameter 'vcpu' in the function osq_lock
that calls the function vcpu_is_preempted is assigned as follows:

The C code is in the function node_cpu:
cpu = node->cpu - 1;

The instructions corresponding to the C code are:
mov 0x14(%rax),%edi
sub $0x1,%edi

The above instructions are unnecessary
in the X86 Native operating environment,
causing high cache-misses and degrading performance.

This patch uses static_key to not execute this instruction
in the Native runtime environment.

The patch effect is as follows two machines,
Unixbench runs with full core score:

1. Machine configuration:
Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
CPU core: 40
Memory: 256G
OS Kernel: 5.19-rc3

Before using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0  948326591.2  81261.9
Double-Precision Whetstone   55.0 211986.3  38543.0
Execl Throughput 43.0  43453.2  10105.4
File Copy 1024 bufsize 2000 maxblocks  3960.0 438936.2   1108.4
File Copy 256 bufsize 500 maxblocks1655.0 118197.4714.2
File Copy 4096 bufsize 8000 maxblocks  5800.01534674.7   2646.0
Pipe Throughput   12440.0   46482107.6  37365.0
Pipe-based Context Switching   4000.01915094.2   4787.7
Process Creation126.0  85442.2   6781.1
Shell Scripts (1 concurrent) 42.4  69400.7  16368.1
Shell Scripts (8 concurrent)  6.0   8877.2  14795.3
System Call Overhead  15000.04714906.1   3143.3

System Benchmarks Index Score7923.3

After using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0  947032915.5  81151.1
Double-Precision Whetstone   55.0 211971.2  38540.2
Execl Throughput 43.0  45054.8  10477.9
File Copy 1024 bufsize 2000 maxblocks  3960.0 515024.9   1300.6
File Copy 256 bufsize 500 maxblocks1655.0 146354.6884.3
File Copy 4096 bufsize 8000 maxblocks  5800.01679995.9   2896.5
Pipe Throughput   12440.0   46466394.2  37352.4
Pipe-based Context Switching   4000.01898221.4   4745.6
Process Creation126.0  85653.1   6797.9
Shell Scripts (1 concurrent) 42.4  69437.3  16376.7
Shell Scripts (8 concurrent)  6.0   8898.9  14831.4
System Call Overhead  15000.04658746.7   3105.8

System Benchmarks Index Score8248.8

2. Machine configuration:
Hygon C86 7185 32-core Processor
CPU core: 128
Memory: 256G
OS Kernel: 5.19-rc3

Before using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0 2256644068.3 193371.4
Double-Precision Whetstone   55.0 438969.9  79812.7
Execl Throughput 43.0  10108.6   2350.8
File Copy 1024 bufsize 2000 maxblocks  3960.0 275892.8696.7
File Copy 256 bufsize 500 maxblocks1655.0  72082.7435.5
File Copy 4096 bufsize 8000 maxblocks  5800.0 925043.4   1594.9
Pipe Throughput   12440.0  118905512.5  95583.2
Pipe-based Context Switching   4000.07820945.7  19552.4
Process Creation126.0  31233.3   2478.8
Shell Scripts (1 concurrent) 42.4  49042.8  11566.7
Shell Scripts (8 concurrent)  6.0   6656.0  11093.3
System Call Overhead  15000.06816047.5   4544.0

System Benchmarks Index Score7756.6

After using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0 2252272929.4 192996.8
Double-Precision Whetstone   55.0 451847.2  82154.0
Execl Throughput 43.0  10595.1   2464.0
File Copy 1024 bufsize 2000 maxblocks  3960.0 

Re: [PATCH v2] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation

2022-06-27 Thread Waiman Long

On 6/27/22 01:54, Guo Hui wrote:
Thank you very much Longman, my patch is as you said, only disable 
node_cpu on X86, enable node_cpu on arm64, powerpc, s390 architectures;

the code is in file arch/x86/kernel/paravirt-spinlocks.c:
    DECLARE_STATIC_KEY_FALSE(preemted_key);
    static_branch_enable(_key);

the default value of preemted_key is false and the if conditional 
statement is reversed,

the code is in file kernel/locking/osq_lock.c:
    DEFINE_STATIC_KEY_FALSE(preemted_key);

    static inline int node_cpu(struct optimistic_spin_node *node)
    {
    int cpu = 0;

    if (!static_branch_unlikely(_key))
    cpu = node->cpu - 1;

    return cpu;
  }

In this way, only one nop instruction is added to architectures arm64, 
powerpc and s390, including virtual machines, without any other changes.


You are right. I am probably too tired last night to read the patch more 
carefully.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v2] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation

2022-06-26 Thread Waiman Long

On 6/26/22 22:13, Guo Hui wrote:

The instructions assigned to the vcpu_is_preempted function parameter
in the X86 architecture physical machine are redundant instructions,
causing the multi-core performance of Unixbench to drop by about 4% to 5%.
The C function is as follows:
static bool vcpu_is_preempted(long vcpu);

The parameter 'vcpu' in the function osq_lock
that calls the function vcpu_is_preempted is assigned as follows:

The C code is in the function node_cpu:
cpu = node->cpu - 1;

The instructions corresponding to the C code are:
mov 0x14(%rax),%edi
sub $0x1,%edi

The above instructions are unnecessary
in the X86 Native operating environment,
causing high cache-misses and degrading performance.

This patch uses static_key to not execute this instruction
in the Native runtime environment.

The code is as follows:

DEFINE_STATIC_KEY_FALSE(preemted_key);

static inline int node_cpu(struct optimistic_spin_node *node)
{
  int cpu = 0;

  if (!static_branch_unlikely(_key))
  cpu = node->cpu - 1;

  return cpu;
}

The patch effect is as follows two machines,
Unixbench runs with full core score:

1. Machine configuration:
Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
CPU core: 40
Memory: 256G
OS Kernel: 5.19-rc3

Before using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0  948326591.2  81261.9
Double-Precision Whetstone   55.0 211986.3  38543.0
Execl Throughput 43.0  43453.2  10105.4
File Copy 1024 bufsize 2000 maxblocks  3960.0 438936.2   1108.4
File Copy 256 bufsize 500 maxblocks1655.0 118197.4714.2
File Copy 4096 bufsize 8000 maxblocks  5800.01534674.7   2646.0
Pipe Throughput   12440.0   46482107.6  37365.0
Pipe-based Context Switching   4000.01915094.2   4787.7
Process Creation126.0  85442.2   6781.1
Shell Scripts (1 concurrent) 42.4  69400.7  16368.1
Shell Scripts (8 concurrent)  6.0   8877.2  14795.3
System Call Overhead  15000.04714906.1   3143.3

System Benchmarks Index Score7923.3

After using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0  947032915.5  81151.1
Double-Precision Whetstone   55.0 211971.2  38540.2
Execl Throughput 43.0  45054.8  10477.9
File Copy 1024 bufsize 2000 maxblocks  3960.0 515024.9   1300.6
File Copy 256 bufsize 500 maxblocks1655.0 146354.6884.3
File Copy 4096 bufsize 8000 maxblocks  5800.01679995.9   2896.5
Pipe Throughput   12440.0   46466394.2  37352.4
Pipe-based Context Switching   4000.01898221.4   4745.6
Process Creation126.0  85653.1   6797.9
Shell Scripts (1 concurrent) 42.4  69437.3  16376.7
Shell Scripts (8 concurrent)  6.0   8898.9  14831.4
System Call Overhead  15000.04658746.7   3105.8

System Benchmarks Index Score8248.8

2. Machine configuration:
Hygon C86 7185 32-core Processor
CPU core: 128
Memory: 256G
OS Kernel: 5.19-rc3

Before using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0 2256644068.3 193371.4
Double-Precision Whetstone   55.0 438969.9  79812.7
Execl Throughput 43.0  10108.6   2350.8
File Copy 1024 bufsize 2000 maxblocks  3960.0 275892.8696.7
File Copy 256 bufsize 500 maxblocks1655.0  72082.7435.5
File Copy 4096 bufsize 8000 maxblocks  5800.0 925043.4   1594.9
Pipe Throughput   12440.0  118905512.5  95583.2
Pipe-based Context Switching   4000.07820945.7  19552.4
Process Creation126.0  31233.3   2478.8
Shell Scripts (1 concurrent) 42.4  49042.8  11566.7
Shell Scripts (8 concurrent)  6.0   6656.0  11093.3
System Call Overhead  15000.06816047.5   4544.0

System Benchmarks Index Score7756.6

After using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables   

Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-25 Thread Waiman Long

On 7/25/20 1:26 PM, Peter Zijlstra wrote:

On Fri, Jul 24, 2020 at 03:10:59PM -0400, Waiman Long wrote:

On 7/24/20 4:16 AM, Will Deacon wrote:

On Thu, Jul 23, 2020 at 08:47:59PM +0200, pet...@infradead.org wrote:

On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote:

BTW, do you have any comment on my v2 lock holder cpu info qspinlock patch?
I will have to update the patch to fix the reported 0-day test problem, but
I want to collect other feedback before sending out v3.

I want to say I hate it all, it adds instructions to a path we spend an
aweful lot of time optimizing without really getting anything back for
it.

Will, how do you feel about it?

I can see it potentially being useful for debugging, but I hate the
limitation to 256 CPUs. Even arm64 is hitting that now.

After thinking more about that, I think we can use all the remaining bits in
the 16-bit locked_pending. Reserving 1 bit for locked and 1 bit for pending,
there are 14 bits left. So as long as NR_CPUS < 16k (requirement for 16-bit
locked_pending), we can put all possible cpu numbers into the lock. We can
also just use smp_processor_id() without additional percpu data.

That sounds horrific, wouldn't that destroy the whole point of using a
byte for pending?
You are right. I realized that later on and had sent a follow-up mail to 
correct that.

Also, you're talking ~1% gains here. I think our collective time would
be better spent off reviewing the CNA series and trying to make it more
deterministic.

I thought you guys are not interested in CNA. I do want to get CNA merged,
if possible. Let review the current version again and see if there are ways
we can further improve it.

It's not a lack of interrest. We were struggling with the fairness
issues and the complexity of the thing. I forgot the current state of
matters, but at one point UNLOCK was O(n) in waiters, which is, of
course, 'unfortunate'.

I'll have to look up whatever notes remain, but the basic idea of
keeping remote nodes on a secondary list is obviously breaking all sorts
of fairness. After that they pile on a bunch of hacks to fix the worst
of them, but it feels exactly like that, a bunch of hacks.

One of the things I suppose we ought to do is see if some of the ideas
of phase-fair locks can be applied to this.

That could be a possible solution to ensure better fairness.


That coupled with a chronic lack of time for anything :-(


That is always true and I feel this way too:-)

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-24 Thread Waiman Long

On 7/24/20 3:10 PM, Waiman Long wrote:

On 7/24/20 4:16 AM, Will Deacon wrote:

On Thu, Jul 23, 2020 at 08:47:59PM +0200, pet...@infradead.org wrote:

On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote:
BTW, do you have any comment on my v2 lock holder cpu info 
qspinlock patch?
I will have to update the patch to fix the reported 0-day test 
problem, but

I want to collect other feedback before sending out v3.

I want to say I hate it all, it adds instructions to a path we spend an
aweful lot of time optimizing without really getting anything back for
it.

Will, how do you feel about it?

I can see it potentially being useful for debugging, but I hate the
limitation to 256 CPUs. Even arm64 is hitting that now.


After thinking more about that, I think we can use all the remaining 
bits in the 16-bit locked_pending. Reserving 1 bit for locked and 1 
bit for pending, there are 14 bits left. So as long as NR_CPUS < 16k 
(requirement for 16-bit locked_pending), we can put all possible cpu 
numbers into the lock. We can also just use smp_processor_id() without 
additional percpu data. 


Sorry, that doesn't work. The extra bits in the pending byte won't get 
cleared on unlock. That will have noticeable performance impact. 
Clearing the pending byte on unlock will cause other performance 
problem. So I guess we will have to limit the cpu number in the locked byte.


Regards,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 0/6] powerpc: queued spinlocks and rwlocks

2020-07-24 Thread Waiman Long

On 7/24/20 9:14 AM, Nicholas Piggin wrote:

Updated with everybody's feedback (thanks all), and more performance
results.

What I've found is I might have been measuring the worst load point for
the paravirt case, and by looking at a range of loads it's clear that
queued spinlocks are overall better even on PV, doubly so when you look
at the generally much improved worst case latencies.

I have defaulted it to N even though I'm less concerned about the PV
numbers now, just because I think it needs more stress testing. But
it's very nicely selectable so should be low risk to include.

All in all this is a very cool technology and great results especially
on the big systems but even on smaller ones there are nice gains. Thanks
Waiman and everyone who developed it.

Thanks,
Nick

Nicholas Piggin (6):
   powerpc/pseries: move some PAPR paravirt functions to their own file
   powerpc: move spinlock implementation to simple_spinlock
   powerpc/64s: implement queued spinlocks and rwlocks
   powerpc/pseries: implement paravirt qspinlocks for SPLPAR
   powerpc/qspinlock: optimised atomic_try_cmpxchg_lock that adds the
 lock hint
   powerpc: implement smp_cond_load_relaxed

  arch/powerpc/Kconfig  |  15 +
  arch/powerpc/include/asm/Kbuild   |   1 +
  arch/powerpc/include/asm/atomic.h |  28 ++
  arch/powerpc/include/asm/barrier.h|  14 +
  arch/powerpc/include/asm/paravirt.h   |  87 +
  arch/powerpc/include/asm/qspinlock.h  |  91 ++
  arch/powerpc/include/asm/qspinlock_paravirt.h |   7 +
  arch/powerpc/include/asm/simple_spinlock.h| 288 
  .../include/asm/simple_spinlock_types.h   |  21 ++
  arch/powerpc/include/asm/spinlock.h   | 308 +-
  arch/powerpc/include/asm/spinlock_types.h |  17 +-
  arch/powerpc/lib/Makefile |   3 +
  arch/powerpc/lib/locks.c  |  12 +-
  arch/powerpc/platforms/pseries/Kconfig|   9 +-
  arch/powerpc/platforms/pseries/setup.c|   4 +-
  include/asm-generic/qspinlock.h   |   4 +
  16 files changed, 588 insertions(+), 321 deletions(-)
  create mode 100644 arch/powerpc/include/asm/paravirt.h
  create mode 100644 arch/powerpc/include/asm/qspinlock.h
  create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h
  create mode 100644 arch/powerpc/include/asm/simple_spinlock.h
  create mode 100644 arch/powerpc/include/asm/simple_spinlock_types.h


That patch series looks good to me. Thanks for working on this.

For the series,

Acked-by: Waiman Long 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 6/6] powerpc: implement smp_cond_load_relaxed

2020-07-24 Thread Waiman Long

On 7/24/20 9:14 AM, Nicholas Piggin wrote:

This implements smp_cond_load_relaed with the slowpath busy loop using the


Nit: "smp_cond_load_relaxed"

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-24 Thread Waiman Long

On 7/24/20 4:16 AM, Will Deacon wrote:

On Thu, Jul 23, 2020 at 08:47:59PM +0200, pet...@infradead.org wrote:

On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote:

BTW, do you have any comment on my v2 lock holder cpu info qspinlock patch?
I will have to update the patch to fix the reported 0-day test problem, but
I want to collect other feedback before sending out v3.

I want to say I hate it all, it adds instructions to a path we spend an
aweful lot of time optimizing without really getting anything back for
it.

Will, how do you feel about it?

I can see it potentially being useful for debugging, but I hate the
limitation to 256 CPUs. Even arm64 is hitting that now.


After thinking more about that, I think we can use all the remaining 
bits in the 16-bit locked_pending. Reserving 1 bit for locked and 1 bit 
for pending, there are 14 bits left. So as long as NR_CPUS < 16k 
(requirement for 16-bit locked_pending), we can put all possible cpu 
numbers into the lock. We can also just use smp_processor_id() without 
additional percpu data.




Also, you're talking ~1% gains here. I think our collective time would
be better spent off reviewing the CNA series and trying to make it more
deterministic.


I thought you guys are not interested in CNA. I do want to get CNA 
merged, if possible. Let review the current version again and see if 
there are ways we can further improve it.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-23 Thread Waiman Long

On 7/23/20 3:58 PM, pet...@infradead.org wrote:

On Thu, Jul 23, 2020 at 03:04:13PM -0400, Waiman Long wrote:

On 7/23/20 2:47 PM, pet...@infradead.org wrote:

On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote:

BTW, do you have any comment on my v2 lock holder cpu info qspinlock patch?
I will have to update the patch to fix the reported 0-day test problem, but
I want to collect other feedback before sending out v3.

I want to say I hate it all, it adds instructions to a path we spend an
aweful lot of time optimizing without really getting anything back for
it.

It does add some extra instruction that may slow it down slightly, but I
don't agree that it gives nothing back. The cpu lock holder information can
be useful in analyzing crash dumps and in some debugging situation. I think
it can be useful in RHEL for this readon. How about an x86 config option to
allow distros to decide if they want to have it enabled? I will make sure
that it will have no performance degradation if the option is not enabled.

Config knobs suck too; they create a maintenance burden (we get to make
sure all the permutations works/build/etc..) and effectively nobody uses
them, since world+dog uses what distros pick.

Anyway, instead of adding a second per-cpu variable, can you see how
horrible something like this is:

unsigned char adds(unsigned char var, unsigned char val)
{
unsigned short sat = 0xff, tmp = var;

asm ("addb %[val], %b[var];"
 "cmovc%[sat], %[var];"
 : [var] "+r" (tmp)
 : [val] "ir" (val), [sat] "r" (sat)
 );

return tmp;
}

Another thing to try is, instead of threading that lockval throughout
the thing, simply:

#define _Q_LOCKED_VAL   this_cpu_read_stable(cpu_sat)

or combined with the above

#define _Q_LOCKED_VAL   adds(this_cpu_read_stable(cpu_number), 2)

and see if the compiler really makes a mess of things.


Thanks for the suggestion. I will try that out.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-23 Thread Waiman Long

On 7/23/20 2:47 PM, pet...@infradead.org wrote:

On Thu, Jul 23, 2020 at 02:32:36PM -0400, Waiman Long wrote:

BTW, do you have any comment on my v2 lock holder cpu info qspinlock patch?
I will have to update the patch to fix the reported 0-day test problem, but
I want to collect other feedback before sending out v3.

I want to say I hate it all, it adds instructions to a path we spend an
aweful lot of time optimizing without really getting anything back for
it.


It does add some extra instruction that may slow it down slightly, but I 
don't agree that it gives nothing back. The cpu lock holder information 
can be useful in analyzing crash dumps and in some debugging situation. 
I think it can be useful in RHEL for this readon. How about an x86 
config option to allow distros to decide if they want to have it 
enabled? I will make sure that it will have no performance degradation 
if the option is not enabled.


Cheers,
Longman


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-23 Thread Waiman Long

On 7/23/20 10:00 AM, Peter Zijlstra wrote:

On Thu, Jul 09, 2020 at 12:06:13PM -0400, Waiman Long wrote:

We don't really need to do a pv_spinlocks_init() if pv_kick() isn't
supported.

Waiman, if you cannot explain how not having kick is a sane thing, what
are you saying here?

The current PPC paravirt spinlock code doesn't do any cpu kick. It does 
an equivalence of pv_wait by yielding the cpu to the lock holder only. 
The pv_spinlocks_init() is for setting up the hash table for doing 
pv_kick. If we don't need to do pv_kick, we don't need the hash table.


I am not saying that pv_kick is not needed for the PPC environment. I 
was just trying to adapt the pvqspinlock code to such an environment 
first. Further investigation on how to implement some kind of pv_kick 
will be something that we may want to do as a follow on.


BTW, do you have any comment on my v2 lock holder cpu info qspinlock 
patch? I will have to update the patch to fix the reported 0-day test 
problem, but I want to collect other feedback before sending out v3.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-23 Thread Waiman Long

On 7/23/20 9:30 AM, Nicholas Piggin wrote:

I would prefer to extract out the pending bit handling code out into a
separate helper function which can be overridden by the arch code
instead of breaking the slowpath into 2 pieces.

You mean have the arch provide a queued_spin_lock_slowpath_pending
function that the slow path calls?

I would actually prefer the pending handling can be made inline in
the queued_spin_lock function, especially with out-of-line locks it
makes sense to put it there.

We could ifdef out queued_spin_lock_slowpath_queue if it's not used,
then __queued_spin_lock_slowpath_queue would be inlined into the
caller so there would be no split?


The pending code is an optimization for lightly contended locks. That is 
why I think it is appropriate to extract it into a helper function and 
mark it as such.


You can certainly put the code in the arch's spin_lock code, you just 
has to override the generic pending code by a null function.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-21 Thread Waiman Long

On 7/21/20 7:08 AM, Nicholas Piggin wrote:

diff --git a/arch/powerpc/include/asm/qspinlock.h 
b/arch/powerpc/include/asm/qspinlock.h
index b752d34517b3..26d8766a1106 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -31,16 +31,57 @@ static inline void queued_spin_unlock(struct qspinlock 
*lock)
  
  #else

  extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void queued_spin_lock_slowpath_queue(struct qspinlock *lock);
  #endif
  
  static __always_inline void queued_spin_lock(struct qspinlock *lock)

  {
-   u32 val = 0;
-
-   if (likely(atomic_try_cmpxchg_lock(>val, , _Q_LOCKED_VAL)))
+   atomic_t *a = >val;
+   u32 val;
+
+again:
+   asm volatile(
+"1:\t"   PPC_LWARX(%0,0,%1,1) " # queued_spin_lock  
\n"
+   : "=" (val)
+   : "r" (>counter)
+   : "memory");
+
+   if (likely(val == 0)) {
+   asm_volatile_goto(
+   "  stwcx.  %0,0,%1 \n"
+   "  bne-%l[again]   \n"
+   "\t"  PPC_ACQUIRE_BARRIER "  
\n"
+   :
+   : "r"(_Q_LOCKED_VAL), "r" (>counter)
+   : "cr0", "memory"
+   : again );
return;
-
-   queued_spin_lock_slowpath(lock, val);
+   }
+
+   if (likely(val == _Q_LOCKED_VAL)) {
+   asm_volatile_goto(
+   "  stwcx.  %0,0,%1 \n"
+   "  bne-%l[again]   \n"
+   :
+   : "r"(_Q_LOCKED_VAL | _Q_PENDING_VAL), "r" (>counter)
+   : "cr0", "memory"
+   : again );
+
+   atomic_cond_read_acquire(a, !(VAL & _Q_LOCKED_MASK));
+// clear_pending_set_locked(lock);
+   WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
+// lockevent_inc(lock_pending);
+   return;
+   }
+
+   if (val == _Q_PENDING_VAL) {
+   int cnt = _Q_PENDING_LOOPS;
+   val = atomic_cond_read_relaxed(a,
+  (VAL != _Q_PENDING_VAL) || 
!cnt--);
+   if (!(val & ~_Q_LOCKED_MASK))
+   goto again;
+}
+   queued_spin_lock_slowpath_queue(lock);
  }
  #define queued_spin_lock queued_spin_lock
  


I am fine with the arch code override some part of the generic code.



diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index b9515fcc9b29..ebcc6f5d99d5 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -287,10 +287,14 @@ static __always_inline u32  __pv_wait_head_or_lock(struct 
qspinlock *lock,
  
  #ifdef CONFIG_PARAVIRT_SPINLOCKS

  #define queued_spin_lock_slowpath native_queued_spin_lock_slowpath
+#define queued_spin_lock_slowpath_queue
native_queued_spin_lock_slowpath_queue
  #endif
  
  #endif /* _GEN_PV_LOCK_SLOWPATH */
  
+void queued_spin_lock_slowpath_queue(struct qspinlock *lock);

+static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock);
+
  /**
   * queued_spin_lock_slowpath - acquire the queued spinlock
   * @lock: Pointer to queued spinlock structure
@@ -314,12 +318,6 @@ static __always_inline u32  __pv_wait_head_or_lock(struct 
qspinlock *lock,
   */
  void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
  {
-   struct mcs_spinlock *prev, *next, *node;
-   u32 old, tail;
-   int idx;
-
-   BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
-
if (pv_enabled())
goto pv_queue;
  
@@ -397,6 +395,26 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)

  queue:
lockevent_inc(lock_slowpath);
  pv_queue:
+   __queued_spin_lock_slowpath_queue(lock);
+}
+EXPORT_SYMBOL(queued_spin_lock_slowpath);
+
+void queued_spin_lock_slowpath_queue(struct qspinlock *lock)
+{
+   lockevent_inc(lock_slowpath);
+   __queued_spin_lock_slowpath_queue(lock);
+}
+EXPORT_SYMBOL(queued_spin_lock_slowpath_queue);
+
+static void __queued_spin_lock_slowpath_queue(struct qspinlock *lock)
+{
+   struct mcs_spinlock *prev, *next, *node;
+   u32 old, tail;
+   u32 val;
+   int idx;
+
+   BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
+
node = this_cpu_ptr([0].mcs);
idx = node->count++;
tail = encode_tail(smp_processor_id(), idx);
@@ -559,7 +577,6 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 
val)
 */
__this_cpu_dec(qnodes[0].mcs.count);
  }
-EXPORT_SYMBOL(queued_spin_lock_slowpath);
  
  /*

   * Generate the paravirt code for queued_spin_unlock_slowpath().

I would prefer to extract out the pending bit handling code out into a 
separate helper function which can be overridden by the arch code 
instead of breaking the slowpath 

Re: [PATCH v3 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-09 Thread Waiman Long

On 7/9/20 6:53 AM, Michael Ellerman wrote:

Nicholas Piggin  writes:


Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/include/asm/paravirt.h   | 28 
  arch/powerpc/include/asm/qspinlock.h  | 66 +++
  arch/powerpc/include/asm/qspinlock_paravirt.h |  7 ++
  arch/powerpc/platforms/pseries/Kconfig|  5 ++
  arch/powerpc/platforms/pseries/setup.c|  6 +-
  include/asm-generic/qspinlock.h   |  2 +

Another ack?


I am OK with adding the #ifdef around queued_spin_lock().

Acked-by: Waiman Long 


diff --git a/arch/powerpc/include/asm/paravirt.h 
b/arch/powerpc/include/asm/paravirt.h
index 7a8546660a63..f2d51f929cf5 100644
--- a/arch/powerpc/include/asm/paravirt.h
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -45,6 +55,19 @@ static inline void yield_to_preempted(int cpu, u32 
yield_count)
  {
___bad_yield_to_preempted(); /* This would be a bug */
  }
+
+extern void ___bad_yield_to_any(void);
+static inline void yield_to_any(void)
+{
+   ___bad_yield_to_any(); /* This would be a bug */
+}

Why do we do that rather than just not defining yield_to_any() at all
and letting the build fail on that?

There's a condition somewhere that we know will false at compile time
and drop the call before linking?


diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h 
b/arch/powerpc/include/asm/qspinlock_paravirt.h
new file mode 100644
index ..750d1b5e0202
--- /dev/null
+++ b/arch/powerpc/include/asm/qspinlock_paravirt.h
@@ -0,0 +1,7 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+#ifndef __ASM_QSPINLOCK_PARAVIRT_H
+#define __ASM_QSPINLOCK_PARAVIRT_H

_ASM_POWERPC_QSPINLOCK_PARAVIRT_H please.


+
+EXPORT_SYMBOL(__pv_queued_spin_unlock);

Why's that in a header? Should that (eventually) go with the generic 
implementation?
The PV qspinlock implementation is not that generic at the moment. Even 
though native qspinlock is used by a number of archs, PV qspinlock is 
only currently used in x86. This is certainly an area that needs 
improvement.

diff --git a/arch/powerpc/platforms/pseries/Kconfig 
b/arch/powerpc/platforms/pseries/Kconfig
index 24c18362e5ea..756e727b383f 100644
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -25,9 +25,14 @@ config PPC_PSERIES
select SWIOTLB
default y
  
+config PARAVIRT_SPINLOCKS

+   bool
+   default n

default n is the default.


diff --git a/arch/powerpc/platforms/pseries/setup.c 
b/arch/powerpc/platforms/pseries/setup.c
index 2db8469e475f..747a203d9453 100644
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -771,8 +771,12 @@ static void __init pSeries_setup_arch(void)
if (firmware_has_feature(FW_FEATURE_LPAR)) {
vpa_init(boot_cpuid);
  
-		if (lppaca_shared_proc(get_lppaca()))

+   if (lppaca_shared_proc(get_lppaca())) {
static_branch_enable(_processor);
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+   pv_spinlocks_init();
+#endif
+   }

We could avoid the ifdef with this I think?

diff --git a/arch/powerpc/include/asm/spinlock.h 
b/arch/powerpc/include/asm/spinlock.h
index 434615f1d761..6ec72282888d 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -10,5 +10,9 @@
  #include 
  #endif

+#ifndef CONFIG_PARAVIRT_SPINLOCKS
+static inline void pv_spinlocks_init(void) { }
+#endif
+
  #endif /* __KERNEL__ */
  #endif /* __ASM_SPINLOCK_H */


cheers

We don't really need to do a pv_spinlocks_init() if pv_kick() isn't 
supported.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-08 Thread Waiman Long

On 7/8/20 7:50 PM, Waiman Long wrote:

On 7/8/20 1:10 AM, Nicholas Piggin wrote:

Excerpts from Waiman Long's message of July 8, 2020 1:33 pm:

On 7/7/20 1:57 AM, Nicholas Piggin wrote:

Yes, powerpc could certainly get more performance out of the slow
paths, and then there are a few parameters to tune.

We don't have a good alternate patching for function calls yet, but
that would be something to do for native vs pv.

And then there seem to be one or two tunable parameters we could
experiment with.

The paravirt locks may need a bit more tuning. Some simple testing
under KVM shows we might be a bit slower in some cases. Whether this
is fairness or something else I'm not sure. The current simple pv
spinlock code can do a directed yield to the lock holder CPU, whereas
the pv qspl here just does a general yield. I think we might actually
be able to change that to also support directed yield. Though I'm
not sure if this is actually the cause of the slowdown yet.

Regarding the paravirt lock, I have taken a further look into the
current PPC spinlock code. There is an equivalent of pv_wait() but no
pv_kick(). Maybe PPC doesn't really need that.

So powerpc has two types of wait, either undirected "all processors" or
directed to a specific processor which has been preempted by the
hypervisor.

The simple spinlock code does a directed wait, because it knows the CPU
which is holding the lock. In this case, there is a sequence that is
used to ensure we don't wait if the condition has become true, and the
target CPU does not need to kick the waiter it will happen automatically
(see splpar_spin_yield). This is preferable because we only wait as
needed and don't require the kick operation.

Thanks for the explanation.


The pv spinlock code I did uses the undirected wait, because we don't
know the CPU number which we are waiting on. This is undesirable because
it's higher overhead and the wait is not so accurate.

I think perhaps we could change things so we wait on the correct CPU
when queued, which might be good enough (we could also put the lock
owner CPU in the spinlock word, if we add another format).


The LS byte of the lock word is used to indicate locking status. If we 
have less than 255 cpus, we can put the (cpu_nr + 1) into the lock 
byte. The special 0xff value can be used to indicate a cpu number >= 
255 for indirect yield. The required change to the qspinlock code will 
be minimal, I think. 


BTW, we can also keep track of the previous cpu in the waiting queue. 
Due to lock stealing, that may not be the cpu that is holding the lock. 
Maybe we can use this, if available, in case the cpu number is >= 255.


Regards,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-08 Thread Waiman Long

On 7/8/20 4:41 AM, Peter Zijlstra wrote:

On Tue, Jul 07, 2020 at 03:57:06PM +1000, Nicholas Piggin wrote:

Yes, powerpc could certainly get more performance out of the slow
paths, and then there are a few parameters to tune.

Can you clarify? The slow path is already in use on ARM64 which is weak,
so I doubt there's superfluous serialization present. And Will spend a
fair amount of time on making that thing guarantee forward progressm, so
there just isn't too much room to play.


We don't have a good alternate patching for function calls yet, but
that would be something to do for native vs pv.

Going by your jump_label implementation, support for static_call should
be fairly straight forward too, no?

   https://lkml.kernel.org/r/20200624153024.794671...@infradead.org

Speaking of static_call, I am also looking forward to it. Do you have an 
idea when that will be merged?


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-08 Thread Waiman Long

On 7/8/20 4:32 AM, Peter Zijlstra wrote:

On Tue, Jul 07, 2020 at 11:33:45PM -0400, Waiman Long wrote:

 From 5d7941a498935fb225b2c7a3108cbf590114c3db Mon Sep 17 00:00:00 2001
From: Waiman Long 
Date: Tue, 7 Jul 2020 22:29:16 -0400
Subject: [PATCH 2/9] locking/pvqspinlock: Introduce
  CONFIG_PARAVIRT_QSPINLOCKS_LITE

Add a new PARAVIRT_QSPINLOCKS_LITE config option that allows
architectures to use the PV qspinlock code without the need to use or
implement a pv_kick() function, thus eliminating the atomic unlock
overhead. The non-atomic queued_spin_unlock() can be used instead.
The pv_wait() function will still be needed, but it can be a dummy
function.

With that option set, the hybrid PV queued/unfair locking code should
still be able to make it performant enough in a paravirtualized

How is this supposed to work? If there is no kick, you have no control
over who wakes up and fairness goes out the window entirely.

You don't even begin to explain...

I don't have a full understanding of how the PPC hypervisor work myself. 
Apparently, a cpu kick may not be needed.


This is just a test patch to see if it yields better result. It is 
subjected to further modifcation.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-08 Thread Waiman Long

On 7/8/20 1:10 AM, Nicholas Piggin wrote:

Excerpts from Waiman Long's message of July 8, 2020 1:33 pm:

On 7/7/20 1:57 AM, Nicholas Piggin wrote:

Yes, powerpc could certainly get more performance out of the slow
paths, and then there are a few parameters to tune.

We don't have a good alternate patching for function calls yet, but
that would be something to do for native vs pv.

And then there seem to be one or two tunable parameters we could
experiment with.

The paravirt locks may need a bit more tuning. Some simple testing
under KVM shows we might be a bit slower in some cases. Whether this
is fairness or something else I'm not sure. The current simple pv
spinlock code can do a directed yield to the lock holder CPU, whereas
the pv qspl here just does a general yield. I think we might actually
be able to change that to also support directed yield. Though I'm
not sure if this is actually the cause of the slowdown yet.

Regarding the paravirt lock, I have taken a further look into the
current PPC spinlock code. There is an equivalent of pv_wait() but no
pv_kick(). Maybe PPC doesn't really need that.

So powerpc has two types of wait, either undirected "all processors" or
directed to a specific processor which has been preempted by the
hypervisor.

The simple spinlock code does a directed wait, because it knows the CPU
which is holding the lock. In this case, there is a sequence that is
used to ensure we don't wait if the condition has become true, and the
target CPU does not need to kick the waiter it will happen automatically
(see splpar_spin_yield). This is preferable because we only wait as
needed and don't require the kick operation.

Thanks for the explanation.


The pv spinlock code I did uses the undirected wait, because we don't
know the CPU number which we are waiting on. This is undesirable because
it's higher overhead and the wait is not so accurate.

I think perhaps we could change things so we wait on the correct CPU
when queued, which might be good enough (we could also put the lock
owner CPU in the spinlock word, if we add another format).


The LS byte of the lock word is used to indicate locking status. If we 
have less than 255 cpus, we can put the (cpu_nr + 1) into the lock byte. 
The special 0xff value can be used to indicate a cpu number >= 255 for 
indirect yield. The required change to the qspinlock code will be 
minimal, I think.




Attached are two
additional qspinlock patches that adds a CONFIG_PARAVIRT_QSPINLOCKS_LITE
option to not require pv_kick(). There is also a fixup patch to be
applied after your patchset.

I don't have access to a PPC LPAR with shared processor at the moment,
so I can't test the performance of the paravirt code. Would you mind
adding my patches and do some performance test on your end to see if it
gives better result?

Great, I'll do some tests. Any suggestions for what to try?


I will just like to see if it will produce some better performance 
result compared with your current version.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 0/6] powerpc: queued spinlocks and rwlocks

2020-07-07 Thread Waiman Long

On 7/7/20 1:57 AM, Nicholas Piggin wrote:

Yes, powerpc could certainly get more performance out of the slow
paths, and then there are a few parameters to tune.

We don't have a good alternate patching for function calls yet, but
that would be something to do for native vs pv.

And then there seem to be one or two tunable parameters we could
experiment with.

The paravirt locks may need a bit more tuning. Some simple testing
under KVM shows we might be a bit slower in some cases. Whether this
is fairness or something else I'm not sure. The current simple pv
spinlock code can do a directed yield to the lock holder CPU, whereas
the pv qspl here just does a general yield. I think we might actually
be able to change that to also support directed yield. Though I'm
not sure if this is actually the cause of the slowdown yet.


Regarding the paravirt lock, I have taken a further look into the 
current PPC spinlock code. There is an equivalent of pv_wait() but no 
pv_kick(). Maybe PPC doesn't really need that. Attached are two 
additional qspinlock patches that adds a CONFIG_PARAVIRT_QSPINLOCKS_LITE 
option to not require pv_kick(). There is also a fixup patch to be 
applied after your patchset.


I don't have access to a PPC LPAR with shared processor at the moment, 
so I can't test the performance of the paravirt code. Would you mind 
adding my patches and do some performance test on your end to see if it 
gives better result?


Thanks,
Longman

>From 161e545523a7eb4c42c145c04e9a5a15903ba3d9 Mon Sep 17 00:00:00 2001
From: Waiman Long 
Date: Tue, 7 Jul 2020 20:46:51 -0400
Subject: [PATCH 1/9] locking/pvqspinlock: Code relocation and extraction

Move pv_kick_node() and the unlock functions up and extract out the hash
and lock code from pv_wait_head_or_lock() into pv_hash_lock(). There
is no functional change.

Signed-off-by: Waiman Long 
---
 kernel/locking/qspinlock_paravirt.h | 302 ++--
 1 file changed, 156 insertions(+), 146 deletions(-)

diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e84d21aa0722..8eec58320b85 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -55,6 +55,7 @@ struct pv_node {
 
 /*
  * Hybrid PV queued/unfair lock
+ * 
  *
  * By replacing the regular queued_spin_trylock() with the function below,
  * it will be called once when a lock waiter enter the PV slowpath before
@@ -259,6 +260,156 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
 	BUG();
 }
 
+/*
+ * Insert lock into hash and set _Q_SLOW_VAL.
+ * Return true if lock acquired.
+ */
+static inline bool pv_hash_lock(struct qspinlock *lock, struct pv_node *node)
+{
+	struct qspinlock **lp = pv_hash(lock, node);
+
+	/*
+	 * We must hash before setting _Q_SLOW_VAL, such that
+	 * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
+	 * we'll be sure to be able to observe our hash entry.
+	 *
+	 *   [S]  [Rmw] l->locked == _Q_SLOW_VAL
+	 *   MB   RMB
+	 * [RmW] l->locked = _Q_SLOW_VAL  [L] 
+	 *
+	 * Matches the smp_rmb() in __pv_queued_spin_unlock().
+	 */
+	if (xchg(>locked, _Q_SLOW_VAL) == 0) {
+		/*
+		 * The lock was free and now we own the lock.
+		 * Change the lock value back to _Q_LOCKED_VAL
+		 * and unhash the table.
+		 */
+		WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
+		WRITE_ONCE(*lp, NULL);
+		return true;
+	}
+	return false;
+}
+
+/*
+ * Called after setting next->locked = 1 when we're the lock owner.
+ *
+ * Instead of waking the waiters stuck in pv_wait_node() advance their state
+ * such that they're waiting in pv_wait_head_or_lock(), this avoids a
+ * wake/sleep cycle.
+ */
+static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
+{
+	struct pv_node *pn = (struct pv_node *)node;
+
+	/*
+	 * If the vCPU is indeed halted, advance its state to match that of
+	 * pv_wait_node(). If OTOH this fails, the vCPU was running and will
+	 * observe its next->locked value and advance itself.
+	 *
+	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
+	 *
+	 * The write to next->locked in arch_mcs_spin_unlock_contended()
+	 * must be ordered before the read of pn->state in the cmpxchg()
+	 * below for the code to work correctly. To guarantee full ordering
+	 * irrespective of the success or failure of the cmpxchg(),
+	 * a relaxed version with explicit barrier is used. The control
+	 * dependency will order the reading of pn->state before any
+	 * subsequent writes.
+	 */
+	smp_mb__before_atomic();
+	if (cmpxchg_relaxed(>state, vcpu_halted, vcpu_hashed)
+	!= vcpu_halted)
+		return;
+
+	/*
+	 * Put the lock into the hash table and set the _Q_SLOW_VAL.
+	 *
+	 * As this is the same vCPU that will check the _Q_SLOW_VAL value and
+	 * the hash table later on at unlock time, no atomic instruction is
+	 * needed.
+	 */
+	WRITE_ONCE(lock->locked, _Q_SLOW_VAL);
+	(void)pv_hash

Re: [PATCH v2 5/6] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-05 Thread Waiman Long

On 7/3/20 3:35 AM, Nicholas Piggin wrote:

Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/include/asm/paravirt.h   | 28 ++
  arch/powerpc/include/asm/qspinlock.h  | 55 +++
  arch/powerpc/include/asm/qspinlock_paravirt.h |  5 ++
  arch/powerpc/platforms/pseries/Kconfig|  5 ++
  arch/powerpc/platforms/pseries/setup.c|  6 +-
  include/asm-generic/qspinlock.h   |  2 +
  6 files changed, 100 insertions(+), 1 deletion(-)
  create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h

diff --git a/arch/powerpc/include/asm/paravirt.h 
b/arch/powerpc/include/asm/paravirt.h
index 7a8546660a63..f2d51f929cf5 100644
--- a/arch/powerpc/include/asm/paravirt.h
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -29,6 +29,16 @@ static inline void yield_to_preempted(int cpu, u32 
yield_count)
  {
plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(cpu), 
yield_count);
  }
+
+static inline void prod_cpu(int cpu)
+{
+   plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu));
+}
+
+static inline void yield_to_any(void)
+{
+   plpar_hcall_norets(H_CONFER, -1, 0);
+}
  #else
  static inline bool is_shared_processor(void)
  {
@@ -45,6 +55,19 @@ static inline void yield_to_preempted(int cpu, u32 
yield_count)
  {
___bad_yield_to_preempted(); /* This would be a bug */
  }
+
+extern void ___bad_yield_to_any(void);
+static inline void yield_to_any(void)
+{
+   ___bad_yield_to_any(); /* This would be a bug */
+}
+
+extern void ___bad_prod_cpu(void);
+static inline void prod_cpu(int cpu)
+{
+   ___bad_prod_cpu(); /* This would be a bug */
+}
+
  #endif
  
  #define vcpu_is_preempted vcpu_is_preempted

@@ -57,5 +80,10 @@ static inline bool vcpu_is_preempted(int cpu)
return false;
  }
  
+static inline bool pv_is_native_spin_unlock(void)

+{
+ return !is_shared_processor();
+}
+
  #endif /* __KERNEL__ */
  #endif /* __ASM_PARAVIRT_H */
diff --git a/arch/powerpc/include/asm/qspinlock.h 
b/arch/powerpc/include/asm/qspinlock.h
index c49e33e24edd..0960a0de2467 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -3,9 +3,36 @@
  #define _ASM_POWERPC_QSPINLOCK_H
  
  #include 

+#include 
  
  #define _Q_PENDING_LOOPS	(1 << 9) /* not tuned */
  
+#ifdef CONFIG_PARAVIRT_SPINLOCKS

+extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+static __always_inline void queued_spin_lock_slowpath(struct qspinlock *lock, 
u32 val)
+{
+   if (!is_shared_processor())
+   native_queued_spin_lock_slowpath(lock, val);
+   else
+   __pv_queued_spin_lock_slowpath(lock, val);
+}


In a previous mail, I said that:

You may need to match the use of __pv_queued_spin_lock_slowpath() with 
the corresponding __pv_queued_spin_unlock(), e.g.


#define queued_spin_unlock queued_spin_unlock
static inline queued_spin_unlock(struct qspinlock *lock)
{
    if (!is_shared_processor())
    smp_store_release(>locked, 0);
    else
    __pv_queued_spin_unlock(lock);
}

Otherwise, pv_kick() will never be called.

Maybe PowerPC HMT is different that the shared cpus can still process 
instruction, though slower, that cpu kicking like what was done in kvm 
is not really necessary. If that is the case, I think we should document 
that.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 6/8] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-02 Thread Waiman Long

On 7/2/20 12:15 PM, kernel test robot wrote:

Hi Nicholas,

I love your patch! Yet something to improve:

[auto build test ERROR on powerpc/next]
[also build test ERROR on tip/locking/core v5.8-rc3 next-20200702]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use  as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Nicholas-Piggin/powerpc-queued-spinlocks-and-rwlocks/20200702-155158
base:   https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
config: powerpc-allyesconfig (attached as .config)
compiler: powerpc64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
 wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
 chmod +x ~/bin/make.cross
 # save the attached .config to linux build tree
 COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross 
ARCH=powerpc

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

kernel/locking/lock_events.c:61:16: warning: no previous prototype for 
'lockevent_read' [-Wmissing-prototypes]
   61 | ssize_t __weak lockevent_read(struct file *file, char __user 
*user_buf,
  |^~
kernel/locking/lock_events.c: In function 'skip_lockevent':

kernel/locking/lock_events.c:126:12: error: implicit declaration of function 
'pv_is_native_spin_unlock' [-Werror=implicit-function-declaration]

  126 |   pv_on = !pv_is_native_spin_unlock();
  |^~~~
cc1: some warnings being treated as errors

vim +/pv_is_native_spin_unlock +126 kernel/locking/lock_events.c


I think you will need to add the following into 
arch/powerpc/include/asm/qspinlock_paravirt.h:


static inline pv_is_native_spin_unlock(void)
{
    return !is_shared_processor();
}

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 6/8] powerpc/pseries: implement paravirt qspinlocks for SPLPAR

2020-07-02 Thread Waiman Long

On 7/2/20 3:48 AM, Nicholas Piggin wrote:

Signed-off-by: Nicholas Piggin 
---
  arch/powerpc/include/asm/paravirt.h   | 23 
  arch/powerpc/include/asm/qspinlock.h  | 55 +++
  arch/powerpc/include/asm/qspinlock_paravirt.h |  5 ++
  arch/powerpc/platforms/pseries/Kconfig|  5 ++
  arch/powerpc/platforms/pseries/setup.c|  6 +-
  include/asm-generic/qspinlock.h   |  2 +
  6 files changed, 95 insertions(+), 1 deletion(-)
  create mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h

diff --git a/arch/powerpc/include/asm/paravirt.h 
b/arch/powerpc/include/asm/paravirt.h
index 7a8546660a63..5fae9dfa6fe9 100644
--- a/arch/powerpc/include/asm/paravirt.h
+++ b/arch/powerpc/include/asm/paravirt.h
@@ -29,6 +29,16 @@ static inline void yield_to_preempted(int cpu, u32 
yield_count)
  {
plpar_hcall_norets(H_CONFER, get_hard_smp_processor_id(cpu), 
yield_count);
  }
+
+static inline void prod_cpu(int cpu)
+{
+   plpar_hcall_norets(H_PROD, get_hard_smp_processor_id(cpu));
+}
+
+static inline void yield_to_any(void)
+{
+   plpar_hcall_norets(H_CONFER, -1, 0);
+}
  #else
  static inline bool is_shared_processor(void)
  {
@@ -45,6 +55,19 @@ static inline void yield_to_preempted(int cpu, u32 
yield_count)
  {
___bad_yield_to_preempted(); /* This would be a bug */
  }
+
+extern void ___bad_yield_to_any(void);
+static inline void yield_to_any(void)
+{
+   ___bad_yield_to_any(); /* This would be a bug */
+}
+
+extern void ___bad_prod_cpu(void);
+static inline void prod_cpu(int cpu)
+{
+   ___bad_prod_cpu(); /* This would be a bug */
+}
+
  #endif
  
  #define vcpu_is_preempted vcpu_is_preempted

diff --git a/arch/powerpc/include/asm/qspinlock.h 
b/arch/powerpc/include/asm/qspinlock.h
index f84da77b6bb7..997a9a32df77 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -3,9 +3,36 @@
  #define _ASM_POWERPC_QSPINLOCK_H
  
  #include 

+#include 
  
  #define _Q_PENDING_LOOPS	(1 << 9) /* not tuned */
  
+#ifdef CONFIG_PARAVIRT_SPINLOCKS

+extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+
+static __always_inline void queued_spin_lock_slowpath(struct qspinlock *lock, 
u32 val)
+{
+   if (!is_shared_processor())
+   native_queued_spin_lock_slowpath(lock, val);
+   else
+   __pv_queued_spin_lock_slowpath(lock, val);
+}


You may need to match the use of __pv_queued_spin_lock_slowpath() with 
the corresponding __pv_queued_spin_unlock(), e.g.


#define queued_spin_unlock queued_spin_unlock
static inline queued_spin_unlock(struct qspinlock *lock)
{
    if (!is_shared_processor())
    smp_store_release(>locked, 0);
    else
    __pv_queued_spin_unlock(lock);
}

Otherwise, pv_kick() will never be called.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v4 0/3] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-16 Thread Waiman Long

On 6/16/20 2:53 PM, Joe Perches wrote:

On Mon, 2020-06-15 at 21:57 -0400, Waiman Long wrote:

  v4:
   - Break out the memzero_explicit() change as suggested by Dan Carpenter
 so that it can be backported to stable.
   - Drop the "crypto: Remove unnecessary memzero_explicit()" patch for
 now as there can be a bit more discussion on what is best. It will be
 introduced as a separate patch later on after this one is merged.

To this larger audience and last week without reply:
https://lore.kernel.org/lkml/573b3fbd5927c643920e1364230c296b23e7584d.ca...@perches.com/

Are there _any_ fastpath uses of kfree or vfree?

Many patches have been posted recently to fix mispairings
of specific types of alloc and free functions.

To eliminate these mispairings at a runtime cost of four
comparisons, should the kfree/vfree/kvfree/kfree_const
functions be consolidated into a single kfree?

Something like the below:

void kfree(const void *addr)
{
if (is_kernel_rodata((unsigned long)addr))
return;

if (is_vmalloc_addr(addr))
_vfree(addr);
else
_kfree(addr);
}

#define kvfree  kfree
#define vfree   kfree
#define kfree_const kfree


How about adding CONFIG_DEBUG_VM code to check for invalid address 
ranges in kfree() and vfree()? By doing this, we can catch unmatched 
pairing in debug mode, but won't have the overhead when debug mode is off.


Thought?

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 0/3] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-16 Thread Waiman Long

On 6/16/20 2:53 PM, Joe Perches wrote:

On Mon, 2020-06-15 at 21:57 -0400, Waiman Long wrote:

  v4:
   - Break out the memzero_explicit() change as suggested by Dan Carpenter
 so that it can be backported to stable.
   - Drop the "crypto: Remove unnecessary memzero_explicit()" patch for
 now as there can be a bit more discussion on what is best. It will be
 introduced as a separate patch later on after this one is merged.

To this larger audience and last week without reply:
https://lore.kernel.org/lkml/573b3fbd5927c643920e1364230c296b23e7584d.ca...@perches.com/

Are there _any_ fastpath uses of kfree or vfree?


I am not sure about that, but both of them can be slow.




Many patches have been posted recently to fix mispairings
of specific types of alloc and free functions.

To eliminate these mispairings at a runtime cost of four
comparisons, should the kfree/vfree/kvfree/kfree_const
functions be consolidated into a single kfree?

Something like the below:

void kfree(const void *addr)
{
if (is_kernel_rodata((unsigned long)addr))
return;

if (is_vmalloc_addr(addr))
_vfree(addr);
else
_kfree(addr);
}

is_kernel_rodata() is inlined, but is_vmalloc_addr() isn't. So the 
overhead can be a bit bigger.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v5 2/2] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-16 Thread Waiman Long

On 6/16/20 2:09 PM, Andrew Morton wrote:

On Tue, 16 Jun 2020 11:43:11 -0400 Waiman Long  wrote:


As said by Linus:

   A symmetric naming is only helpful if it implies symmetries in use.
   Otherwise it's actively misleading.

   In "kzalloc()", the z is meaningful and an important part of what the
   caller wants.

   In "kzfree()", the z is actively detrimental, because maybe in the
   future we really _might_ want to use that "memfill(0xdeadbeef)" or
   something. The "zero" part of the interface isn't even _relevant_.

The main reason that kzfree() exists is to clear sensitive information
that should not be leaked to other future users of the same memory
objects.

Rename kzfree() to kfree_sensitive() to follow the example of the
recently added kvfree_sensitive() and make the intention of the API
more explicit. In addition, memzero_explicit() is used to clear the
memory to make sure that it won't get optimized away by the compiler.

The renaming is done by using the command sequence:

   git grep -w --name-only kzfree |\
   xargs sed -i 's/\bkzfree\b/kfree_sensitive/'

followed by some editing of the kfree_sensitive() kerneldoc and adding
a kzfree backward compatibility macro in slab.h.

...

--- a/include/linux/slab.h
+++ b/include/linux/slab.h
@@ -186,10 +186,12 @@ void memcg_deactivate_kmem_caches(struct mem_cgroup *, 
struct mem_cgroup *);
   */
  void * __must_check krealloc(const void *, size_t, gfp_t);
  void kfree(const void *);
-void kzfree(const void *);
+void kfree_sensitive(const void *);
  size_t __ksize(const void *);
  size_t ksize(const void *);
  
+#define kzfree(x)	kfree_sensitive(x)	/* For backward compatibility */

+

What was the thinking here?  Is this really necessary?

I suppose we could keep this around for a while to ease migration.  But
not for too long, please.

It should be there just for 1 release cycle. I have broken out the btrfs 
patch to the btrfs list and I didn't make the kzfree to kfree_sensitive 
conversion there as that patch was in front in my patch list. So 
depending on which one lands first, there can be a window where the 
compilation may fail without this workaround. I am going to send out 
another patch in the next release cycle to remove it.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v5 2/2] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-16 Thread Waiman Long
As said by Linus:

  A symmetric naming is only helpful if it implies symmetries in use.
  Otherwise it's actively misleading.

  In "kzalloc()", the z is meaningful and an important part of what the
  caller wants.

  In "kzfree()", the z is actively detrimental, because maybe in the
  future we really _might_ want to use that "memfill(0xdeadbeef)" or
  something. The "zero" part of the interface isn't even _relevant_.

The main reason that kzfree() exists is to clear sensitive information
that should not be leaked to other future users of the same memory
objects.

Rename kzfree() to kfree_sensitive() to follow the example of the
recently added kvfree_sensitive() and make the intention of the API
more explicit. In addition, memzero_explicit() is used to clear the
memory to make sure that it won't get optimized away by the compiler.

The renaming is done by using the command sequence:

  git grep -w --name-only kzfree |\
  xargs sed -i 's/\bkzfree\b/kfree_sensitive/'

followed by some editing of the kfree_sensitive() kerneldoc and adding
a kzfree backward compatibility macro in slab.h.

Suggested-by: Joe Perches 
Acked-by: David Howells 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Signed-off-by: Waiman Long 
---
 arch/s390/crypto/prng.c   |  4 +--
 arch/x86/power/hibernate.c|  2 +-
 crypto/adiantum.c |  2 +-
 crypto/ahash.c|  4 +--
 crypto/api.c  |  2 +-
 crypto/asymmetric_keys/verify_pefile.c|  4 +--
 crypto/deflate.c  |  2 +-
 crypto/drbg.c | 10 +++---
 crypto/ecc.c  |  8 ++---
 crypto/ecdh.c |  2 +-
 crypto/gcm.c  |  2 +-
 crypto/gf128mul.c |  4 +--
 crypto/jitterentropy-kcapi.c  |  2 +-
 crypto/rng.c  |  2 +-
 crypto/rsa-pkcs1pad.c |  6 ++--
 crypto/seqiv.c|  2 +-
 crypto/shash.c|  2 +-
 crypto/skcipher.c |  2 +-
 crypto/testmgr.c  |  6 ++--
 crypto/zstd.c |  2 +-
 .../allwinner/sun8i-ce/sun8i-ce-cipher.c  |  2 +-
 .../allwinner/sun8i-ss/sun8i-ss-cipher.c  |  2 +-
 drivers/crypto/amlogic/amlogic-gxl-cipher.c   |  4 +--
 drivers/crypto/atmel-ecc.c|  2 +-
 drivers/crypto/caam/caampkc.c | 28 +++
 drivers/crypto/cavium/cpt/cptvf_main.c|  6 ++--
 drivers/crypto/cavium/cpt/cptvf_reqmanager.c  | 12 +++
 drivers/crypto/cavium/nitrox/nitrox_lib.c |  4 +--
 drivers/crypto/cavium/zip/zip_crypto.c|  6 ++--
 drivers/crypto/ccp/ccp-crypto-rsa.c   |  6 ++--
 drivers/crypto/ccree/cc_aead.c|  4 +--
 drivers/crypto/ccree/cc_buffer_mgr.c  |  4 +--
 drivers/crypto/ccree/cc_cipher.c  |  6 ++--
 drivers/crypto/ccree/cc_hash.c|  8 ++---
 drivers/crypto/ccree/cc_request_mgr.c |  2 +-
 drivers/crypto/marvell/cesa/hash.c|  2 +-
 .../crypto/marvell/octeontx/otx_cptvf_main.c  |  6 ++--
 .../marvell/octeontx/otx_cptvf_reqmgr.h   |  2 +-
 drivers/crypto/mediatek/mtk-aes.c |  2 +-
 drivers/crypto/nx/nx.c|  4 +--
 drivers/crypto/virtio/virtio_crypto_algs.c| 12 +++
 drivers/crypto/virtio/virtio_crypto_core.c|  2 +-
 drivers/md/dm-crypt.c | 32 -
 drivers/md/dm-integrity.c |  6 ++--
 drivers/misc/ibmvmc.c |  6 ++--
 .../hisilicon/hns3/hns3pf/hclge_mbx.c |  2 +-
 .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c|  6 ++--
 drivers/net/ppp/ppp_mppe.c|  6 ++--
 drivers/net/wireguard/noise.c |  4 +--
 drivers/net/wireguard/peer.c  |  2 +-
 drivers/net/wireless/intel/iwlwifi/pcie/rx.c  |  2 +-
 .../net/wireless/intel/iwlwifi/pcie/tx-gen2.c |  6 ++--
 drivers/net/wireless/intel/iwlwifi/pcie/tx.c  |  6 ++--
 drivers/net/wireless/intersil/orinoco/wext.c  |  4 +--
 drivers/s390/crypto/ap_bus.h  |  4 +--
 drivers/staging/ks7010/ks_hostif.c|  2 +-
 drivers/staging/rtl8723bs/core/rtw_security.c |  2 +-
 drivers/staging/wlan-ng/p80211netdev.c|  2 +-
 drivers/target/iscsi/iscsi_target_auth.c  |  2 +-
 fs/cifs/cifsencrypt.c |  2 +-
 fs/cifs/connect.c | 10 +++---
 fs/cifs/dfs_cache.c   |  2 +-
 fs/cifs/misc.c|  8 ++---
 fs/crypto/keyring.c   |  6 ++--
 fs/crypto/keysetup_v1.c   |  4 +--
 fs/ecryptfs/keystore.c 

[PATCH v5 1/2] mm/slab: Use memzero_explicit() in kzfree()

2020-06-16 Thread Waiman Long
The kzfree() function is normally used to clear some sensitive
information, like encryption keys, in the buffer before freeing it back
to the pool. Memset() is currently used for buffer clearing. However
unlikely, there is still a non-zero probability that the compiler may
choose to optimize away the memory clearing especially if LTO is being
used in the future. To make sure that this optimization will never
happen, memzero_explicit(), which is introduced in v3.18, is now used
in kzfree() to future-proof it.

Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()")
Cc: sta...@vger.kernel.org
Acked-by: Michal Hocko 
Signed-off-by: Waiman Long 
---
 mm/slab_common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 9e72ba224175..37d48a56431d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1726,7 +1726,7 @@ void kzfree(const void *p)
if (unlikely(ZERO_OR_NULL_PTR(mem)))
return;
ks = ksize(mem);
-   memset(mem, 0, ks);
+   memzero_explicit(mem, ks);
kfree(mem);
 }
 EXPORT_SYMBOL(kzfree);
-- 
2.18.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v5 0/2] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-16 Thread Waiman Long
 v5:
  - Break the btrfs patch out as a separate patch to be processed
independently.
  - Update the commit log of patch 1 to make it less scary.
  - Add a kzfree backward compatibility macro in patch 2.

 v4:
  - Break out the memzero_explicit() change as suggested by Dan Carpenter
so that it can be backported to stable.
  - Drop the "crypto: Remove unnecessary memzero_explicit()" patch for
now as there can be a bit more discussion on what is best. It will be
introduced as a separate patch later on after this one is merged.

This patchset makes a global rename of the kzfree() to kfree_sensitive()
to highlight the fact buffer clearing is only needed if the data objects
contain sensitive information like encrpytion key. The fact that kzfree()
uses memset() to do the clearing isn't totally safe either as compiler
may compile out the clearing in their optimizer especially if LTO is
used. Instead, the new kfree_sensitive() uses memzero_explicit() which
won't get compiled out.


Waiman Long (2):
  mm/slab: Use memzero_explicit() in kzfree()
  mm, treewide: Rename kzfree() to kfree_sensitive()

 arch/s390/crypto/prng.c   |  4 +--
 arch/x86/power/hibernate.c|  2 +-
 crypto/adiantum.c |  2 +-
 crypto/ahash.c|  4 +--
 crypto/api.c  |  2 +-
 crypto/asymmetric_keys/verify_pefile.c|  4 +--
 crypto/deflate.c  |  2 +-
 crypto/drbg.c | 10 +++---
 crypto/ecc.c  |  8 ++---
 crypto/ecdh.c |  2 +-
 crypto/gcm.c  |  2 +-
 crypto/gf128mul.c |  4 +--
 crypto/jitterentropy-kcapi.c  |  2 +-
 crypto/rng.c  |  2 +-
 crypto/rsa-pkcs1pad.c |  6 ++--
 crypto/seqiv.c|  2 +-
 crypto/shash.c|  2 +-
 crypto/skcipher.c |  2 +-
 crypto/testmgr.c  |  6 ++--
 crypto/zstd.c |  2 +-
 .../allwinner/sun8i-ce/sun8i-ce-cipher.c  |  2 +-
 .../allwinner/sun8i-ss/sun8i-ss-cipher.c  |  2 +-
 drivers/crypto/amlogic/amlogic-gxl-cipher.c   |  4 +--
 drivers/crypto/atmel-ecc.c|  2 +-
 drivers/crypto/caam/caampkc.c | 28 +++
 drivers/crypto/cavium/cpt/cptvf_main.c|  6 ++--
 drivers/crypto/cavium/cpt/cptvf_reqmanager.c  | 12 +++
 drivers/crypto/cavium/nitrox/nitrox_lib.c |  4 +--
 drivers/crypto/cavium/zip/zip_crypto.c|  6 ++--
 drivers/crypto/ccp/ccp-crypto-rsa.c   |  6 ++--
 drivers/crypto/ccree/cc_aead.c|  4 +--
 drivers/crypto/ccree/cc_buffer_mgr.c  |  4 +--
 drivers/crypto/ccree/cc_cipher.c  |  6 ++--
 drivers/crypto/ccree/cc_hash.c|  8 ++---
 drivers/crypto/ccree/cc_request_mgr.c |  2 +-
 drivers/crypto/marvell/cesa/hash.c|  2 +-
 .../crypto/marvell/octeontx/otx_cptvf_main.c  |  6 ++--
 .../marvell/octeontx/otx_cptvf_reqmgr.h   |  2 +-
 drivers/crypto/mediatek/mtk-aes.c |  2 +-
 drivers/crypto/nx/nx.c|  4 +--
 drivers/crypto/virtio/virtio_crypto_algs.c| 12 +++
 drivers/crypto/virtio/virtio_crypto_core.c|  2 +-
 drivers/md/dm-crypt.c | 32 -
 drivers/md/dm-integrity.c |  6 ++--
 drivers/misc/ibmvmc.c |  6 ++--
 .../hisilicon/hns3/hns3pf/hclge_mbx.c |  2 +-
 .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c|  6 ++--
 drivers/net/ppp/ppp_mppe.c|  6 ++--
 drivers/net/wireguard/noise.c |  4 +--
 drivers/net/wireguard/peer.c  |  2 +-
 drivers/net/wireless/intel/iwlwifi/pcie/rx.c  |  2 +-
 .../net/wireless/intel/iwlwifi/pcie/tx-gen2.c |  6 ++--
 drivers/net/wireless/intel/iwlwifi/pcie/tx.c  |  6 ++--
 drivers/net/wireless/intersil/orinoco/wext.c  |  4 +--
 drivers/s390/crypto/ap_bus.h  |  4 +--
 drivers/staging/ks7010/ks_hostif.c|  2 +-
 drivers/staging/rtl8723bs/core/rtw_security.c |  2 +-
 drivers/staging/wlan-ng/p80211netdev.c|  2 +-
 drivers/target/iscsi/iscsi_target_auth.c  |  2 +-
 fs/cifs/cifsencrypt.c |  2 +-
 fs/cifs/connect.c | 10 +++---
 fs/cifs/dfs_cache.c   |  2 +-
 fs/cifs/misc.c|  8 ++---
 fs/crypto/keyring.c   |  6 ++--
 fs/crypto/keysetup_v1.c   |  4 +--
 fs/ecryptfs/keystore.c|  4 +--
 fs/ecryptfs/messaging.c   |  2 +-
 include/crypto/aead.h |  2 +-
 include/crypto/

Re: [PATCH v4 2/3] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-16 Thread Waiman Long

On 6/16/20 10:26 AM, Dan Carpenter wrote:

Last time you sent this we couldn't decide which tree it should go
through.  Either the crypto tree or through Andrew seems like the right
thing to me.

Also the other issue is that it risks breaking things if people add
new kzfree() instances while we are doing the transition.  Could you
just add a "#define kzfree kfree_sensitive" so that things continue to
compile and we can remove it in the next kernel release?

regards,
dan carpenter


Yes, that make sure sense. Will send out v5 later today.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 3/3] btrfs: Use kfree() in btrfs_ioctl_get_subvol_info()

2020-06-16 Thread Waiman Long

On 6/16/20 10:48 AM, David Sterba wrote:

On Mon, Jun 15, 2020 at 09:57:18PM -0400, Waiman Long wrote:

In btrfs_ioctl_get_subvol_info(), there is a classic case where kzalloc()
was incorrectly paired with kzfree(). According to David Sterba, there
isn't any sensitive information in the subvol_info that needs to be
cleared before freeing. So kfree_sensitive() isn't really needed,
use kfree() instead.

Reported-by: David Sterba 
Signed-off-by: Waiman Long 
---
  fs/btrfs/ioctl.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f1dd9e4271e9..e8f7c5f00894 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2692,7 +2692,7 @@ static int btrfs_ioctl_get_subvol_info(struct file *file, 
void __user *argp)
btrfs_put_root(root);
  out_free:
btrfs_free_path(path);
-   kfree_sensitive(subvol_info);
+   kfree(subvol_info);

I would rather merge a patch doing to kzfree -> kfree instead of doing
the middle step to switch it to kfree_sensitive. If it would help
integration of your patchset I can push it to the next rc so there are
no kzfree left in the btrfs code. Treewide change like that can take
time so it would be one less problem to care about for you.


Sure, I will move it forward in the patch series.

Thanks,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 1/3] mm/slab: Use memzero_explicit() in kzfree()

2020-06-16 Thread Waiman Long

On 6/15/20 11:30 PM, Eric Biggers wrote:

On Mon, Jun 15, 2020 at 09:57:16PM -0400, Waiman Long wrote:

The kzfree() function is normally used to clear some sensitive
information, like encryption keys, in the buffer before freeing it back
to the pool. Memset() is currently used for the buffer clearing. However,
it is entirely possible that the compiler may choose to optimize away the
memory clearing especially if LTO is being used. To make sure that this
optimization will not happen, memzero_explicit(), which is introduced
in v3.18, is now used in kzfree() to do the clearing.

Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()")
Cc: sta...@vger.kernel.org
Signed-off-by: Waiman Long 
---
  mm/slab_common.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 9e72ba224175..37d48a56431d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1726,7 +1726,7 @@ void kzfree(const void *p)
if (unlikely(ZERO_OR_NULL_PTR(mem)))
return;
ks = ksize(mem);
-   memset(mem, 0, ks);
+   memzero_explicit(mem, ks);
kfree(mem);
  }
  EXPORT_SYMBOL(kzfree);

This is a good change, but the commit message isn't really accurate.  AFAIK, no
one has found any case where this memset() gets optimized out.  And even with
LTO, it would be virtually impossible due to all the synchronization and global
data structures that kfree() uses.  (Remember that this isn't the C standard
function "free()", so the compiler can't assign it any special meaning.)
Not to mention that LTO support isn't actually upstream yet.

I still agree with the change, but it might be helpful if the commit message
were honest that this is really a hardening measure and about properly conveying
the intent.  As-is this sounds like a critical fix, which might confuse people.


Yes, I agree that the commit log may look a bit scary. How about the 
following:


The kzfree() function is normally used to clear some sensitive
information, like encryption keys, in the buffer before freeing it back
to the pool. Memset() is currently used for buffer clearing. However
unlikely, there is still a non-zero probability that the compiler may
choose to optimize away the memory clearing especially if LTO is being
used in the future. To make sure that this optimization will never
happen, memzero_explicit(), which is introduced in v3.18, is now used
in kzfree() to future-proof it.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4 3/3] btrfs: Use kfree() in btrfs_ioctl_get_subvol_info()

2020-06-15 Thread Waiman Long
In btrfs_ioctl_get_subvol_info(), there is a classic case where kzalloc()
was incorrectly paired with kzfree(). According to David Sterba, there
isn't any sensitive information in the subvol_info that needs to be
cleared before freeing. So kfree_sensitive() isn't really needed,
use kfree() instead.

Reported-by: David Sterba 
Signed-off-by: Waiman Long 
---
 fs/btrfs/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index f1dd9e4271e9..e8f7c5f00894 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2692,7 +2692,7 @@ static int btrfs_ioctl_get_subvol_info(struct file *file, 
void __user *argp)
btrfs_put_root(root);
 out_free:
btrfs_free_path(path);
-   kfree_sensitive(subvol_info);
+   kfree(subvol_info);
return ret;
 }
 
-- 
2.18.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4 2/3] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-15 Thread Waiman Long
As said by Linus:

  A symmetric naming is only helpful if it implies symmetries in use.
  Otherwise it's actively misleading.

  In "kzalloc()", the z is meaningful and an important part of what the
  caller wants.

  In "kzfree()", the z is actively detrimental, because maybe in the
  future we really _might_ want to use that "memfill(0xdeadbeef)" or
  something. The "zero" part of the interface isn't even _relevant_.

The main reason that kzfree() exists is to clear sensitive information
that should not be leaked to other future users of the same memory
objects.

Rename kzfree() to kfree_sensitive() to follow the example of the
recently added kvfree_sensitive() and make the intention of the API
more explicit. In addition, memzero_explicit() is used to clear the
memory to make sure that it won't get optimized away by the compiler.

The renaming is done by using the command sequence:

  git grep -w --name-only kzfree |\
  xargs sed -i 's/\bkzfree\b/kfree_sensitive/'

followed by some editing of the kfree_sensitive() kerneldoc and the
use of memzero_explicit() instead of memset().

Suggested-by: Joe Perches 
Acked-by: David Howells 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Signed-off-by: Waiman Long 
---
 arch/s390/crypto/prng.c   |  4 +--
 arch/x86/power/hibernate.c|  2 +-
 crypto/adiantum.c |  2 +-
 crypto/ahash.c|  4 +--
 crypto/api.c  |  2 +-
 crypto/asymmetric_keys/verify_pefile.c|  4 +--
 crypto/deflate.c  |  2 +-
 crypto/drbg.c | 10 +++---
 crypto/ecc.c  |  8 ++---
 crypto/ecdh.c |  2 +-
 crypto/gcm.c  |  2 +-
 crypto/gf128mul.c |  4 +--
 crypto/jitterentropy-kcapi.c  |  2 +-
 crypto/rng.c  |  2 +-
 crypto/rsa-pkcs1pad.c |  6 ++--
 crypto/seqiv.c|  2 +-
 crypto/shash.c|  2 +-
 crypto/skcipher.c |  2 +-
 crypto/testmgr.c  |  6 ++--
 crypto/zstd.c |  2 +-
 .../allwinner/sun8i-ce/sun8i-ce-cipher.c  |  2 +-
 .../allwinner/sun8i-ss/sun8i-ss-cipher.c  |  2 +-
 drivers/crypto/amlogic/amlogic-gxl-cipher.c   |  4 +--
 drivers/crypto/atmel-ecc.c|  2 +-
 drivers/crypto/caam/caampkc.c | 28 +++
 drivers/crypto/cavium/cpt/cptvf_main.c|  6 ++--
 drivers/crypto/cavium/cpt/cptvf_reqmanager.c  | 12 +++
 drivers/crypto/cavium/nitrox/nitrox_lib.c |  4 +--
 drivers/crypto/cavium/zip/zip_crypto.c|  6 ++--
 drivers/crypto/ccp/ccp-crypto-rsa.c   |  6 ++--
 drivers/crypto/ccree/cc_aead.c|  4 +--
 drivers/crypto/ccree/cc_buffer_mgr.c  |  4 +--
 drivers/crypto/ccree/cc_cipher.c  |  6 ++--
 drivers/crypto/ccree/cc_hash.c|  8 ++---
 drivers/crypto/ccree/cc_request_mgr.c |  2 +-
 drivers/crypto/marvell/cesa/hash.c|  2 +-
 .../crypto/marvell/octeontx/otx_cptvf_main.c  |  6 ++--
 .../marvell/octeontx/otx_cptvf_reqmgr.h   |  2 +-
 drivers/crypto/mediatek/mtk-aes.c |  2 +-
 drivers/crypto/nx/nx.c|  4 +--
 drivers/crypto/virtio/virtio_crypto_algs.c| 12 +++
 drivers/crypto/virtio/virtio_crypto_core.c|  2 +-
 drivers/md/dm-crypt.c | 32 -
 drivers/md/dm-integrity.c |  6 ++--
 drivers/misc/ibmvmc.c |  6 ++--
 .../hisilicon/hns3/hns3pf/hclge_mbx.c |  2 +-
 .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c|  6 ++--
 drivers/net/ppp/ppp_mppe.c|  6 ++--
 drivers/net/wireguard/noise.c |  4 +--
 drivers/net/wireguard/peer.c  |  2 +-
 drivers/net/wireless/intel/iwlwifi/pcie/rx.c  |  2 +-
 .../net/wireless/intel/iwlwifi/pcie/tx-gen2.c |  6 ++--
 drivers/net/wireless/intel/iwlwifi/pcie/tx.c  |  6 ++--
 drivers/net/wireless/intersil/orinoco/wext.c  |  4 +--
 drivers/s390/crypto/ap_bus.h  |  4 +--
 drivers/staging/ks7010/ks_hostif.c|  2 +-
 drivers/staging/rtl8723bs/core/rtw_security.c |  2 +-
 drivers/staging/wlan-ng/p80211netdev.c|  2 +-
 drivers/target/iscsi/iscsi_target_auth.c  |  2 +-
 fs/btrfs/ioctl.c  |  2 +-
 fs/cifs/cifsencrypt.c |  2 +-
 fs/cifs/connect.c | 10 +++---
 fs/cifs/dfs_cache.c   |  2 +-
 fs/cifs/misc.c|  8 ++---
 fs/crypto/keyring.c   |  6 ++--
 fs/crypto/keysetup_v1.c   

[PATCH v4 0/3] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-15 Thread Waiman Long
 v4:
  - Break out the memzero_explicit() change as suggested by Dan Carpenter
so that it can be backported to stable.
  - Drop the "crypto: Remove unnecessary memzero_explicit()" patch for
now as there can be a bit more discussion on what is best. It will be
introduced as a separate patch later on after this one is merged.

This patchset makes a global rename of the kzfree() to kfree_sensitive()
to highlight the fact buffer clearing is only needed if the data objects
contain sensitive information like encrpytion key. The fact that kzfree()
uses memset() to do the clearing isn't totally safe either as compiler
may compile out the clearing in their optimizer especially if LTO is
used. Instead, the new kfree_sensitive() uses memzero_explicit() which
won't get compiled out.

Waiman Long (3):
  mm/slab: Use memzero_explicit() in kzfree()
  mm, treewide: Rename kzfree() to kfree_sensitive()
  btrfs: Use kfree() in btrfs_ioctl_get_subvol_info()

 arch/s390/crypto/prng.c   |  4 +--
 arch/x86/power/hibernate.c|  2 +-
 crypto/adiantum.c |  2 +-
 crypto/ahash.c|  4 +--
 crypto/api.c  |  2 +-
 crypto/asymmetric_keys/verify_pefile.c|  4 +--
 crypto/deflate.c  |  2 +-
 crypto/drbg.c | 10 +++---
 crypto/ecc.c  |  8 ++---
 crypto/ecdh.c |  2 +-
 crypto/gcm.c  |  2 +-
 crypto/gf128mul.c |  4 +--
 crypto/jitterentropy-kcapi.c  |  2 +-
 crypto/rng.c  |  2 +-
 crypto/rsa-pkcs1pad.c |  6 ++--
 crypto/seqiv.c|  2 +-
 crypto/shash.c|  2 +-
 crypto/skcipher.c |  2 +-
 crypto/testmgr.c  |  6 ++--
 crypto/zstd.c |  2 +-
 .../allwinner/sun8i-ce/sun8i-ce-cipher.c  |  2 +-
 .../allwinner/sun8i-ss/sun8i-ss-cipher.c  |  2 +-
 drivers/crypto/amlogic/amlogic-gxl-cipher.c   |  4 +--
 drivers/crypto/atmel-ecc.c|  2 +-
 drivers/crypto/caam/caampkc.c | 28 +++
 drivers/crypto/cavium/cpt/cptvf_main.c|  6 ++--
 drivers/crypto/cavium/cpt/cptvf_reqmanager.c  | 12 +++
 drivers/crypto/cavium/nitrox/nitrox_lib.c |  4 +--
 drivers/crypto/cavium/zip/zip_crypto.c|  6 ++--
 drivers/crypto/ccp/ccp-crypto-rsa.c   |  6 ++--
 drivers/crypto/ccree/cc_aead.c|  4 +--
 drivers/crypto/ccree/cc_buffer_mgr.c  |  4 +--
 drivers/crypto/ccree/cc_cipher.c  |  6 ++--
 drivers/crypto/ccree/cc_hash.c|  8 ++---
 drivers/crypto/ccree/cc_request_mgr.c |  2 +-
 drivers/crypto/marvell/cesa/hash.c|  2 +-
 .../crypto/marvell/octeontx/otx_cptvf_main.c  |  6 ++--
 .../marvell/octeontx/otx_cptvf_reqmgr.h   |  2 +-
 drivers/crypto/mediatek/mtk-aes.c |  2 +-
 drivers/crypto/nx/nx.c|  4 +--
 drivers/crypto/virtio/virtio_crypto_algs.c| 12 +++
 drivers/crypto/virtio/virtio_crypto_core.c|  2 +-
 drivers/md/dm-crypt.c | 32 -
 drivers/md/dm-integrity.c |  6 ++--
 drivers/misc/ibmvmc.c |  6 ++--
 .../hisilicon/hns3/hns3pf/hclge_mbx.c |  2 +-
 .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c|  6 ++--
 drivers/net/ppp/ppp_mppe.c|  6 ++--
 drivers/net/wireguard/noise.c |  4 +--
 drivers/net/wireguard/peer.c  |  2 +-
 drivers/net/wireless/intel/iwlwifi/pcie/rx.c  |  2 +-
 .../net/wireless/intel/iwlwifi/pcie/tx-gen2.c |  6 ++--
 drivers/net/wireless/intel/iwlwifi/pcie/tx.c  |  6 ++--
 drivers/net/wireless/intersil/orinoco/wext.c  |  4 +--
 drivers/s390/crypto/ap_bus.h  |  4 +--
 drivers/staging/ks7010/ks_hostif.c|  2 +-
 drivers/staging/rtl8723bs/core/rtw_security.c |  2 +-
 drivers/staging/wlan-ng/p80211netdev.c|  2 +-
 drivers/target/iscsi/iscsi_target_auth.c  |  2 +-
 fs/btrfs/ioctl.c  |  2 +-
 fs/cifs/cifsencrypt.c |  2 +-
 fs/cifs/connect.c | 10 +++---
 fs/cifs/dfs_cache.c   |  2 +-
 fs/cifs/misc.c|  8 ++---
 fs/crypto/keyring.c   |  6 ++--
 fs/crypto/keysetup_v1.c   |  4 +--
 fs/ecryptfs/keystore.c|  4 +--
 fs/ecryptfs/messaging.c   |  2 +-
 include/crypto/aead.h |  2 +-
 include/crypto/akcipher.h |  2 +-
 include/crypto/gf128mul.h |  2 +-
 include/cry

[PATCH v4 1/3] mm/slab: Use memzero_explicit() in kzfree()

2020-06-15 Thread Waiman Long
The kzfree() function is normally used to clear some sensitive
information, like encryption keys, in the buffer before freeing it back
to the pool. Memset() is currently used for the buffer clearing. However,
it is entirely possible that the compiler may choose to optimize away the
memory clearing especially if LTO is being used. To make sure that this
optimization will not happen, memzero_explicit(), which is introduced
in v3.18, is now used in kzfree() to do the clearing.

Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()")
Cc: sta...@vger.kernel.org
Signed-off-by: Waiman Long 
---
 mm/slab_common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 9e72ba224175..37d48a56431d 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1726,7 +1726,7 @@ void kzfree(const void *p)
if (unlikely(ZERO_OR_NULL_PTR(mem)))
return;
ks = ksize(mem);
-   memset(mem, 0, ks);
+   memzero_explicit(mem, ks);
kfree(mem);
 }
 EXPORT_SYMBOL(kzfree);
-- 
2.18.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-15 Thread Waiman Long

On 6/15/20 2:07 PM, Dan Carpenter wrote:

On Mon, Apr 13, 2020 at 05:15:49PM -0400, Waiman Long wrote:

diff --git a/mm/slab_common.c b/mm/slab_common.c
index 23c7500eea7d..c08bc7eb20bd 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -1707,17 +1707,17 @@ void *krealloc(const void *p, size_t new_size, gfp_t 
flags)
  EXPORT_SYMBOL(krealloc);
  
  /**

- * kzfree - like kfree but zero memory
+ * kfree_sensitive - Clear sensitive information in memory before freeing
   * @p: object to free memory of
   *
   * The memory of the object @p points to is zeroed before freed.
- * If @p is %NULL, kzfree() does nothing.
+ * If @p is %NULL, kfree_sensitive() does nothing.
   *
   * Note: this function zeroes the whole allocated buffer which can be a good
   * deal bigger than the requested buffer size passed to kmalloc(). So be
   * careful when using this function in performance sensitive code.
   */
-void kzfree(const void *p)
+void kfree_sensitive(const void *p)
  {
size_t ks;
void *mem = (void *)p;
@@ -1725,10 +1725,10 @@ void kzfree(const void *p)
if (unlikely(ZERO_OR_NULL_PTR(mem)))
return;
ks = ksize(mem);
-   memset(mem, 0, ks);
+   memzero_explicit(mem, ks);

 ^
This is an unrelated bug fix.  It really needs to be pulled into a
separate patch by itself and back ported to stable kernels.


kfree(mem);
  }
-EXPORT_SYMBOL(kzfree);
+EXPORT_SYMBOL(kfree_sensitive);
  
  /**

   * ksize - get the actual amount of memory allocated for a given object

regards,
dan carpenter


Thanks for the suggestion. I will break it out and post a version soon.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] crypto: Remove unnecessary memzero_explicit()

2020-04-14 Thread Waiman Long
On 4/14/20 3:16 PM, Michal Suchánek wrote:
> On Tue, Apr 14, 2020 at 12:24:36PM -0400, Waiman Long wrote:
>> On 4/14/20 2:08 AM, Christophe Leroy wrote:
>>>
>>> Le 14/04/2020 à 00:28, Waiman Long a écrit :
>>>> Since kfree_sensitive() will do an implicit memzero_explicit(), there
>>>> is no need to call memzero_explicit() before it. Eliminate those
>>>> memzero_explicit() and simplify the call sites. For better correctness,
>>>> the setting of keylen is also moved down after the key pointer check.
>>>>
>>>> Signed-off-by: Waiman Long 
>>>> ---
>>>>   .../allwinner/sun8i-ce/sun8i-ce-cipher.c  | 19 +-
>>>>   .../allwinner/sun8i-ss/sun8i-ss-cipher.c  | 20 +--
>>>>   drivers/crypto/amlogic/amlogic-gxl-cipher.c   | 12 +++
>>>>   drivers/crypto/inside-secure/safexcel_hash.c  |  3 +--
>>>>   4 files changed, 14 insertions(+), 40 deletions(-)
>>>>
>>>> diff --git a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
>>>> b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
>>>> index aa4e8fdc2b32..8358fac98719 100644
>>>> --- a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
>>>> +++ b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
>>>> @@ -366,10 +366,7 @@ void sun8i_ce_cipher_exit(struct crypto_tfm *tfm)
>>>>   {
>>>>   struct sun8i_cipher_tfm_ctx *op = crypto_tfm_ctx(tfm);
>>>>   -    if (op->key) {
>>>> -    memzero_explicit(op->key, op->keylen);
>>>> -    kfree(op->key);
>>>> -    }
>>>> +    kfree_sensitive(op->key);
>>>>   crypto_free_sync_skcipher(op->fallback_tfm);
>>>>   pm_runtime_put_sync_suspend(op->ce->dev);
>>>>   }
>>>> @@ -391,14 +388,11 @@ int sun8i_ce_aes_setkey(struct crypto_skcipher
>>>> *tfm, const u8 *key,
>>>>   dev_dbg(ce->dev, "ERROR: Invalid keylen %u\n", keylen);
>>>>   return -EINVAL;
>>>>   }
>>>> -    if (op->key) {
>>>> -    memzero_explicit(op->key, op->keylen);
>>>> -    kfree(op->key);
>>>> -    }
>>>> -    op->keylen = keylen;
>>>> +    kfree_sensitive(op->key);
>>>>   op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA);
>>>>   if (!op->key)
>>>>   return -ENOMEM;
>>>> +    op->keylen = keylen;
>>> Does it matter at all to ensure op->keylen is not set when of->key is
>>> NULL ? I'm not sure.
>>>
>>> But if it does, then op->keylen should be set to 0 when freeing op->key. 
>> My thinking is that if memory allocation fails, we just don't touch
>> anything and return an error code. I will not explicitly set keylen to 0
>> in this case unless it is specified in the API documentation.
> You already freed the key by now so not touching anything is not
> possible. The key is set to NULL on allocation failure so setting keylen
> to 0 should be redundant. However, setting keylen to 0 is consisent with
> not having a key, and it avoids the possibility of leaking the length
> later should that ever cause any problem.

OK, I can change it to clear the key length when the allocation failed
which isn't likely.

Cheers,
Longman


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 1/2] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-04-14 Thread Waiman Long
On 4/14/20 8:48 AM, David Sterba wrote:
> On Mon, Apr 13, 2020 at 05:15:49PM -0400, Waiman Long wrote:
>>  fs/btrfs/ioctl.c  |  2 +-
>
>> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
>> index 40b729dce91c..eab3f8510426 100644
>> --- a/fs/btrfs/ioctl.c
>> +++ b/fs/btrfs/ioctl.c
>> @@ -2691,7 +2691,7 @@ static int btrfs_ioctl_get_subvol_info(struct file 
>> *file, void __user *argp)
>>  btrfs_put_root(root);
>>  out_free:
>>  btrfs_free_path(path);
>> -kzfree(subvol_info);
>> +kfree_sensitive(subvol_info);
> This is not in a sensitive context so please switch it to plain kfree.
> With that you have my acked-by. Thanks.
>
Thanks for letting me know about. I think I will send it out as a
separate patch.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] crypto: Remove unnecessary memzero_explicit()

2020-04-14 Thread Waiman Long
On 4/14/20 2:08 AM, Christophe Leroy wrote:
>
>
> Le 14/04/2020 à 00:28, Waiman Long a écrit :
>> Since kfree_sensitive() will do an implicit memzero_explicit(), there
>> is no need to call memzero_explicit() before it. Eliminate those
>> memzero_explicit() and simplify the call sites. For better correctness,
>> the setting of keylen is also moved down after the key pointer check.
>>
>> Signed-off-by: Waiman Long 
>> ---
>>   .../allwinner/sun8i-ce/sun8i-ce-cipher.c  | 19 +-
>>   .../allwinner/sun8i-ss/sun8i-ss-cipher.c  | 20 +--
>>   drivers/crypto/amlogic/amlogic-gxl-cipher.c   | 12 +++
>>   drivers/crypto/inside-secure/safexcel_hash.c  |  3 +--
>>   4 files changed, 14 insertions(+), 40 deletions(-)
>>
>> diff --git a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
>> b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
>> index aa4e8fdc2b32..8358fac98719 100644
>> --- a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
>> +++ b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
>> @@ -366,10 +366,7 @@ void sun8i_ce_cipher_exit(struct crypto_tfm *tfm)
>>   {
>>   struct sun8i_cipher_tfm_ctx *op = crypto_tfm_ctx(tfm);
>>   -    if (op->key) {
>> -    memzero_explicit(op->key, op->keylen);
>> -    kfree(op->key);
>> -    }
>> +    kfree_sensitive(op->key);
>>   crypto_free_sync_skcipher(op->fallback_tfm);
>>   pm_runtime_put_sync_suspend(op->ce->dev);
>>   }
>> @@ -391,14 +388,11 @@ int sun8i_ce_aes_setkey(struct crypto_skcipher
>> *tfm, const u8 *key,
>>   dev_dbg(ce->dev, "ERROR: Invalid keylen %u\n", keylen);
>>   return -EINVAL;
>>   }
>> -    if (op->key) {
>> -    memzero_explicit(op->key, op->keylen);
>> -    kfree(op->key);
>> -    }
>> -    op->keylen = keylen;
>> +    kfree_sensitive(op->key);
>>   op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA);
>>   if (!op->key)
>>   return -ENOMEM;
>> +    op->keylen = keylen;
>
> Does it matter at all to ensure op->keylen is not set when of->key is
> NULL ? I'm not sure.
>
> But if it does, then op->keylen should be set to 0 when freeing op->key. 

My thinking is that if memory allocation fails, we just don't touch
anything and return an error code. I will not explicitly set keylen to 0
in this case unless it is specified in the API documentation.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH v2 2/2] crypto: Remove unnecessary memzero_explicit()

2020-04-13 Thread Waiman Long
Since kfree_sensitive() will do an implicit memzero_explicit(), there
is no need to call memzero_explicit() before it. Eliminate those
memzero_explicit() and simplify the call sites. For better correctness,
the setting of keylen is also moved down after the key pointer check.

Signed-off-by: Waiman Long 
---
 .../allwinner/sun8i-ce/sun8i-ce-cipher.c  | 19 +-
 .../allwinner/sun8i-ss/sun8i-ss-cipher.c  | 20 +--
 drivers/crypto/amlogic/amlogic-gxl-cipher.c   | 12 +++
 drivers/crypto/inside-secure/safexcel_hash.c  |  3 +--
 4 files changed, 14 insertions(+), 40 deletions(-)

diff --git a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c 
b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
index aa4e8fdc2b32..8358fac98719 100644
--- a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
+++ b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
@@ -366,10 +366,7 @@ void sun8i_ce_cipher_exit(struct crypto_tfm *tfm)
 {
struct sun8i_cipher_tfm_ctx *op = crypto_tfm_ctx(tfm);
 
-   if (op->key) {
-   memzero_explicit(op->key, op->keylen);
-   kfree(op->key);
-   }
+   kfree_sensitive(op->key);
crypto_free_sync_skcipher(op->fallback_tfm);
pm_runtime_put_sync_suspend(op->ce->dev);
 }
@@ -391,14 +388,11 @@ int sun8i_ce_aes_setkey(struct crypto_skcipher *tfm, 
const u8 *key,
dev_dbg(ce->dev, "ERROR: Invalid keylen %u\n", keylen);
return -EINVAL;
}
-   if (op->key) {
-   memzero_explicit(op->key, op->keylen);
-   kfree(op->key);
-   }
-   op->keylen = keylen;
+   kfree_sensitive(op->key);
op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA);
if (!op->key)
return -ENOMEM;
+   op->keylen = keylen;
 
crypto_sync_skcipher_clear_flags(op->fallback_tfm, CRYPTO_TFM_REQ_MASK);
crypto_sync_skcipher_set_flags(op->fallback_tfm, tfm->base.crt_flags & 
CRYPTO_TFM_REQ_MASK);
@@ -416,14 +410,11 @@ int sun8i_ce_des3_setkey(struct crypto_skcipher *tfm, 
const u8 *key,
if (err)
return err;
 
-   if (op->key) {
-   memzero_explicit(op->key, op->keylen);
-   kfree(op->key);
-   }
-   op->keylen = keylen;
+   kfree_sensitive(op->key);
op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA);
if (!op->key)
return -ENOMEM;
+   op->keylen = keylen;
 
crypto_sync_skcipher_clear_flags(op->fallback_tfm, CRYPTO_TFM_REQ_MASK);
crypto_sync_skcipher_set_flags(op->fallback_tfm, tfm->base.crt_flags & 
CRYPTO_TFM_REQ_MASK);
diff --git a/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c 
b/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c
index 5246ef4f5430..0495fbc27fcc 100644
--- a/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c
+++ b/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c
@@ -249,7 +249,6 @@ static int sun8i_ss_cipher(struct skcipher_request *areq)
offset = areq->cryptlen - ivsize;
if (rctx->op_dir & SS_DECRYPTION) {
memcpy(areq->iv, backup_iv, ivsize);
-   memzero_explicit(backup_iv, ivsize);
kfree_sensitive(backup_iv);
} else {
scatterwalk_map_and_copy(areq->iv, areq->dst, 
offset,
@@ -367,10 +366,7 @@ void sun8i_ss_cipher_exit(struct crypto_tfm *tfm)
 {
struct sun8i_cipher_tfm_ctx *op = crypto_tfm_ctx(tfm);
 
-   if (op->key) {
-   memzero_explicit(op->key, op->keylen);
-   kfree(op->key);
-   }
+   kfree_sensitive(op->key);
crypto_free_sync_skcipher(op->fallback_tfm);
pm_runtime_put_sync(op->ss->dev);
 }
@@ -392,14 +388,11 @@ int sun8i_ss_aes_setkey(struct crypto_skcipher *tfm, 
const u8 *key,
dev_dbg(ss->dev, "ERROR: Invalid keylen %u\n", keylen);
return -EINVAL;
}
-   if (op->key) {
-   memzero_explicit(op->key, op->keylen);
-   kfree(op->key);
-   }
-   op->keylen = keylen;
+   kfree_sensitive(op->key);
op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA);
if (!op->key)
return -ENOMEM;
+   op->keylen = keylen;
 
crypto_sync_skcipher_clear_flags(op->fallback_tfm, CRYPTO_TFM_REQ_MASK);
crypto_sync_skcipher_set_flags(op->fallback_tfm, tfm->base.crt_flags & 
CRYPTO_TFM_REQ_MASK);
@@ -418,14 +411,11 @@ int sun8i_ss_des3_setkey(struct crypto_skcipher *tfm, 
const u8 *key,
return -EINVAL;
}
 
-   if (op->key) {
-   memzero_explicit(op->key, op-&g

Re: [PATCH 2/2] crypto: Remove unnecessary memzero_explicit()

2020-04-13 Thread Waiman Long
On 4/13/20 5:31 PM, Joe Perches wrote:
> On Mon, 2020-04-13 at 17:15 -0400, Waiman Long wrote:
>> Since kfree_sensitive() will do an implicit memzero_explicit(), there
>> is no need to call memzero_explicit() before it. Eliminate those
>> memzero_explicit() and simplify the call sites.
> 2 bits of trivia:
>
>> diff --git a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c 
>> b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
> []
>> @@ -391,10 +388,7 @@ int sun8i_ce_aes_setkey(struct crypto_skcipher *tfm, 
>> const u8 *key,
>>  dev_dbg(ce->dev, "ERROR: Invalid keylen %u\n", keylen);
>>  return -EINVAL;
>>  }
>> -if (op->key) {
>> -memzero_explicit(op->key, op->keylen);
>> -kfree(op->key);
>> -}
>> +kfree_sensitive(op->key);
>>  op->keylen = keylen;
>>  op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA);
>>  if (!op->key)
> It might be a defect to set op->keylen before the kmemdup succeeds.
It could be. I can move it down after the op->key check.
>> @@ -416,10 +410,7 @@ int sun8i_ce_des3_setkey(struct crypto_skcipher *tfm, 
>> const u8 *key,
>>  if (err)
>>  return err;
>>  
>> -if (op->key) {
>> -memzero_explicit(op->key, op->keylen);
>> -kfree(op->key);
>> -}
>> +free_sensitive(op->key, op->keylen);
> Why not kfree_sensitive(op->key) ?

Oh, it is a bug. I will send out v2 to fix that.

Thanks for spotting it.

Cheers,
Longman


>
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 1/2] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-04-13 Thread Waiman Long
As said by Linus:

  A symmetric naming is only helpful if it implies symmetries in use.
  Otherwise it's actively misleading.

  In "kzalloc()", the z is meaningful and an important part of what the
  caller wants.

  In "kzfree()", the z is actively detrimental, because maybe in the
  future we really _might_ want to use that "memfill(0xdeadbeef)" or
  something. The "zero" part of the interface isn't even _relevant_.

The main reason that kzfree() exists is to clear sensitive information
that should not be leaked to other future users of the same memory
objects.

Rename kzfree() to kfree_sensitive() to follow the example of the
recently added kvfree_sensitive() and make the intention of the API
more explicit. In addition, memzero_explicit() is used to clear the
memory to make sure that it won't get optimized away by the compiler.

The renaming is done by using the command sequence:

  git grep -w --name-only kzfree |\
  xargs sed -i 's/\bkzfree\b/kfree_sensitive/'

followed by some editing of the kfree_sensitive() kerneldoc and the
use of memzero_explicit() instead of memset().

Suggested-by: Joe Perches 
Signed-off-by: Waiman Long 
---
 arch/s390/crypto/prng.c   |  4 +--
 arch/x86/power/hibernate.c|  2 +-
 crypto/adiantum.c |  2 +-
 crypto/ahash.c|  4 +--
 crypto/api.c  |  2 +-
 crypto/asymmetric_keys/verify_pefile.c|  4 +--
 crypto/deflate.c  |  2 +-
 crypto/drbg.c | 10 +++---
 crypto/ecc.c  |  8 ++---
 crypto/ecdh.c |  2 +-
 crypto/gcm.c  |  2 +-
 crypto/gf128mul.c |  4 +--
 crypto/jitterentropy-kcapi.c  |  2 +-
 crypto/rng.c  |  2 +-
 crypto/rsa-pkcs1pad.c |  6 ++--
 crypto/seqiv.c|  2 +-
 crypto/shash.c|  2 +-
 crypto/skcipher.c |  2 +-
 crypto/testmgr.c  |  6 ++--
 crypto/zstd.c |  2 +-
 .../allwinner/sun8i-ce/sun8i-ce-cipher.c  |  2 +-
 .../allwinner/sun8i-ss/sun8i-ss-cipher.c  |  2 +-
 drivers/crypto/amlogic/amlogic-gxl-cipher.c   |  4 +--
 drivers/crypto/atmel-ecc.c|  2 +-
 drivers/crypto/caam/caampkc.c | 28 +++
 drivers/crypto/cavium/cpt/cptvf_main.c|  6 ++--
 drivers/crypto/cavium/cpt/cptvf_reqmanager.c  | 12 +++
 drivers/crypto/cavium/nitrox/nitrox_lib.c |  4 +--
 drivers/crypto/cavium/zip/zip_crypto.c|  6 ++--
 drivers/crypto/ccp/ccp-crypto-rsa.c   |  6 ++--
 drivers/crypto/ccree/cc_aead.c|  4 +--
 drivers/crypto/ccree/cc_buffer_mgr.c  |  4 +--
 drivers/crypto/ccree/cc_cipher.c  |  6 ++--
 drivers/crypto/ccree/cc_hash.c|  8 ++---
 drivers/crypto/ccree/cc_request_mgr.c |  2 +-
 drivers/crypto/marvell/cesa/hash.c|  2 +-
 .../crypto/marvell/octeontx/otx_cptvf_main.c  |  6 ++--
 .../marvell/octeontx/otx_cptvf_reqmgr.h   |  2 +-
 drivers/crypto/mediatek/mtk-aes.c |  2 +-
 drivers/crypto/nx/nx.c|  4 +--
 drivers/crypto/virtio/virtio_crypto_algs.c| 12 +++
 drivers/crypto/virtio/virtio_crypto_core.c|  2 +-
 drivers/md/dm-crypt.c | 34 +--
 drivers/md/dm-integrity.c |  6 ++--
 drivers/misc/ibmvmc.c |  6 ++--
 .../hisilicon/hns3/hns3pf/hclge_mbx.c |  2 +-
 .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c|  6 ++--
 drivers/net/ppp/ppp_mppe.c|  6 ++--
 drivers/net/wireguard/noise.c |  4 +--
 drivers/net/wireguard/peer.c  |  2 +-
 drivers/net/wireless/intel/iwlwifi/pcie/rx.c  |  2 +-
 .../net/wireless/intel/iwlwifi/pcie/tx-gen2.c |  6 ++--
 drivers/net/wireless/intel/iwlwifi/pcie/tx.c  |  6 ++--
 drivers/net/wireless/intersil/orinoco/wext.c  |  4 +--
 drivers/s390/crypto/ap_bus.h  |  4 +--
 drivers/staging/ks7010/ks_hostif.c|  2 +-
 drivers/staging/rtl8723bs/core/rtw_security.c |  2 +-
 drivers/staging/wlan-ng/p80211netdev.c|  2 +-
 drivers/target/iscsi/iscsi_target_auth.c  |  2 +-
 fs/btrfs/ioctl.c  |  2 +-
 fs/cifs/cifsencrypt.c |  2 +-
 fs/cifs/connect.c | 10 +++---
 fs/cifs/dfs_cache.c   |  2 +-
 fs/cifs/misc.c|  8 ++---
 fs/crypto/keyring.c   |  6 ++--
 fs/crypto/keysetup_v1.c   |  4 +--
 fs/ecryptfs/keystore.c|  4 +--
 fs/ecryptf

[PATCH 2/2] crypto: Remove unnecessary memzero_explicit()

2020-04-13 Thread Waiman Long
Since kfree_sensitive() will do an implicit memzero_explicit(), there
is no need to call memzero_explicit() before it. Eliminate those
memzero_explicit() and simplify the call sites.

Signed-off-by: Waiman Long 
---
 .../crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c  | 15 +++
 .../crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c  | 16 +++-
 drivers/crypto/amlogic/amlogic-gxl-cipher.c  | 10 ++
 drivers/crypto/inside-secure/safexcel_hash.c |  3 +--
 4 files changed, 9 insertions(+), 35 deletions(-)

diff --git a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c 
b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
index aa4e8fdc2b32..46c10c7ca6d0 100644
--- a/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
+++ b/drivers/crypto/allwinner/sun8i-ce/sun8i-ce-cipher.c
@@ -366,10 +366,7 @@ void sun8i_ce_cipher_exit(struct crypto_tfm *tfm)
 {
struct sun8i_cipher_tfm_ctx *op = crypto_tfm_ctx(tfm);
 
-   if (op->key) {
-   memzero_explicit(op->key, op->keylen);
-   kfree(op->key);
-   }
+   kfree_sensitive(op->key);
crypto_free_sync_skcipher(op->fallback_tfm);
pm_runtime_put_sync_suspend(op->ce->dev);
 }
@@ -391,10 +388,7 @@ int sun8i_ce_aes_setkey(struct crypto_skcipher *tfm, const 
u8 *key,
dev_dbg(ce->dev, "ERROR: Invalid keylen %u\n", keylen);
return -EINVAL;
}
-   if (op->key) {
-   memzero_explicit(op->key, op->keylen);
-   kfree(op->key);
-   }
+   kfree_sensitive(op->key);
op->keylen = keylen;
op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA);
if (!op->key)
@@ -416,10 +410,7 @@ int sun8i_ce_des3_setkey(struct crypto_skcipher *tfm, 
const u8 *key,
if (err)
return err;
 
-   if (op->key) {
-   memzero_explicit(op->key, op->keylen);
-   kfree(op->key);
-   }
+   free_sensitive(op->key, op->keylen);
op->keylen = keylen;
op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA);
if (!op->key)
diff --git a/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c 
b/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c
index 5246ef4f5430..7e09a923cbaf 100644
--- a/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c
+++ b/drivers/crypto/allwinner/sun8i-ss/sun8i-ss-cipher.c
@@ -249,7 +249,6 @@ static int sun8i_ss_cipher(struct skcipher_request *areq)
offset = areq->cryptlen - ivsize;
if (rctx->op_dir & SS_DECRYPTION) {
memcpy(areq->iv, backup_iv, ivsize);
-   memzero_explicit(backup_iv, ivsize);
kfree_sensitive(backup_iv);
} else {
scatterwalk_map_and_copy(areq->iv, areq->dst, 
offset,
@@ -367,10 +366,7 @@ void sun8i_ss_cipher_exit(struct crypto_tfm *tfm)
 {
struct sun8i_cipher_tfm_ctx *op = crypto_tfm_ctx(tfm);
 
-   if (op->key) {
-   memzero_explicit(op->key, op->keylen);
-   kfree(op->key);
-   }
+   kfree_sensitive(op->key);
crypto_free_sync_skcipher(op->fallback_tfm);
pm_runtime_put_sync(op->ss->dev);
 }
@@ -392,10 +388,7 @@ int sun8i_ss_aes_setkey(struct crypto_skcipher *tfm, const 
u8 *key,
dev_dbg(ss->dev, "ERROR: Invalid keylen %u\n", keylen);
return -EINVAL;
}
-   if (op->key) {
-   memzero_explicit(op->key, op->keylen);
-   kfree(op->key);
-   }
+   kfree_sensitive(op->key);
op->keylen = keylen;
op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA);
if (!op->key)
@@ -418,10 +411,7 @@ int sun8i_ss_des3_setkey(struct crypto_skcipher *tfm, 
const u8 *key,
return -EINVAL;
}
 
-   if (op->key) {
-   memzero_explicit(op->key, op->keylen);
-   kfree(op->key);
-   }
+   kfree_sensitive(op->key);
op->keylen = keylen;
op->key = kmemdup(key, keylen, GFP_KERNEL | GFP_DMA);
if (!op->key)
diff --git a/drivers/crypto/amlogic/amlogic-gxl-cipher.c 
b/drivers/crypto/amlogic/amlogic-gxl-cipher.c
index fd1269900d67..f424397fbba4 100644
--- a/drivers/crypto/amlogic/amlogic-gxl-cipher.c
+++ b/drivers/crypto/amlogic/amlogic-gxl-cipher.c
@@ -341,10 +341,7 @@ void meson_cipher_exit(struct crypto_tfm *tfm)
 {
struct meson_cipher_tfm_ctx *op = crypto_tfm_ctx(tfm);
 
-   if (op->key) {
-   memzero_explicit(op->key, op->keylen);
-   kfree(op->key);
-   }
+   kfree_sensitive(op->key)
crypto_free_sync_skcipher(op->fallback_tfm);
 }
 
@@ -368,10

[PATCH 0/2] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-04-13 Thread Waiman Long
This patchset makes a global rename of the kzfree() to kfree_sensitive()
to highlight the fact buffer clearing is only needed if the data objects
contain sensitive information like encrpytion key. The fact that kzfree()
uses memset() to do the clearing isn't totally safe either as compiler
may compile out the clearing in their optimizer. Instead, the new
kfree_sensitive() uses memzero_explicit() which won't get compiled out.

Waiman Long (2):
  mm, treewide: Rename kzfree() to kfree_sensitive()
  crypto: Remove unnecessary memzero_explicit()

 arch/s390/crypto/prng.c   |  4 +--
 arch/x86/power/hibernate.c|  2 +-
 crypto/adiantum.c |  2 +-
 crypto/ahash.c|  4 +--
 crypto/api.c  |  2 +-
 crypto/asymmetric_keys/verify_pefile.c|  4 +--
 crypto/deflate.c  |  2 +-
 crypto/drbg.c | 10 +++---
 crypto/ecc.c  |  8 ++---
 crypto/ecdh.c |  2 +-
 crypto/gcm.c  |  2 +-
 crypto/gf128mul.c |  4 +--
 crypto/jitterentropy-kcapi.c  |  2 +-
 crypto/rng.c  |  2 +-
 crypto/rsa-pkcs1pad.c |  6 ++--
 crypto/seqiv.c|  2 +-
 crypto/shash.c|  2 +-
 crypto/skcipher.c |  2 +-
 crypto/testmgr.c  |  6 ++--
 crypto/zstd.c |  2 +-
 .../allwinner/sun8i-ce/sun8i-ce-cipher.c  | 17 +++---
 .../allwinner/sun8i-ss/sun8i-ss-cipher.c  | 18 +++---
 drivers/crypto/amlogic/amlogic-gxl-cipher.c   | 14 +++-
 drivers/crypto/atmel-ecc.c|  2 +-
 drivers/crypto/caam/caampkc.c | 28 +++
 drivers/crypto/cavium/cpt/cptvf_main.c|  6 ++--
 drivers/crypto/cavium/cpt/cptvf_reqmanager.c  | 12 +++
 drivers/crypto/cavium/nitrox/nitrox_lib.c |  4 +--
 drivers/crypto/cavium/zip/zip_crypto.c|  6 ++--
 drivers/crypto/ccp/ccp-crypto-rsa.c   |  6 ++--
 drivers/crypto/ccree/cc_aead.c|  4 +--
 drivers/crypto/ccree/cc_buffer_mgr.c  |  4 +--
 drivers/crypto/ccree/cc_cipher.c  |  6 ++--
 drivers/crypto/ccree/cc_hash.c|  8 ++---
 drivers/crypto/ccree/cc_request_mgr.c |  2 +-
 drivers/crypto/inside-secure/safexcel_hash.c  |  3 +-
 drivers/crypto/marvell/cesa/hash.c|  2 +-
 .../crypto/marvell/octeontx/otx_cptvf_main.c  |  6 ++--
 .../marvell/octeontx/otx_cptvf_reqmgr.h   |  2 +-
 drivers/crypto/mediatek/mtk-aes.c |  2 +-
 drivers/crypto/nx/nx.c|  4 +--
 drivers/crypto/virtio/virtio_crypto_algs.c| 12 +++
 drivers/crypto/virtio/virtio_crypto_core.c|  2 +-
 drivers/md/dm-crypt.c | 34 +--
 drivers/md/dm-integrity.c |  6 ++--
 drivers/misc/ibmvmc.c |  6 ++--
 .../hisilicon/hns3/hns3pf/hclge_mbx.c |  2 +-
 .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c|  6 ++--
 drivers/net/ppp/ppp_mppe.c|  6 ++--
 drivers/net/wireguard/noise.c |  4 +--
 drivers/net/wireguard/peer.c  |  2 +-
 drivers/net/wireless/intel/iwlwifi/pcie/rx.c  |  2 +-
 .../net/wireless/intel/iwlwifi/pcie/tx-gen2.c |  6 ++--
 drivers/net/wireless/intel/iwlwifi/pcie/tx.c  |  6 ++--
 drivers/net/wireless/intersil/orinoco/wext.c  |  4 +--
 drivers/s390/crypto/ap_bus.h  |  4 +--
 drivers/staging/ks7010/ks_hostif.c|  2 +-
 drivers/staging/rtl8723bs/core/rtw_security.c |  2 +-
 drivers/staging/wlan-ng/p80211netdev.c|  2 +-
 drivers/target/iscsi/iscsi_target_auth.c  |  2 +-
 fs/btrfs/ioctl.c  |  2 +-
 fs/cifs/cifsencrypt.c |  2 +-
 fs/cifs/connect.c | 10 +++---
 fs/cifs/dfs_cache.c   |  2 +-
 fs/cifs/misc.c|  8 ++---
 fs/crypto/keyring.c   |  6 ++--
 fs/crypto/keysetup_v1.c   |  4 +--
 fs/ecryptfs/keystore.c|  4 +--
 fs/ecryptfs/messaging.c   |  2 +-
 include/crypto/aead.h |  2 +-
 include/crypto/akcipher.h |  2 +-
 include/crypto/gf128mul.h |  2 +-
 include/crypto/hash.h |  2 +-
 include/crypto/internal/acompress.h   |  2 +-
 include/crypto/kpp.h  |  2 +-
 include/crypto/skcipher.h |  2 +-
 include/linux/slab.h  |  2 +-
 lib/mpi/mpiutil.c |  6 ++--
 lib/test_kasan.c

Re: [PATCH] x86/paravirt: Guard against invalid cpu # in pv_vcpu_is_preempted()

2019-04-01 Thread Waiman Long
On 04/01/2019 02:38 AM, Juergen Gross wrote:
> On 25/03/2019 19:03, Waiman Long wrote:
>> On 03/25/2019 12:40 PM, Juergen Gross wrote:
>>> On 25/03/2019 16:57, Waiman Long wrote:
>>>> It was found that passing an invalid cpu number to pv_vcpu_is_preempted()
>>>> might panic the kernel in a VM guest. For example,
>>>>
>>>> [2.531077] Oops:  [#1] SMP PTI
>>>>   :
>>>> [2.532545] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>>>> [2.533321] RIP: 0010:__raw_callee_save___kvm_vcpu_is_preempted+0x0/0x20
>>>>
>>>> To guard against this kind of kernel panic, check is added to
>>>> pv_vcpu_is_preempted() to make sure that no invalid cpu number will
>>>> be used.
>>>>
>>>> Signed-off-by: Waiman Long 
>>>> ---
>>>>  arch/x86/include/asm/paravirt.h | 6 ++
>>>>  1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/arch/x86/include/asm/paravirt.h 
>>>> b/arch/x86/include/asm/paravirt.h
>>>> index c25c38a05c1c..4cfb465dcde4 100644
>>>> --- a/arch/x86/include/asm/paravirt.h
>>>> +++ b/arch/x86/include/asm/paravirt.h
>>>> @@ -671,6 +671,12 @@ static __always_inline void pv_kick(int cpu)
>>>>  
>>>>  static __always_inline bool pv_vcpu_is_preempted(long cpu)
>>>>  {
>>>> +  /*
>>>> +   * Guard against invalid cpu number or the kernel might panic.
>>>> +   */
>>>> +  if (WARN_ON_ONCE((unsigned long)cpu >= nr_cpu_ids))
>>>> +  return false;
>>>> +
>>>>return PVOP_CALLEE1(bool, lock.vcpu_is_preempted, cpu);
>>>>  }
>>> Can this really happen without being a programming error?
>> This shouldn't happen without a programming error, I think. In my case,
>> it was caused by a race condition leading to use-after-free of the cpu
>> number. However, my point is that error like that shouldn't cause the
>> kernel to panic.
>>
>>> Basically you'd need to guard all percpu area accesses to foreign cpus
>>> this way. Why is this one special?
>> It depends. If out-of-bound access can only happen with obvious
>> programming error, I don't think we need to guard against them. In this
>> case, I am not totally sure if the race condition that I found may
>> happen with existing code or not. To be prudent, I decide to send this
>> patch out.
>>
>> The race condition that I am looking at is as follows:
>>
>>   CPU 0 CPU 1
>>   - -
>> up_write:
>>   owner = NULL;
>>   
>>   count = 0;
>>
>> 
>>  
>>   rwsem_can_spin_on_owner:
>>     rcu_read_lock();
>>     read owner;
>>   :
>>     vcpu_is_preempted(owner->cpu);
>>   :
>>     rcu_read_unlock()
>>
>> When I tried to merge the owner into the count (clear the owner after
>> the barrier), I can reproduce the crash 100% when booting up the kernel
>> in a VM guest. However, I am not sure if the configuration above is safe
>> and is just very hard to reproduce.
>>
>> Alternatively, I can also do the cpu check before calling
>> vcpu_is_preempted().
> I think I'd prefer that.
>
>
> Juergen
>
It turns out that it may be caused by a software bug after all. You can
ignore this patch for now.

Thanks,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] x86/paravirt: Guard against invalid cpu # in pv_vcpu_is_preempted()

2019-03-25 Thread Waiman Long
On 03/25/2019 12:40 PM, Juergen Gross wrote:
> On 25/03/2019 16:57, Waiman Long wrote:
>> It was found that passing an invalid cpu number to pv_vcpu_is_preempted()
>> might panic the kernel in a VM guest. For example,
>>
>> [2.531077] Oops:  [#1] SMP PTI
>>   :
>> [2.532545] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>> [2.533321] RIP: 0010:__raw_callee_save___kvm_vcpu_is_preempted+0x0/0x20
>>
>> To guard against this kind of kernel panic, check is added to
>> pv_vcpu_is_preempted() to make sure that no invalid cpu number will
>> be used.
>>
>> Signed-off-by: Waiman Long 
>> ---
>>  arch/x86/include/asm/paravirt.h | 6 ++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/paravirt.h 
>> b/arch/x86/include/asm/paravirt.h
>> index c25c38a05c1c..4cfb465dcde4 100644
>> --- a/arch/x86/include/asm/paravirt.h
>> +++ b/arch/x86/include/asm/paravirt.h
>> @@ -671,6 +671,12 @@ static __always_inline void pv_kick(int cpu)
>>  
>>  static __always_inline bool pv_vcpu_is_preempted(long cpu)
>>  {
>> +/*
>> + * Guard against invalid cpu number or the kernel might panic.
>> + */
>> +if (WARN_ON_ONCE((unsigned long)cpu >= nr_cpu_ids))
>> +return false;
>> +
>>  return PVOP_CALLEE1(bool, lock.vcpu_is_preempted, cpu);
>>  }
> Can this really happen without being a programming error?

This shouldn't happen without a programming error, I think. In my case,
it was caused by a race condition leading to use-after-free of the cpu
number. However, my point is that error like that shouldn't cause the
kernel to panic.

> Basically you'd need to guard all percpu area accesses to foreign cpus
> this way. Why is this one special?

It depends. If out-of-bound access can only happen with obvious
programming error, I don't think we need to guard against them. In this
case, I am not totally sure if the race condition that I found may
happen with existing code or not. To be prudent, I decide to send this
patch out.

The race condition that I am looking at is as follows:

  CPU 0 CPU 1
  - -
up_write:
  owner = NULL;
  
  count = 0;


 
  rwsem_can_spin_on_owner:
    rcu_read_lock();
    read owner;
  :
    vcpu_is_preempted(owner->cpu);
  :
    rcu_read_unlock()

When I tried to merge the owner into the count (clear the owner after
the barrier), I can reproduce the crash 100% when booting up the kernel
in a VM guest. However, I am not sure if the configuration above is safe
and is just very hard to reproduce.

Alternatively, I can also do the cpu check before calling
vcpu_is_preempted().

Cheers,
Longman


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH] x86/paravirt: Guard against invalid cpu # in pv_vcpu_is_preempted()

2019-03-25 Thread Waiman Long
It was found that passing an invalid cpu number to pv_vcpu_is_preempted()
might panic the kernel in a VM guest. For example,

[2.531077] Oops:  [#1] SMP PTI
  :
[2.532545] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[2.533321] RIP: 0010:__raw_callee_save___kvm_vcpu_is_preempted+0x0/0x20

To guard against this kind of kernel panic, check is added to
pv_vcpu_is_preempted() to make sure that no invalid cpu number will
be used.

Signed-off-by: Waiman Long 
---
 arch/x86/include/asm/paravirt.h | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index c25c38a05c1c..4cfb465dcde4 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -671,6 +671,12 @@ static __always_inline void pv_kick(int cpu)
 
 static __always_inline bool pv_vcpu_is_preempted(long cpu)
 {
+   /*
+* Guard against invalid cpu number or the kernel might panic.
+*/
+   if (WARN_ON_ONCE((unsigned long)cpu >= nr_cpu_ids))
+   return false;
+
return PVOP_CALLEE1(bool, lock.vcpu_is_preempted, cpu);
 }
 
-- 
2.18.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH-tip v2 2/2] x86/xen: Deprecate xen_nopvspin

2017-11-02 Thread Waiman Long
On 11/01/2017 06:01 PM, Boris Ostrovsky wrote:
> On 11/01/2017 04:58 PM, Waiman Long wrote:
>> +/* TODO: To be removed in a future kernel version */
>>  static __init int xen_parse_nopvspin(char *arg)
>>  {
>> -xen_pvspin = false;
>> +pr_warn("xen_nopvspin is deprecated, replace it with 
>> \"pvlock_type=queued\"!\n");
>> +if (!pv_spinlock_type)
>> +pv_spinlock_type = locktype_queued;
> Since we currently end up using unfair locks and because you are
> deprecating xen_nopvspin I wonder whether it would be better to set this
> to locktype_unfair so that current behavior doesn't change. (Sorry, I
> haven't responded to your earlier message before you posted this). Juergen?

I think the latest patch from Juergen in tip is to use native qspinlock
when xen_nopvspin is specified. Right? That is why I made the current
choice. I can certainly change to unfair if it is what you guys want.

> I am also not sure I agree with making pv_spinlock an enum *and* a
> bitmask at the same time. I understand that it makes checks easier but I
> think not assuming a value or a pattern would be better, especially
> since none of the uses is on a critical path.
>
> (For example, !pv_spinlock_type is the same as locktype_auto, which is
> defined but never used)

OK, I will take out the enum and make explicit use of locktype_auto.

Cheers,
Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH-tip v2 2/2] x86/xen: Deprecate xen_nopvspin

2017-11-01 Thread Waiman Long
With the new pvlock_type kernel parameter, xen_nopvspin is no longer
needed. This patch deprecates the xen_nopvspin parameter by removing
its documentation and treating it as an alias of "pvlock_type=queued".

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  4 
 arch/x86/xen/spinlock.c | 19 +++
 2 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index c98d9c7..683a817 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4596,10 +4596,6 @@
the unplug protocol
never -- do not unplug even if version check succeeds
 
-   xen_nopvspin[X86,XEN]
-   Disables the ticketlock slowpath using Xen PV
-   optimizations.
-
xen_nopv[X86]
Disables the PV optimizations forcing the HVM guest to
run as generic HVM guest with no PV drivers.
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index d5f79ac..19e2e75 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -20,7 +20,6 @@
 
 static DEFINE_PER_CPU(int, lock_kicker_irq) = -1;
 static DEFINE_PER_CPU(char *, irq_name);
-static bool xen_pvspin = true;
 
 #include 
 
@@ -81,12 +80,8 @@ void xen_init_lock_cpu(int cpu)
int irq;
char *name;
 
-   if (!xen_pvspin ||
-  (pv_spinlock_type & (locktype_queued|locktype_unfair))) {
-   if ((cpu == 0) && !pv_spinlock_type)
-   static_branch_disable(_spin_lock_key);
+   if (pv_spinlock_type & (locktype_queued|locktype_unfair))
return;
-   }
 
WARN(per_cpu(lock_kicker_irq, cpu) >= 0, "spinlock on CPU%d exists on 
IRQ%d!\n",
 cpu, per_cpu(lock_kicker_irq, cpu));
@@ -110,8 +105,7 @@ void xen_init_lock_cpu(int cpu)
 
 void xen_uninit_lock_cpu(int cpu)
 {
-   if (!xen_pvspin ||
-  (pv_spinlock_type & (locktype_queued|locktype_unfair)))
+   if (pv_spinlock_type & (locktype_queued|locktype_unfair))
return;
 
unbind_from_irqhandler(per_cpu(lock_kicker_irq, cpu), NULL);
@@ -132,8 +126,7 @@ void xen_uninit_lock_cpu(int cpu)
  */
 void __init xen_init_spinlocks(void)
 {
-   if (!xen_pvspin ||
-  (pv_spinlock_type & (locktype_queued|locktype_unfair))) {
+   if (pv_spinlock_type & (locktype_queued|locktype_unfair)) {
printk(KERN_DEBUG "xen: PV spinlocks disabled\n");
return;
}
@@ -147,10 +140,12 @@ void __init xen_init_spinlocks(void)
pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(xen_vcpu_stolen);
 }
 
+/* TODO: To be removed in a future kernel version */
 static __init int xen_parse_nopvspin(char *arg)
 {
-   xen_pvspin = false;
+   pr_warn("xen_nopvspin is deprecated, replace it with 
\"pvlock_type=queued\"!\n");
+   if (!pv_spinlock_type)
+   pv_spinlock_type = locktype_queued;
return 0;
 }
 early_param("xen_nopvspin", xen_parse_nopvspin);
-
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH-tip v2 0/2] x86/paravirt: Enable users to choose PV lock type

2017-11-01 Thread Waiman Long
v1->v2:
 - Make pv_spinlock_type a bit mask for easier checking.
 - Add patch 2 to deprecate xen_nopvspin

v1 - https://lkml.org/lkml/2017/11/1/381

Patch 1 adds a new pvlock_type parameter for the administrators to
specify the type of lock to be used in a para-virtualized kernel.

Patch 2 deprecates Xen's xen_nopvspin parameter as it is no longer
needed.

Waiman Long (2):
  x86/paravirt: Add kernel parameter to choose paravirt lock type
  x86/xen: Deprecate xen_nopvspin

 Documentation/admin-guide/kernel-parameters.txt | 11 ---
 arch/x86/include/asm/paravirt.h |  9 ++
 arch/x86/kernel/kvm.c   |  3 ++
 arch/x86/kernel/paravirt.c  | 40 -
 arch/x86/xen/spinlock.c | 17 +--
 5 files changed, 65 insertions(+), 15 deletions(-)

-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH-tip v2 1/2] x86/paravirt: Add kernel parameter to choose paravirt lock type

2017-11-01 Thread Waiman Long
Currently, there are 3 different lock types that can be chosen for
the x86 architecture:

 - qspinlock
 - pvqspinlock
 - unfair lock

One of the above lock types will be chosen at boot time depending on
a number of different factors.

Ideally, the hypervisors should be able to pick the best performing
lock type for the current VM configuration. That is not currently
the case as the performance of each lock type are affected by many
different factors like the number of vCPUs in the VM, the amount vCPU
overcommitment, the CPU type and so on.

Generally speaking, unfair lock performs well for VMs with a small
number of vCPUs. Native qspinlock may perform better than pvqspinlock
if there is vCPU pinning and there is no vCPU over-commitment.

This patch adds a new kernel parameter to allow administrator to
choose the paravirt spinlock type to be used. VM administrators can
experiment with the different lock types and choose one that can best
suit their need, if they want to. Hypervisor developers can also use
that to experiment with different lock types so that they can come
up with a better algorithm to pick the best lock type.

The hypervisor paravirt spinlock code will override this new parameter
in determining if pvqspinlock should be used. The parameter, however,
will override Xen's xen_nopvspin in term of disabling unfair lock.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  7 +
 arch/x86/include/asm/paravirt.h |  9 ++
 arch/x86/kernel/kvm.c   |  3 ++
 arch/x86/kernel/paravirt.c  | 40 -
 arch/x86/xen/spinlock.c | 12 
 5 files changed, 65 insertions(+), 6 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index f7df49d..c98d9c7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3275,6 +3275,13 @@
[KNL] Number of legacy pty's. Overwrites compiled-in
default number.
 
+   pvlock_type=[X86,PV_OPS]
+   Specify the paravirt spinlock type to be used.
+   Options are:
+   queued - native queued spinlock
+   pv - paravirt queued spinlock
+   unfair - simple TATAS unfair lock
+
quiet   [KNL] Disable most log messages
 
r128=   [HW,DRM]
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 12deec7..c8f9ad9 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -690,6 +690,15 @@ static __always_inline bool pv_vcpu_is_preempted(long cpu)
 
 #endif /* SMP && PARAVIRT_SPINLOCKS */
 
+enum pv_spinlock_type {
+   locktype_auto = 0,
+   locktype_queued   = 1,
+   locktype_paravirt = 2,
+   locktype_unfair   = 4,
+};
+
+extern enum pv_spinlock_type pv_spinlock_type;
+
 #ifdef CONFIG_X86_32
 #define PV_SAVE_REGS "pushl %ecx; pushl %edx;"
 #define PV_RESTORE_REGS "popl %edx; popl %ecx;"
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8bb9594..3faee63 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -646,6 +646,9 @@ void __init kvm_spinlock_init(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return;
 
+   if (pv_spinlock_type & (locktype_queued|locktype_unfair))
+   return;
+
__pv_init_lock_hash();
pv_lock_ops.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
pv_lock_ops.queued_spin_unlock = 
PV_CALLEE_SAVE(__pv_queued_spin_unlock);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 041096b..aee6756 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -115,11 +115,48 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void 
*target,
return 5;
 }
 
+/*
+ * The kernel argument "pvlock_type=" can be used to explicitly specify
+ * which type of spinlocks to be used. Currently, there are 3 options:
+ * 1) queued - the native queued spinlock
+ * 2) pv - the paravirt queued spinlock (if CONFIG_PARAVIRT_SPINLOCKS)
+ * 3) unfair - the simple TATAS unfair lock
+ *
+ * If this argument is not specified, the kernel will automatically choose
+ * an appropriate one depending on X86_FEATURE_HYPERVISOR and hypervisor
+ * specific settings.
+ */
+enum pv_spinlock_type __read_mostly pv_spinlock_type = locktype_auto;
+
+static int __init pvlock_setup(char *s)
+{
+   if (!s)
+   return -EINVAL;
+
+   if (!strcmp(s, "queued"))
+   pv_spinlock_type = locktype_queued;
+   else if (!strcmp(s, "pv"))
+   pv_spinlock_type = locktype_paravirt;
+   else if (!strcmp(s, "u

Re: [PATCH] x86/paravirt: Add kernel parameter to choose paravirt lock type

2017-11-01 Thread Waiman Long
On 11/01/2017 03:01 PM, Boris Ostrovsky wrote:
> On 11/01/2017 12:28 PM, Waiman Long wrote:
>> On 11/01/2017 11:51 AM, Juergen Gross wrote:
>>> On 01/11/17 16:32, Waiman Long wrote:
>>>> Currently, there are 3 different lock types that can be chosen for
>>>> the x86 architecture:
>>>>
>>>>  - qspinlock
>>>>  - pvqspinlock
>>>>  - unfair lock
>>>>
>>>> One of the above lock types will be chosen at boot time depending on
>>>> a number of different factors.
>>>>
>>>> Ideally, the hypervisors should be able to pick the best performing
>>>> lock type for the current VM configuration. That is not currently
>>>> the case as the performance of each lock type are affected by many
>>>> different factors like the number of vCPUs in the VM, the amount vCPU
>>>> overcommitment, the CPU type and so on.
>>>>
>>>> Generally speaking, unfair lock performs well for VMs with a small
>>>> number of vCPUs. Native qspinlock may perform better than pvqspinlock
>>>> if there is vCPU pinning and there is no vCPU over-commitment.
>>>>
>>>> This patch adds a new kernel parameter to allow administrator to
>>>> choose the paravirt spinlock type to be used. VM administrators can
>>>> experiment with the different lock types and choose one that can best
>>>> suit their need, if they want to. Hypervisor developers can also use
>>>> that to experiment with different lock types so that they can come
>>>> up with a better algorithm to pick the best lock type.
>>>>
>>>> The hypervisor paravirt spinlock code will override this new parameter
>>>> in determining if pvqspinlock should be used. The parameter, however,
>>>> will override Xen's xen_nopvspin in term of disabling unfair lock.
>>> Hmm, I'm not sure we need pvlock_type _and_ xen_nopvspin. What do others
>>> think?
>> I don't think we need xen_nopvspin, but I don't want to remove that
>> without agreement from the community.
> I also don't think xen_nopvspin will be needed after pvlock_type is
> introduced.
>
> -boris

Another reason that I didn't try to remove xen_nopvspin is backward 
compatibility concern. One way to handle it is to depreciate it and
treat it as an alias to pvlock_type=queued.

Cheers,
Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] x86/paravirt: Add kernel parameter to choose paravirt lock type

2017-11-01 Thread Waiman Long
On 11/01/2017 11:51 AM, Juergen Gross wrote:
> On 01/11/17 16:32, Waiman Long wrote:
>> Currently, there are 3 different lock types that can be chosen for
>> the x86 architecture:
>>
>>  - qspinlock
>>  - pvqspinlock
>>  - unfair lock
>>
>> One of the above lock types will be chosen at boot time depending on
>> a number of different factors.
>>
>> Ideally, the hypervisors should be able to pick the best performing
>> lock type for the current VM configuration. That is not currently
>> the case as the performance of each lock type are affected by many
>> different factors like the number of vCPUs in the VM, the amount vCPU
>> overcommitment, the CPU type and so on.
>>
>> Generally speaking, unfair lock performs well for VMs with a small
>> number of vCPUs. Native qspinlock may perform better than pvqspinlock
>> if there is vCPU pinning and there is no vCPU over-commitment.
>>
>> This patch adds a new kernel parameter to allow administrator to
>> choose the paravirt spinlock type to be used. VM administrators can
>> experiment with the different lock types and choose one that can best
>> suit their need, if they want to. Hypervisor developers can also use
>> that to experiment with different lock types so that they can come
>> up with a better algorithm to pick the best lock type.
>>
>> The hypervisor paravirt spinlock code will override this new parameter
>> in determining if pvqspinlock should be used. The parameter, however,
>> will override Xen's xen_nopvspin in term of disabling unfair lock.
> Hmm, I'm not sure we need pvlock_type _and_ xen_nopvspin. What do others
> think?

I don't think we need xen_nopvspin, but I don't want to remove that
without agreement from the community.
>>  DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
>>  
>>  void __init native_pv_lock_init(void)
>>  {
>> -if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>> +if (pv_spinlock_type == locktype_unfair)
>> +return;
>> +
>> +if (!static_cpu_has(X86_FEATURE_HYPERVISOR) ||
>> +   (pv_spinlock_type != locktype_auto))
>>  static_branch_disable(_spin_lock_key);
> Really? I don't think locktype_paravirt should disable the static key.

With paravirt spinlock, it doesn't matter if the static key is disabled
or not. Without CONFIG_PARAVIRT_SPINLOCKS, however, it does degenerate
into the native qspinlock. So you are right, I should check for paravirt
type as well.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH] x86/paravirt: Add kernel parameter to choose paravirt lock type

2017-11-01 Thread Waiman Long
Currently, there are 3 different lock types that can be chosen for
the x86 architecture:

 - qspinlock
 - pvqspinlock
 - unfair lock

One of the above lock types will be chosen at boot time depending on
a number of different factors.

Ideally, the hypervisors should be able to pick the best performing
lock type for the current VM configuration. That is not currently
the case as the performance of each lock type are affected by many
different factors like the number of vCPUs in the VM, the amount vCPU
overcommitment, the CPU type and so on.

Generally speaking, unfair lock performs well for VMs with a small
number of vCPUs. Native qspinlock may perform better than pvqspinlock
if there is vCPU pinning and there is no vCPU over-commitment.

This patch adds a new kernel parameter to allow administrator to
choose the paravirt spinlock type to be used. VM administrators can
experiment with the different lock types and choose one that can best
suit their need, if they want to. Hypervisor developers can also use
that to experiment with different lock types so that they can come
up with a better algorithm to pick the best lock type.

The hypervisor paravirt spinlock code will override this new parameter
in determining if pvqspinlock should be used. The parameter, however,
will override Xen's xen_nopvspin in term of disabling unfair lock.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 Documentation/admin-guide/kernel-parameters.txt |  7 +
 arch/x86/include/asm/paravirt.h |  9 ++
 arch/x86/kernel/kvm.c   |  4 +++
 arch/x86/kernel/paravirt.c  | 40 -
 arch/x86/xen/spinlock.c |  6 ++--
 5 files changed, 62 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index f7df49d..c98d9c7 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3275,6 +3275,13 @@
[KNL] Number of legacy pty's. Overwrites compiled-in
default number.
 
+   pvlock_type=[X86,PV_OPS]
+   Specify the paravirt spinlock type to be used.
+   Options are:
+   queued - native queued spinlock
+   pv - paravirt queued spinlock
+   unfair - simple TATAS unfair lock
+
quiet   [KNL] Disable most log messages
 
r128=   [HW,DRM]
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 12deec7..941a046 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -690,6 +690,15 @@ static __always_inline bool pv_vcpu_is_preempted(long cpu)
 
 #endif /* SMP && PARAVIRT_SPINLOCKS */
 
+enum pv_spinlock_type {
+   locktype_auto,
+   locktype_queued,
+   locktype_paravirt,
+   locktype_unfair,
+};
+
+extern enum pv_spinlock_type pv_spinlock_type;
+
 #ifdef CONFIG_X86_32
 #define PV_SAVE_REGS "pushl %ecx; pushl %edx;"
 #define PV_RESTORE_REGS "popl %edx; popl %ecx;"
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 8bb9594..3a5d3ec4 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -646,6 +646,10 @@ void __init kvm_spinlock_init(void)
if (!kvm_para_has_feature(KVM_FEATURE_PV_UNHALT))
return;
 
+   if ((pv_spinlock_type == locktype_queued) ||
+   (pv_spinlock_type == locktype_unfair))
+   return;
+
__pv_init_lock_hash();
pv_lock_ops.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
pv_lock_ops.queued_spin_unlock = 
PV_CALLEE_SAVE(__pv_queued_spin_unlock);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 041096b..ca35cd3 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -115,11 +115,48 @@ unsigned paravirt_patch_jmp(void *insnbuf, const void 
*target,
return 5;
 }
 
+/*
+ * The kernel argument "pvlock_type=" can be used to explicitly specify
+ * which type of spinlocks to be used. Currently, there are 3 options:
+ * 1) queued - the native queued spinlock
+ * 2) pv - the paravirt queued spinlock (if CONFIG_PARAVIRT_SPINLOCKS)
+ * 3) unfair - the simple TATAS unfair lock
+ *
+ * If this argument is not specified, the kernel will automatically choose
+ * an appropriate one depending on X86_FEATURE_HYPERVISOR and hypervisor
+ * specific settings.
+ */
+enum pv_spinlock_type __read_mostly pv_spinlock_type = locktype_auto;
+
+static int __init pvlock_setup(char *s)
+{
+   if (!s)
+   return -EINVAL;
+
+   if (!strcmp(s, "queued"))
+   pv_spinlock_type = locktype_queued;
+   else if (!strcmp(s, "pv"))
+   pv_spinlock_type = locktype_paravirt;
+   else if (!strcmp(

Re: [PATCH v3 0/2] guard virt_spin_lock() with a static key

2017-09-25 Thread Waiman Long
On 09/25/2017 09:59 AM, Juergen Gross wrote:
> Ping?
>
> On 06/09/17 19:36, Juergen Gross wrote:
>> With virt_spin_lock() being guarded by a static key the bare metal case
>> can be optimized by patching the call away completely. In case a kernel
>> running as a guest it can decide whether to use paravitualized
>> spinlocks, the current fallback to the unfair test-and-set scheme, or
>> to mimic the bare metal behavior.
>>
>> V3:
>> - remove test for hypervisor environment from virt_spin_lock(9 as
>>   suggested by Waiman Long
>>
>> V2:
>> - use static key instead of making virt_spin_lock() a pvops function
>>
>> Juergen Gross (2):
>>   paravirt/locks: use new static key for controlling call of
>> virt_spin_lock()
>>   paravirt,xen: correct xen_nopvspin case
>>
>>  arch/x86/include/asm/qspinlock.h | 11 ++-
>>  arch/x86/kernel/paravirt-spinlocks.c |  6 ++
>>  arch/x86/kernel/smpboot.c|  2 ++
>>  arch/x86/xen/spinlock.c  |  2 ++
>>  kernel/locking/qspinlock.c   |  4 
>>  5 files changed, 24 insertions(+), 1 deletion(-)
>>
Acked-by: Waiman Long <long...@redhat.com>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/2] paravirt/locks: use new static key for controlling call of virt_spin_lock()

2017-09-06 Thread Waiman Long
On 09/06/2017 12:04 PM, Peter Zijlstra wrote:
> On Wed, Sep 06, 2017 at 11:49:49AM -0400, Waiman Long wrote:
>>>  #define virt_spin_lock virt_spin_lock
>>>  static inline bool virt_spin_lock(struct qspinlock *lock)
>>>  {
>>> +   if (!static_branch_likely(_spin_lock_key))
>>> +   return false;
>>> if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>>> return false;
>>>  
> Now native has two NOPs instead of one. Can't we merge these two static
> branches?


I guess we can remove the static_cpu_has() call. Just that any spin_lock
calls before native_pv_lock_init() will use the virt_spin_lock(). That
is still OK as the init call is done before SMP starts.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 2/2] paravirt,xen: correct xen_nopvspin case

2017-09-06 Thread Waiman Long
On 09/06/2017 11:29 AM, Juergen Gross wrote:
> With the boot parameter "xen_nopvspin" specified a Xen guest should not
> make use of paravirt spinlocks, but behave as if running on bare
> metal. This is not true, however, as the qspinlock code will fall back
> to a test-and-set scheme when it is detecting a hypervisor.
>
> In order to avoid this disable the virt_spin_lock_key.
>
> Signed-off-by: Juergen Gross <jgr...@suse.com>
> ---
>  arch/x86/xen/spinlock.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
> index 25a7c4302ce7..e8ab80ad7a6f 100644
> --- a/arch/x86/xen/spinlock.c
> +++ b/arch/x86/xen/spinlock.c
> @@ -10,6 +10,7 @@
>  #include 
>  
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -129,6 +130,7 @@ void __init xen_init_spinlocks(void)
>  
>   if (!xen_pvspin) {
>   printk(KERN_DEBUG "xen: PV spinlocks disabled\n");
> + static_branch_disable(_spin_lock_key);
>   return;
>   }
>   printk(KERN_DEBUG "xen: PV spinlocks enabled\n");

Acked-by: Waiman Long <long...@redhat.com>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2 1/2] paravirt/locks: use new static key for controlling call of virt_spin_lock()

2017-09-06 Thread Waiman Long
On 09/06/2017 11:29 AM, Juergen Gross wrote:
> There are cases where a guest tries to switch spinlocks to bare metal
> behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this
> has the downside of falling back to unfair test and set scheme for
> qspinlocks due to virt_spin_lock() detecting the virtualized
> environment.
>
> Add a static key controlling whether virt_spin_lock() should be
> called or not. When running on bare metal set the new key to false.
>
> Signed-off-by: Juergen Gross <jgr...@suse.com>
> ---
>  arch/x86/include/asm/qspinlock.h | 11 +++
>  arch/x86/kernel/paravirt-spinlocks.c |  6 ++
>  arch/x86/kernel/smpboot.c|  2 ++
>  kernel/locking/qspinlock.c   |  4 
>  4 files changed, 23 insertions(+)
>
> diff --git a/arch/x86/include/asm/qspinlock.h 
> b/arch/x86/include/asm/qspinlock.h
> index 48a706f641f2..fc39389f196b 100644
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -1,6 +1,7 @@
>  #ifndef _ASM_X86_QSPINLOCK_H
>  #define _ASM_X86_QSPINLOCK_H
>  
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -46,9 +47,15 @@ static inline void queued_spin_unlock(struct qspinlock 
> *lock)
>  #endif
>  
>  #ifdef CONFIG_PARAVIRT
> +DECLARE_STATIC_KEY_TRUE(virt_spin_lock_key);
> +
> +void native_pv_lock_init(void) __init;
> +
>  #define virt_spin_lock virt_spin_lock
>  static inline bool virt_spin_lock(struct qspinlock *lock)
>  {
> + if (!static_branch_likely(_spin_lock_key))
> + return false;
>   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>   return false;
>  
> @@ -65,6 +72,10 @@ static inline bool virt_spin_lock(struct qspinlock *lock)
>  
>   return true;
>  }
> +#else
> +static inline void native_pv_lock_init(void)
> +{
> +}
>  #endif /* CONFIG_PARAVIRT */
>  
>  #include 
> diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
> b/arch/x86/kernel/paravirt-spinlocks.c
> index 8f2d1c9d43a8..2fc65ddea40d 100644
> --- a/arch/x86/kernel/paravirt-spinlocks.c
> +++ b/arch/x86/kernel/paravirt-spinlocks.c
> @@ -42,3 +42,9 @@ struct pv_lock_ops pv_lock_ops = {
>  #endif /* SMP */
>  };
>  EXPORT_SYMBOL(pv_lock_ops);
> +
> +void __init native_pv_lock_init(void)
> +{
> + if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
> + static_branch_disable(_spin_lock_key);
> +}
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 54b9e89d4d6b..21500d3ba359 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -77,6 +77,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /* Number of siblings per CPU package */
>  int smp_num_siblings = 1;
> @@ -1381,6 +1382,7 @@ void __init native_smp_prepare_boot_cpu(void)
>   /* already set me in cpu_online_mask in boot_cpu_init() */
>   cpumask_set_cpu(me, cpu_callout_mask);
>   cpu_set_state_online(me);
> + native_pv_lock_init();
>  }
>  
>  void __init native_smp_cpus_done(unsigned int max_cpus)
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 294294c71ba4..838d235b87ef 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -76,6 +76,10 @@
>  #define MAX_NODES4
>  #endif
>  
> +#ifdef CONFIG_PARAVIRT
> +DEFINE_STATIC_KEY_TRUE(virt_spin_lock_key);
> +#endif
> +
>  /*
>   * Per-CPU queue node structures; we can never have more than 4 nested
>   * contexts: task, softirq, hardirq, nmi.

Acked-by: Waiman Long <long...@redhat.com>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/4] paravirt: add virt_spin_lock pvops function

2017-09-06 Thread Waiman Long
On 09/06/2017 03:08 AM, Peter Zijlstra wrote:
> Guys, please trim email.
>
> On Tue, Sep 05, 2017 at 10:31:46AM -0400, Waiman Long wrote:
>> For clarification, I was actually asking if you consider just adding one
>> more jump label to skip it for Xen/KVM instead of making
>> virt_spin_lock() a pv-op.
> I don't understand. What performance are you worried about. Native will
> now do: "xor rax,rax; jnz some_cold_label" that's fairly trival code.

It is not native that I am talking about. I am worry about VM with
non-Xen/KVM hypervisor where virt_spin_lock() will actually be called.
Now that function will become a callee-saved function call instead of
being inlined into the native slowpath function.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/4] paravirt: add virt_spin_lock pvops function

2017-09-05 Thread Waiman Long
On 09/05/2017 10:24 AM, Waiman Long wrote:
> On 09/05/2017 10:18 AM, Juergen Gross wrote:
>> On 05/09/17 16:10, Waiman Long wrote:
>>> On 09/05/2017 09:24 AM, Juergen Gross wrote:
>>>> There are cases where a guest tries to switch spinlocks to bare metal
>>>> behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this
>>>> has the downside of falling back to unfair test and set scheme for
>>>> qspinlocks due to virt_spin_lock() detecting the virtualized
>>>> environment.
>>>>
>>>> Make virt_spin_lock() a paravirt operation in order to enable users
>>>> to select an explicit behavior like bare metal.
>>>>
>>>> Signed-off-by: Juergen Gross <jgr...@suse.com>
>>>> ---
>>>>  arch/x86/include/asm/paravirt.h   |  5 
>>>>  arch/x86/include/asm/paravirt_types.h |  1 +
>>>>  arch/x86/include/asm/qspinlock.h  | 48 
>>>> ---
>>>>  arch/x86/kernel/paravirt-spinlocks.c  | 14 ++
>>>>  arch/x86/kernel/smpboot.c |  2 ++
>>>>  5 files changed, 55 insertions(+), 15 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/paravirt.h 
>>>> b/arch/x86/include/asm/paravirt.h
>>>> index c25dd22f7c70..d9e954fb37df 100644
>>>> --- a/arch/x86/include/asm/paravirt.h
>>>> +++ b/arch/x86/include/asm/paravirt.h
>>>> @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long 
>>>> cpu)
>>>>return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
>>>>  }
>>>>  
>>>> +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock)
>>>> +{
>>>> +  return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock);
>>>> +}
>>>> +
>>>>  #endif /* SMP && PARAVIRT_SPINLOCKS */
>>>>  
>>>>  #ifdef CONFIG_X86_32
>>>> diff --git a/arch/x86/include/asm/paravirt_types.h 
>>>> b/arch/x86/include/asm/paravirt_types.h
>>>> index 19efefc0e27e..928f5e7953a7 100644
>>>> --- a/arch/x86/include/asm/paravirt_types.h
>>>> +++ b/arch/x86/include/asm/paravirt_types.h
>>>> @@ -319,6 +319,7 @@ struct pv_lock_ops {
>>>>void (*kick)(int cpu);
>>>>  
>>>>struct paravirt_callee_save vcpu_is_preempted;
>>>> +  struct paravirt_callee_save virt_spin_lock;
>>>>  } __no_randomize_layout;
>>>>  
>>>>  /* This contains all the paravirt structures: we get a convenient
>>>> diff --git a/arch/x86/include/asm/qspinlock.h 
>>>> b/arch/x86/include/asm/qspinlock.h
>>>> index 48a706f641f2..fbd98896385c 100644
>>>> --- a/arch/x86/include/asm/qspinlock.h
>>>> +++ b/arch/x86/include/asm/qspinlock.h
>>>> @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct 
>>>> qspinlock *lock)
>>>>smp_store_release((u8 *)lock, 0);
>>>>  }
>>>>  
>>>> +static inline bool native_virt_spin_lock(struct qspinlock *lock)
>>>> +{
>>>> +  if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>>>> +  return false;
>>>> +
>>>> +  /*
>>>> +   * On hypervisors without PARAVIRT_SPINLOCKS support we fall
>>>> +   * back to a Test-and-Set spinlock, because fair locks have
>>>> +   * horrible lock 'holder' preemption issues.
>>>> +   */
>>>> +
>>>> +  do {
>>>> +  while (atomic_read(>val) != 0)
>>>> +  cpu_relax();
>>>> +  } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0);
>>>> +
>>>> +  return true;
>>>> +}
>>>> +
>>>>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>>>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 
>>>> val);
>>>>  extern void __pv_init_lock_hash(void);
>>>> @@ -38,33 +57,32 @@ static inline bool vcpu_is_preempted(long cpu)
>>>>  {
>>>>return pv_vcpu_is_preempted(cpu);
>>>>  }
>>>> +
>>>> +void native_pv_lock_init(void) __init;
>>>>  #else
>>>>  static inline void queued_spin_unlock(struct qspinlock *lock)
>>>>  {
>>>>native_queued_spin_unlock(lock);
>>>>  }
>>>> +
>>>> +static inline void native_pv_lock_init(void)
>>>> +{
>>>> +}
>>>>  #endif
>>>>  
>>>>  #ifdef CONFIG_PARAVIRT
>>>>  #define virt_spin_lock virt_spin_lock
>>>> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
>>>>  static inline bool virt_spin_lock(struct qspinlock *lock)
>>>>  {
>>>> -  if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>>>> -  return false;
>>> Have you consider just add one more jump label here to skip
>>> virt_spin_lock when KVM or Xen want to do so?
>> Why? Did you look at patch 4? This is the way to do it...
> I asked this because of my performance concern as stated later in the email.

For clarification, I was actually asking if you consider just adding one
more jump label to skip it for Xen/KVM instead of making
virt_spin_lock() a pv-op.

Cheers,
Longman
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/4] paravirt: add virt_spin_lock pvops function

2017-09-05 Thread Waiman Long
On 09/05/2017 10:18 AM, Juergen Gross wrote:
> On 05/09/17 16:10, Waiman Long wrote:
>> On 09/05/2017 09:24 AM, Juergen Gross wrote:
>>> There are cases where a guest tries to switch spinlocks to bare metal
>>> behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this
>>> has the downside of falling back to unfair test and set scheme for
>>> qspinlocks due to virt_spin_lock() detecting the virtualized
>>> environment.
>>>
>>> Make virt_spin_lock() a paravirt operation in order to enable users
>>> to select an explicit behavior like bare metal.
>>>
>>> Signed-off-by: Juergen Gross <jgr...@suse.com>
>>> ---
>>>  arch/x86/include/asm/paravirt.h   |  5 
>>>  arch/x86/include/asm/paravirt_types.h |  1 +
>>>  arch/x86/include/asm/qspinlock.h  | 48 
>>> ---
>>>  arch/x86/kernel/paravirt-spinlocks.c  | 14 ++
>>>  arch/x86/kernel/smpboot.c |  2 ++
>>>  5 files changed, 55 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/arch/x86/include/asm/paravirt.h 
>>> b/arch/x86/include/asm/paravirt.h
>>> index c25dd22f7c70..d9e954fb37df 100644
>>> --- a/arch/x86/include/asm/paravirt.h
>>> +++ b/arch/x86/include/asm/paravirt.h
>>> @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long 
>>> cpu)
>>> return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
>>>  }
>>>  
>>> +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock)
>>> +{
>>> +   return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock);
>>> +}
>>> +
>>>  #endif /* SMP && PARAVIRT_SPINLOCKS */
>>>  
>>>  #ifdef CONFIG_X86_32
>>> diff --git a/arch/x86/include/asm/paravirt_types.h 
>>> b/arch/x86/include/asm/paravirt_types.h
>>> index 19efefc0e27e..928f5e7953a7 100644
>>> --- a/arch/x86/include/asm/paravirt_types.h
>>> +++ b/arch/x86/include/asm/paravirt_types.h
>>> @@ -319,6 +319,7 @@ struct pv_lock_ops {
>>> void (*kick)(int cpu);
>>>  
>>> struct paravirt_callee_save vcpu_is_preempted;
>>> +   struct paravirt_callee_save virt_spin_lock;
>>>  } __no_randomize_layout;
>>>  
>>>  /* This contains all the paravirt structures: we get a convenient
>>> diff --git a/arch/x86/include/asm/qspinlock.h 
>>> b/arch/x86/include/asm/qspinlock.h
>>> index 48a706f641f2..fbd98896385c 100644
>>> --- a/arch/x86/include/asm/qspinlock.h
>>> +++ b/arch/x86/include/asm/qspinlock.h
>>> @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct 
>>> qspinlock *lock)
>>> smp_store_release((u8 *)lock, 0);
>>>  }
>>>  
>>> +static inline bool native_virt_spin_lock(struct qspinlock *lock)
>>> +{
>>> +   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>>> +   return false;
>>> +
>>> +   /*
>>> +* On hypervisors without PARAVIRT_SPINLOCKS support we fall
>>> +* back to a Test-and-Set spinlock, because fair locks have
>>> +* horrible lock 'holder' preemption issues.
>>> +*/
>>> +
>>> +   do {
>>> +   while (atomic_read(>val) != 0)
>>> +   cpu_relax();
>>> +   } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0);
>>> +
>>> +   return true;
>>> +}
>>> +
>>>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>>>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 
>>> val);
>>>  extern void __pv_init_lock_hash(void);
>>> @@ -38,33 +57,32 @@ static inline bool vcpu_is_preempted(long cpu)
>>>  {
>>> return pv_vcpu_is_preempted(cpu);
>>>  }
>>> +
>>> +void native_pv_lock_init(void) __init;
>>>  #else
>>>  static inline void queued_spin_unlock(struct qspinlock *lock)
>>>  {
>>> native_queued_spin_unlock(lock);
>>>  }
>>> +
>>> +static inline void native_pv_lock_init(void)
>>> +{
>>> +}
>>>  #endif
>>>  
>>>  #ifdef CONFIG_PARAVIRT
>>>  #define virt_spin_lock virt_spin_lock
>>> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
>>>  static inline bool virt_spin_lock(struct qspinlock *lock)
>>>  {
>>> -   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>>> -   return false;
>> Have you consider just add one more j

Re: [PATCH 3/4] paravirt: add virt_spin_lock pvops function

2017-09-05 Thread Waiman Long
On 09/05/2017 10:08 AM, Peter Zijlstra wrote:
> On Tue, Sep 05, 2017 at 10:02:57AM -0400, Waiman Long wrote:
>> On 09/05/2017 09:24 AM, Juergen Gross wrote:
>>> +static inline bool native_virt_spin_lock(struct qspinlock *lock)
>>> +{
>>> +   if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
>>> +   return false;
>>> +
>> I think you can take the above if statement out as you has done test in
>> native_pv_lock_init(). So the test will also be false here.
> That does mean we'll run a test-and-set spinlock until paravirt patching
> happens though. I prefer to not do that.
>
> One important point.. we must not be holding any locks when we switch
> over between the two locks. Back then I spend some time making sure that
> didn't happen with the X86 feature flag muck.

AFAICT, native_pv_lock_init() is called before SMP init. So it shouldn't
matter.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/4] paravirt: add virt_spin_lock pvops function

2017-09-05 Thread Waiman Long
On 09/05/2017 09:24 AM, Juergen Gross wrote:
> There are cases where a guest tries to switch spinlocks to bare metal
> behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this
> has the downside of falling back to unfair test and set scheme for
> qspinlocks due to virt_spin_lock() detecting the virtualized
> environment.
>
> Make virt_spin_lock() a paravirt operation in order to enable users
> to select an explicit behavior like bare metal.
>
> Signed-off-by: Juergen Gross 
> ---
>  arch/x86/include/asm/paravirt.h   |  5 
>  arch/x86/include/asm/paravirt_types.h |  1 +
>  arch/x86/include/asm/qspinlock.h  | 48 
> ---
>  arch/x86/kernel/paravirt-spinlocks.c  | 14 ++
>  arch/x86/kernel/smpboot.c |  2 ++
>  5 files changed, 55 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index c25dd22f7c70..d9e954fb37df 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long 
> cpu)
>   return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
>  }
>  
> +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock)
> +{
> + return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock);
> +}
> +
>  #endif /* SMP && PARAVIRT_SPINLOCKS */
>  
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/include/asm/paravirt_types.h 
> b/arch/x86/include/asm/paravirt_types.h
> index 19efefc0e27e..928f5e7953a7 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -319,6 +319,7 @@ struct pv_lock_ops {
>   void (*kick)(int cpu);
>  
>   struct paravirt_callee_save vcpu_is_preempted;
> + struct paravirt_callee_save virt_spin_lock;
>  } __no_randomize_layout;
>  
>  /* This contains all the paravirt structures: we get a convenient
> diff --git a/arch/x86/include/asm/qspinlock.h 
> b/arch/x86/include/asm/qspinlock.h
> index 48a706f641f2..fbd98896385c 100644
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct 
> qspinlock *lock)
>   smp_store_release((u8 *)lock, 0);
>  }
>  
> +static inline bool native_virt_spin_lock(struct qspinlock *lock)
> +{
> + if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
> + return false;
> +
> + /*
> +  * On hypervisors without PARAVIRT_SPINLOCKS support we fall
> +  * back to a Test-and-Set spinlock, because fair locks have
> +  * horrible lock 'holder' preemption issues.
> +  */
> +
> + do {
> + while (atomic_read(>val) != 0)
> + cpu_relax();
> + } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0);
> +
> + return true;
> +}
> +
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 
> val);
>  extern void __pv_init_lock_hash(void);
> @@ -38,33 +57,32 @@ static inline bool vcpu_is_preempted(long cpu)
>  {
>   return pv_vcpu_is_preempted(cpu);
>  }
> +
> +void native_pv_lock_init(void) __init;
>  #else
>  static inline void queued_spin_unlock(struct qspinlock *lock)
>  {
>   native_queued_spin_unlock(lock);
>  }
> +
> +static inline void native_pv_lock_init(void)
> +{
> +}
>  #endif
>  
>  #ifdef CONFIG_PARAVIRT
>  #define virt_spin_lock virt_spin_lock
> +#ifdef CONFIG_PARAVIRT_SPINLOCKS
>  static inline bool virt_spin_lock(struct qspinlock *lock)
>  {
> - if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
> - return false;

Have you consider just add one more jump label here to skip
virt_spin_lock when KVM or Xen want to do so?

> -
> - /*
> -  * On hypervisors without PARAVIRT_SPINLOCKS support we fall
> -  * back to a Test-and-Set spinlock, because fair locks have
> -  * horrible lock 'holder' preemption issues.
> -  */
> -
> - do {
> - while (atomic_read(>val) != 0)
> - cpu_relax();
> - } while (atomic_cmpxchg(>val, 0, _Q_LOCKED_VAL) != 0);
> -
> - return true;
> + return pv_virt_spin_lock(lock);
> +}
> +#else
> +static inline bool virt_spin_lock(struct qspinlock *lock)
> +{
> + return native_virt_spin_lock(lock);
>  }
> +#endif /* CONFIG_PARAVIRT_SPINLOCKS */
>  #endif /* CONFIG_PARAVIRT */
>  
>  #include 
> diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
> b/arch/x86/kernel/paravirt-spinlocks.c
> index 26e4bd92f309..1be187ef8a38 100644
> --- a/arch/x86/kernel/paravirt-spinlocks.c
> +++ b/arch/x86/kernel/paravirt-spinlocks.c
> @@ -20,6 +20,12 @@ bool pv_is_native_spin_unlock(void)
>   __raw_callee_save___native_queued_spin_unlock;
>  }
>  
> +__visible bool __native_virt_spin_lock(struct qspinlock *lock)
> +{
> + return native_virt_spin_lock(lock);
> +}
> +PV_CALLEE_SAVE_REGS_THUNK(__native_virt_spin_lock);

I have some 

Re: [PATCH 3/4] paravirt: add virt_spin_lock pvops function

2017-09-05 Thread Waiman Long
On 09/05/2017 09:24 AM, Juergen Gross wrote:
> There are cases where a guest tries to switch spinlocks to bare metal
> behavior (e.g. by setting "xen_nopvspin" boot parameter). Today this
> has the downside of falling back to unfair test and set scheme for
> qspinlocks due to virt_spin_lock() detecting the virtualized
> environment.
>
> Make virt_spin_lock() a paravirt operation in order to enable users
> to select an explicit behavior like bare metal.
>
> Signed-off-by: Juergen Gross 
> ---
>  arch/x86/include/asm/paravirt.h   |  5 
>  arch/x86/include/asm/paravirt_types.h |  1 +
>  arch/x86/include/asm/qspinlock.h  | 48 
> ---
>  arch/x86/kernel/paravirt-spinlocks.c  | 14 ++
>  arch/x86/kernel/smpboot.c |  2 ++
>  5 files changed, 55 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index c25dd22f7c70..d9e954fb37df 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -725,6 +725,11 @@ static __always_inline bool pv_vcpu_is_preempted(long 
> cpu)
>   return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
>  }
>  
> +static __always_inline bool pv_virt_spin_lock(struct qspinlock *lock)
> +{
> + return PVOP_CALLEE1(bool, pv_lock_ops.virt_spin_lock, lock);
> +}
> +
>  #endif /* SMP && PARAVIRT_SPINLOCKS */
>  
>  #ifdef CONFIG_X86_32
> diff --git a/arch/x86/include/asm/paravirt_types.h 
> b/arch/x86/include/asm/paravirt_types.h
> index 19efefc0e27e..928f5e7953a7 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -319,6 +319,7 @@ struct pv_lock_ops {
>   void (*kick)(int cpu);
>  
>   struct paravirt_callee_save vcpu_is_preempted;
> + struct paravirt_callee_save virt_spin_lock;
>  } __no_randomize_layout;
>  
>  /* This contains all the paravirt structures: we get a convenient
> diff --git a/arch/x86/include/asm/qspinlock.h 
> b/arch/x86/include/asm/qspinlock.h
> index 48a706f641f2..fbd98896385c 100644
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -17,6 +17,25 @@ static inline void native_queued_spin_unlock(struct 
> qspinlock *lock)
>   smp_store_release((u8 *)lock, 0);
>  }
>  
> +static inline bool native_virt_spin_lock(struct qspinlock *lock)
> +{
> + if (!static_cpu_has(X86_FEATURE_HYPERVISOR))
> + return false;
> +

I think you can take the above if statement out as you has done test in
native_pv_lock_init(). So the test will also be false here.

As this patch series is x86 specific, you should probably add "x86/" in
front of paravirt in the patch titles.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v5 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

2017-02-20 Thread Waiman Long
It was found when running fio sequential write test with a XFS ramdisk
on a KVM guest running on a 2-socket x86-64 system, the %CPU times
as reported by perf were as follows:

 69.75%  0.59%  fio  [k] down_write
 69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
 67.12%  1.12%  fio  [k] rwsem_down_write_failed
 63.48% 52.77%  fio  [k] osq_lock
  9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
  3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted

Making vcpu_is_preempted() a callee-save function has a relatively
high cost on x86-64 primarily due to at least one more cacheline of
data access from the saving and restoring of registers (8 of them)
to and from stack as well as one more level of function call.

To reduce this performance overhead, an optimized assembly version
of the the __raw_callee_save___kvm_vcpu_is_preempt() function is
provided for x86-64.

With this patch applied on a KVM guest on a 2-socekt 16-core 32-thread
system with 16 parallel jobs (8 on each socket), the aggregrate
bandwidth of the fio test on an XFS ramdisk were as follows:

   I/O Type  w/o patchwith patch
     ---
   random read   8141.2 MB/s  8497.1 MB/s
   seq read  8229.4 MB/s  8304.2 MB/s
   random write  1675.5 MB/s  1701.5 MB/s
   seq write 1681.3 MB/s  1699.9 MB/s

There are some increases in the aggregated bandwidth because of
the patch.

The perf data now became:

 70.78%  0.58%  fio  [k] down_write
 70.20%  0.01%  fio  [k] call_rwsem_down_write_failed
 69.70%  1.17%  fio  [k] rwsem_down_write_failed
 59.91% 55.42%  fio  [k] osq_lock
 10.14% 10.14%  fio  [k] __kvm_vcpu_is_preempted

The assembly code was verified by using a test kernel module to
compare the output of C __kvm_vcpu_is_preempted() and that of assembly
__raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched.

Suggested-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Waiman Long <long...@redhat.com>
---
 arch/x86/kernel/asm-offsets_64.c |  9 +
 arch/x86/kernel/kvm.c| 24 
 2 files changed, 33 insertions(+)

diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index 210927e..99332f5 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -13,6 +13,10 @@
 #include 
 };
 
+#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS)
+#include 
+#endif
+
 int main(void)
 {
 #ifdef CONFIG_PARAVIRT
@@ -22,6 +26,11 @@ int main(void)
BLANK();
 #endif
 
+#if defined(CONFIG_KVM_GUEST) && defined(CONFIG_PARAVIRT_SPINLOCKS)
+   OFFSET(KVM_STEAL_TIME_preempted, kvm_steal_time, preempted);
+   BLANK();
+#endif
+
 #define ENTRY(entry) OFFSET(pt_regs_ ## entry, pt_regs, entry)
ENTRY(bx);
ENTRY(cx);
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 85ed343..14f65a5 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
local_irq_restore(flags);
 }
 
+#ifdef CONFIG_X86_32
 __visible bool __kvm_vcpu_is_preempted(long cpu)
 {
struct kvm_steal_time *src = _cpu(steal_time, cpu);
@@ -597,6 +598,29 @@ __visible bool __kvm_vcpu_is_preempted(long cpu)
 }
 PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
 
+#else
+
+#include 
+
+extern bool __raw_callee_save___kvm_vcpu_is_preempted(long);
+
+/*
+ * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
+ * restoring to/from the stack.
+ */
+asm(
+".pushsection .text;"
+".global __raw_callee_save___kvm_vcpu_is_preempted;"
+".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
+"__raw_callee_save___kvm_vcpu_is_preempted:"
+"movq  __per_cpu_offset(,%rdi,8), %rax;"
+"cmpb  $0, " __stringify(KVM_STEAL_TIME_preempted) "+steal_time(%rax);"
+"setne %al;"
+"ret;"
+".popsection");
+
+#endif
+
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
  */
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v5 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead

2017-02-20 Thread Waiman Long
 v4->v5:
  - As suggested by PeterZ, use the asm-offsets header file generation
mechanism to get the offset of the preempted field in
kvm_steal_time instead of hardcoding it.

 v3->v4:
  - Fix x86-32 build error.

 v2->v3:
  - Provide an optimized __raw_callee_save___kvm_vcpu_is_preempted()
in assembly as suggested by PeterZ.
  - Add a new patch to change vcpu_is_preempted() argument type to long
to ease the writing of the assembly code.

 v1->v2:
  - Rerun the fio test on a different system on both bare-metal and a
KVM guest. Both sockets were utilized in this test.
  - The commit log was updated with new performance numbers, but the
patch wasn't changed.
  - Drop patch 2.

As it was found that the overhead of callee-save vcpu_is_preempted()
can have some impact on system performance on a VM guest, especially
of x86-64 guest, this patch set intends to reduce this performance
overhead by replacing the C __kvm_vcpu_is_preempted() function by
an optimized version of __raw_callee_save___kvm_vcpu_is_preempted()
written in assembly.

Waiman Long (2):
  x86/paravirt: Change vcp_is_preempted() arg type to long
  x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

 arch/x86/include/asm/paravirt.h  |  2 +-
 arch/x86/include/asm/qspinlock.h |  2 +-
 arch/x86/kernel/asm-offsets_64.c |  9 +
 arch/x86/kernel/kvm.c| 26 +-
 arch/x86/kernel/paravirt-spinlocks.c |  2 +-
 5 files changed, 37 insertions(+), 4 deletions(-)

-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v5 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long

2017-02-20 Thread Waiman Long
The cpu argument in the function prototype of vcpu_is_preempted()
is changed from int to long. That makes it easier to provide a better
optimized assembly version of that function.

For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
downcast from long to int is not a problem as vCPU number won't exceed
32 bits.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 arch/x86/include/asm/paravirt.h  | 2 +-
 arch/x86/include/asm/qspinlock.h | 2 +-
 arch/x86/kernel/kvm.c| 2 +-
 arch/x86/kernel/paravirt-spinlocks.c | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 1eea6ca..f75fbfe 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -673,7 +673,7 @@ static __always_inline void pv_kick(int cpu)
PVOP_VCALL1(pv_lock_ops.kick, cpu);
 }
 
-static __always_inline bool pv_vcpu_is_preempted(int cpu)
+static __always_inline bool pv_vcpu_is_preempted(long cpu)
 {
return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
 }
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index c343ab5..48a706f 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -34,7 +34,7 @@ static inline void queued_spin_unlock(struct qspinlock *lock)
 }
 
 #define vcpu_is_preempted vcpu_is_preempted
-static inline bool vcpu_is_preempted(int cpu)
+static inline bool vcpu_is_preempted(long cpu)
 {
return pv_vcpu_is_preempted(cpu);
 }
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 099fcba..85ed343 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,7 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
local_irq_restore(flags);
 }
 
-__visible bool __kvm_vcpu_is_preempted(int cpu)
+__visible bool __kvm_vcpu_is_preempted(long cpu)
 {
struct kvm_steal_time *src = _cpu(steal_time, cpu);
 
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 6259327..8f2d1c9 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -20,7 +20,7 @@ bool pv_is_native_spin_unlock(void)
__raw_callee_save___native_queued_spin_unlock;
 }
 
-__visible bool __native_vcpu_is_preempted(int cpu)
+__visible bool __native_vcpu_is_preempted(long cpu)
 {
return false;
 }
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long

2017-02-16 Thread Waiman Long
On 02/16/2017 11:09 AM, Peter Zijlstra wrote:
> On Wed, Feb 15, 2017 at 04:37:49PM -0500, Waiman Long wrote:
>> The cpu argument in the function prototype of vcpu_is_preempted()
>> is changed from int to long. That makes it easier to provide a better
>> optimized assembly version of that function.
>>
>> For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
>> downcast from long to int is not a problem as vCPU number won't exceed
>> 32 bits.
>>
> Note that because of the cast in PVOP_CALL_ARG1() this patch is
> pointless.
>
> Then again, it doesn't seem to affect code generation, so why not. Takes
> away the reliance on that weird cast.

I add this patch because I am a bit uneasy about clearing the upper 32
bits of rdi and assuming that the compiler won't have a previous use of
those bits. It gives me peace of mind.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

2017-02-16 Thread Waiman Long
On 02/16/2017 11:48 AM, Peter Zijlstra wrote:
> On Wed, Feb 15, 2017 at 04:37:50PM -0500, Waiman Long wrote:
>> +/*
>> + * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
>> + * restoring to/from the stack. It is assumed that the preempted value
>> + * is at an offset of 16 from the beginning of the kvm_steal_time structure
>> + * which is verified by the BUILD_BUG_ON() macro below.
>> + */
>> +#define PREEMPTED_OFFSET16
> As per Andrew's suggestion, the 'right' way is something like so.

Thanks for the tip. I was not aware of the asm-offsets stuff. I will
update the patch to use it.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

2017-02-15 Thread Waiman Long
It was found when running fio sequential write test with a XFS ramdisk
on a KVM guest running on a 2-socket x86-64 system, the %CPU times
as reported by perf were as follows:

 69.75%  0.59%  fio  [k] down_write
 69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
 67.12%  1.12%  fio  [k] rwsem_down_write_failed
 63.48% 52.77%  fio  [k] osq_lock
  9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
  3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted

Making vcpu_is_preempted() a callee-save function has a relatively
high cost on x86-64 primarily due to at least one more cacheline of
data access from the saving and restoring of registers (8 of them)
to and from stack as well as one more level of function call.

To reduce this performance overhead, an optimized assembly version
of the the __raw_callee_save___kvm_vcpu_is_preempt() function is
provided for x86-64.

With this patch applied on a KVM guest on a 2-socekt 16-core 32-thread
system with 16 parallel jobs (8 on each socket), the aggregrate
bandwidth of the fio test on an XFS ramdisk were as follows:

   I/O Type  w/o patchwith patch
     ---
   random read   8141.2 MB/s  8497.1 MB/s
   seq read  8229.4 MB/s  8304.2 MB/s
   random write  1675.5 MB/s  1701.5 MB/s
   seq write 1681.3 MB/s  1699.9 MB/s

There are some increases in the aggregated bandwidth because of
the patch.

The perf data now became:

 70.78%  0.58%  fio  [k] down_write
 70.20%  0.01%  fio  [k] call_rwsem_down_write_failed
 69.70%  1.17%  fio  [k] rwsem_down_write_failed
 59.91% 55.42%  fio  [k] osq_lock
 10.14% 10.14%  fio  [k] __kvm_vcpu_is_preempted

The assembly code was verified by using a test kernel module to
compare the output of C __kvm_vcpu_is_preempted() and that of assembly
__raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched.

Suggested-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Waiman Long <long...@redhat.com>
---
 arch/x86/kernel/kvm.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 85ed343..e423435 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
local_irq_restore(flags);
 }
 
+#ifdef CONFIG_X86_32
 __visible bool __kvm_vcpu_is_preempted(long cpu)
 {
struct kvm_steal_time *src = _cpu(steal_time, cpu);
@@ -597,11 +598,40 @@ __visible bool __kvm_vcpu_is_preempted(long cpu)
 }
 PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
 
+#else
+
+extern bool __raw_callee_save___kvm_vcpu_is_preempted(long);
+
+/*
+ * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
+ * restoring to/from the stack. It is assumed that the preempted value
+ * is at an offset of 16 from the beginning of the kvm_steal_time structure
+ * which is verified by the BUILD_BUG_ON() macro below.
+ */
+#define PREEMPTED_OFFSET   16
+asm(
+".pushsection .text;"
+".global __raw_callee_save___kvm_vcpu_is_preempted;"
+".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
+"__raw_callee_save___kvm_vcpu_is_preempted:"
+"movq  __per_cpu_offset(,%rdi,8), %rax;"
+"cmpb  $0, " __stringify(PREEMPTED_OFFSET) "+steal_time(%rax);"
+"setne %al;"
+"ret;"
+".popsection");
+
+#endif
+
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
  */
 void __init kvm_spinlock_init(void)
 {
+#ifdef CONFIG_X86_64
+   BUILD_BUG_ON((offsetof(struct kvm_steal_time, preempted)
+   != PREEMPTED_OFFSET) || (sizeof(steal_time.preempted) != 1));
+#endif
+
if (!kvm_para_available())
return;
/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long

2017-02-15 Thread Waiman Long
The cpu argument in the function prototype of vcpu_is_preempted()
is changed from int to long. That makes it easier to provide a better
optimized assembly version of that function.

For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
downcast from long to int is not a problem as vCPU number won't exceed
32 bits.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 arch/x86/include/asm/paravirt.h  | 2 +-
 arch/x86/include/asm/qspinlock.h | 2 +-
 arch/x86/kernel/kvm.c| 2 +-
 arch/x86/kernel/paravirt-spinlocks.c | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 1eea6ca..f75fbfe 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -673,7 +673,7 @@ static __always_inline void pv_kick(int cpu)
PVOP_VCALL1(pv_lock_ops.kick, cpu);
 }
 
-static __always_inline bool pv_vcpu_is_preempted(int cpu)
+static __always_inline bool pv_vcpu_is_preempted(long cpu)
 {
return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
 }
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index c343ab5..48a706f 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -34,7 +34,7 @@ static inline void queued_spin_unlock(struct qspinlock *lock)
 }
 
 #define vcpu_is_preempted vcpu_is_preempted
-static inline bool vcpu_is_preempted(int cpu)
+static inline bool vcpu_is_preempted(long cpu)
 {
return pv_vcpu_is_preempted(cpu);
 }
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 099fcba..85ed343 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,7 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
local_irq_restore(flags);
 }
 
-__visible bool __kvm_vcpu_is_preempted(int cpu)
+__visible bool __kvm_vcpu_is_preempted(long cpu)
 {
struct kvm_steal_time *src = _cpu(steal_time, cpu);
 
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 6259327..8f2d1c9 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -20,7 +20,7 @@ bool pv_is_native_spin_unlock(void)
__raw_callee_save___native_queued_spin_unlock;
 }
 
-__visible bool __native_vcpu_is_preempted(int cpu)
+__visible bool __native_vcpu_is_preempted(long cpu)
 {
return false;
 }
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v4 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead

2017-02-15 Thread Waiman Long
 v3->v4:
  - Fix x86-32 build error.

 v2->v3:
  - Provide an optimized __raw_callee_save___kvm_vcpu_is_preempted()
in assembly as suggested by PeterZ.
  - Add a new patch to change vcpu_is_preempted() argument type to long
to ease the writing of the assembly code.

 v1->v2:
  - Rerun the fio test on a different system on both bare-metal and a
KVM guest. Both sockets were utilized in this test.
  - The commit log was updated with new performance numbers, but the
patch wasn't changed.
  - Drop patch 2.

As it was found that the overhead of callee-save vcpu_is_preempted()
can have some impact on system performance on a VM guest, especially
of x86-64 guest, this patch set intends to reduce this performance
overhead by replacing the C __kvm_vcpu_is_preempted() function by
an optimized version of __raw_callee_save___kvm_vcpu_is_preempted()
written in assembly.

Waiman Long (2):
  x86/paravirt: Change vcp_is_preempted() arg type to long
  x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

 arch/x86/include/asm/paravirt.h  |  2 +-
 arch/x86/include/asm/qspinlock.h |  2 +-
 arch/x86/kernel/kvm.c| 32 +++-
 arch/x86/kernel/paravirt-spinlocks.c |  2 +-
 4 files changed, 34 insertions(+), 4 deletions(-)

-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 1/2] x86/paravirt: Change vcp_is_preempted() arg type to long

2017-02-15 Thread Waiman Long
The cpu argument in the function prototype of vcpu_is_preempted()
is changed from int to long. That makes it easier to provide a better
optimized assembly version of that function.

For Xen, vcpu_is_preempted(long) calls xen_vcpu_stolen(int), the
downcast from long to int is not a problem as vCPU number won't exceed
32 bits.

Signed-off-by: Waiman Long <long...@redhat.com>
---
 arch/x86/include/asm/paravirt.h  | 2 +-
 arch/x86/include/asm/qspinlock.h | 2 +-
 arch/x86/kernel/kvm.c| 2 +-
 arch/x86/kernel/paravirt-spinlocks.c | 2 +-
 4 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 864f57b..2a46f73 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -674,7 +674,7 @@ static __always_inline void pv_kick(int cpu)
PVOP_VCALL1(pv_lock_ops.kick, cpu);
 }
 
-static __always_inline bool pv_vcpu_is_preempted(int cpu)
+static __always_inline bool pv_vcpu_is_preempted(long cpu)
 {
return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
 }
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index c343ab5..48a706f 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -34,7 +34,7 @@ static inline void queued_spin_unlock(struct qspinlock *lock)
 }
 
 #define vcpu_is_preempted vcpu_is_preempted
-static inline bool vcpu_is_preempted(int cpu)
+static inline bool vcpu_is_preempted(long cpu)
 {
return pv_vcpu_is_preempted(cpu);
 }
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 099fcba..85ed343 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,7 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
local_irq_restore(flags);
 }
 
-__visible bool __kvm_vcpu_is_preempted(int cpu)
+__visible bool __kvm_vcpu_is_preempted(long cpu)
 {
struct kvm_steal_time *src = _cpu(steal_time, cpu);
 
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 6259327..8f2d1c9 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -20,7 +20,7 @@ bool pv_is_native_spin_unlock(void)
__raw_callee_save___native_queued_spin_unlock;
 }
 
-__visible bool __native_vcpu_is_preempted(int cpu)
+__visible bool __native_vcpu_is_preempted(long cpu)
 {
return false;
 }
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 0/2] x86/kvm: Reduce vcpu_is_preempted() overhead

2017-02-15 Thread Waiman Long
 v2->v3:
  - Provide an optimized __raw_callee_save___kvm_vcpu_is_preempted()
in assembly as suggested by PeterZ.
  - Add a new patch to change vcpu_is_preempted() argument type to long
to ease the writing of the assembly code.

 v1->v2:
  - Rerun the fio test on a different system on both bare-metal and a
KVM guest. Both sockets were utilized in this test.
  - The commit log was updated with new performance numbers, but the
patch wasn't changed.
  - Drop patch 2.

As it was found that the overhead of callee-save vcpu_is_preempted()
can have some impact on system performance on a VM guest, especially
of x86-64 guest, this patch set intends to reduce this performance
overhead by replacing the C __kvm_vcpu_is_preempted() function by
an optimized version of __raw_callee_save___kvm_vcpu_is_preempted()
written in assembly.

Waiman Long (2):
  x86/paravirt: Change vcp_is_preempted() arg type to long
  x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

 arch/x86/include/asm/paravirt.h  |  2 +-
 arch/x86/include/asm/qspinlock.h |  2 +-
 arch/x86/kernel/kvm.c| 30 +-
 arch/x86/kernel/paravirt-spinlocks.c |  2 +-
 4 files changed, 32 insertions(+), 4 deletions(-)

-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v3 2/2] x86/kvm: Provide optimized version of vcpu_is_preempted() for x86-64

2017-02-15 Thread Waiman Long
It was found when running fio sequential write test with a XFS ramdisk
on a KVM guest running on a 2-socket x86-64 system, the %CPU times
as reported by perf were as follows:

 69.75%  0.59%  fio  [k] down_write
 69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
 67.12%  1.12%  fio  [k] rwsem_down_write_failed
 63.48% 52.77%  fio  [k] osq_lock
  9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
  3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted

Making vcpu_is_preempted() a callee-save function has a relatively
high cost on x86-64 primarily due to at least one more cacheline of
data access from the saving and restoring of registers (8 of them)
to and from stack as well as one more level of function call.

To reduce this performance overhead, an optimized assembly version
of the the __raw_callee_save___kvm_vcpu_is_preempt() function is
provided for x86-64.

With this patch applied on a KVM guest on a 2-socekt 16-core 32-thread
system with 16 parallel jobs (8 on each socket), the aggregrate
bandwidth of the fio test on an XFS ramdisk were as follows:

   I/O Type  w/o patchwith patch
     ---
   random read   8141.2 MB/s  8497.1 MB/s
   seq read  8229.4 MB/s  8304.2 MB/s
   random write  1675.5 MB/s  1701.5 MB/s
   seq write 1681.3 MB/s  1699.9 MB/s

There are some increases in the aggregated bandwidth because of
the patch.

The perf data now became:

 70.78%  0.58%  fio  [k] down_write
 70.20%  0.01%  fio  [k] call_rwsem_down_write_failed
 69.70%  1.17%  fio  [k] rwsem_down_write_failed
 59.91% 55.42%  fio  [k] osq_lock
 10.14% 10.14%  fio  [k] __kvm_vcpu_is_preempted

The assembly code was verified by using a test kernel module to
compare the output of C __kvm_vcpu_is_preempted() and that of assembly
__raw_callee_save___kvm_vcpu_is_preempt() to verify that they matched.

Suggested-by: Peter Zijlstra <pet...@infradead.org>
Signed-off-by: Waiman Long <long...@redhat.com>
---
 arch/x86/kernel/kvm.c | 28 
 1 file changed, 28 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 85ed343..83e22c1 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
local_irq_restore(flags);
 }
 
+#ifdef CONFIG_X86_32
 __visible bool __kvm_vcpu_is_preempted(long cpu)
 {
struct kvm_steal_time *src = _cpu(steal_time, cpu);
@@ -597,11 +598,38 @@ __visible bool __kvm_vcpu_is_preempted(long cpu)
 }
 PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
 
+#else
+
+extern bool __raw_callee_save___kvm_vcpu_is_preempted(long);
+
+/*
+ * Hand-optimize version for x86-64 to avoid 8 64-bit register saving and
+ * restoring to/from the stack. It is assumed that the preempted value
+ * is at an offset of 16 from the beginning of the kvm_steal_time structure
+ * which is verified by the BUILD_BUG_ON() macro below.
+ */
+#define PREEMPTED_OFFSET   16
+asm(
+".pushsection .text;"
+".global __raw_callee_save___kvm_vcpu_is_preempted;"
+".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
+"__raw_callee_save___kvm_vcpu_is_preempted:"
+"movq  __per_cpu_offset(,%rdi,8), %rax;"
+"cmpb  $0, " __stringify(PREEMPTED_OFFSET) "+steal_time(%rax);"
+"setne %al;"
+"ret;"
+".popsection");
+
+#endif
+
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
  */
 void __init kvm_spinlock_init(void)
 {
+   BUILD_BUG_ON((offsetof(struct kvm_steal_time, preempted)
+   != PREEMPTED_OFFSET) || (sizeof(steal_time.preempted) != 1));
+
if (!kvm_para_available())
return;
/* Does host kernel support KVM_FEATURE_PV_UNHALT? */
-- 
1.8.3.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-14 Thread Waiman Long
On 02/14/2017 04:39 AM, Peter Zijlstra wrote:
> On Mon, Feb 13, 2017 at 05:34:01PM -0500, Waiman Long wrote:
>> It is the address of _time that will exceed the 32-bit limit.
> That seems extremely unlikely. That would mean we have more than 4G
> worth of per-cpu variables declared in the kernel.

I have some doubt about if the compiler is able to properly use
RIP-relative addressing for this. Anyway, it seems like constraints
aren't allowed for asm() when not in the function context, at least for
the the compiler that I am using (4.8.5). So it is a moot point.

Cheers,
Longman


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-13 Thread Waiman Long
On 02/13/2017 04:52 PM, Peter Zijlstra wrote:
> On Mon, Feb 13, 2017 at 03:12:45PM -0500, Waiman Long wrote:
>> On 02/13/2017 02:42 PM, Waiman Long wrote:
>>> On 02/13/2017 05:53 AM, Peter Zijlstra wrote:
>>>> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote:
>>>>> That way we'd end up with something like:
>>>>>
>>>>> asm("
>>>>> push %rdi;
>>>>> movslq %edi, %rdi;
>>>>> movq __per_cpu_offset(,%rdi,8), %rax;
>>>>> cmpb $0, %[offset](%rax);
>>>>> setne %al;
>>>>> pop %rdi;
>>>>> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct 
>>>>> steal_time, preempted)));
>>>>>
>>>>> And if we could get rid of the sign extend on edi we could avoid all the
>>>>> push-pop nonsense, but I'm not sure I see how to do that (then again,
>>>>> this asm foo isn't my strongest point).
>>>> Maybe:
>>>>
>>>> movsql %edi, %rax;
>>>> movq __per_cpu_offset(,%rax,8), %rax;
>>>> cmpb $0, %[offset](%rax);
>>>> setne %al;
>>>>
>>>> ?
>>> Yes, that looks good to me.
>>>
>>> Cheers,
>>> Longman
>>>
>> Sorry, I am going to take it back. The displacement or offset can only
>> be up to 32-bit. So we will still need to use at least one more
>> register, I think.
> I don't think that would be a problem, I very much doubt we declare more
> than 4G worth of per-cpu variables in the kernel.
>
> In any case, use "e" or "Z" as constraint (I never quite know when to
> use which). That are s32 and u32 displacement immediates resp. and
> should fail compile with a semi-sensible failure if the displacement is
> too big.
>
It is the address of _time that will exceed the 32-bit limit.

Cheers,
Longman



___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-13 Thread Waiman Long
On 02/13/2017 03:06 PM, h...@zytor.com wrote:
> On February 13, 2017 2:53:43 AM PST, Peter Zijlstra  
> wrote:
>> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote:
>>> That way we'd end up with something like:
>>>
>>> asm("
>>> push %rdi;
>>> movslq %edi, %rdi;
>>> movq __per_cpu_offset(,%rdi,8), %rax;
>>> cmpb $0, %[offset](%rax);
>>> setne %al;
>>> pop %rdi;
>>> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct
>> steal_time, preempted)));
>>> And if we could get rid of the sign extend on edi we could avoid all
>> the
>>> push-pop nonsense, but I'm not sure I see how to do that (then again,
>>> this asm foo isn't my strongest point).
>> Maybe:
>>
>> movsql %edi, %rax;
>> movq __per_cpu_offset(,%rax,8), %rax;
>> cmpb $0, %[offset](%rax);
>> setne %al;
>>
>> ?
> We could kill the zero or sign extend by changing the calling interface to 
> pass an unsigned long instead of an int.  It is much more likely that a zero 
> extend is free for the caller than a sign extend.

I have thought of that too. However, the goal is to eliminate memory
read/write from/to stack. Eliminating a register sign-extend instruction
won't help much in term of performance.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-13 Thread Waiman Long
On 02/13/2017 02:42 PM, Waiman Long wrote:
> On 02/13/2017 05:53 AM, Peter Zijlstra wrote:
>> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote:
>>> That way we'd end up with something like:
>>>
>>> asm("
>>> push %rdi;
>>> movslq %edi, %rdi;
>>> movq __per_cpu_offset(,%rdi,8), %rax;
>>> cmpb $0, %[offset](%rax);
>>> setne %al;
>>> pop %rdi;
>>> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct 
>>> steal_time, preempted)));
>>>
>>> And if we could get rid of the sign extend on edi we could avoid all the
>>> push-pop nonsense, but I'm not sure I see how to do that (then again,
>>> this asm foo isn't my strongest point).
>> Maybe:
>>
>> movsql %edi, %rax;
>> movq __per_cpu_offset(,%rax,8), %rax;
>> cmpb $0, %[offset](%rax);
>> setne %al;
>>
>> ?
> Yes, that looks good to me.
>
> Cheers,
> Longman
>
Sorry, I am going to take it back. The displacement or offset can only
be up to 32-bit. So we will still need to use at least one more
register, I think.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-13 Thread Waiman Long
On 02/13/2017 05:53 AM, Peter Zijlstra wrote:
> On Mon, Feb 13, 2017 at 11:47:16AM +0100, Peter Zijlstra wrote:
>> That way we'd end up with something like:
>>
>> asm("
>> push %rdi;
>> movslq %edi, %rdi;
>> movq __per_cpu_offset(,%rdi,8), %rax;
>> cmpb $0, %[offset](%rax);
>> setne %al;
>> pop %rdi;
>> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct 
>> steal_time, preempted)));
>>
>> And if we could get rid of the sign extend on edi we could avoid all the
>> push-pop nonsense, but I'm not sure I see how to do that (then again,
>> this asm foo isn't my strongest point).
> Maybe:
>
> movsql %edi, %rax;
> movq __per_cpu_offset(,%rax,8), %rax;
> cmpb $0, %[offset](%rax);
> setne %al;
>
> ?

Yes, that looks good to me.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-13 Thread Waiman Long
On 02/13/2017 05:47 AM, Peter Zijlstra wrote:
> On Fri, Feb 10, 2017 at 12:00:43PM -0500, Waiman Long wrote:
>
>>>> +asm(
>>>> +".pushsection .text;"
>>>> +".global __raw_callee_save___kvm_vcpu_is_preempted;"
>>>> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
>>>> +"__raw_callee_save___kvm_vcpu_is_preempted:"
>>>> +FRAME_BEGIN
>>>> +"push %rdi;"
>>>> +"push %rdx;"
>>>> +"movslq  %edi, %rdi;"
>>>> +"movq$steal_time+16, %rax;"
>>>> +"movq__per_cpu_offset(,%rdi,8), %rdx;"
>>>> +"cmpb$0, (%rdx,%rax);"
> Could we not put the $steal_time+16 displacement as an immediate in the
> cmpb and save a whole register here?
>
> That way we'd end up with something like:
>
> asm("
> push %rdi;
> movslq %edi, %rdi;
> movq __per_cpu_offset(,%rdi,8), %rax;
> cmpb $0, %[offset](%rax);
> setne %al;
> pop %rdi;
> " : : [offset] "i" (((unsigned long)_time) + offsetof(struct 
> steal_time, preempted)));
>
> And if we could get rid of the sign extend on edi we could avoid all the
> push-pop nonsense, but I'm not sure I see how to do that (then again,
> this asm foo isn't my strongest point).

Yes, I think that can work. I will try to ran this patch to see how
thing goes.

>>>> +"setne   %al;"
>>>> +"pop %rdx;"
>>>> +"pop %rdi;"
>>>> +FRAME_END
>>>> +"ret;"
>>>> +".popsection");
>>>> +
>>>> +#endif
>>>> +
>>>>  /*
>>>>   * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
>>>>   */
>>> That should work for now. I have done something similar for
>>> __pv_queued_spin_unlock. However, this has the problem of creating a
>>> dependency on the exact layout of the steal_time structure. Maybe the
>>> constant 16 can be passed in as a parameter offsetof(struct
>>> kvm_steal_time, preempted) to the asm call.
> Yeah it should be well possible to pass that in. But ideally we'd have
> GCC grow something like __attribute__((callee_saved)) or somesuch and it
> would do all this for us.

That will be really nice too. I am not too fond of working in assembly.

>> One more thing, that will improve KVM performance, but it won't help Xen.
> People still use Xen? ;-) In any case, their implementation looks very
> similar and could easily crib this.

In Red Hat, my focus will be on KVM performance. I do believe that there
are still Xen users out there. So we still need to keep their interest
into consideration. Given that, I am OK to make it work better in KVM
first and then think about Xen later.

>> I looked into the assembly code for rwsem_spin_on_owner, It need to save
>> and restore 2 additional registers with my patch. Doing it your way,
>> will transfer the save and restore overhead to the assembly code.
>> However, __kvm_vcpu_is_preempted() is called multiple times per
>> invocation of rwsem_spin_on_owner. That function is simple enough that
>> making __kvm_vcpu_is_preempted() callee-save won't produce much compiler
>> optimization opportunity.
> This is because of that noinline, right? Otherwise it would've been
> folded and register pressure would be much higher.

Yes, I guess so. The noinline is there so that we know what the CPU time
is for spinning rather than other activities within the slowpath.

>
>> The outer function rwsem_down_write_failed()
>> does appear to be a bit bigger (from 866 bytes to 884 bytes) though.
> I suspect GCC is being clever and since all this is static it plays
> games with the calling convention and pushes these clobbers out.
>
>

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-10 Thread Waiman Long
On 02/10/2017 11:35 AM, Waiman Long wrote:
> On 02/10/2017 11:19 AM, Peter Zijlstra wrote:
>> On Fri, Feb 10, 2017 at 10:43:09AM -0500, Waiman Long wrote:
>>> It was found when running fio sequential write test with a XFS ramdisk
>>> on a VM running on a 2-socket x86-64 system, the %CPU times as reported
>>> by perf were as follows:
>>>
>>>  69.75%  0.59%  fio  [k] down_write
>>>  69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
>>>  67.12%  1.12%  fio  [k] rwsem_down_write_failed
>>>  63.48% 52.77%  fio  [k] osq_lock
>>>   9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
>>>   3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted
>>>
>> Thinking about this again, wouldn't something like the below also work?
>>
>>
>> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
>> index 099fcba4981d..6aa33702c15c 100644
>> --- a/arch/x86/kernel/kvm.c
>> +++ b/arch/x86/kernel/kvm.c
>> @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
>>  local_irq_restore(flags);
>>  }
>>  
>> +#ifdef CONFIG_X86_32
>>  __visible bool __kvm_vcpu_is_preempted(int cpu)
>>  {
>>  struct kvm_steal_time *src = _cpu(steal_time, cpu);
>> @@ -597,6 +598,31 @@ __visible bool __kvm_vcpu_is_preempted(int cpu)
>>  }
>>  PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
>>  
>> +#else
>> +
>> +extern bool __raw_callee_save___kvm_vcpu_is_preempted(int);
>> +
>> +asm(
>> +".pushsection .text;"
>> +".global __raw_callee_save___kvm_vcpu_is_preempted;"
>> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
>> +"__raw_callee_save___kvm_vcpu_is_preempted:"
>> +FRAME_BEGIN
>> +"push %rdi;"
>> +"push %rdx;"
>> +"movslq  %edi, %rdi;"
>> +"movq$steal_time+16, %rax;"
>> +"movq__per_cpu_offset(,%rdi,8), %rdx;"
>> +"cmpb$0, (%rdx,%rax);"
>> +"setne   %al;"
>> +"pop %rdx;"
>> +"pop %rdi;"
>> +FRAME_END
>> +"ret;"
>> +".popsection");
>> +
>> +#endif
>> +
>>  /*
>>   * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
>>   */
> That should work for now. I have done something similar for
> __pv_queued_spin_unlock. However, this has the problem of creating a
> dependency on the exact layout of the steal_time structure. Maybe the
> constant 16 can be passed in as a parameter offsetof(struct
> kvm_steal_time, preempted) to the asm call.
>
> Cheers,
> Longman

One more thing, that will improve KVM performance, but it won't help Xen.

I looked into the assembly code for rwsem_spin_on_owner, It need to save
and restore 2 additional registers with my patch. Doing it your way,
will transfer the save and restore overhead to the assembly code.
However, __kvm_vcpu_is_preempted() is called multiple times per
invocation of rwsem_spin_on_owner. That function is simple enough that
making __kvm_vcpu_is_preempted() callee-save won't produce much compiler
optimization opportunity. The outer function rwsem_down_write_failed()
does appear to be a bit bigger (from 866 bytes to 884 bytes) though.

Cheers,
Longman


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-10 Thread Waiman Long
On 02/10/2017 11:19 AM, Peter Zijlstra wrote:
> On Fri, Feb 10, 2017 at 10:43:09AM -0500, Waiman Long wrote:
>> It was found when running fio sequential write test with a XFS ramdisk
>> on a VM running on a 2-socket x86-64 system, the %CPU times as reported
>> by perf were as follows:
>>
>>  69.75%  0.59%  fio  [k] down_write
>>  69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
>>  67.12%  1.12%  fio  [k] rwsem_down_write_failed
>>  63.48% 52.77%  fio  [k] osq_lock
>>   9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
>>   3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted
>>
> Thinking about this again, wouldn't something like the below also work?
>
>
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 099fcba4981d..6aa33702c15c 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -589,6 +589,7 @@ static void kvm_wait(u8 *ptr, u8 val)
>   local_irq_restore(flags);
>  }
>  
> +#ifdef CONFIG_X86_32
>  __visible bool __kvm_vcpu_is_preempted(int cpu)
>  {
>   struct kvm_steal_time *src = _cpu(steal_time, cpu);
> @@ -597,6 +598,31 @@ __visible bool __kvm_vcpu_is_preempted(int cpu)
>  }
>  PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
>  
> +#else
> +
> +extern bool __raw_callee_save___kvm_vcpu_is_preempted(int);
> +
> +asm(
> +".pushsection .text;"
> +".global __raw_callee_save___kvm_vcpu_is_preempted;"
> +".type __raw_callee_save___kvm_vcpu_is_preempted, @function;"
> +"__raw_callee_save___kvm_vcpu_is_preempted:"
> +FRAME_BEGIN
> +"push %rdi;"
> +"push %rdx;"
> +"movslq  %edi, %rdi;"
> +"movq$steal_time+16, %rax;"
> +"movq__per_cpu_offset(,%rdi,8), %rdx;"
> +"cmpb$0, (%rdx,%rax);"
> +"setne   %al;"
> +"pop %rdx;"
> +"pop %rdi;"
> +FRAME_END
> +"ret;"
> +".popsection");
> +
> +#endif
> +
>  /*
>   * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
>   */

That should work for now. I have done something similar for
__pv_queued_spin_unlock. However, this has the problem of creating a
dependency on the exact layout of the steal_time structure. Maybe the
constant 16 can be passed in as a parameter offsetof(struct
kvm_steal_time, preempted) to the asm call.

Cheers,
Longman


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-10 Thread Waiman Long
It was found when running fio sequential write test with a XFS ramdisk
on a VM running on a 2-socket x86-64 system, the %CPU times as reported
by perf were as follows:

 69.75%  0.59%  fio  [k] down_write
 69.15%  0.01%  fio  [k] call_rwsem_down_write_failed
 67.12%  1.12%  fio  [k] rwsem_down_write_failed
 63.48% 52.77%  fio  [k] osq_lock
  9.46%  7.88%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempt
  3.93%  3.93%  fio  [k] __kvm_vcpu_is_preempted

Making vcpu_is_preempted() a callee-save function has a relatively
high cost on x86-64 primarily due to at least one more cacheline of
data access from the saving and restoring of registers (8 of them)
to and from stack as well as one more level of function call. As
vcpu_is_preempted() is called within the spinlock, mutex and rwsem
slowpaths, there isn't much to gain by making it callee-save. So it
is now changed to a normal function call instead.

With this patch applied on both bare-metal & KVM guest on a 2-socekt
16-core 32-thread system with 16 parallel jobs (8 on each socket), the
aggregrate bandwidth of the fio test on an XFS ramdisk were as follows:

   Bare MetalKVM Guest
   I/O Type  w/o patchwith patch   w/o patchwith patch
     ---   ---
   random read   8650.5 MB/s  8560.9 MB/s  7602.9 MB/s  8196.1 MB/s  
   seq read  9104.8 MB/s  9397.2 MB/s  8293.7 MB/s  8566.9 MB/s
   random write  1623.8 MB/s  1626.7 MB/s  1590.6 MB/s  1700.7 MB/s
   seq write 1626.4 MB/s  1624.9 MB/s  1604.8 MB/s  1726.3 MB/s

The perf data (on KVM guest) now became:

 70.78%  0.58%  fio  [k] down_write
 70.20%  0.01%  fio  [k] call_rwsem_down_write_failed
 69.70%  1.17%  fio  [k] rwsem_down_write_failed
 59.91% 55.42%  fio  [k] osq_lock
 10.14% 10.14%  fio  [k] __kvm_vcpu_is_preempted

On bare metal, the patch doesn't introduce any performance
regression. On KVM guest, it produces noticeable performance
improvement (up to 7%).

Signed-off-by: Waiman Long <long...@redhat.com>
---
 v1->v2:
  - Rerun the fio test on a different system on both bare-metal and a
KVM guest. Both sockets were utilized in this test.
  - The commit log was updated with new performance numbers, but the
patch wasn't changed.
  - Drop patch 2.

 arch/x86/include/asm/paravirt.h   | 2 +-
 arch/x86/include/asm/paravirt_types.h | 2 +-
 arch/x86/kernel/kvm.c | 7 ++-
 arch/x86/kernel/paravirt-spinlocks.c  | 6 ++
 arch/x86/xen/spinlock.c   | 4 +---
 5 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 864f57b..2515885 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -676,7 +676,7 @@ static __always_inline void pv_kick(int cpu)
 
 static __always_inline bool pv_vcpu_is_preempted(int cpu)
 {
-   return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
+   return PVOP_CALL1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
 }
 
 #endif /* SMP && PARAVIRT_SPINLOCKS */
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index bb2de45..88dc852 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -309,7 +309,7 @@ struct pv_lock_ops {
void (*wait)(u8 *ptr, u8 val);
void (*kick)(int cpu);
 
-   struct paravirt_callee_save vcpu_is_preempted;
+   bool (*vcpu_is_preempted)(int cpu);
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 099fcba..eb3753d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -595,7 +595,6 @@ __visible bool __kvm_vcpu_is_preempted(int cpu)
 
return !!src->preempted;
 }
-PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
 
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
@@ -614,10 +613,8 @@ void __init kvm_spinlock_init(void)
pv_lock_ops.wait = kvm_wait;
pv_lock_ops.kick = kvm_kick_cpu;
 
-   if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
-   pv_lock_ops.vcpu_is_preempted =
-   PV_CALLEE_SAVE(__kvm_vcpu_is_preempted);
-   }
+   if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME))
+   pv_lock_ops.vcpu_is_preempted = __kvm_vcpu_is_preempted;
 }
 
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 6259327..da050bc 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -24,12 +24,10 @@ __visible bool __native_vcpu_is_preempted(int cpu)
 {
return false;
 }
-PV_CALLEE_SAVE_REGS_THUNK(__native_vcpu_is_preempted);
 
 bool pv_is_native_vcpu_is_preempted(void)
 {
-   return pv_lock_ops.vcpu_is_preempted.func ==
-   __raw_callee_save___

Re: [PATCH 1/2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-08 Thread Waiman Long
On 02/08/2017 02:05 PM, Peter Zijlstra wrote:
> On Wed, Feb 08, 2017 at 01:00:24PM -0500, Waiman Long wrote:
>> It was found when running fio sequential write test with a XFS ramdisk
>> on a 2-socket x86-64 system, the %CPU times as reported by perf were
>> as follows:
>>
>>  71.27%  0.28%  fio  [k] down_write
>>  70.99%  0.01%  fio  [k] call_rwsem_down_write_failed
>>  69.43%  1.18%  fio  [k] rwsem_down_write_failed
>>  65.51% 54.57%  fio  [k] osq_lock
>>   9.72%  7.99%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempted
>>   4.16%  4.16%  fio  [k] __kvm_vcpu_is_preempted
>>
>> So making vcpu_is_preempted() a callee-save function has a pretty high
>> cost associated with it. As vcpu_is_preempted() is called within the
>> spinlock, mutex and rwsem slowpaths, there isn't much to gain by making
>> it callee-save. So it is now changed to a normal function call instead.
>>
> Numbers for bare metal too please.

I will run the test on bare metal, but I doubt there will be noticeable
difference.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/2] locking/mutex,rwsem: Reduce vcpu_is_preempted() calling frequency

2017-02-08 Thread Waiman Long
On 02/08/2017 02:05 PM, Peter Zijlstra wrote:
> On Wed, Feb 08, 2017 at 01:00:25PM -0500, Waiman Long wrote:
>> As the vcpu_is_preempted() call is pretty costly compared with other
>> checks within mutex_spin_on_owner() and rwsem_spin_on_owner(), they
>> are done at a reduce frequency of once every 256 iterations.
> That's just disgusting.

I do have some doubt myself on the effectiveness of this patch. Anyway,
it is the first patch that I think is more beneficial.

Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 1/2] x86/paravirt: Don't make vcpu_is_preempted() a callee-save function

2017-02-08 Thread Waiman Long
It was found when running fio sequential write test with a XFS ramdisk
on a 2-socket x86-64 system, the %CPU times as reported by perf were
as follows:

 71.27%  0.28%  fio  [k] down_write
 70.99%  0.01%  fio  [k] call_rwsem_down_write_failed
 69.43%  1.18%  fio  [k] rwsem_down_write_failed
 65.51% 54.57%  fio  [k] osq_lock
  9.72%  7.99%  fio  [k] __raw_callee_save___kvm_vcpu_is_preempted
  4.16%  4.16%  fio  [k] __kvm_vcpu_is_preempted

So making vcpu_is_preempted() a callee-save function has a pretty high
cost associated with it. As vcpu_is_preempted() is called within the
spinlock, mutex and rwsem slowpaths, there isn't much to gain by making
it callee-save. So it is now changed to a normal function call instead.

With this patch applied, the aggregrate bandwidth of the fio sequential
write test increased slightly from 2563.3MB/s to 2588.1MB/s (about 1%).

Signed-off-by: Waiman Long <long...@redhat.com>
---
 arch/x86/include/asm/paravirt.h   | 2 +-
 arch/x86/include/asm/paravirt_types.h | 2 +-
 arch/x86/kernel/kvm.c | 7 ++-
 arch/x86/kernel/paravirt-spinlocks.c  | 6 ++
 arch/x86/xen/spinlock.c   | 4 +---
 5 files changed, 7 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 864f57b..2515885 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -676,7 +676,7 @@ static __always_inline void pv_kick(int cpu)
 
 static __always_inline bool pv_vcpu_is_preempted(int cpu)
 {
-   return PVOP_CALLEE1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
+   return PVOP_CALL1(bool, pv_lock_ops.vcpu_is_preempted, cpu);
 }
 
 #endif /* SMP && PARAVIRT_SPINLOCKS */
diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index bb2de45..88dc852 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -309,7 +309,7 @@ struct pv_lock_ops {
void (*wait)(u8 *ptr, u8 val);
void (*kick)(int cpu);
 
-   struct paravirt_callee_save vcpu_is_preempted;
+   bool (*vcpu_is_preempted)(int cpu);
 };
 
 /* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 099fcba..eb3753d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -595,7 +595,6 @@ __visible bool __kvm_vcpu_is_preempted(int cpu)
 
return !!src->preempted;
 }
-PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
 
 /*
  * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
@@ -614,10 +613,8 @@ void __init kvm_spinlock_init(void)
pv_lock_ops.wait = kvm_wait;
pv_lock_ops.kick = kvm_kick_cpu;
 
-   if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
-   pv_lock_ops.vcpu_is_preempted =
-   PV_CALLEE_SAVE(__kvm_vcpu_is_preempted);
-   }
+   if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME))
+   pv_lock_ops.vcpu_is_preempted = __kvm_vcpu_is_preempted;
 }
 
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
b/arch/x86/kernel/paravirt-spinlocks.c
index 6259327..da050bc 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -24,12 +24,10 @@ __visible bool __native_vcpu_is_preempted(int cpu)
 {
return false;
 }
-PV_CALLEE_SAVE_REGS_THUNK(__native_vcpu_is_preempted);
 
 bool pv_is_native_vcpu_is_preempted(void)
 {
-   return pv_lock_ops.vcpu_is_preempted.func ==
-   __raw_callee_save___native_vcpu_is_preempted;
+   return pv_lock_ops.vcpu_is_preempted == __native_vcpu_is_preempted;
 }
 
 struct pv_lock_ops pv_lock_ops = {
@@ -38,7 +36,7 @@ struct pv_lock_ops pv_lock_ops = {
.queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock),
.wait = paravirt_nop,
.kick = paravirt_nop,
-   .vcpu_is_preempted = PV_CALLEE_SAVE(__native_vcpu_is_preempted),
+   .vcpu_is_preempted = __native_vcpu_is_preempted,
 #endif /* SMP */
 };
 EXPORT_SYMBOL(pv_lock_ops);
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 25a7c43..c85bb8f 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -114,8 +114,6 @@ void xen_uninit_lock_cpu(int cpu)
per_cpu(irq_name, cpu) = NULL;
 }
 
-PV_CALLEE_SAVE_REGS_THUNK(xen_vcpu_stolen);
-
 /*
  * Our init of PV spinlocks is split in two init functions due to us
  * using paravirt patching and jump labels patching and having to do
@@ -138,7 +136,7 @@ void __init xen_init_spinlocks(void)
pv_lock_ops.queued_spin_unlock = 
PV_CALLEE_SAVE(__pv_queued_spin_unlock);
pv_lock_ops.wait = xen_qlock_wait;
pv_lock_ops.kick = xen_qlock_kick;
-   pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(xen_vcpu_stolen);
+   pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen;
 }
 
 static __init int xen_parse_nopvspin(char 

  1   2   3   4   5   >