[RFC][PATCH] locking: Generic ticket-lock

2021-04-14 Thread Peter Zijlstra
On Wed, Apr 14, 2021 at 11:05:24AM +0200, Peter Zijlstra wrote:

> That made me look at the qspinlock code, and queued_spin_*lock() uses
> atomic_try_cmpxchg_acquire(), which means any arch that uses qspinlock
> and has RCpc atomics will give us massive pain.
> 
> Current archs using qspinlock are: x86, arm64, power, sparc64, mips and
> openrisc (WTF?!).
> 
> Of those, x86 and sparc are TSO archs with SC atomics, arm64 has RCsc
> atomics, power has RCtso atomics (and is the arch we all hate for having
> RCtso locks).
> 
> Now MIPS has all sorts of ill specified barriers, but last time looked
> at it it didn't actually use any of that and stuck to using smp_mb(), so
> it will have RCsc atomics.
> 
> /me goes look at wth openrisc is..  doesn't even appear to have
> asm/barrier.h :-/ Looking at wikipedia it also doesn't appear to
> actually have hardware ...

FWIW this is broken, anything SMP *MUST* define mb(), at the very least.

> I'm thinking openrisc is a prime candidate for this ticket_lock.h we're
> all talking about.

How's this then? Compile tested only on openrisc/simple_smp_defconfig.

---
 arch/openrisc/Kconfig  |  1 -
 arch/openrisc/include/asm/Kbuild   |  5 +-
 arch/openrisc/include/asm/spinlock.h   |  3 +-
 arch/openrisc/include/asm/spinlock_types.h |  2 +-
 include/asm-generic/qspinlock.h| 30 +++
 include/asm-generic/ticket-lock-types.h| 11 
 include/asm-generic/ticket-lock.h  | 86 ++
 7 files changed, 131 insertions(+), 7 deletions(-)

diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
index 591acc5990dc..1858cf309f1f 100644
--- a/arch/openrisc/Kconfig
+++ b/arch/openrisc/Kconfig
@@ -32,7 +32,6 @@ config OPENRISC
select HAVE_DEBUG_STACKOVERFLOW
select OR1K_PIC
select CPU_NO_EFFICIENT_FFS if !OPENRISC_HAVE_INST_FF1
-   select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_USE_QUEUED_RWLOCKS
select OMPIC if SMP
select ARCH_WANT_FRAME_POINTERS
diff --git a/arch/openrisc/include/asm/Kbuild b/arch/openrisc/include/asm/Kbuild
index ca5987e11053..cb260e7d73db 100644
--- a/arch/openrisc/include/asm/Kbuild
+++ b/arch/openrisc/include/asm/Kbuild
@@ -1,9 +1,8 @@
 # SPDX-License-Identifier: GPL-2.0
 generic-y += extable.h
 generic-y += kvm_para.h
-generic-y += mcs_spinlock.h
-generic-y += qspinlock_types.h
-generic-y += qspinlock.h
+generic-y += ticket-lock.h
+generic-y += ticket-lock-types.h
 generic-y += qrwlock_types.h
 generic-y += qrwlock.h
 generic-y += user.h
diff --git a/arch/openrisc/include/asm/spinlock.h 
b/arch/openrisc/include/asm/spinlock.h
index a8940bdfcb7e..0b839ed1f3a0 100644
--- a/arch/openrisc/include/asm/spinlock.h
+++ b/arch/openrisc/include/asm/spinlock.h
@@ -15,8 +15,7 @@
 #ifndef __ASM_OPENRISC_SPINLOCK_H
 #define __ASM_OPENRISC_SPINLOCK_H
 
-#include 
-
+#include 
 #include 
 
 #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
diff --git a/arch/openrisc/include/asm/spinlock_types.h 
b/arch/openrisc/include/asm/spinlock_types.h
index 7c6fb1208c88..58ea31fa65ce 100644
--- a/arch/openrisc/include/asm/spinlock_types.h
+++ b/arch/openrisc/include/asm/spinlock_types.h
@@ -1,7 +1,7 @@
 #ifndef _ASM_OPENRISC_SPINLOCK_TYPES_H
 #define _ASM_OPENRISC_SPINLOCK_TYPES_H
 
-#include 
+#include 
 #include 
 
 #endif /* _ASM_OPENRISC_SPINLOCK_TYPES_H */
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index d74b13825501..a7a1296b0b4d 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -2,6 +2,36 @@
 /*
  * Queued spinlock
  *
+ * A 'generic' spinlock implementation that is based on MCS locks. An
+ * architecture that's looking for a 'generic' spinlock, please first consider
+ * ticket-lock.h and only come looking here when you've considered all the
+ * constraints below and can show your hardware does actually perform better
+ * with qspinlock.
+ *
+ *
+ * It relies on atomic_*_release()/atomic_*_acquire() to be RCsc (or no weaker
+ * than RCtso if you're power), where regular code only expects atomic_t to be
+ * RCpc.
+ *
+ * It relies on a far greater (compared to ticket-lock.h) set of atomic
+ * operations to behave well together, please audit them carefully to ensure
+ * they all have forward progress. Many atomic operations may default to
+ * cmpxchg() loops which will not have good forward progress properties on
+ * LL/SC architectures.
+ *
+ * One notable example is atomic_fetch_or_acquire(), which x86 cannot (cheaply)
+ * do. Carefully read the patches that introduced 
queued_fetch_set_pending_acquire().
+ *
+ * It also heavily relies on mixed size atomic operations, in specific it
+ * requires architectures to have xchg16; something which many LL/SC
+ * architectures need to implement as a 32bit and+or in order to satisfy the
+ * forward progress guarantees mentioned above.
+ *
+ * Further reading on mixed size atomics that might be relevant:
+ *
+ *   http://www.cl.

Re: [RFC][PATCH] locking: Generic ticket-lock

2021-04-14 Thread Guo Ren
On Wed, Apr 14, 2021 at 6:16 PM Peter Zijlstra  wrote:
>
> On Wed, Apr 14, 2021 at 11:05:24AM +0200, Peter Zijlstra wrote:
>
> > That made me look at the qspinlock code, and queued_spin_*lock() uses
> > atomic_try_cmpxchg_acquire(), which means any arch that uses qspinlock
> > and has RCpc atomics will give us massive pain.
> >
> > Current archs using qspinlock are: x86, arm64, power, sparc64, mips and
> > openrisc (WTF?!).
> >
> > Of those, x86 and sparc are TSO archs with SC atomics, arm64 has RCsc
> > atomics, power has RCtso atomics (and is the arch we all hate for having
> > RCtso locks).
> >
> > Now MIPS has all sorts of ill specified barriers, but last time looked
> > at it it didn't actually use any of that and stuck to using smp_mb(), so
> > it will have RCsc atomics.
> >
> > /me goes look at wth openrisc is..  doesn't even appear to have
> > asm/barrier.h :-/ Looking at wikipedia it also doesn't appear to
> > actually have hardware ...
>
> FWIW this is broken, anything SMP *MUST* define mb(), at the very least.
>
> > I'm thinking openrisc is a prime candidate for this ticket_lock.h we're
> > all talking about.
>
> How's this then? Compile tested only on openrisc/simple_smp_defconfig.
I've tested it on csky SMP*4 hw (860) & riscv SMP*4 hw (c910) and it's okay.

Hope you can keep
typedef struct {
union {
atomic_t lock;
struct __raw_tickets {
#ifdef __BIG_ENDIAN
u16 next;
u16 owner;
#else
u16 owner;
u16 next;
#endif
} tickets;
};
} arch_spinlock_t;

Using owner & next is much more readable.

>
> ---
>  arch/openrisc/Kconfig  |  1 -
>  arch/openrisc/include/asm/Kbuild   |  5 +-
>  arch/openrisc/include/asm/spinlock.h   |  3 +-
>  arch/openrisc/include/asm/spinlock_types.h |  2 +-
>  include/asm-generic/qspinlock.h| 30 +++
>  include/asm-generic/ticket-lock-types.h| 11 
>  include/asm-generic/ticket-lock.h  | 86 
> ++
>  7 files changed, 131 insertions(+), 7 deletions(-)
>
> diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
> index 591acc5990dc..1858cf309f1f 100644
> --- a/arch/openrisc/Kconfig
> +++ b/arch/openrisc/Kconfig
> @@ -32,7 +32,6 @@ config OPENRISC
> select HAVE_DEBUG_STACKOVERFLOW
> select OR1K_PIC
> select CPU_NO_EFFICIENT_FFS if !OPENRISC_HAVE_INST_FF1
> -   select ARCH_USE_QUEUED_SPINLOCKS
> select ARCH_USE_QUEUED_RWLOCKS
> select OMPIC if SMP
> select ARCH_WANT_FRAME_POINTERS
> diff --git a/arch/openrisc/include/asm/Kbuild 
> b/arch/openrisc/include/asm/Kbuild
> index ca5987e11053..cb260e7d73db 100644
> --- a/arch/openrisc/include/asm/Kbuild
> +++ b/arch/openrisc/include/asm/Kbuild
> @@ -1,9 +1,8 @@
>  # SPDX-License-Identifier: GPL-2.0
>  generic-y += extable.h
>  generic-y += kvm_para.h
> -generic-y += mcs_spinlock.h
> -generic-y += qspinlock_types.h
> -generic-y += qspinlock.h
> +generic-y += ticket-lock.h
> +generic-y += ticket-lock-types.h
>  generic-y += qrwlock_types.h
>  generic-y += qrwlock.h
>  generic-y += user.h
> diff --git a/arch/openrisc/include/asm/spinlock.h 
> b/arch/openrisc/include/asm/spinlock.h
> index a8940bdfcb7e..0b839ed1f3a0 100644
> --- a/arch/openrisc/include/asm/spinlock.h
> +++ b/arch/openrisc/include/asm/spinlock.h
> @@ -15,8 +15,7 @@
>  #ifndef __ASM_OPENRISC_SPINLOCK_H
>  #define __ASM_OPENRISC_SPINLOCK_H
>
> -#include 
> -
> +#include 
>  #include 
>
>  #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
> diff --git a/arch/openrisc/include/asm/spinlock_types.h 
> b/arch/openrisc/include/asm/spinlock_types.h
> index 7c6fb1208c88..58ea31fa65ce 100644
> --- a/arch/openrisc/include/asm/spinlock_types.h
> +++ b/arch/openrisc/include/asm/spinlock_types.h
> @@ -1,7 +1,7 @@
>  #ifndef _ASM_OPENRISC_SPINLOCK_TYPES_H
>  #define _ASM_OPENRISC_SPINLOCK_TYPES_H
>
> -#include 
> +#include 
>  #include 
>
>  #endif /* _ASM_OPENRISC_SPINLOCK_TYPES_H */
> diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
> index d74b13825501..a7a1296b0b4d 100644
> --- a/include/asm-generic/qspinlock.h
> +++ b/include/asm-generic/qspinlock.h
> @@ -2,6 +2,36 @@
>  /*
>   * Queued spinlock
>   *
> + * A 'generic' spinlock implementation that is based on MCS locks. An
> + * architecture that's looking for a 'generic' spinlock, please first 
> consider
> + * ticket-lock.h and only come looking here when you've considered all the
> + * constraints below and can show your hardware does actually perform better
> + * with qspinlock.
> + *
> + *
> + * It relies on atomic_*_release()/atomic_*_acquire() to be RCsc (or no 
> weaker
> + * than RCtso if you're power), where regular code only expects atomic_t to 
> be
> + * RCpc.
> + *
> + * It relies on a far greater (compared to ticket-lock.h) set of atomic
> + * operations to behave well together,

Re: [RFC][PATCH] locking: Generic ticket-lock

2021-04-14 Thread Peter Zijlstra
On Wed, Apr 14, 2021 at 12:16:38PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 14, 2021 at 11:05:24AM +0200, Peter Zijlstra wrote:
> 
> > That made me look at the qspinlock code, and queued_spin_*lock() uses
> > atomic_try_cmpxchg_acquire(), which means any arch that uses qspinlock
> > and has RCpc atomics will give us massive pain.
> > 
> > Current archs using qspinlock are: x86, arm64, power, sparc64, mips and
> > openrisc (WTF?!).
> > 
> > Of those, x86 and sparc are TSO archs with SC atomics, arm64 has RCsc
> > atomics, power has RCtso atomics (and is the arch we all hate for having
> > RCtso locks).
> > 
> > Now MIPS has all sorts of ill specified barriers, but last time looked
> > at it it didn't actually use any of that and stuck to using smp_mb(), so
> > it will have RCsc atomics.
> > 
> > /me goes look at wth openrisc is..  doesn't even appear to have
> > asm/barrier.h :-/ Looking at wikipedia it also doesn't appear to
> > actually have hardware ...
> 
> FWIW this is broken, anything SMP *MUST* define mb(), at the very least.

As near as I can tell this should do. The arch spec only lists this one
instruction and the text makes it sound like a completion barrier.

---
 arch/openrisc/include/asm/barrier.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/arch/openrisc/include/asm/barrier.h 
b/arch/openrisc/include/asm/barrier.h
new file mode 100644
index ..7538294721be
--- /dev/null
+++ b/arch/openrisc/include/asm/barrier.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __ASM_BARRIER_H
+#define __ASM_BARRIER_H
+
+#define mb() asm volatile ("l.msync" ::: "memory")
+
+#include 
+
+#endif /* __ASM_BARRIER_H */


Re: [RFC][PATCH] locking: Generic ticket-lock

2021-04-14 Thread Peter Zijlstra
On Wed, Apr 14, 2021 at 08:39:33PM +0800, Guo Ren wrote:

> I've tested it on csky SMP*4 hw (860) & riscv SMP*4 hw (c910) and it's okay.

W00t :-)

> Hope you can keep
> typedef struct {
> union {
> atomic_t lock;
> struct __raw_tickets {
> #ifdef __BIG_ENDIAN
> u16 next;
> u16 owner;
> #else
> u16 owner;
> u16 next;
> #endif
> } tickets;
> };
> } arch_spinlock_t;
> 
> Using owner & next is much more readable.

That almost doubles the line-count of the thing ;-)


> > + * It further assumes atomic_*_release() + atomic_*_acquire() is RCpc and 
> > hence
> > + * uses atomic_fetch_add() which is SC to create an RCsc lock.

This ^^^ then vvv

> > +static __always_inline void ticket_lock(arch_spinlock_t *lock)
> > +{
> > +   u32 val = atomic_fetch_add(1<<16, lock); /* SC, gives us RCsc */
> atomic_fetch_add_acquire ?

Then we must rely on the arch to implement RCsc atomics. And I for one
can never tell wth Risc-V actually does.

> > +static __always_inline int ticket_is_locked(arch_spinlock_t *lock)
> > +{
> > +   u32 val = atomic_read(lock);
> > +
> > +   return ((val >> 16) != (val & 0x));
> I perfer:
> return !arch_spin_value_unlocked(READ_ONCE(*lock));
> > +}

> > +}
> > +
> > +static __always_inline int ticket_value_unlocked(arch_spinlock_t lock)
> > +{
> > +   return !ticket_is_locked(&lock);
> Are you sure to let ticket_is_locked->atomic_read(lock) again, the
> lock has contained all information?
> 
> return lock.tickets.owner == lock.tickets.next;

Yeah, I wrote then the wrong way around. Couldn't be bothered to go back
when I figured it out.

> > +
> > +static __always_inline int ticket_is_contended(arch_spinlock_t *lock)
> > +{
> > +   u32 val = atomic_read(lock);
> > +
> > +   return (s16)((val >> 16) - (val & 0x)) > 1;
> How big-endian ?

How not? Endian-ness only matters when you go poke at sub-words, which
the above does not. Only ticket_unlock() does and cares about that.



Re: [RFC][PATCH] locking: Generic ticket-lock

2021-04-14 Thread Peter Zijlstra
On Wed, Apr 14, 2021 at 02:55:57PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 14, 2021 at 08:39:33PM +0800, Guo Ren wrote:

> > > + * It further assumes atomic_*_release() + atomic_*_acquire() is RCpc 
> > > and hence
> > > + * uses atomic_fetch_add() which is SC to create an RCsc lock.
> 
> This ^^^ then vvv
> 
> > > +static __always_inline void ticket_lock(arch_spinlock_t *lock)
> > > +{
> > > +   u32 val = atomic_fetch_add(1<<16, lock); /* SC, gives us RCsc */
> > atomic_fetch_add_acquire ?
> 
> Then we must rely on the arch to implement RCsc atomics. And I for one
> can never tell wth Risc-V actually does.

Anyway, if we can mandate that atomic acquire/release is RCsc, then yes
we can do that but then we also need:

#define smp_mb__after_spinlock()smp_mb__after_atomic()

But currently atomic acquire/release is not RCsc (at the very least
Power does RCtso -- and power gets to deal with any and all pain caused
by that).


RE: [RFC][PATCH] locking: Generic ticket-lock

2021-04-14 Thread David Laight
From: Peter Zijlstra
> Sent: 14 April 2021 13:56
> 
> > I've tested it on csky SMP*4 hw (860) & riscv SMP*4 hw (c910) and it's okay.
> 
> W00t :-)
> 
> > Hope you can keep
> > typedef struct {
> > union {
> > atomic_t lock;
> > struct __raw_tickets {
> > #ifdef __BIG_ENDIAN
> > u16 next;
> > u16 owner;
> > #else
> > u16 owner;
> > u16 next;
> > #endif
> > } tickets;
> > };
> > } arch_spinlock_t;
> >
> > Using owner & next is much more readable.
> 
> That almost doubles the line-count of the thing ;-)

And relies on the compiler not ever spilling it to stack.
Which it is much more likely to do that for the version
that uses shifts.

Have you checked what happens with -O0 ?
I don't think that is banned by the build system.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [RFC][PATCH] locking: Generic ticket-lock

2021-04-14 Thread Stafford Horne
On Wed, Apr 14, 2021 at 12:16:38PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 14, 2021 at 11:05:24AM +0200, Peter Zijlstra wrote:
> 
> > That made me look at the qspinlock code, and queued_spin_*lock() uses
> > atomic_try_cmpxchg_acquire(), which means any arch that uses qspinlock
> > and has RCpc atomics will give us massive pain.
> > 
> > Current archs using qspinlock are: x86, arm64, power, sparc64, mips and
> > openrisc (WTF?!).
> > 
> > Of those, x86 and sparc are TSO archs with SC atomics, arm64 has RCsc
> > atomics, power has RCtso atomics (and is the arch we all hate for having
> > RCtso locks).
> > 
> > Now MIPS has all sorts of ill specified barriers, but last time looked
> > at it it didn't actually use any of that and stuck to using smp_mb(), so
> > it will have RCsc atomics.
> > 
> > /me goes look at wth openrisc is..  doesn't even appear to have
> > asm/barrier.h :-/ Looking at wikipedia it also doesn't appear to
> > actually have hardware ...

Yes, not hardware available to consumers directoy, my development is done on
FPGAs.

> FWIW this is broken, anything SMP *MUST* define mb(), at the very least.

Oh, thats right, something missed, when we developed qspinlocks we discussed
this and my point there was that l.swa/l.lwa implied a mem flush
l.msync/barrier.  But mb still needs to be added.

> > I'm thinking openrisc is a prime candidate for this ticket_lock.h we're
> > all talking about.
> 
> How's this then? Compile tested only on openrisc/simple_smp_defconfig.

I did my testing with this FPGA build SoC:

 https://github.com/stffrdhrn/de0_nano-multicore

Note, the CPU timer sync logic uses mb() and is a bit flaky.  So missing mb()
might be a reason.  I thought we had defined mb() and l.msync, but it seems to
have gotten lost.

With that said I could test out this ticket-lock implementation.  How would I
tell if its better than qspinlock?

> ---
>  arch/openrisc/Kconfig  |  1 -
>  arch/openrisc/include/asm/Kbuild   |  5 +-
>  arch/openrisc/include/asm/spinlock.h   |  3 +-
>  arch/openrisc/include/asm/spinlock_types.h |  2 +-
>  include/asm-generic/qspinlock.h| 30 +++
>  include/asm-generic/ticket-lock-types.h| 11 
>  include/asm-generic/ticket-lock.h  | 86 
> ++
>  7 files changed, 131 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/openrisc/Kconfig b/arch/openrisc/Kconfig
> index 591acc5990dc..1858cf309f1f 100644
> --- a/arch/openrisc/Kconfig
> +++ b/arch/openrisc/Kconfig
> @@ -32,7 +32,6 @@ config OPENRISC
>   select HAVE_DEBUG_STACKOVERFLOW
>   select OR1K_PIC
>   select CPU_NO_EFFICIENT_FFS if !OPENRISC_HAVE_INST_FF1
> - select ARCH_USE_QUEUED_SPINLOCKS
>   select ARCH_USE_QUEUED_RWLOCKS
>   select OMPIC if SMP
>   select ARCH_WANT_FRAME_POINTERS
> diff --git a/arch/openrisc/include/asm/Kbuild 
> b/arch/openrisc/include/asm/Kbuild
> index ca5987e11053..cb260e7d73db 100644
> --- a/arch/openrisc/include/asm/Kbuild
> +++ b/arch/openrisc/include/asm/Kbuild
> @@ -1,9 +1,8 @@
>  # SPDX-License-Identifier: GPL-2.0
>  generic-y += extable.h
>  generic-y += kvm_para.h
> -generic-y += mcs_spinlock.h
> -generic-y += qspinlock_types.h
> -generic-y += qspinlock.h
> +generic-y += ticket-lock.h
> +generic-y += ticket-lock-types.h
>  generic-y += qrwlock_types.h
>  generic-y += qrwlock.h
>  generic-y += user.h
> diff --git a/arch/openrisc/include/asm/spinlock.h 
> b/arch/openrisc/include/asm/spinlock.h
> index a8940bdfcb7e..0b839ed1f3a0 100644
> --- a/arch/openrisc/include/asm/spinlock.h
> +++ b/arch/openrisc/include/asm/spinlock.h
> @@ -15,8 +15,7 @@
>  #ifndef __ASM_OPENRISC_SPINLOCK_H
>  #define __ASM_OPENRISC_SPINLOCK_H
>  
> -#include 
> -
> +#include 
>  #include 
>  
>  #define arch_read_lock_flags(lock, flags) arch_read_lock(lock)
> diff --git a/arch/openrisc/include/asm/spinlock_types.h 
> b/arch/openrisc/include/asm/spinlock_types.h
> index 7c6fb1208c88..58ea31fa65ce 100644
> --- a/arch/openrisc/include/asm/spinlock_types.h
> +++ b/arch/openrisc/include/asm/spinlock_types.h
> @@ -1,7 +1,7 @@
>  #ifndef _ASM_OPENRISC_SPINLOCK_TYPES_H
>  #define _ASM_OPENRISC_SPINLOCK_TYPES_H
>  
> -#include 
> +#include 
>  #include 
>  
>  #endif /* _ASM_OPENRISC_SPINLOCK_TYPES_H */
> diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
> index d74b13825501..a7a1296b0b4d 100644
> --- a/include/asm-generic/qspinlock.h
> +++ b/include/asm-generic/qspinlock.h
> @@ -2,6 +2,36 @@
>  /*
>   * Queued spinlock
>   *
> + * A 'generic' spinlock implementation that is based on MCS locks. An
> + * architecture that's looking for a 'generic' spinlock, please first 
> consider
> + * ticket-lock.h and only come looking here when you've considered all the
> + * constraints below and can show your hardware does actually perform better
> + * with qspinlock.
> + *
> + *
> + * It relies on atomic_*_release()/atomic_*_acquire() to be RCsc (or no 
> weaker
> + * than RCtso if you

Re: [RFC][PATCH] locking: Generic ticket-lock

2021-04-14 Thread Stafford Horne
On Wed, Apr 14, 2021 at 02:45:43PM +0200, Peter Zijlstra wrote:
> On Wed, Apr 14, 2021 at 12:16:38PM +0200, Peter Zijlstra wrote:
> > On Wed, Apr 14, 2021 at 11:05:24AM +0200, Peter Zijlstra wrote:
> > 
> > > That made me look at the qspinlock code, and queued_spin_*lock() uses
> > > atomic_try_cmpxchg_acquire(), which means any arch that uses qspinlock
> > > and has RCpc atomics will give us massive pain.
> > > 
> > > Current archs using qspinlock are: x86, arm64, power, sparc64, mips and
> > > openrisc (WTF?!).
> > > 
> > > Of those, x86 and sparc are TSO archs with SC atomics, arm64 has RCsc
> > > atomics, power has RCtso atomics (and is the arch we all hate for having
> > > RCtso locks).
> > > 
> > > Now MIPS has all sorts of ill specified barriers, but last time looked
> > > at it it didn't actually use any of that and stuck to using smp_mb(), so
> > > it will have RCsc atomics.
> > > 
> > > /me goes look at wth openrisc is..  doesn't even appear to have
> > > asm/barrier.h :-/ Looking at wikipedia it also doesn't appear to
> > > actually have hardware ...
> > 
> > FWIW this is broken, anything SMP *MUST* define mb(), at the very least.
> 
> As near as I can tell this should do. The arch spec only lists this one
> instruction and the text makes it sound like a completion barrier.

Yes, I will submit this patch.

The l.msync instruction completes all load/store operations.
The l.lwa/l.swa pair (used for xchg/cmpxchg) also implies l.msync.

> ---
>  arch/openrisc/include/asm/barrier.h | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/arch/openrisc/include/asm/barrier.h 
> b/arch/openrisc/include/asm/barrier.h
> new file mode 100644
> index ..7538294721be
> --- /dev/null
> +++ b/arch/openrisc/include/asm/barrier.h
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __ASM_BARRIER_H
> +#define __ASM_BARRIER_H
> +
> +#define mb() asm volatile ("l.msync" ::: "memory")
> +
> +#include 
> +
> +#endif /* __ASM_BARRIER_H */


Re: [RFC][PATCH] locking: Generic ticket-lock

2021-04-15 Thread Peter Zijlstra
On Thu, Apr 15, 2021 at 05:47:34AM +0900, Stafford Horne wrote:

> > How's this then? Compile tested only on openrisc/simple_smp_defconfig.
> 
> I did my testing with this FPGA build SoC:
> 
>  https://github.com/stffrdhrn/de0_nano-multicore
> 
> Note, the CPU timer sync logic uses mb() and is a bit flaky.  So missing mb()
> might be a reason.  I thought we had defined mb() and l.msync, but it seems to
> have gotten lost.
> 
> With that said I could test out this ticket-lock implementation.  How would I
> tell if its better than qspinlock?

Mostly if it isn't worse, it's better for being *much* simpler. As you
can see, the guts of ticket is like 16 lines of C (lock+unlock) and you
only need the behaviour of atomic_fetch_add() to reason about behaviour
of the whole thing. qspinlock OTOH is mind bending painful to reason
about.

There are some spinlock tests in locktorture; but back when I had a
userspace copy of the lot and would measure min,avg,max acquire times
under various contention loads (making sure to only run a single task
per CPU etc.. to avoid lock holder preemption and other such 'fun'
things).

It took us a fair amount of work to get qspinlock to compete with ticket
for low contention cases (by far the most common in the kernel), and it
took a fairly large amount of CPUs for qspinlock to really win from
ticket on the contended case. Your hardware may vary. In particular the
access to the external cacheline (for queueing, see the queue: label in
queued_spin_lock_slowpath) is a pain-point and the relative cost of
cacheline misses for your arch determines where (and if) low contention
behaviour is competitive.

Also, less variance (the reason for the min/max measure) is better.
Large variance is typically a sign of fwd progress trouble.

That's not saying that qspinlock isn't awesome, but I'm arguing that you
should get there by first trying all the simpler things. By gradually
increasing complexity you can also find the problem spots (for your
architecture) and you have something to fall back to in case of trouble.

Now, the obvious selling point of qspinlock is that due to the MCS style
nature of the thing it doesn't bounce the lock around, but that comes at
a cost of having to use that extra cacheline (due to the kernel liking
sizeof(spinlock_t) == sizeof(u32)). But things like ARM64's WFE (see
smp_cond_load_acquire()) can shift the balance quite a bit on that front
as well (ARM has a similar thing but less useful, see it's spinlock.h
and look for wfe() and dsb_sev()).

Once your arch hits NUMA, qspinlock is probably a win. However, low
contention performance is still king for most workloads. Better high
contention behaviour is nice.


Re: [RFC][PATCH] locking: Generic ticket-lock

2021-04-15 Thread Catalin Marinas
(fixed Will's email address)

On Thu, Apr 15, 2021 at 10:09:54AM +0200, Peter Zijlstra wrote:
> On Thu, Apr 15, 2021 at 05:47:34AM +0900, Stafford Horne wrote:
> > > How's this then? Compile tested only on openrisc/simple_smp_defconfig.
> > 
> > I did my testing with this FPGA build SoC:
> > 
> >  https://github.com/stffrdhrn/de0_nano-multicore
> > 
> > Note, the CPU timer sync logic uses mb() and is a bit flaky.  So missing 
> > mb()
> > might be a reason.  I thought we had defined mb() and l.msync, but it seems 
> > to
> > have gotten lost.
> > 
> > With that said I could test out this ticket-lock implementation.  How would 
> > I
> > tell if its better than qspinlock?
> 
> Mostly if it isn't worse, it's better for being *much* simpler. As you
> can see, the guts of ticket is like 16 lines of C (lock+unlock) and you
> only need the behaviour of atomic_fetch_add() to reason about behaviour
> of the whole thing. qspinlock OTOH is mind bending painful to reason
> about.
> 
> There are some spinlock tests in locktorture; but back when I had a
> userspace copy of the lot and would measure min,avg,max acquire times
> under various contention loads (making sure to only run a single task
> per CPU etc.. to avoid lock holder preemption and other such 'fun'
> things).
> 
> It took us a fair amount of work to get qspinlock to compete with ticket
> for low contention cases (by far the most common in the kernel), and it
> took a fairly large amount of CPUs for qspinlock to really win from
> ticket on the contended case. Your hardware may vary. In particular the
> access to the external cacheline (for queueing, see the queue: label in
> queued_spin_lock_slowpath) is a pain-point and the relative cost of
> cacheline misses for your arch determines where (and if) low contention
> behaviour is competitive.
> 
> Also, less variance (the reason for the min/max measure) is better.
> Large variance is typically a sign of fwd progress trouble.

IIRC, one issue we had with ticket spinlocks on arm64 was on big.LITTLE
systems where the little CPUs were always last to get a ticket when
racing with the big cores. That was with load/store exclusives (LR/SC
style) and would have probably got better with atomics but we moved to
qspinlocks eventually (the Juno board didn't have atomics).

(leaving the rest of the text below for Will's convenience)

> That's not saying that qspinlock isn't awesome, but I'm arguing that you
> should get there by first trying all the simpler things. By gradually
> increasing complexity you can also find the problem spots (for your
> architecture) and you have something to fall back to in case of trouble.
> 
> Now, the obvious selling point of qspinlock is that due to the MCS style
> nature of the thing it doesn't bounce the lock around, but that comes at
> a cost of having to use that extra cacheline (due to the kernel liking
> sizeof(spinlock_t) == sizeof(u32)). But things like ARM64's WFE (see
> smp_cond_load_acquire()) can shift the balance quite a bit on that front
> as well (ARM has a similar thing but less useful, see it's spinlock.h
> and look for wfe() and dsb_sev()).
> 
> Once your arch hits NUMA, qspinlock is probably a win. However, low
> contention performance is still king for most workloads. Better high
> contention behaviour is nice.

-- 
Catalin


Re: [RFC][PATCH] locking: Generic ticket-lock

2021-04-15 Thread Will Deacon
On Thu, Apr 15, 2021 at 10:02:18AM +0100, Catalin Marinas wrote:
> (fixed Will's email address)
> 
> On Thu, Apr 15, 2021 at 10:09:54AM +0200, Peter Zijlstra wrote:
> > On Thu, Apr 15, 2021 at 05:47:34AM +0900, Stafford Horne wrote:
> > > > How's this then? Compile tested only on openrisc/simple_smp_defconfig.
> > > 
> > > I did my testing with this FPGA build SoC:
> > > 
> > >  https://github.com/stffrdhrn/de0_nano-multicore
> > > 
> > > Note, the CPU timer sync logic uses mb() and is a bit flaky.  So missing 
> > > mb()
> > > might be a reason.  I thought we had defined mb() and l.msync, but it 
> > > seems to
> > > have gotten lost.
> > > 
> > > With that said I could test out this ticket-lock implementation.  How 
> > > would I
> > > tell if its better than qspinlock?
> > 
> > Mostly if it isn't worse, it's better for being *much* simpler. As you
> > can see, the guts of ticket is like 16 lines of C (lock+unlock) and you
> > only need the behaviour of atomic_fetch_add() to reason about behaviour
> > of the whole thing. qspinlock OTOH is mind bending painful to reason
> > about.
> > 
> > There are some spinlock tests in locktorture; but back when I had a
> > userspace copy of the lot and would measure min,avg,max acquire times
> > under various contention loads (making sure to only run a single task
> > per CPU etc.. to avoid lock holder preemption and other such 'fun'
> > things).
> > 
> > It took us a fair amount of work to get qspinlock to compete with ticket
> > for low contention cases (by far the most common in the kernel), and it
> > took a fairly large amount of CPUs for qspinlock to really win from
> > ticket on the contended case. Your hardware may vary. In particular the
> > access to the external cacheline (for queueing, see the queue: label in
> > queued_spin_lock_slowpath) is a pain-point and the relative cost of
> > cacheline misses for your arch determines where (and if) low contention
> > behaviour is competitive.
> > 
> > Also, less variance (the reason for the min/max measure) is better.
> > Large variance is typically a sign of fwd progress trouble.
> 
> IIRC, one issue we had with ticket spinlocks on arm64 was on big.LITTLE
> systems where the little CPUs were always last to get a ticket when
> racing with the big cores. That was with load/store exclusives (LR/SC
> style) and would have probably got better with atomics but we moved to
> qspinlocks eventually (the Juno board didn't have atomics).
> 
> (leaving the rest of the text below for Will's convenience)

Yes, I think it was this thread:

https://lore.kernel.org/lkml/alpine.DEB.2.20.1707261548560.2186@nanos

but I don't think you can really fix such hardware by changing the locking
algorithm (although my proposed cpu_relax() hack was worryingly effective).

Will


Re: [RFC][PATCH] locking: Generic ticket-lock

2021-04-15 Thread Peter Zijlstra
On Thu, Apr 15, 2021 at 10:02:18AM +0100, Catalin Marinas wrote:
> IIRC, one issue we had with ticket spinlocks on arm64 was on big.LITTLE
> systems where the little CPUs were always last to get a ticket when
> racing with the big cores. That was with load/store exclusives (LR/SC
> style) and would have probably got better with atomics but we moved to
> qspinlocks eventually (the Juno board didn't have atomics).

That sounds like a fundamental LL/SC fail, and I'm not sure qspinlock
will help with that at all. The big cores can still hog the lock word
and starve the little ones.

And those things not having AMOs there's really not much you can do. You
want the big cores to back off, but they're having success, not failure.
I suppose you can add a delay after a successful LL/SC, but that sucks.

I suppose modern big.little things will have AMOs, so maybe nobody still
cares about those systems.


Re: [RFC][PATCH] locking: Generic ticket-lock

2021-04-19 Thread Will Deacon
On Wed, Apr 14, 2021 at 12:16:38PM +0200, Peter Zijlstra wrote:
> How's this then? Compile tested only on openrisc/simple_smp_defconfig.
> 
> diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
> index d74b13825501..a7a1296b0b4d 100644
> --- a/include/asm-generic/qspinlock.h
> +++ b/include/asm-generic/qspinlock.h
> @@ -2,6 +2,36 @@
>  /*
>   * Queued spinlock
>   *
> + * A 'generic' spinlock implementation that is based on MCS locks. An
> + * architecture that's looking for a 'generic' spinlock, please first 
> consider
> + * ticket-lock.h and only come looking here when you've considered all the
> + * constraints below and can show your hardware does actually perform better
> + * with qspinlock.
> + *
> + *
> + * It relies on atomic_*_release()/atomic_*_acquire() to be RCsc (or no 
> weaker
> + * than RCtso if you're power), where regular code only expects atomic_t to 
> be
> + * RCpc.

Maybe capitalise "Power" to make it clear this about the architecture?

> + *
> + * It relies on a far greater (compared to ticket-lock.h) set of atomic
> + * operations to behave well together, please audit them carefully to ensure
> + * they all have forward progress. Many atomic operations may default to
> + * cmpxchg() loops which will not have good forward progress properties on
> + * LL/SC architectures.
> + *
> + * One notable example is atomic_fetch_or_acquire(), which x86 cannot 
> (cheaply)
> + * do. Carefully read the patches that introduced 
> queued_fetch_set_pending_acquire().
> + *
> + * It also heavily relies on mixed size atomic operations, in specific it
> + * requires architectures to have xchg16; something which many LL/SC
> + * architectures need to implement as a 32bit and+or in order to satisfy the
> + * forward progress guarantees mentioned above.
> + *
> + * Further reading on mixed size atomics that might be relevant:
> + *
> + *   http://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf
> + *
> + *
>   * (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
>   * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
>   *
> diff --git a/include/asm-generic/ticket-lock-types.h 
> b/include/asm-generic/ticket-lock-types.h
> new file mode 100644
> index ..829759aedda8
> --- /dev/null
> +++ b/include/asm-generic/ticket-lock-types.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef __ASM_GENERIC_TICKET_LOCK_TYPES_H
> +#define __ASM_GENERIC_TICKET_LOCK_TYPES_H
> +
> +#include 
> +typedef atomic_t arch_spinlock_t;
> +
> +#define __ARCH_SPIN_LOCK_UNLOCKEDATOMIC_INIT(0)
> +
> +#endif /* __ASM_GENERIC_TICKET_LOCK_TYPES_H */
> diff --git a/include/asm-generic/ticket-lock.h 
> b/include/asm-generic/ticket-lock.h
> new file mode 100644
> index ..3f0d53e21a37
> --- /dev/null
> +++ b/include/asm-generic/ticket-lock.h
> @@ -0,0 +1,86 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * 'Generic' ticket-lock implementation.
> + *
> + * It relies on atomic_fetch_add() having well defined forward progress
> + * guarantees under contention. If your architecture cannot provide this, 
> stick
> + * to a test-and-set lock.
> + *
> + * It also relies on atomic_fetch_add() being safe vs smp_store_release() on 
> a
> + * sub-word of the value. This is generally true for anything LL/SC although
> + * you'd be hard pressed to find anything useful in architecture 
> specifications
> + * about this. If your architecture cannot do this you might be better off 
> with
> + * a test-and-set.
> + *
> + * It further assumes atomic_*_release() + atomic_*_acquire() is RCpc and 
> hence
> + * uses atomic_fetch_add() which is SC to create an RCsc lock.
> + *
> + * The implementation uses smp_cond_load_acquire() to spin, so if the
> + * architecture has WFE like instructions to sleep instead of poll for word
> + * modifications be sure to implement that (see ARM64 for example).
> + *
> + */
> +
> +#ifndef __ASM_GENERIC_TICKET_LOCK_H
> +#define __ASM_GENERIC_TICKET_LOCK_H
> +
> +#include 
> +#include 
> +
> +static __always_inline void ticket_lock(arch_spinlock_t *lock)
> +{
> + u32 val = atomic_fetch_add(1<<16, lock); /* SC, gives us RCsc */

I hate to say it, but smp_mb__after_unlock_lock() would make the intention
a lot clearer here :( That is, the implementation as you have it gives
stronger than RCsc semantics for all architectures.

Alternatively, we could write the thing RCpc and throw an smp_mb() into
the unlock path if CONFIG_ARCH_WEAK_RELEASE_ACQUIRE.

> + u16 ticket = val >> 16;
> +
> + if (ticket == (u16)val)
> + return;
> +
> + atomic_cond_read_acquire(lock, ticket == (u16)VAL);
> +}
> +
> +static __always_inline bool ticket_trylock(arch_spinlock_t *lock)
> +{
> + u32 old = atomic_read(lock);
> +
> + if ((old >> 16) != (old & 0x))
> + return false;
> +
> + return atomic_try_cmpxchg(lock, &old, old + (1<<16)); /* SC, for RCsc */
> +}
> +
> +static __always_inline void