Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation
On Thu Nov 10, 2022 at 4:37 PM AEST, Christophe Leroy wrote: > > > Le 10/11/2022 à 01:35, Jordan Niethe a écrit : > > On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote: > > > >> -#define queued_spin_lock queued_spin_lock > >> > >> -static inline void queued_spin_unlock(struct qspinlock *lock) > >> +static __always_inline int queued_spin_trylock(struct qspinlock *lock) > >> { > >> - if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor()) > >> - smp_store_release(&lock->locked, 0); > >> - else > >> - __pv_queued_spin_unlock(lock); > >> + if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0) > >> + return 1; > >> + return 0; > > > > optional style nit: return (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0); > > No parenthesis. > No == 0 > > Should be : > > return !atomic_cmpxchg_acquire(&lock->val, 0, 1); In this case I prefer the == 0 because we're testing against the 0 old parameter being passed in. This is the recognisable cmpxchg pattern. The other style of cmpxchg returns true if it succeeded, so it's less clear we're not using that version if using !. Thanks, Nick
Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation
On Thu Nov 10, 2022 at 10:35 AM AEST, Jordan Niethe wrote: > On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote: > > > -#define queued_spin_lock queued_spin_lock > > > > -static inline void queued_spin_unlock(struct qspinlock *lock) > > +static __always_inline int queued_spin_trylock(struct qspinlock *lock) > > { > > - if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor()) > > - smp_store_release(&lock->locked, 0); > > - else > > - __pv_queued_spin_unlock(lock); > > + if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0) > > + return 1; > > + return 0; > > optional style nit: return (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0); > > [resend as utf-8, not utf-7] Thanks for the thorough review, apologies again it took me so long to get back to. I'm not completely sold on this. I guess it's already side-effects in a control flow statement though... Maybe I will change it, not sure. Thanks, Nick
Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation
Le 10/11/2022 à 01:35, Jordan Niethe a écrit : > On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote: > >> -#define queued_spin_lock queued_spin_lock >> >> -static inline void queued_spin_unlock(struct qspinlock *lock) >> +static __always_inline int queued_spin_trylock(struct qspinlock *lock) >> { >> -if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor()) >> -smp_store_release(&lock->locked, 0); >> -else >> -__pv_queued_spin_unlock(lock); >> +if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0) >> +return 1; >> +return 0; > > optional style nit: return (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0); No parenthesis. No == 0 Should be : return !atomic_cmpxchg_acquire(&lock->val, 0, 1); > > [resend as utf-8, not utf-7] >
Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote: > -#define queued_spin_lock queued_spin_lock > > -static inline void queued_spin_unlock(struct qspinlock *lock) > +static __always_inline int queued_spin_trylock(struct qspinlock *lock) > { > - if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor()) > - smp_store_release(&lock->locked, 0); > - else > - __pv_queued_spin_unlock(lock); > + if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0) > + return 1; > + return 0; optional style nit: return (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0); [resend as utf-8, not utf-7]
Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation
Le 10/08/2022 à 03:52, Jordan NIethe a écrit : > On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote: > >> -#define queued_spin_lock queued_spin_lock >> >> -static inline void queued_spin_unlock(struct qspinlock *lock) >> +static __always_inline int queued_spin_trylock(struct qspinlock *lock) >> { >> -if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor()) >> -smp_store_release(&lock->locked, 0); >> -else >> -__pv_queued_spin_unlock(lock); >> +if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0) >> +return 1; >> +return 0; > > optional style nit: return (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0); > The parenthesis are pointless, and ! is usually prefered to == 0, something like that: return !atomic_cmpxchg_acquire(&lock->val, 0, 1);
Re: [PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation
On Thu, 2022-07-28 at 16:31 +1000, Nicholas Piggin wrote: > -#define queued_spin_lock queued_spin_lock > > -static inline void queued_spin_unlock(struct qspinlock *lock) > +static __always_inline int queued_spin_trylock(struct qspinlock *lock) > { > - if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor()) > - smp_store_release(&lock->locked, 0); > - else > - __pv_queued_spin_unlock(lock); > + if (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0) > + return 1; > + return 0; optional style nit: return (atomic_cmpxchg_acquire(&lock->val, 0, 1) == 0);
[PATCH 01/17] powerpc/qspinlock: powerpc qspinlock implementation
Add a powerpc specific implementation of queued spinlocks. This is the build framework with a very simple (non-queued) spinlock implementation to begin with. Later changes add queueing, and other features and optimisations one-at-a-time. It is done this way to more easily see how the queued spinlocks are built, and to make performance and correctness bisects more useful. Generic PV qspinlock code is causing latency / starvation regressions on large systems that are resulting in hard lockups reported (mostly in pathoogical cases). The generic qspinlock code has a number of issues important for powerpc hardware and hypervisors that aren't easily solved without changing code that would impact other architectures. Follow s390's lead and implement our own for now. Issues for powerpc using generic qspinlocks: - The previous lock value should not be loaded with simple loads, and need not be passed around from previous loads or cmpxchg results, because powerpc uses ll/sc-style atomics which can perform more complex operations that do not require this. powerpc implementations tend to prefer loads use larx for improved coherency performance. - The queueing process should absolutely minimise the number of stores to the lock word to reduce exclusive coherency probes, important for large system scalability. The pending logic is counter productive here. - Non-atomic unlock for paravirt locks is important (atomic instructions tend to still be more expensive than x86 CPUs). - Yielding to the lock owner is important in the oversubscribed paravirt case, which requires storing the owner CPU in the lock word. - More control of lock stealing for the paravirt case is important to keep latency down on large systems. - The lock acquisition operation should always be made with a special variant of atomic instructions with the lock hint bit set, including (especially) in the queueing paths. This is more a matter of adding more arch lock helpers so not an insurmountable problem for generic code. Since the RFC series, I tested this on a 16-socket 1920 thread POWER10 system with some microbenchmarks, and that showed up significant problems with the previous series. High amount of spinning on the lock up-front (lock stealing) for SPLPAR mode (paravirt) really hurts scalability when the guest is not overcommitted. However on smaller KVM systems with significant overcommit (e.g., 5-10%), this spinning is very important to avoid performance tanking due to the queueing problem. So rather than set STEAL_SPINS and HEAD_SPINS based on SPLPAR at boot-time, I lowered them and do more to dynamically deal with vCPU preemption. So behaviour of dedicated and shared LPAR mode is now the same until there is vCPU preemption detected. This seems to be leading to better results overall, but some worst-case latencies are significantly up with the lockstorm test (latency is still better than generic queued spinlocks, but not as good as it previously was or as good as simple). Statistical fairness is still significantly better. Thanks, Nick --- arch/powerpc/Kconfig | 1 - arch/powerpc/include/asm/qspinlock.h | 78 ++ arch/powerpc/include/asm/qspinlock_types.h | 13 arch/powerpc/include/asm/spinlock_types.h | 2 +- arch/powerpc/lib/Makefile | 4 +- arch/powerpc/lib/qspinlock.c | 18 + 6 files changed, 69 insertions(+), 47 deletions(-) create mode 100644 arch/powerpc/include/asm/qspinlock_types.h create mode 100644 arch/powerpc/lib/qspinlock.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 7aa12e88c580..4838e6c96b20 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -154,7 +154,6 @@ config PPC select ARCH_USE_CMPXCHG_LOCKREF if PPC64 select ARCH_USE_MEMTEST select ARCH_USE_QUEUED_RWLOCKS if PPC_QUEUED_SPINLOCKS - select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT select ARCH_WANT_IPC_PARSE_VERSION select ARCH_WANT_IRQS_OFF_ACTIVATE_MM diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h index 39c1c7f80579..cb2b4f91e976 100644 --- a/arch/powerpc/include/asm/qspinlock.h +++ b/arch/powerpc/include/asm/qspinlock.h @@ -2,66 +2,56 @@ #ifndef _ASM_POWERPC_QSPINLOCK_H #define _ASM_POWERPC_QSPINLOCK_H -#include -#include +#include +#include +#include -#define _Q_PENDING_LOOPS (1 << 9) /* not tuned */ - -void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); -void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val); -void __pv_queued_spin_unlock(struct qspinlock *lock); - -static __always_inline void queued_spin_lock(struct qspinlock *lock) +static __always_inline int queued_spin_is_locked(struct qspinlock *lock) { - u32 val = 0; + return atomic_read(&lock->val); +} - if (likely(arch_atomic