On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote: > All the locking related cmpxchg's in the following functions are > replaced with the _acquire variants: > - pv_queued_spin_steal_lock() > - trylock_clear_pending() > > This change should help performance on architectures that use LL/SC. > > On a 2-core 16-thread Power8 system with pvqspinlock explicitly > enabled, the performance of a locking microbenchmark with and without > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch > were as follows: > > # of thread w/o patch with patch % Change > ----------- --------- ---------- -------- > 4 4053.3 Mop/s 4223.7 Mop/s +4.2% > 8 3310.4 Mop/s 3406.0 Mop/s +2.9% > 12 2576.4 Mop/s 2674.6 Mop/s +3.8% > > Signed-off-by: Waiman Long <long...@redhat.com> > --- > > v2->v3: > - Reduce scope by relaxing cmpxchg's in fast path only. > > v1->v2: > - Add comments in changelog and code for the rationale of the change. > > kernel/locking/qspinlock_paravirt.h | 15 +++++++++------ > 1 file changed, 9 insertions(+), 6 deletions(-) > > diff --git a/kernel/locking/qspinlock_paravirt.h > b/kernel/locking/qspinlock_paravirt.h > index e6b2f7a..a59dc05 100644 > --- a/kernel/locking/qspinlock_paravirt.h > +++ b/kernel/locking/qspinlock_paravirt.h > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct > qspinlock *lock) > struct __qspinlock *l = (void *)lock; > > if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) && > - (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { > + (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { > qstat_inc(qstat_pv_lock_stealing, true); > return true; > } > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct > qspinlock *lock) > > /* > * The pending bit check in pv_queued_spin_steal_lock() isn't a memory > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock > - * just to be sure that it will get it. > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the > + * lock just to be sure that it will get it. > */ > static __always_inline int trylock_clear_pending(struct qspinlock *lock) > { > struct __qspinlock *l = (void *)lock; > > return !READ_ONCE(l->locked) && > - (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL) > - == _Q_PENDING_VAL); > + (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL, > + _Q_LOCKED_VAL) == _Q_PENDING_VAL); > } > #else /* _Q_PENDING_BITS == 8 */ > static __always_inline void set_pending(struct qspinlock *lock) > @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct > qspinlock *lock) > */ > old = val; > new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; > - val = atomic_cmpxchg(&lock->val, old, new); > + val = atomic_cmpxchg_acquire(&lock->val, old, new); > > if (val == old) > return 1; > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct > mcs_spinlock *node) > * observe its next->locked value and advance itself. > * > * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > + * > + * We can't used relaxed form of cmpxchg here as the loading of > + * pn->state can happen before setting next->locked in some archs. > */ > if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
Hi Waiman. cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f., e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled" to something like: _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit: Andrea > return; > -- > 1.8.3.1 >