(Really add Will this time ...) On Mon, Feb 20, 2017 at 12:53:58PM +0800, Boqun Feng wrote: > On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote: > > On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote: > > > All the locking related cmpxchg's in the following functions are > > > replaced with the _acquire variants: > > > - pv_queued_spin_steal_lock() > > > - trylock_clear_pending() > > > > > > This change should help performance on architectures that use LL/SC. > > > > > > On a 2-core 16-thread Power8 system with pvqspinlock explicitly > > > enabled, the performance of a locking microbenchmark with and without > > > this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch > > > were as follows: > > > > > > # of thread w/o patch with patch % Change > > > ----------- --------- ---------- -------- > > > 4 4053.3 Mop/s 4223.7 Mop/s +4.2% > > > 8 3310.4 Mop/s 3406.0 Mop/s +2.9% > > > 12 2576.4 Mop/s 2674.6 Mop/s +3.8% > > > > > > Signed-off-by: Waiman Long <long...@redhat.com> > > > --- > > > > > > v2->v3: > > > - Reduce scope by relaxing cmpxchg's in fast path only. > > > > > > v1->v2: > > > - Add comments in changelog and code for the rationale of the change. > > > > > > kernel/locking/qspinlock_paravirt.h | 15 +++++++++------ > > > 1 file changed, 9 insertions(+), 6 deletions(-) > > > > > > diff --git a/kernel/locking/qspinlock_paravirt.h > > > b/kernel/locking/qspinlock_paravirt.h > > > index e6b2f7a..a59dc05 100644 > > > --- a/kernel/locking/qspinlock_paravirt.h > > > +++ b/kernel/locking/qspinlock_paravirt.h > > > @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct > > > qspinlock *lock) > > > struct __qspinlock *l = (void *)lock; > > > > > > if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) && > > > - (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { > > > + (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) { > > > qstat_inc(qstat_pv_lock_stealing, true); > > > return true; > > > } > > > @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct > > > qspinlock *lock) > > > > > > /* > > > * The pending bit check in pv_queued_spin_steal_lock() isn't a memory > > > - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock > > > - * just to be sure that it will get it. > > > + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the > > > + * lock just to be sure that it will get it. > > > */ > > > static __always_inline int trylock_clear_pending(struct qspinlock *lock) > > > { > > > struct __qspinlock *l = (void *)lock; > > > > > > return !READ_ONCE(l->locked) && > > > - (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL) > > > - == _Q_PENDING_VAL); > > > + (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL, > > > + _Q_LOCKED_VAL) == _Q_PENDING_VAL); > > > } > > > #else /* _Q_PENDING_BITS == 8 */ > > > static __always_inline void set_pending(struct qspinlock *lock) > > > @@ -138,7 +138,7 @@ static __always_inline int > > > trylock_clear_pending(struct qspinlock *lock) > > > */ > > > old = val; > > > new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL; > > > - val = atomic_cmpxchg(&lock->val, old, new); > > > + val = atomic_cmpxchg_acquire(&lock->val, old, new); > > > > > > if (val == old) > > > return 1; > > > @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, > > > struct mcs_spinlock *node) > > > * observe its next->locked value and advance itself. > > > * > > > * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > > > + * > > > + * We can't used relaxed form of cmpxchg here as the loading of > > > + * pn->state can happen before setting next->locked in some archs. > > > */ > > > if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) > > > > Hi Waiman. > > > > cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f., > > e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled" > > to something like: > > > > _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit: > > > > Yes, sorry for be late for this one. > > So Waiman, the fact is that in this case, we want the following code > sequence: > > CPU 0 CPU 1 > ================= ==================== > {pn->state = vcpu_running, node->locked = 0} > > smp_store_smb(&pn->state, vcpu_halted): > WRITE_ONCE(pn->state, vcpu_halted); > smp_mb(); > r1 = READ_ONCE(node->locked); > > arch_mcs_spin_unlock_contented(); > WRITE_ONCE(node->locked, 1) > > cmpxchg(&pn->state, > vcpu_halted, vcpu_hashed); > > never ends up in: > > r1 == 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the > value vcpu_running). > > We can have such a guarantee if cmpxchg has a smp_mb() before its load > part, which is true for PPC. But semantically, cmpxchg() doesn't provide > any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will > in Cc for his insight ;-)). > > So a possible "fix"(in case ARM64 will use qspinlock some day), would be > replace cmpxchg() with smp_mb() + cmpxchg_relaxed(). > > Regards, > Boqun > > > Andrea > > > > > > > return; > > > -- > > > 1.8.3.1 > > >
signature.asc
Description: PGP signature