On 02/19/2017 11:53 PM, Boqun Feng wrote:
> On Mon, Feb 20, 2017 at 05:20:52AM +0100, Andrea Parri wrote:
>> On Fri, Feb 17, 2017 at 03:43:40PM -0500, Waiman Long wrote:
>>> All the locking related cmpxchg's in the following functions are
>>> replaced with the _acquire variants:
>>>  - pv_queued_spin_steal_lock()
>>>  - trylock_clear_pending()
>>>
>>> This change should help performance on architectures that use LL/SC.
>>>
>>> On a 2-core 16-thread Power8 system with pvqspinlock explicitly
>>> enabled, the performance of a locking microbenchmark with and without
>>> this patch on a 4.10-rc8 kernel with Xinhui's PPC qspinlock patch
>>> were as follows:
>>>
>>>   # of thread     w/o patch    with patch      % Change
>>>   -----------     ---------    ----------      --------
>>>        4         4053.3 Mop/s  4223.7 Mop/s     +4.2%
>>>        8         3310.4 Mop/s  3406.0 Mop/s     +2.9%
>>>       12         2576.4 Mop/s  2674.6 Mop/s     +3.8%
>>>
>>> Signed-off-by: Waiman Long <long...@redhat.com>
>>> ---
>>>
>>>  v2->v3:
>>>   - Reduce scope by relaxing cmpxchg's in fast path only.
>>>
>>>  v1->v2:
>>>   - Add comments in changelog and code for the rationale of the change.
>>>
>>>  kernel/locking/qspinlock_paravirt.h | 15 +++++++++------
>>>  1 file changed, 9 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/kernel/locking/qspinlock_paravirt.h 
>>> b/kernel/locking/qspinlock_paravirt.h
>>> index e6b2f7a..a59dc05 100644
>>> --- a/kernel/locking/qspinlock_paravirt.h
>>> +++ b/kernel/locking/qspinlock_paravirt.h
>>> @@ -72,7 +72,7 @@ static inline bool pv_queued_spin_steal_lock(struct 
>>> qspinlock *lock)
>>>     struct __qspinlock *l = (void *)lock;
>>>  
>>>     if (!(atomic_read(&lock->val) & _Q_LOCKED_PENDING_MASK) &&
>>> -       (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
>>> +       (cmpxchg_acquire(&l->locked, 0, _Q_LOCKED_VAL) == 0)) {
>>>             qstat_inc(qstat_pv_lock_stealing, true);
>>>             return true;
>>>     }
>>> @@ -101,16 +101,16 @@ static __always_inline void clear_pending(struct 
>>> qspinlock *lock)
>>>  
>>>  /*
>>>   * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
>>> - * barrier. Therefore, an atomic cmpxchg() is used to acquire the lock
>>> - * just to be sure that it will get it.
>>> + * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
>>> + * lock just to be sure that it will get it.
>>>   */
>>>  static __always_inline int trylock_clear_pending(struct qspinlock *lock)
>>>  {
>>>     struct __qspinlock *l = (void *)lock;
>>>  
>>>     return !READ_ONCE(l->locked) &&
>>> -          (cmpxchg(&l->locked_pending, _Q_PENDING_VAL, _Q_LOCKED_VAL)
>>> -                   == _Q_PENDING_VAL);
>>> +          (cmpxchg_acquire(&l->locked_pending, _Q_PENDING_VAL,
>>> +                           _Q_LOCKED_VAL) == _Q_PENDING_VAL);
>>>  }
>>>  #else /* _Q_PENDING_BITS == 8 */
>>>  static __always_inline void set_pending(struct qspinlock *lock)
>>> @@ -138,7 +138,7 @@ static __always_inline int trylock_clear_pending(struct 
>>> qspinlock *lock)
>>>              */
>>>             old = val;
>>>             new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
>>> -           val = atomic_cmpxchg(&lock->val, old, new);
>>> +           val = atomic_cmpxchg_acquire(&lock->val, old, new);
>>>  
>>>             if (val == old)
>>>                     return 1;
>>> @@ -361,6 +361,9 @@ static void pv_kick_node(struct qspinlock *lock, struct 
>>> mcs_spinlock *node)
>>>      * observe its next->locked value and advance itself.
>>>      *
>>>      * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
>>> +    *
>>> +    * We can't used relaxed form of cmpxchg here as the loading of
>>> +    * pn->state can happen before setting next->locked in some archs.
>>>      */
>>>     if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted)
>> Hi Waiman.
>>
>> cmpxchg() does not guarantee the (here implied) smp_mb(), in general; c.f.,
>> e.g., arch/arm64/include/asm/atomic_ll_sc.h, where cmpxchg() get "compiled"
>> to something like:
>>
>>     _loop: ldxr; eor; cbnz _exit; stlxr; cbnz _loop; dmb ish; _exit:
>>
> Yes, sorry for be late for this one.
>
> So Waiman, the fact is that in this case, we want the following code
> sequence:
>
>       CPU 0                                   CPU 1
>       =================                       ====================
>       {pn->state = vcpu_running, node->locked = 0}
>
>       smp_store_smb(&pn->state, vcpu_halted):
>         WRITE_ONCE(pn->state, vcpu_halted);
>         smp_mb();
>       r1 = READ_ONCE(node->locked);
>                                               
> arch_mcs_spin_unlock_contented();
>                                                 WRITE_ONCE(node->locked, 1)
>
>                                               cmpxchg(&pn->state, 
> vcpu_halted, vcpu_hashed);
>
> never ends up in:
>
>       r1 == 0 && cmpxchg fail(i.e. the read part of cmpxchg reads the
>       value vcpu_running).
>
> We can have such a guarantee if cmpxchg has a smp_mb() before its load
> part, which is true for PPC. But semantically, cmpxchg() doesn't provide
> any order guarantee if it fails, which is true on ARM64, IIUC. (Add Will
> in Cc for his insight ;-)).
>
> So a possible "fix"(in case ARM64 will use qspinlock some day), would be
> replace cmpxchg() with smp_mb() + cmpxchg_relaxed().
>

It is the pvqspinlock code, the native qspinlock will not have this
problem. So we will need to make sure that the write to node->locked
will always precede the read of pn->state whether the cmpxchg is
successful or not. We are not going to replace cmpxchg() with smp_mb() +
cmpxchg_relaxed() as that will impact x86 performance. Perhaps, we could
provide another cmpxchg variant that can provide that memory barrier
guarantee for all archs.

In the mean time, I am going to update the patch to document this
limitation so that we could do something about it when archs like ARM64
want pvqspinlock support.

Cheers,
Longman

Reply via email to