Re: [PATCH v2] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

Waiman Long Fri, 17 Feb 2017 11:02:17 -0800

On 02/12/2017 09:24 PM, panxinhui wrote:
>
> 在 2017/2/10 上午4:53, Waiman Long 写道:
>> On 02/07/2017 10:39 PM, Xinhui Pan wrote:
>>>
>>> 2016-12-26 4:26 GMT+08:00 Waiman Long <long...@redhat.com 
>>> <mailto:long...@redhat.com>>:
>>>
>>>     A number of cmpxchg calls in qspinlock_paravirt.h were replaced by more
>>>     relaxed versions to improve performance on architectures that use LL/SC.
>>>
>>>     All the locking related cmpxchg's are replaced with the _acquire
>>>     variants:
>>>      - pv_queued_spin_steal_lock()
>>>      - trylock_clear_pending()
>>>
>>>     The cmpxchg's related to hashing are replaced by either by the _release
>>>     or the _relaxed variants. See the inline comment for details.
>>>
>>>     Signed-off-by: Waiman Long <long...@redhat.com 
>>> <mailto:long...@redhat.com>>
>>>
>>>      v1->v2:
>>>       - Add comments in changelog and code for the rationale of the change.
>>>
>>>     ---
>>>      kernel/locking/qspinlock_paravirt.h | 50 
>>> ++++++++++++++++++++++++-------------
>>>      1 file changed, 33 insertions(+), 17 deletions(-)
>>>
>>>
>>>     @@ -323,8 +329,14 @@ static void pv_wait_node(struct mcs_spinlock 
>>> *node, struct mcs_spinlock *prev)
>>>                      * If pv_kick_node() changed us to vcpu_hashed, retain 
>>> that
>>>                      * value so that pv_wait_head_or_lock() knows to not 
>>> also try
>>>                      * to hash this lock.
>>>     +                *
>>>     +                * The smp_store_mb() and control dependency above will 
>>> ensure
>>>     +                * that state change won't happen before that. 
>>> Synchronizing
>>>     +                * with pv_kick_node() wrt hashing by this waiter or by 
>>> the
>>>     +                * lock holder is done solely by the state variable. 
>>> There is
>>>     +                * no other ordering requirement.
>>>                      */
>>>     -               cmpxchg(&pn->state, vcpu_halted, vcpu_running);
>>>     +               cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_running);
>>>
>>>                     /*
>>>                      * If the locked flag is still not set after wakeup, it 
>>> is a
>>>     @@ -360,9 +372,12 @@ static void pv_kick_node(struct qspinlock *lock, 
>>> struct mcs_spinlock *node)
>>>              * pv_wait_node(). If OTOH this fails, the vCPU was running and 
>>> will
>>>              * observe its next->locked value and advance itself.
>>>              *
>>>     -        * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
>>>     +        * Matches with smp_store_mb() and cmpxchg_relaxed() in 
>>> pv_wait_node().
>>>     +        * A release barrier is used here to ensure that node->locked is
>>>     +        * always set before changing the state. See comment in 
>>> pv_wait_node().
>>>              */
>>>     -       if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != 
>>> vcpu_halted)
>>>     +       if (cmpxchg_release(&pn->state, vcpu_halted, vcpu_hashed)
>>>     +                       != vcpu_halted)
>>>                     return;
>>>
>>> hi, Waiman
>>> We can't use _release here, a full barrier is needed.
>>>
>>> There is pv_kick_node vs pv_wait_head_or_lock
>>>
>>> [w] l->locked = _Q_SLOW_VAL  //reordered here    
>>>                                                                             
>>>     if (READ_ONCE(pn->state) == vcpu_hashed) //False.
>>>                                                                             
>>>                    lp = (struct qspinlock **)1;
>>>
>>> [STORE] pn->state = vcpu_hashed                        lp = pv_hash(lock, 
>>> pn);
>>> pv_hash()                                                                if 
>>> (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed.
>>>
>>> Then the same lock has hashed twice but only unhashed once. So at last as 
>>> the hash table grows big, we hit RCU stall.
>>>
>>> I hit RCU stall when I run netperf benchmark
>>>
>>> thanks
>>> xinhui
>>>
>>>
>>>     --
>>>     1.8.3.1
>>>
>>>
>> Yes, I know I am being too aggressive in this patch. I am going to tone it 
>> down a bit. I just don't have time to run a performance test on PPC system 
>> to verify the gain yet. I am planning to send an updated patch soon.
>>
> hi, All
>
> I guess I have found the scenario that causes the RCU stall.
>
> pv_wait_node                                  
>                                                                               
> [L] pn->state // this load is reordered from cmxchg_release.
>       smp_store_mb(pn->state, vcpu_halted);                                   
>       if (!READ_ONCE(node->locked))                                           
>                                                                               
> arch_mcs_spin_unlock_contended(&next->locked);
>                                                                               
> pv_kick_node
>                                                                               
>         [-L]cmpxchg_release(&pn->state, vcpu_halted, vcpu_hashed)
>                                                                               
>         //cmpxchg_release fails, so pn->state keep as it is.
>               pv_wait(&pn->state, vcpu_halted);
>               //on PPC, It will not return until pn->state != vcpu_halted.
>               
> And when rcu stall hit, I fire an BUG(), and enter debug mode, it seems most 
> cpus are in pv_wait...
>
> So the soltuion to solve this problems is simple, keep the cmpxchg as it is 
> in pv_kick_node, cmpxchg on ppc provides full barriers.
>
> thanks
> xinhui
>


Sorry for the late reply and thank for the explanation. I am not aware
that the load of cmpxchg_release can be moved up like that. I will need
to be more careful of using relaxed form of cmpxchg next time.

Cheers,
Longman

Re: [PATCH v2] locking/pvqspinlock: Relax cmpxchg's to improve performance on some archs

Reply via email to