On 02/12/2017 09:24 PM, panxinhui wrote: > > 在 2017/2/10 上午4:53, Waiman Long 写道: >> On 02/07/2017 10:39 PM, Xinhui Pan wrote: >>> >>> 2016-12-26 4:26 GMT+08:00 Waiman Long <long...@redhat.com >>> <mailto:long...@redhat.com>>: >>> >>> A number of cmpxchg calls in qspinlock_paravirt.h were replaced by more >>> relaxed versions to improve performance on architectures that use LL/SC. >>> >>> All the locking related cmpxchg's are replaced with the _acquire >>> variants: >>> - pv_queued_spin_steal_lock() >>> - trylock_clear_pending() >>> >>> The cmpxchg's related to hashing are replaced by either by the _release >>> or the _relaxed variants. See the inline comment for details. >>> >>> Signed-off-by: Waiman Long <long...@redhat.com >>> <mailto:long...@redhat.com>> >>> >>> v1->v2: >>> - Add comments in changelog and code for the rationale of the change. >>> >>> --- >>> kernel/locking/qspinlock_paravirt.h | 50 >>> ++++++++++++++++++++++++------------- >>> 1 file changed, 33 insertions(+), 17 deletions(-) >>> >>> >>> @@ -323,8 +329,14 @@ static void pv_wait_node(struct mcs_spinlock >>> *node, struct mcs_spinlock *prev) >>> * If pv_kick_node() changed us to vcpu_hashed, retain >>> that >>> * value so that pv_wait_head_or_lock() knows to not >>> also try >>> * to hash this lock. >>> + * >>> + * The smp_store_mb() and control dependency above will >>> ensure >>> + * that state change won't happen before that. >>> Synchronizing >>> + * with pv_kick_node() wrt hashing by this waiter or by >>> the >>> + * lock holder is done solely by the state variable. >>> There is >>> + * no other ordering requirement. >>> */ >>> - cmpxchg(&pn->state, vcpu_halted, vcpu_running); >>> + cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_running); >>> >>> /* >>> * If the locked flag is still not set after wakeup, it >>> is a >>> @@ -360,9 +372,12 @@ static void pv_kick_node(struct qspinlock *lock, >>> struct mcs_spinlock *node) >>> * pv_wait_node(). If OTOH this fails, the vCPU was running and >>> will >>> * observe its next->locked value and advance itself. >>> * >>> - * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() >>> + * Matches with smp_store_mb() and cmpxchg_relaxed() in >>> pv_wait_node(). >>> + * A release barrier is used here to ensure that node->locked is >>> + * always set before changing the state. See comment in >>> pv_wait_node(). >>> */ >>> - if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != >>> vcpu_halted) >>> + if (cmpxchg_release(&pn->state, vcpu_halted, vcpu_hashed) >>> + != vcpu_halted) >>> return; >>> >>> hi, Waiman >>> We can't use _release here, a full barrier is needed. >>> >>> There is pv_kick_node vs pv_wait_head_or_lock >>> >>> [w] l->locked = _Q_SLOW_VAL //reordered here >>> >>> if (READ_ONCE(pn->state) == vcpu_hashed) //False. >>> >>> lp = (struct qspinlock **)1; >>> >>> [STORE] pn->state = vcpu_hashed lp = pv_hash(lock, >>> pn); >>> pv_hash() if >>> (xchg(&l->locked, _Q_SLOW_VAL) == 0) // fasle, not unhashed. >>> >>> Then the same lock has hashed twice but only unhashed once. So at last as >>> the hash table grows big, we hit RCU stall. >>> >>> I hit RCU stall when I run netperf benchmark >>> >>> thanks >>> xinhui >>> >>> >>> -- >>> 1.8.3.1 >>> >>> >> Yes, I know I am being too aggressive in this patch. I am going to tone it >> down a bit. I just don't have time to run a performance test on PPC system >> to verify the gain yet. I am planning to send an updated patch soon. >> > hi, All > > I guess I have found the scenario that causes the RCU stall. > > pv_wait_node > > [L] pn->state // this load is reordered from cmxchg_release. > smp_store_mb(pn->state, vcpu_halted); > if (!READ_ONCE(node->locked)) > > arch_mcs_spin_unlock_contended(&next->locked); > > pv_kick_node > > [-L]cmpxchg_release(&pn->state, vcpu_halted, vcpu_hashed) > > //cmpxchg_release fails, so pn->state keep as it is. > pv_wait(&pn->state, vcpu_halted); > //on PPC, It will not return until pn->state != vcpu_halted. > > And when rcu stall hit, I fire an BUG(), and enter debug mode, it seems most > cpus are in pv_wait... > > So the soltuion to solve this problems is simple, keep the cmpxchg as it is > in pv_kick_node, cmpxchg on ppc provides full barriers. > > thanks > xinhui >
Sorry for the late reply and thank for the explanation. I am not aware that the load of cmpxchg_release can be moved up like that. I will need to be more careful of using relaxed form of cmpxchg next time. Cheers, Longman