Hi Diogo,
Thanks for your explanation. As documented in https://developer.arm.com/documentation/ddi0487/fc B2.9.5 Load-Exclusive and Store-Exclusive instruction usage restrictions: " Between the Load-Exclusive and the Store-Exclusive, there are no explicit memory accesses, preloads, direct or indirect System register writes, address translation instructions, cache or TLB maintenance instructions, exception generating instructions, exception returns, or indirect branches." [Honnappa] This is a requirement on the software, not on the micro-architecture. We are having few discussions internally, will get back soon. So it is not allowed to insert (1) & (4) between (2, 3). The cmpxchg operation is atomic. Thanks, Phil Yang > -----Original Message----- > From: Diogo Behrens <mailto:[email protected]> > Sent: Thursday, August 27, 2020 4:57 PM > To: Phil Yang <mailto:[email protected]> > Cc: mailto:[email protected]; nd <mailto:[email protected]>; Honnappa Nagarahalli > <mailto:[email protected]> > Subject: RE: [PATCH] librte_eal: fix mcslock hang on weak memory > > Hi Phil, > > thanks for taking a look at this. Indeed, the cmpxchg will have release > semantics, but implicit barriers do not provide the same ordering guarantees > as DMB ISH, ie, atomic_thread_fence(__ATOMIC_ACQ_REL). > > Let me try to explain the scenario. Here is a pseudo-ARM assembly with the > instructions involved: > > STR me->locked = 1 // 1: initialize the node > LDAXR pred = tail // 2: cmpxchg > STLXR tail = me // 3: cmpxchg > STR pred->next = me // 4: the problematic store > > Now, ARM allows instruction 1 and 2 to be reordered, and also instructions 3 > and 4 to be reordered: > > LDAXR pred = tail // 2: cmpxchg (acquires can be moved up) > STR me-> locked = 1 // 1: initialize the node > STR pred->next = me // 4: the problematic store > STLXR tail = me // 3: cmpxchg (releases can be moved down) > > And since stores 1 and 4 can be reordered too, the following execution is > possible: > > LDAXR pred = tail // 2: cmpxchg > STR pred->next = me // 4: the problematic store > STR me-> locked = 1 // 1: initialize the node > STLXR tail = me // 3: cmpxchg > > In this subtle situation, the locking-thread can overwrite the store to next- > >locked of the unlocking-thread (if it occurs between 4 and 1), and > subsequently hang waiting for me->locked to become 0. > > We couldn't reproduce this with our available hardware, but we > "reproduced" it with the RMEM model checker, which precisely models the > ARM memory model (https://github.com/rems-project/rmem). The GenMC > model checker (https://github.com/MPI-SWS/genmc) also finds the issue, > but its intermediate memory model is slightly more general than the ARM > model. > > The release attached to store (4) prevents (1) to reordering with it. > > Please, let me know if that makes sense or if I am missing something. > > Thanks, > -Diogo > > -----Original Message----- > From: Phil Yang [mailto:[email protected]] > Sent: Wednesday, August 26, 2020 12:17 PM > To: Diogo Behrens <mailto:[email protected]> > Cc: mailto:[email protected]; nd <mailto:[email protected]>; Honnappa Nagarahalli > <mailto:[email protected]>; nd <mailto:[email protected]> > Subject: RE: [PATCH] librte_eal: fix mcslock hang on weak memory > > Diogo Behrens <mailto:[email protected]> writes: > > > Subject: [PATCH] librte_eal: fix mcslock hang on weak memory > > > > The initialization me->locked=1 in lock() must happen before > > next->locked=0 in unlock(), otherwise a thread may hang forever, > > waiting me->locked become 0. On weak memory systems (sHi Puch as > ARMv8), > > the current implementation allows me->locked=1 to be reordered with > > announcing the node (pred->next=me) and, consequently, to be > > reordered with next->locked=0 in unlock(). > > > > This fix adds a release barrier to pred->next=me, forcing > > me->locked=1 to happen before this operation. > > > > Signed-off-by: Diogo Behrens <mailto:[email protected]> > > --- > > lib/librte_eal/include/generic/rte_mcslock.h | 9 ++++++++- > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/lib/librte_eal/include/generic/rte_mcslock.h > > b/lib/librte_eal/include/generic/rte_mcslock.h > > index 2bef28351..ce553f547 100644 > > --- a/lib/librte_eal/include/generic/rte_mcslock.h > > +++ b/lib/librte_eal/include/generic/rte_mcslock.h > > @@ -68,7 +68,14 @@ rte_mcslock_lock(rte_mcslock_t **msl, > rte_mcslock_t > > *me) > > */ > > return; > > } > > - __atomic_store_n(&prev->next, me, __ATOMIC_RELAXED); > > + /* The store to me->next above should also complete before the > > node is > > + * visible to predecessor thread releasing the lock. Hence, the store > > + * prev->next also requires release semantics. Note that, for > > example, > > + * on ARM, the release semantics in the exchange operation is not > > + * strong as a release fence and is not sufficient to enforce the > > + * desired order here. > > > Hi Diogo, > > I didn't quite understand why the exchange operation with ACQ_REL > memory ordering is not sufficient. > It emits 'stlxr' on armv8.0-a and 'swpal' on armv8.1-a (with LSE support). > Both of these two instructions contain a release semantics. > > Please check the URL for the detail. > https://gcc.godbolt.org/z/Mc4373 > > BTW, if you captured a deadlock issue on your testbed, could you please > share your test case here? > > Thanks, > Phil > > > > + */ > > + __atomic_store_n(&prev->next, me, __ATOMIC_RELEASE); > > > > /* The while-load of me->locked should not move above the > previous > > * store to prev->next. Otherwise it will cause a deadlock. Need a > > -- > > 2.28.0 > >

