On Thu, Sep 27, 2018 at 09:17:47AM +0200, Peter Zijlstra wrote: > On Wed, Sep 26, 2018 at 10:52:08PM +0200, Andrea Parri wrote: > > On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote: > > > On x86 we cannot do fetch_or with a single instruction and end up > > > using a cmpxchg loop, this reduces determinism. Replace the fetch_or > > > with a very tricky composite xchg8 + load. > > > > > > The basic idea is that we use xchg8 to test-and-set the pending bit > > > (when it is a byte) and then a load to fetch the whole word. Using > > > two instructions of course opens a window we previously did not have. > > > In particular the ordering between pending and tail is of interrest, > > > because that is where the split happens. > > > > > > The claim is that if we order them, it all works out just fine. There > > > are two specific cases where the pending,tail state changes: > > > > > > - when the 3rd lock(er) comes in and finds pending set, it'll queue > > > and set tail; since we set tail while pending is set, the ordering > > > is split is not important (and not fundamentally different form > > > fetch_or). [*] > > > > > > - when the last queued lock holder acquires the lock (uncontended), > > > we clear the tail and set the lock byte. By first setting the > > > pending bit this cmpxchg will fail and the later load must then > > > see the remaining tail. > > > > > > Another interesting scenario is where there are only 2 threads: > > > > > > lock := (0,0,0) > > > > > > CPU 0 CPU 1 > > > > > > lock() lock() > > > trylock(-> 0,0,1) trylock() /* fail */ > > > return; xchg_relaxed(pending, 1) (-> 0,1,1) > > > mb() > > > val = smp_load_acquire(*lock); > > > > > > Where, without the mb() the load would've been allowed to return 0 for > > > the locked byte. > > > > If this were true, we would have a violation of "coherence": > > The thing is, this is mixed size, see:
The accesses to ->val are not, and those certainly have to meet the "coherence" constraint (no matter the store to ->pending). > > https://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf > > If I remember things correctly (I've not reread that paper recently) it > is allowed for: > > old = xchg(pending,1); > val = smp_load_acquire(*lock); > > to be re-ordered like: > > val = smp_load_acquire(*lock); > old = xchg(pending, 1); > > with the exception that it will fwd the pending byte into the later > load, so we get: > > val = (val & _Q_PENDING_MASK) | (old << _Q_PENDING_OFFSET); > > for 'free'. > > LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_. True, but it is nothing conceptually new to deal with: there're Cat models that handle mixed-size accesses, just give it time. Andrea > > With the addition of smp_mb__after_atomic(), we disallow the load to be > done prior to the xchg(). It might still fwd the more recent pending > byte from its store buffer, but at least the other bytes must not be > earlier.