qspinlock: Optimize for x86

Andrea Parri Thu, 27 Sep 2018 00:48:10 -0700

On Thu, Sep 27, 2018 at 09:17:47AM +0200, Peter Zijlstra wrote:
> On Wed, Sep 26, 2018 at 10:52:08PM +0200, Andrea Parri wrote:
> > On Wed, Sep 26, 2018 at 01:01:20PM +0200, Peter Zijlstra wrote:
> > > On x86 we cannot do fetch_or with a single instruction and end up
> > > using a cmpxchg loop, this reduces determinism. Replace the fetch_or
> > > with a very tricky composite xchg8 + load.
> > > 
> > > The basic idea is that we use xchg8 to test-and-set the pending bit
> > > (when it is a byte) and then a load to fetch the whole word. Using
> > > two instructions of course opens a window we previously did not have.
> > > In particular the ordering between pending and tail is of interrest,
> > > because that is where the split happens.
> > > 
> > > The claim is that if we order them, it all works out just fine. There
> > > are two specific cases where the pending,tail state changes:
> > > 
> > >  - when the 3rd lock(er) comes in and finds pending set, it'll queue
> > >    and set tail; since we set tail while pending is set, the ordering
> > >    is split is not important (and not fundamentally different form
> > >    fetch_or). [*]
> > > 
> > >  - when the last queued lock holder acquires the lock (uncontended),
> > >    we clear the tail and set the lock byte. By first setting the
> > >    pending bit this cmpxchg will fail and the later load must then
> > >    see the remaining tail.
> > > 
> > > Another interesting scenario is where there are only 2 threads:
> > > 
> > >   lock := (0,0,0)
> > > 
> > >   CPU 0                   CPU 1
> > > 
> > >   lock()                  lock()
> > >     trylock(-> 0,0,1)       trylock() /* fail */
> > >       return;               xchg_relaxed(pending, 1) (-> 0,1,1)
> > >                             mb()
> > >                             val = smp_load_acquire(*lock);
> > > 
> > > Where, without the mb() the load would've been allowed to return 0 for
> > > the locked byte.
> > 
> > If this were true, we would have a violation of "coherence":
> 
> The thing is, this is mixed size, see:


The accesses to ->val are not, and those certainly have to meet the
"coherence" constraint (no matter the store to ->pending).


> 
>   https://www.cl.cam.ac.uk/~pes20/popl17/mixed-size.pdf
> 
> If I remember things correctly (I've not reread that paper recently) it
> is allowed for:
> 
>       old = xchg(pending,1);
>       val = smp_load_acquire(*lock);
> 
> to be re-ordered like:
> 
>       val = smp_load_acquire(*lock);
>       old = xchg(pending, 1);
> 
> with the exception that it will fwd the pending byte into the later
> load, so we get:
> 
>       val = (val & _Q_PENDING_MASK) | (old << _Q_PENDING_OFFSET);
> 
> for 'free'.
> 
> LKMM in particular does _NOT_ deal with mixed sized atomics _at_all_.

True, but it is nothing conceptually new to deal with: there're Cat
models that handle mixed-size accesses, just give it time.

  Andrea


> 
> With the addition of smp_mb__after_atomic(), we disallow the load to be
> done prior to the xchg(). It might still fwd the more recent pending
> byte from its store buffer, but at least the other bytes must not be
> earlier.

Re: [RFC][PATCH 3/3] locking/qspinlock: Optimize for x86

Reply via email to