Back when Will did his qspinlock determinism patches, we were left with one cmpxchg loop on x86 due to the use of atomic_fetch_or(). Will proposed a nifty trick:
http://lkml.kernel.org/r/20180409145409.ga9...@arm.com But at the time we didn't pursue it. This series implements that and argues for its correctness. In particular it places an smp_mb__after_atomic() in between the two operations, which forces the load to come after the store (which is free on x86 anyway). In particular this ordering ensures a concurrent unlock cannot trigger the uncontended handoff. Also it ensures that if the xchg() happens after a (successful) trylock, we must observe that LOCKED bit.