On Thu, May 2, 2019 at 1:27 AM Jan Glauber <jglau...@marvell.com> wrote: > > I'll see how x86 runs the same testcase, I thought that playing > cacheline ping-pong is not the optimal use case for any CPU.
Oh, ping-pong is always bad. But from past experience, x86 tends to be able to always do tight a cmpxchg loop without failing more than a once or twice, which is all you need for things like this. And it's "easy" to do in hardware on a CPU: all you need to do is guarantee that when you have a cmpxchg loop, the cacheline is sticky enough that it stays around at the local CPU for the duration of one loop entry (ie from one cmpxchg to the next). Obviously you can do that wrong too, and make cachelines *too* sticky, and then you get fairness issues. But it really sounds like what happens for your ThunderX2 case, the different CPU's steal each others cachelines so quickly that even when you get the cacheline, you don't then get to update it. Does ThunderX2 do LSE atomics? Are the acquire/release versions really slow, perhaps, and more or less serializing (maybe it does the "release" logic even when the store _fails_?), so that doing two back-to-back cmpxchg ends up taking the core a "long" time, so that the cache subsystem then steals it easily in between cmpxchg's in a loop? Does the L1 cache maybe have no way to keep a line around from one cmpxchg to the next? This is (one example) where having a CPU and an interconnect that works together matters. And yes, it probably needs a few generations of hardware tuning where people see problems and fix them. Linus