https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
--- Comment #22 from torvald at gcc dot gnu.org --- (In reply to James Greenhalgh from comment #12) > There are two problems here, one of which concerns me more in the real > world, and both of which rely on races if you are in the C/C++11 model - > there isn't much I can do about that as the __sync primitives are legacy and > give stronger guarantees about "visible memory references" (which includes > ALL global data). > > Consider: > > int x = 0; > int y = 0; > int z = 0; > > void foo (void) > { > x = 1; > __sync_fetch_and_add (&y, 1); > z = 1; > } > > My reading of what a full barrier implies would suggest that we should never > have a modification order which is any of: > > z = 1, x = 1, y = 0+1 > z = 1, y = 0+1, x = 1 > x = 1, z = 1, y = 0+1 At least in C11/C++11, modification orders are per memory location. If we want to have something that orders accesses to x, y, and z, we need to model the observer side too (which the C11 model does, for example). I'm mentioning that because at the very least, we have compiler reordering happening at the observer side; if a programmer would need to use, for example, __sync builtins to order the observers, then this means we "just" have to consider combinations of those on the modification and the observation side. (And because I'm not sure what you think "modification order" is, I can't comment further.) > GCC 5.0-ish (recent trunk) will emit: > > foo: > adrp x3, .LANCHOR0 > mov w1, 1 > add x0, x3, :lo12:.LANCHOR0 > add x2, x0, 4 > str w1, [x3, #:lo12:.LANCHOR0] > .L2: > ldaxr w3, [x2] > add w3, w3, w1 > stlxr w4, w3, [x2] > cbnz w4, .L2 > str w1, [x0, 8] > ret > > Dropping some of the context and switching to pseudo-asm we have: > > str 1, [x] > .L2: > ldaxr tmp, [y] > add tmp, tmp, 1 > stlxr flag, tmp, [y] > cbnz flag, .L2 > str 1, [z] > > As far as I understand it, the memory ordering guarantees of the > half-barrier LDAXR/STLXR instructions do not prevent this being reordered to > an execution which looks like: > > ldaxr tmp, [y] > str 1, [z] > str 1, [x] > add tmp, tmp, 1 > stlxr flag, tmp, [y] > cbnz flag, .L2 > > which gives one of the modification orders we wanted to forbid above. > Similar reordering can give the other undesirable modification orders. (I'm not familiar with the ARM model, so please bear with me.) This is reordering the HW can do? Or are you concerned about the compiler backend? Would the reordered str always become visible atomically with the stlxr? Would it become visible even if the LLSC fails, thus potentially storing more than once to z? This would be surprising in that the ldaxr would then have to be reloaded too potentially after the store to z, I believe (at least for a strong CAS) -- which would break the acquire MO on the load. > As mentioned, this reordering is permitted under C11, as the stores to x and > z are racy - this permits the CPPMEM/Cambridge documentation of what an > SEQ_CST must do. That's true in this example, but at the instruction level (assuming HW reordering is what you're concerned about), a atomic relaxed-MO load isn't distinguishable from a normal memory access, right? So, it's not DRF what this is strictly about, but the difference between C11 seq_cst fences and seq_cst RMW ops. > However, my feeling is that it is at the very least > *surprising* for the __sync primitives to allow this ordering, and more > likely points to a break in the AArch64 implementation. With the caveat that given that __sync isn't documented in great detail, a lot of interpretations might happen in practice, so there might be a few surprises to some people :) > Fixing this requires a hard memory barrier - but I really don't want to see > us penalise C11 SEQ_CST to get the right behaviour out of > __sync_fetch_and_add... I agree.