https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
--- Comment #23 from James Greenhalgh <jgreenhalgh at gcc dot gnu.org> --- (In reply to torvald from comment #22) > (In reply to James Greenhalgh from comment #12) > > There are two problems here, one of which concerns me more in the real > > world, and both of which rely on races if you are in the C/C++11 model - > > there isn't much I can do about that as the __sync primitives are legacy and > > give stronger guarantees about "visible memory references" (which includes > > ALL global data). > > > > Consider: > > > > int x = 0; > > int y = 0; > > int z = 0; > > > > void foo (void) > > { > > x = 1; > > __sync_fetch_and_add (&y, 1); > > z = 1; > > } > > > > My reading of what a full barrier implies would suggest that we should never > > have a modification order which is any of: > > > > z = 1, x = 1, y = 0+1 > > z = 1, y = 0+1, x = 1 > > x = 1, z = 1, y = 0+1 > > At least in C11/C++11, modification orders are per memory location. If we > want to have something that orders accesses to x, y, and z, we need to model > the observer side too (which the C11 model does, for example). I'm > mentioning that because at the very least, we have compiler reordering > happening at the observer side; if a programmer would need to use, for > example, __sync builtins to order the observers, then this means we "just" > have to consider combinations of those on the modification and the > observation side. > > (And because I'm not sure what you think "modification order" is, I can't > comment further.) Yes. I'm sorry, I was imprecise in my language as I couldn't quite find the right phrase in specifications - what I am trying to get at is: the order in which observers in the system are able to observe the writes to x, y, z. Later in this post I'll call this "observation order", until I can find a better terminology. I believe the __sync builtins should guarantee that an observer could trust: assert (z == 1 && y == 1 && x ==1); to hold. That is, the write to z is observed if and only if the writes to y and x have also been observed. > > GCC 5.0-ish (recent trunk) will emit: > > > > foo: > > adrp x3, .LANCHOR0 > > mov w1, 1 > > add x0, x3, :lo12:.LANCHOR0 > > add x2, x0, 4 > > str w1, [x3, #:lo12:.LANCHOR0] > > .L2: > > ldaxr w3, [x2] > > add w3, w3, w1 > > stlxr w4, w3, [x2] > > cbnz w4, .L2 > > str w1, [x0, 8] > > ret > > > > Dropping some of the context and switching to pseudo-asm we have: > > > > str 1, [x] > > .L2: > > ldaxr tmp, [y] > > add tmp, tmp, 1 > > stlxr flag, tmp, [y] > > cbnz flag, .L2 > > str 1, [z] > > > > As far as I understand it, the memory ordering guarantees of the > > half-barrier LDAXR/STLXR instructions do not prevent this being reordered to > > an execution which looks like: > > > > ldaxr tmp, [y] > > str 1, [z] > > str 1, [x] > > add tmp, tmp, 1 > > stlxr flag, tmp, [y] > > cbnz flag, .L2 > > > > which gives one of the modification orders we wanted to forbid above. > > Similar reordering can give the other undesirable modification orders. > > (I'm not familiar with the ARM model, so please bear with me.) > > This is reordering the HW can do? Or are you concerned about the compiler > backend? The architectural description of the memory model - this is reordering the HW is permitted to do. > Would the reordered str always become visible atomically with the stlxr? > Would it become visible even if the LLSC fails, thus potentially storing > more than once to z? This would be surprising in that the ldaxr would then > have to be reloaded too potentially after the store to z, I believe (at > least for a strong CAS) -- which would break the acquire MO on the load. The STR would not become visible until after the stlxr had resolved as successful, however, it can take a position of the observation order ahead of the stlxr, such the the ordering of writes seen by observers would not be program order. > > As mentioned, this reordering is permitted under C11, as the stores to x and > > z are racy - this permits the CPPMEM/Cambridge documentation of what an > > SEQ_CST must do. > > That's true in this example, but at the instruction level (assuming HW > reordering is what you're concerned about), a atomic relaxed-MO load isn't > distinguishable from a normal memory access, right? So, it's not DRF what > this is strictly about, but the difference between C11 seq_cst fences and > seq_cst RMW ops. Absolutely! I was trying to preempt "That program is racy" responses :). > > However, my feeling is that it is at the very least > > *surprising* for the __sync primitives to allow this ordering, and more > > likely points to a break in the AArch64 implementation. > > With the caveat that given that __sync isn't documented in great detail, a > lot of interpretations might happen in practice, so there might be a few > surprises to some people :) :) Agreed. > > Fixing this requires a hard memory barrier - but I really don't want to see > > us penalise C11 SEQ_CST to get the right behaviour out of > > __sync_fetch_and_add... > > I agree.