https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697
--- Comment #5 from torvald at gcc dot gnu.org --- (In reply to Matthew Wahab from comment #0) > The __sync builtins are implemented by expanding them to the equivalent > __atomic builtins, using MEMMODEL_SEQ_CST for those that are full barriers. > This is too weak since __atomic operations with MEMMODEL_SEQ_CST still allow > some data references to move across the operation (see > https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html) I don't think that's a precise characterization. The difference discussed in that thread is that seq-cst fences have a stronger effect than seq-cst stores. This is not really a question of MEMMODEL_SEQ_CST but what you apply it to. > while a __sync full > barrier should block all movement > (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins. > html#_005f_005fsync-Builtins). But in your code example below, you didn't use __synch_synchronize. The other __sync operations do not document that they are a full barrier, AFAICS. (I'm not sure whether there is a wide assumption that they do.) > This is problem for Aarch64 where data references after the barrier could be > speculated ahead of it. > > For example, compiling with aarch64-none-linux-gnu at -O2, > ----- > int foo = 0; > int bar = 0; > > int T1(void) > { > int x = __sync_fetch_and_add(&foo, 1); > return bar; > } This test case is problematic in that it has a data race on bar unless bar is never modified by another thread, in which case it would be fine to load bar before the fetch_add. It also loads bar irrespective of which value the fetch_add actually returns. I'm aware that the __sync builtins offer no MEMMODEL_RELAXED, and so lots of code simply has a plain memory access. Nonetheless, I think you should also look at a test case that is properly synchronized, data-race-free C11 code with the builtins, and see whether that is compiled differently (ie, either use a relaxed MO load for bar or make the load of bar conditional on the outcome of the fetch_add). To be on the safe side, you can also use the cppmem tool to verify what the C11 model requires the compiled code to guarantee. Here's what you probably want to achieve: int main() { atomic_int foo = 0; atomic_int bar = 0; {{{ { r1=cas_strong(foo, 0, 1); r2=bar.load(memory_order_relaxed); } ||| { bar.store(42, memory_order_relaxed); foo.store(1, memory_order_release); } }}}; return 0; } This tries to test whether "ordered" stores to bar and foo are observed in this order in the first thread (it uses a CAS instead of fetch_add). This gives me 3 consistent executions, of which one is the one in which the CAS succeeds (ie, like fetch_add would do), and this one returns a value of 42 for the load of bar. > ---- > produces > ---- > T1: > adrp x0, .LANCHOR0 > add x0, x0, :lo12:.LANCHOR0 > .L2: > ldaxr w1, [x0] ; load-acquire (__sync_fetch_and_add) > add w1, w1, 1 > stlxr w2, w1, [x0] ; store-release (__sync_fetch_and_add) > cbnz w2, .L2 > ldr w0, [x0, 4] ; load (return bar) > ret > ---- > With this code, the load can be executed ahead of the store-release which > ends the __sync_fetch_and_add. A correct implementation should emit a dmb > instruction after the cbnz. I'm not sufficiently familiar with ARM to assess whether that's correct. Is this ARMv8? It seems to be consistent with the mapping to machine code that GCC has followed (http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html), which does not include a DMB. However, it also shows no difference between an acq_rel CAS and a seq_cst CAS. Could it be that the CAS sequence itself (ie, the machine code for it) prevents the reordering you are concerned about at the HW level?