https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

--- Comment #5 from torvald at gcc dot gnu.org ---
(In reply to Matthew Wahab from comment #0)
> The __sync builtins are implemented by expanding them to the equivalent
> __atomic builtins, using MEMMODEL_SEQ_CST for those that are full barriers.
> This is too weak since __atomic operations with MEMMODEL_SEQ_CST still allow
> some data references to move across the operation (see
> https://gcc.gnu.org/ml/gcc/2014-02/msg00058.html)

I don't think that's a precise characterization.  The difference discussed in
that thread is that seq-cst fences have a stronger effect than seq-cst stores. 
This is not really a question of MEMMODEL_SEQ_CST but what you apply it to.

> while a __sync full
> barrier should block all movement
> (https://gcc.gnu.org/onlinedocs/gcc/_005f_005fsync-Builtins.
> html#_005f_005fsync-Builtins).

But in your code example below, you didn't use __synch_synchronize.  The other
__sync operations do not document that they are a full barrier, AFAICS.  (I'm
not sure whether there is a wide assumption that they do.)

> This is problem for Aarch64 where data references after the barrier could be
> speculated ahead of it.
> 
> For example, compiling with aarch64-none-linux-gnu at -O2,
> -----
> int foo = 0;
> int bar = 0;
> 
> int T1(void)
> {
>   int x = __sync_fetch_and_add(&foo, 1);
>   return bar;
> }

This test case is problematic in that it has a data race on bar unless bar is
never modified by another thread, in which case it would be fine to load bar
before the fetch_add.  It also loads bar irrespective of which value the
fetch_add actually returns.  I'm aware that the __sync builtins offer no
MEMMODEL_RELAXED, and so lots of code simply has a plain memory access.  

Nonetheless, I think you should also look at a test case that is properly
synchronized, data-race-free C11 code with the builtins, and see whether that
is compiled differently (ie, either use a relaxed MO load for bar or make the
load of bar conditional on the outcome of the fetch_add).

To be on the safe side, you can also use the cppmem tool to verify what the C11
model requires the compiled code to guarantee.  Here's what you probably want
to achieve:

int main() {
  atomic_int foo = 0; 
  atomic_int bar = 0; 
  {{{ {
        r1=cas_strong(foo, 0, 1);
        r2=bar.load(memory_order_relaxed);
      }
  ||| {
        bar.store(42, memory_order_relaxed);
        foo.store(1, memory_order_release);
      }
  }}};
  return 0; }

This tries to test whether "ordered" stores to bar and foo are observed in this
order in the first thread (it uses a CAS instead of fetch_add).  This gives me
3 consistent executions, of which one is the one in which the CAS succeeds (ie,
like fetch_add would do), and this one returns a value of 42 for the load of
bar.

> ----
> produces
> ----
> T1:
>       adrp    x0, .LANCHOR0
>       add     x0, x0, :lo12:.LANCHOR0
> .L2:
>       ldaxr   w1, [x0]       ; load-acquire (__sync_fetch_and_add)
>       add     w1, w1, 1
>       stlxr   w2, w1, [x0]   ; store-release  (__sync_fetch_and_add)
>       cbnz    w2, .L2
>       ldr     w0, [x0, 4]    ; load (return bar)
>       ret
> ----
> With this code, the load can be executed ahead of the store-release which
> ends the __sync_fetch_and_add. A correct implementation should emit a dmb
> instruction after the cbnz.

I'm not sufficiently familiar with ARM to assess whether that's correct.  Is
this ARMv8?  It seems to be consistent with the mapping to machine code that
GCC has followed (http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html), which
does not include a DMB.  However, it also shows no difference between an
acq_rel CAS and a seq_cst CAS.  Could it be that the CAS sequence itself (ie,
the machine code for it) prevents the reordering you are concerned about at the
HW level?

Reply via email to