https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65697

--- Comment #22 from torvald at gcc dot gnu.org ---
(In reply to James Greenhalgh from comment #12)
> There are two problems here, one of which concerns me more in the real
> world, and both of which rely on races if you are in the C/C++11 model -
> there isn't much I can do about that as the __sync primitives are legacy and
> give stronger guarantees about "visible memory references" (which includes
> ALL global data).
> 
> Consider:
> 
> int x = 0;
> int y = 0;
> int z = 0;
> 
> void foo (void)
> {
>   x = 1;
>   __sync_fetch_and_add (&y, 1);
>   z = 1;
> }
> 
> My reading of what a full barrier implies would suggest that we should never
> have a modification order which is any of:
> 
>   z = 1, x = 1, y = 0+1
>   z = 1, y = 0+1, x = 1
>   x = 1, z = 1, y = 0+1

At least in C11/C++11, modification orders are per memory location.  If we want
to have something that orders accesses to x, y, and z, we need to model the
observer side too (which the C11 model does, for example).  I'm mentioning that
because at the very least, we have compiler reordering happening at the
observer side; if a programmer would need to use, for example, __sync builtins
to order the observers, then this means we "just" have to consider combinations
of those on the modification and the observation side.

(And because I'm not sure what you think "modification order" is, I can't
comment further.)

> GCC 5.0-ish (recent trunk) will emit:
> 
> foo:
>       adrp    x3, .LANCHOR0
>       mov     w1, 1
>       add     x0, x3, :lo12:.LANCHOR0
>       add     x2, x0, 4
>       str     w1, [x3, #:lo12:.LANCHOR0]
> .L2:
>       ldaxr   w3, [x2]
>       add     w3, w3, w1
>       stlxr   w4, w3, [x2]
>       cbnz    w4, .L2
>       str     w1, [x0, 8]
>       ret
> 
> Dropping some of the context and switching to pseudo-asm we have:
> 
>       str     1, [x]
> .L2:
>       ldaxr   tmp, [y]
>       add     tmp, tmp, 1
>       stlxr   flag, tmp, [y]
>       cbnz    flag, .L2
>       str     1, [z]
> 
> As far as I understand it, the memory ordering guarantees of the
> half-barrier LDAXR/STLXR instructions do not prevent this being reordered to
> an execution which looks like:
> 
>       ldaxr   tmp, [y]
>       str     1, [z]
>       str     1, [x]
>       add     tmp, tmp, 1
>       stlxr   flag, tmp, [y]
>       cbnz    flag, .L2
>    
> which gives one of the modification orders we wanted to forbid above.
> Similar reordering can give the other undesirable modification orders.

(I'm not familiar with the ARM model, so please bear with me.)

This is reordering the HW can do?  Or are you concerned about the compiler
backend?
Would the reordered str always become visible atomically with the stlxr?
Would it become visible even if the LLSC fails, thus potentially storing more
than once to z?  This would be surprising in that the ldaxr would then have to
be reloaded too potentially after the store to z, I believe (at least for a
strong CAS) -- which would break the acquire MO on the load.

> As mentioned, this reordering is permitted under C11, as the stores to x and
> z are racy - this permits the CPPMEM/Cambridge documentation of what an
> SEQ_CST must do.

That's true in this example, but at the instruction level (assuming HW
reordering is what you're concerned about), a atomic relaxed-MO load isn't
distinguishable from a normal memory access, right?  So, it's not DRF what this
is strictly about, but the difference between C11 seq_cst fences and seq_cst
RMW ops.

> However, my feeling is that it is at the very least
> *surprising* for the __sync primitives to allow this ordering, and more
> likely points to a break in the AArch64 implementation.

With the caveat that given that __sync isn't documented in great detail, a lot
of interpretations might happen in practice, so there might be a few surprises
to some people :)

> Fixing this requires a hard memory barrier - but I really don't want to see
> us penalise C11 SEQ_CST to get the right behaviour out of
> __sync_fetch_and_add...

I agree.

Reply via email to