On Wed, Oct 31, 2018 at 04:38:53PM +0000, Richard Henderson wrote: > On 10/31/18 3:04 PM, Will Deacon wrote: > > The example test above uses relaxed atomics in conjunction with an acquire > > fence, so I don't think we can actually use ST<op> at all without a change > > to the language specification. I previouslyyallocated P0861 for this purpose > > but never got a chance to write it up... > > > > Perhaps the issue is a bit clearer with an additional thread (not often I > > say that!): > > > > > > P0 (atomic_int* y,atomic_int* x) { > > atomic_store_explicit(x,1,memory_order_relaxed); > > atomic_thread_fence(memory_order_release); > > atomic_store_explicit(y,1,memory_order_relaxed); > > } > > > > P1 (atomic_int* y,atomic_int* x) { > > atomic_fetch_add_explicit(y,1,memory_order_relaxed); // STADD > > atomic_thread_fence(memory_order_acquire); > > int r0 = atomic_load_explicit(x,memory_order_relaxed); > > } > > > > P2 (atomic_int* y) { > > int r1 = atomic_load_explicit(y,memory_order_relaxed); > > } > > > > > > My understanding is that it is forbidden for r0 == 0 and r1 == 2 after > > this test has executed. However, if the relaxed add in P1 compiles to > > STADD and the subsequent acquire fence is compiled as DMB LD, then we > > don't have any ordering guarantees in P1 and the forbidden result could > > be observed. > > I suppose I don't understand exactly what you're saying.
Apologies, I'm probably not explaining things very well. I'm trying to avoid getting into the C11 memory model relations if I can help it, hence the example. > I can see that, yes, if you split the fetch-add from the acquire in P1 you get > the incorrect results you describe. But isn't that a bug in the test itself? Per the C11 memory model, the test above is well-defined and if r1 == 2 then it is required that r0 == 1. With your proposal, this is not guaranteed for AArch64, and it would be possible to end up with r1 == 2 and r0 == 0. > Why would not the only correct version have > > P1 (atomic_int* y, atomic_int* x) { > atomic_fetch_add_explicit(y, 1, memory_order_acquire); > int r0 = atomic_load_explicit(x, memory_order_relaxed); > } > > at which point we won't use STADD for the fetch-add, but LDADDA. That would indeed work correctly, but the problem is that the C11 memory model doesn't rule out the previous test as something which isn't portable. > If the problem is more fundamental than this, would you have another go at > explaining? In particular, I don't see the difference between > > ldadd val, scratch, [base] > vs > stadd val, [base] > > and > > ldaddl val, scratch, [base] > vs > staddl val, [base] > > where both pairs of instructions have the same memory ordering semantics. > Currently we are always producing the ld version of each pair. Aha, maybe this is the problem. An acquire fence on AArch64 is implemented using a DMB LD instruction, which orders prior reads against subsequent reads and writes. However, the architecture says: | The ST<OP> instructions, and LD<OP> instructions where the destination | register is WZR or XZR, are not regarded as doing a read for the purpose | of a DMB LD barrier. and so therefore an ST atomic is not affected by a subsequent acquire fence, whereas an LD atomic is. Does that help at all? Will