On Fri, 2017-02-03 at 13:44 +0000, Ramana Radhakrishnan wrote:
> __atomic_load on ARM appears to be ok as well
> 
> except for
> 
> __atomic_load_di which should really be the ldrexd / strexd loop but we 
> could ameliorate that similar to your option 3b.

This uses just ldrexd now, and thus is not guaranteed to be atomic?

> On AArch64
> 
> * <16 byte loads have always been fine. The architecture allows single 
> copy atomic loads using single load instructions for all other sizes and 
> memory models, so we are fine there.
> 
> * we have gone through the libatomic locks from day one of the port for 
> 16 byte loads.  This has been a bit of a bugbear for a number of users 
> within ARM who would really like to get performance without heavy weight 
> locks for 16 byte atomic ops.

Would it be acceptable for those users to have loads that perform like
CAS loops, especially under contention?  Or are these users more
concerned about aarch64 not offering a true atomic 16-byte load?

> We could never switch this around on AArch64 to using the loop (or CAS) 
> by default as this would be the reverse issue (i.e. old compilers 
> calling libatomic locks and new code using the inlined instruction 
> sequence).

I think there's an implicit assumption that whenever one uses code
generated with a particular compiler, the libatomic version being used
at runtime is at least as recent as that particular compiler.  In a mix
of old and new code, this would mean that new libatomic has to be used.
However, AFAIK we haven't specified that explicitly yet.  (Richard
Henderson said we'd make similar implicit assumptions for libgcc, if I
understood him correctly.)

If that assumption holds, then switching from locks in libatomic to
inlined instructions is possible, provided that libatomic switches too.

> 
> My interest in this patch was piqued because if we were to switch 
> aarch64 to not use the locks in libatomic and go to a CAS or the loop 
> sequence I showed in my reply to Jakub, I don't believe we have an ABI 
> issue as I would expect there to be a single copy of libatomic on the 
> system and everyone would simply be using that.

Right, and it would have to be at least as recent as the most recent
compiler being used (eg, the system compiler on that system).

> However we would end up with the situation of generating stores for 
> __atomic_loads as you described above.
> 
> We could still in theory do that and gain the same bug for rodata and 
> volatile atomic pointers on AArch64 - I don't see a way of avoiding that 
> on aarch32 as we have a similar ABI issue to you.

Agreed.

> 
> > The patch introduces a can_atomic_load_p, which checks that an entry
> > exists in optabs and the machine descriptions for an atomic load of the
> > particular mode.
> >
> > IIRC, arm and aarch64 had atomic load patterns defined whenever they had
> > a CAS defined.  It would be nice if you could double-check that.  If
> > that's the case, nothing should change with my patch because
> > can_atomic_load_p would always return true if a CAS could be issued.
> > If that's not the case, arm/aarch64 could be in the same boat as x86_64
> > with cmpxchg16b; then, using Option 3b might be a useful (intermediary)
> > solution for ARM as well (OTOH, compatibility with existing code and
> > __sync builtins might perhaps not be as much of a factor as it might be
> > on x86_64).
> 
> On AArch64 IIRC only those instructions that are single copy atomic as 
> per the architecture are allowed to be atomic loads. Thus we do not 
> generate such loops anywhere.

I think the condition we'd have to check is rather that whenever there
is an atomic CAS available, there is also an atomic load available.
That is not quite the same as having only truly atomic loads available,
because then a CAS could still be available regardless of the
availability of a matching load.

Reply via email to