> Am 27.09.2018 um 15:07 schrieb Ramana Radhakrishnan
> <ramana.radhakrish...@arm.com>:
>
>> On 26/09/2018 06:03, rth7...@gmail.com wrote:
>> From: Richard Henderson <richard.hender...@linaro.org>
>> ARMv8.1 adds an (mandatory) Atomics extension, also known as the
>> Large System Extension. Deploying this extension at the OS level
>> has proved challenging.
>> The following is the result of a conversation between myself,
>> Alex Graf of SuSE, and Ramana Radhakrishnan of ARM, at last week's
>> Linaro Connect in Vancouver.
>> The current state of the world is that one could distribute two
>> different copies of a given shared library and place the LSE-enabled
>> version in /lib64/atomics/ and it will be selected over the /lib64/
>> version by ld.so when HWCAP_ATOMICS is present.
>> Alex's main concern with this is that (1) he doesn't want to
>> distribute two copies of every library, or determine what a
>> resonable subset would be and (2) this solution does not work
>> for executables, e.g. mysql.
>> Ramana's main concern was to avoid the overhead of an indirect jump,
>> especially in how that would affect the (non-)branch-prediction of
>> the smallest implementations.
>> Therefore, I've created small out-of-line helpers that are directly
>> linked into every library or executable that requires them. There
>> will be two direct branches, both of which will be well-predicted.
>> In the process, I discovered a number of places within the code
>> where the existing implementation could be improved. In particular:
>> - the LSE patterns didn't use predicates or constraints that
>> match the actual instructions, requiring unnecessary splitting.
>> - the non-LSE compare-and-swap can use an extending compare to
>> avoid requiring the input to have been previously extended.
>> - TImode compare-and-swap was missing entirely. This brings
>> aarch64 to parity with x86_64 wrt __sync_val_compare_and_swap.
>> There is a final patch that enables the new option by default.
>> I am not necessarily expecting this to be merged upstream, but
>> for the operating system to decide what the default should be.
>> It might be that this should be a configure option, so as to
>> make that OS choice easier, but I've just now thought of that. ;-)
>> I'm going to have to rely on Alex and/or Ramana to perform
>> testing on a system that supports LSE.
>
> Thanks for this patchset -
>
> I'll give this a whirl in the next couple of days but don't expect results
> until Monday or so.
>
> I do have an additional concern that I forgot to mention in Vancouver -
>
> Thanks Wilco for reminding me that this now replaces a bunch of inline
> instructions with effectively a library call therefore clobbering a whole
> bunch of caller saved registers.
>
> In which case I see 2 options.
>
> - maybe we should consider a private interface and restrict the registers
> that these files are compiled with to minimise the number of caller saved
> registers we trash.
>
> - Alternatively we should consider an option to inline these at O2 or O3 as
> we may just be trading the performance improvements we get with using the lse
> atomics
I talked to Will Deacon about lse atomics today a bit. Apparently, a key
benefit that you get from using them is guaranteed forward progress when
compared to an exclusives loop.
So IMHO even a tiny slowdown might be better than not progressing.
Another concern he did bring up was that due to the additional conditional code
a cmpxchg loop may become bigger, so converges slower/never than a native
implementation. I assume we can identify those cases later and solve them with
ifuncs in the target code though.
Alex
> for additional stacking and unstacking of caller saved registers in the main
> functions...
>
> But anyway while we discuss that we'll have a look at testing and
> benchmarking this.
>
>
> regards
> Ramana
>
>> r~
>> Richard Henderson (11):
>> aarch64: Simplify LSE cas generation
>> aarch64: Improve cas generation
>> aarch64: Improve swp generation
>> aarch64: Improve atomic-op lse generation
>> aarch64: Emit LSE st<op> instructions
>> Add visibility to libfunc constructors
>> Link static libgcc after shared libgcc for -shared-libgcc
>> aarch64: Add out-of-line functions for LSE atomics
>> aarch64: Implement -matomic-ool
>> aarch64: Implement TImode compare-and-swap
>> Enable -matomic-ool by default
>> gcc/config/aarch64/aarch64-protos.h | 20 +-
>> gcc/optabs-libfuncs.h | 2 +
>> gcc/common/config/aarch64/aarch64-common.c | 6 +-
>> gcc/config/aarch64/aarch64.c | 480 ++++++--------
>> gcc/gcc.c | 9 +-
>> gcc/optabs-libfuncs.c | 26 +-
>> .../atomic-comp-swap-release-acquire.c | 2 +-
>> .../gcc.target/aarch64/atomic-inst-ldadd.c | 18 +-
>> .../gcc.target/aarch64/atomic-inst-ldlogic.c | 54 +-
>> .../gcc.target/aarch64/atomic-op-acq_rel.c | 2 +-
>> .../gcc.target/aarch64/atomic-op-acquire.c | 2 +-
>> .../gcc.target/aarch64/atomic-op-char.c | 2 +-
>> .../gcc.target/aarch64/atomic-op-consume.c | 2 +-
>> .../gcc.target/aarch64/atomic-op-imm.c | 2 +-
>> .../gcc.target/aarch64/atomic-op-int.c | 2 +-
>> .../gcc.target/aarch64/atomic-op-long.c | 2 +-
>> .../gcc.target/aarch64/atomic-op-relaxed.c | 2 +-
>> .../gcc.target/aarch64/atomic-op-release.c | 2 +-
>> .../gcc.target/aarch64/atomic-op-seq_cst.c | 2 +-
>> .../gcc.target/aarch64/atomic-op-short.c | 2 +-
>> .../aarch64/atomic_cmp_exchange_zero_reg_1.c | 2 +-
>> .../atomic_cmp_exchange_zero_strong_1.c | 2 +-
>> .../gcc.target/aarch64/sync-comp-swap.c | 2 +-
>> .../gcc.target/aarch64/sync-op-acquire.c | 2 +-
>> .../gcc.target/aarch64/sync-op-full.c | 2 +-
>> libgcc/config/aarch64/lse.c | 280 ++++++++
>> gcc/config/aarch64/aarch64.opt | 4 +
>> gcc/config/aarch64/atomics.md | 608 ++++++++++--------
>> gcc/config/aarch64/iterators.md | 8 +-
>> gcc/config/aarch64/predicates.md | 12 +
>> gcc/doc/invoke.texi | 14 +-
>> libgcc/config.host | 4 +
>> libgcc/config/aarch64/t-lse | 48 ++
>> 33 files changed, 1050 insertions(+), 577 deletions(-)
>> create mode 100644 libgcc/config/aarch64/lse.c
>> create mode 100644 libgcc/config/aarch64/t-lse
>