> Am 27.09.2018 um 15:07 schrieb Ramana Radhakrishnan 
> <ramana.radhakrish...@arm.com>:
> 
>> On 26/09/2018 06:03, rth7...@gmail.com wrote:
>> From: Richard Henderson <richard.hender...@linaro.org>
>> ARMv8.1 adds an (mandatory) Atomics extension, also known as the
>> Large System Extension.  Deploying this extension at the OS level
>> has proved challenging.
>> The following is the result of a conversation between myself,
>> Alex Graf of SuSE, and Ramana Radhakrishnan of ARM, at last week's
>> Linaro Connect in Vancouver.
>> The current state of the world is that one could distribute two
>> different copies of a given shared library and place the LSE-enabled
>> version in /lib64/atomics/ and it will be selected over the /lib64/
>> version by ld.so when HWCAP_ATOMICS is present.
>> Alex's main concern with this is that (1) he doesn't want to
>> distribute two copies of every library, or determine what a
>> resonable subset would be and (2) this solution does not work
>> for executables, e.g. mysql.
>> Ramana's main concern was to avoid the overhead of an indirect jump,
>> especially in how that would affect the (non-)branch-prediction of
>> the smallest implementations.
>> Therefore, I've created small out-of-line helpers that are directly
>> linked into every library or executable that requires them.  There
>> will be two direct branches, both of which will be well-predicted.
>> In the process, I discovered a number of places within the code
>> where the existing implementation could be improved.  In particular:
>>  - the LSE patterns didn't use predicates or constraints that
>>    match the actual instructions, requiring unnecessary splitting.
>>  - the non-LSE compare-and-swap can use an extending compare to
>>    avoid requiring the input to have been previously extended.
>>  - TImode compare-and-swap was missing entirely.  This brings
>>    aarch64 to parity with x86_64 wrt __sync_val_compare_and_swap.
>> There is a final patch that enables the new option by default.
>> I am not necessarily expecting this to be merged upstream, but
>> for the operating system to decide what the default should be.
>> It might be that this should be a configure option, so as to
>> make that OS choice easier, but I've just now thought of that.  ;-)
>> I'm going to have to rely on Alex and/or Ramana to perform
>> testing on a system that supports LSE.
> 
> Thanks for this patchset -
> 
> I'll give this a whirl in the next couple of days but don't expect results 
> until Monday or so.
> 
> I do have an additional concern that I forgot to mention in Vancouver -
> 
> Thanks Wilco for reminding me that this now replaces a bunch of inline 
> instructions with effectively a library call therefore clobbering a whole 
> bunch of caller saved registers.
> 
> In which case I see 2 options.
> 
> -  maybe we should consider a private interface and restrict the registers 
> that these files are compiled with to minimise the number of caller saved 
> registers we trash.
> 
> - Alternatively we should consider an option to inline these at O2 or O3 as 
> we may just be trading the performance improvements we get with using the lse 
> atomics

I talked to Will Deacon about lse atomics today a bit. Apparently, a key 
benefit that you get from using them is guaranteed forward progress when 
compared to an exclusives loop.

So IMHO even a tiny slowdown might be better than not progressing.

Another concern he did bring up was that due to the additional conditional code 
a cmpxchg loop may become bigger, so converges slower/never than a native 
implementation. I assume we can identify those cases later and solve them with 
ifuncs in the target code though.


Alex

> for additional stacking and unstacking of caller saved registers in the main 
> functions...
> 
> But anyway while we discuss that we'll have a look at testing and 
> benchmarking this.
> 
> 
> regards
> Ramana
> 
>> r~
>> Richard Henderson (11):
>>   aarch64: Simplify LSE cas generation
>>   aarch64: Improve cas generation
>>   aarch64: Improve swp generation
>>   aarch64: Improve atomic-op lse generation
>>   aarch64: Emit LSE st<op> instructions
>>   Add visibility to libfunc constructors
>>   Link static libgcc after shared libgcc for -shared-libgcc
>>   aarch64: Add out-of-line functions for LSE atomics
>>   aarch64: Implement -matomic-ool
>>   aarch64: Implement TImode compare-and-swap
>>   Enable -matomic-ool by default
>>  gcc/config/aarch64/aarch64-protos.h           |  20 +-
>>  gcc/optabs-libfuncs.h                         |   2 +
>>  gcc/common/config/aarch64/aarch64-common.c    |   6 +-
>>  gcc/config/aarch64/aarch64.c                  | 480 ++++++--------
>>  gcc/gcc.c                                     |   9 +-
>>  gcc/optabs-libfuncs.c                         |  26 +-
>>  .../atomic-comp-swap-release-acquire.c        |   2 +-
>>  .../gcc.target/aarch64/atomic-inst-ldadd.c    |  18 +-
>>  .../gcc.target/aarch64/atomic-inst-ldlogic.c  |  54 +-
>>  .../gcc.target/aarch64/atomic-op-acq_rel.c    |   2 +-
>>  .../gcc.target/aarch64/atomic-op-acquire.c    |   2 +-
>>  .../gcc.target/aarch64/atomic-op-char.c       |   2 +-
>>  .../gcc.target/aarch64/atomic-op-consume.c    |   2 +-
>>  .../gcc.target/aarch64/atomic-op-imm.c        |   2 +-
>>  .../gcc.target/aarch64/atomic-op-int.c        |   2 +-
>>  .../gcc.target/aarch64/atomic-op-long.c       |   2 +-
>>  .../gcc.target/aarch64/atomic-op-relaxed.c    |   2 +-
>>  .../gcc.target/aarch64/atomic-op-release.c    |   2 +-
>>  .../gcc.target/aarch64/atomic-op-seq_cst.c    |   2 +-
>>  .../gcc.target/aarch64/atomic-op-short.c      |   2 +-
>>  .../aarch64/atomic_cmp_exchange_zero_reg_1.c  |   2 +-
>>  .../atomic_cmp_exchange_zero_strong_1.c       |   2 +-
>>  .../gcc.target/aarch64/sync-comp-swap.c       |   2 +-
>>  .../gcc.target/aarch64/sync-op-acquire.c      |   2 +-
>>  .../gcc.target/aarch64/sync-op-full.c         |   2 +-
>>  libgcc/config/aarch64/lse.c                   | 280 ++++++++
>>  gcc/config/aarch64/aarch64.opt                |   4 +
>>  gcc/config/aarch64/atomics.md                 | 608 ++++++++++--------
>>  gcc/config/aarch64/iterators.md               |   8 +-
>>  gcc/config/aarch64/predicates.md              |  12 +
>>  gcc/doc/invoke.texi                           |  14 +-
>>  libgcc/config.host                            |   4 +
>>  libgcc/config/aarch64/t-lse                   |  48 ++
>>  33 files changed, 1050 insertions(+), 577 deletions(-)
>>  create mode 100644 libgcc/config/aarch64/lse.c
>>  create mode 100644 libgcc/config/aarch64/t-lse
> 

Reply via email to