RE: [PATCH] aarch64: Add an early RA for strided registers

Kyrylo Tkachov Tue, 05 Dec 2023 03:06:17 -0800

Hi Richard,

> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Monday, November 20, 2023 12:16 PM
> To: gcc-patches@gcc.gnu.org
> Subject: [PATCH] aarch64: Add an early RA for strided registers
> 
> [Yeah, I just missed the stage1 deadline, sorry.  But this is gated
>  behind several other things, so it seems a bit academic whether it
>  was posted yesterday or today.]
> 
> This pass adds a simple register allocator for FP & SIMD registers.
> Its main purpose is to make use of SME2's strided LD1, ST1 and LUTI2/4
> instructions, which require a very specific grouping structure,
> and so would be difficult to exploit with general allocation.
> 
> The allocator is very simple.  It gives up on anything that would
> require spilling, or that it might not handle well for other reasons.
> 
> The allocator needs to track liveness at the level of individual FPRs.
> Doing that fixes a lot of the PRs relating to redundant moves caused by
> structure loads and stores.  That particular problem is now fixed more
> generally by Lehua's RA patches.
> 
> However, the patch runs before scheduling, so it has a chance to bag
> a spill-free allocation of vector code before the scheduler moves
> things around.  It could therefore still be useful for non-SME code
> (e.g. for hand-scheduled ACLE code).
> 
> The pass is controlled by a tristate switch:
> 
> - -mearly-ra=all: run on all functions
> - -mearly-ra=strided: run on functions that have access to strided registers
> - -mearly-ra=none: don't run on any function
> 
> The patch makes -mearly-ra=strided the default at -O2 and above.


Can you please add a statement about that default to the invoke.texi 
documentation.

> However, I've tested it with -mearly-ra=all too.
> 
> As said previously, the pass is very naive.  There's much more that we
> could do, such as handling invariants better.  The main focus is on not
> committing to a bad allocation, rather than on handling as much as
> possible.
> 

At a first glance it seemed to me a bit unconventional to add an 
aarch64-specific pass for such an RA task.
But I tend to agree that the specific allocation requirements for these 
instructions warrant some target-specific logic.
And we're quite happy to have target-specific passes that adjust all manners of 
instruction selection and scheduling decisions, and I don't see the strided 
register allocation decisions as qualitatively different.

> Tested on aarch64-linux-gnu.

I'm okay with that patch, but it looks like it has a small hunk to 
ira-costs.cc, so I guess you'll want an authority on that to okay it.

One further small comment inline below.

Thanks,
Kyrill

> 
> Richard
> 
> 
> gcc/
>       * ira-costs.cc (scan_one_insn): Restrict handling of parameter
>       loads to pseudo registers.
>       * config.gcc: Add aarch64-early-ra.o for AArch64 targets.
>       * config/aarch64/t-aarch64 (aarch64-early-ra.o): New rule.
>       * config/aarch64/aarch64-opts.h (aarch64_early_ra_scope): New enum.
>       * config/aarch64/aarch64.opt (mearly_ra): New option.
>       * doc/invoke.texi: Document it.
>       * common/config/aarch64/aarch64-common.cc
>       (aarch_option_optimization_table): Use -mearly-ra=strided by
>       default for -O2 and above.
>       * config/aarch64/aarch64-passes.def (pass_aarch64_early_ra): New
> pass.
>       * config/aarch64/aarch64-protos.h (aarch64_strided_registers_p)
>       (make_pass_aarch64_early_ra): Declare.
>       * config/aarch64/aarch64-sme.md
> (@aarch64_sme_lut<LUTI_BITS><mode>):
>       Add a stride_type attribute.
>       (@aarch64_sme_lut<LUTI_BITS><mode>_strided2): New pattern.
>       (@aarch64_sme_lut<LUTI_BITS><mode>_strided4): Likewise.
>       * config/aarch64/aarch64-sve-builtins-base.cc (svld1_impl::expand)
>       (svldnt1_impl::expand, svst1_impl::expand, svstn1_impl::expand):
> Handle
>       new way of defining multi-register loads and stores.
>       * config/aarch64/aarch64-sve.md
> (@aarch64_ld1<SVE_FULLx24:mode>)
>       (@aarch64_ldnt1<SVE_FULLx24:mode>,
> @aarch64_st1<SVE_FULLx24:mode>)
>       (@aarch64_stnt1<SVE_FULLx24:mode>): Delete.
>       * config/aarch64/aarch64-sve2.md
> (@aarch64_<LD1_COUNT:optab><mode>)
>       (@aarch64_<LD1_COUNT:optab><mode>_strided2): New patterns.
>       (@aarch64_<LD1_COUNT:optab><mode>_strided4): Likewise.
>       (@aarch64_<ST1_COUNT:optab><mode>): Likewise.
>       (@aarch64_<ST1_COUNT:optab><mode>_strided2): Likewise.
>       (@aarch64_<ST1_COUNT:optab><mode>_strided4): Likewise.
>       * config/aarch64/aarch64.cc (aarch64_strided_registers_p): New
>       function.
>       * config/aarch64/aarch64.md (UNSPEC_LD1_SVE_COUNT): Delete.
>       (UNSPEC_ST1_SVE_COUNT, UNSPEC_LDNT1_SVE_COUNT): Likewise.
>       (UNSPEC_STNT1_SVE_COUNT): Likewise.
>       (stride_type): New attribute.
>       * config/aarch64/constraints.md (Uwd, Uwt): New constraints.
>       * config/aarch64/iterators.md (UNSPEC_LD1_COUNT,
> UNSPEC_LDNT1_COUNT)
>       (UNSPEC_ST1_COUNT, UNSPEC_STNT1_COUNT): New unspecs.
>       (optab): Handle them.
>       (LD1_COUNT, ST1_COUNT): New iterators.
>       * config/aarch64/aarch64-early-ra.cc: New file.
> 
> gcc/testsuite/
>       * gcc.target/aarch64/ldp_stp_16.c (cons4_4_float): Tighten expected
>       output test.
>       * gcc.target/aarch64/sme/strided_1.c: New test.
> ---

> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index bd6c236f884..423f406a00d 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -525,6 +521,11 @@ (define_attr "predicated" "yes,no" (const_string "no"))
>  ;; may chose to hold the tracking state encoded in SP.
>  (define_attr "speculation_barrier" "true,false" (const_string "false"))
> 
> +(define_attr "stride_type"
> +  "none,ld1_consecutive,ld1_strided,st1_consecutive,st1_strided,
> +   luti_consecutive,luti_strided"
> +  (const_string "none"))

I think it'd be good to have a comment on this attribute to let future 
developers know that these attributes are used
by the new RA pass, and whether they are mandatory for correctness or 
optimization purposes.

RE: [PATCH] aarch64: Add an early RA for strided registers

Reply via email to