Hi Richard, > -----Original Message----- > From: Richard Sandiford <richard.sandif...@arm.com> > Sent: Monday, November 20, 2023 12:16 PM > To: gcc-patches@gcc.gnu.org > Subject: [PATCH] aarch64: Add an early RA for strided registers > > [Yeah, I just missed the stage1 deadline, sorry. But this is gated > behind several other things, so it seems a bit academic whether it > was posted yesterday or today.] > > This pass adds a simple register allocator for FP & SIMD registers. > Its main purpose is to make use of SME2's strided LD1, ST1 and LUTI2/4 > instructions, which require a very specific grouping structure, > and so would be difficult to exploit with general allocation. > > The allocator is very simple. It gives up on anything that would > require spilling, or that it might not handle well for other reasons. > > The allocator needs to track liveness at the level of individual FPRs. > Doing that fixes a lot of the PRs relating to redundant moves caused by > structure loads and stores. That particular problem is now fixed more > generally by Lehua's RA patches. > > However, the patch runs before scheduling, so it has a chance to bag > a spill-free allocation of vector code before the scheduler moves > things around. It could therefore still be useful for non-SME code > (e.g. for hand-scheduled ACLE code). > > The pass is controlled by a tristate switch: > > - -mearly-ra=all: run on all functions > - -mearly-ra=strided: run on functions that have access to strided registers > - -mearly-ra=none: don't run on any function > > The patch makes -mearly-ra=strided the default at -O2 and above.
Can you please add a statement about that default to the invoke.texi documentation. > However, I've tested it with -mearly-ra=all too. > > As said previously, the pass is very naive. There's much more that we > could do, such as handling invariants better. The main focus is on not > committing to a bad allocation, rather than on handling as much as > possible. > At a first glance it seemed to me a bit unconventional to add an aarch64-specific pass for such an RA task. But I tend to agree that the specific allocation requirements for these instructions warrant some target-specific logic. And we're quite happy to have target-specific passes that adjust all manners of instruction selection and scheduling decisions, and I don't see the strided register allocation decisions as qualitatively different. > Tested on aarch64-linux-gnu. I'm okay with that patch, but it looks like it has a small hunk to ira-costs.cc, so I guess you'll want an authority on that to okay it. One further small comment inline below. Thanks, Kyrill > > Richard > > > gcc/ > * ira-costs.cc (scan_one_insn): Restrict handling of parameter > loads to pseudo registers. > * config.gcc: Add aarch64-early-ra.o for AArch64 targets. > * config/aarch64/t-aarch64 (aarch64-early-ra.o): New rule. > * config/aarch64/aarch64-opts.h (aarch64_early_ra_scope): New enum. > * config/aarch64/aarch64.opt (mearly_ra): New option. > * doc/invoke.texi: Document it. > * common/config/aarch64/aarch64-common.cc > (aarch_option_optimization_table): Use -mearly-ra=strided by > default for -O2 and above. > * config/aarch64/aarch64-passes.def (pass_aarch64_early_ra): New > pass. > * config/aarch64/aarch64-protos.h (aarch64_strided_registers_p) > (make_pass_aarch64_early_ra): Declare. > * config/aarch64/aarch64-sme.md > (@aarch64_sme_lut<LUTI_BITS><mode>): > Add a stride_type attribute. > (@aarch64_sme_lut<LUTI_BITS><mode>_strided2): New pattern. > (@aarch64_sme_lut<LUTI_BITS><mode>_strided4): Likewise. > * config/aarch64/aarch64-sve-builtins-base.cc (svld1_impl::expand) > (svldnt1_impl::expand, svst1_impl::expand, svstn1_impl::expand): > Handle > new way of defining multi-register loads and stores. > * config/aarch64/aarch64-sve.md > (@aarch64_ld1<SVE_FULLx24:mode>) > (@aarch64_ldnt1<SVE_FULLx24:mode>, > @aarch64_st1<SVE_FULLx24:mode>) > (@aarch64_stnt1<SVE_FULLx24:mode>): Delete. > * config/aarch64/aarch64-sve2.md > (@aarch64_<LD1_COUNT:optab><mode>) > (@aarch64_<LD1_COUNT:optab><mode>_strided2): New patterns. > (@aarch64_<LD1_COUNT:optab><mode>_strided4): Likewise. > (@aarch64_<ST1_COUNT:optab><mode>): Likewise. > (@aarch64_<ST1_COUNT:optab><mode>_strided2): Likewise. > (@aarch64_<ST1_COUNT:optab><mode>_strided4): Likewise. > * config/aarch64/aarch64.cc (aarch64_strided_registers_p): New > function. > * config/aarch64/aarch64.md (UNSPEC_LD1_SVE_COUNT): Delete. > (UNSPEC_ST1_SVE_COUNT, UNSPEC_LDNT1_SVE_COUNT): Likewise. > (UNSPEC_STNT1_SVE_COUNT): Likewise. > (stride_type): New attribute. > * config/aarch64/constraints.md (Uwd, Uwt): New constraints. > * config/aarch64/iterators.md (UNSPEC_LD1_COUNT, > UNSPEC_LDNT1_COUNT) > (UNSPEC_ST1_COUNT, UNSPEC_STNT1_COUNT): New unspecs. > (optab): Handle them. > (LD1_COUNT, ST1_COUNT): New iterators. > * config/aarch64/aarch64-early-ra.cc: New file. > > gcc/testsuite/ > * gcc.target/aarch64/ldp_stp_16.c (cons4_4_float): Tighten expected > output test. > * gcc.target/aarch64/sme/strided_1.c: New test. > --- > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md > index bd6c236f884..423f406a00d 100644 > --- a/gcc/config/aarch64/aarch64.md > +++ b/gcc/config/aarch64/aarch64.md > @@ -525,6 +521,11 @@ (define_attr "predicated" "yes,no" (const_string "no")) > ;; may chose to hold the tracking state encoded in SP. > (define_attr "speculation_barrier" "true,false" (const_string "false")) > > +(define_attr "stride_type" > + "none,ld1_consecutive,ld1_strided,st1_consecutive,st1_strided, > + luti_consecutive,luti_strided" > + (const_string "none")) I think it'd be good to have a comment on this attribute to let future developers know that these attributes are used by the new RA pass, and whether they are mandatory for correctness or optimization purposes.