Alex Coplan <alex.cop...@arm.com> writes:
> This is a v6 of the aarch64 load/store pair fusion pass, which
> addresses the feedback from Richard's last review here:
>
> https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640539.html
>
> In particular this version implements the suggested changes which
> greatly simplify the double list walk.
>
> Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
>
> Thanks,
> Alex
>
> -- >8 --
>
> This adds a new aarch64-specific RTL-SSA pass dedicated to forming load
> and store pairs (LDPs and STPs).
>
> As a motivating example for the kind of thing this improves, take the
> following testcase:
>
> extern double c[20];
>
> double f(double x)
> {
>   double y = x*x;
>   y += c[16];
>   y += c[17];
>   y += c[18];
>   y += c[19];
>   return y;
> }
>
> for which we currently generate (at -O2):
>
> f:
>         adrp    x0, c
>         add     x0, x0, :lo12:c
>         ldp     d31, d29, [x0, 128]
>         ldr     d30, [x0, 144]
>         fmadd   d0, d0, d0, d31
>         ldr     d31, [x0, 152]
>         fadd    d0, d0, d29
>         fadd    d0, d0, d30
>         fadd    d0, d0, d31
>         ret
>
> but with the pass, we generate:
>
> f:
> .LFB0:
>         adrp    x0, c
>         add     x0, x0, :lo12:c
>         ldp     d31, d29, [x0, 128]
>         fmadd   d0, d0, d0, d31
>         ldp     d30, d31, [x0, 144]
>         fadd    d0, d0, d29
>         fadd    d0, d0, d30
>         fadd    d0, d0, d31
>         ret
>
> The pass is local (only considers a BB at a time).  In theory, it should
> be possible to extend it to run over EBBs, at least in the case of pure
> (MEM_READONLY_P) loads, but this is left for future work.
>
> The pass works by identifying two kinds of bases: tree decls obtained
> via MEM_EXPR, and RTL register bases in the form of RTL-SSA def_infos.
> If a candidate memory access has a MEM_EXPR base, then we track it via
> this base, and otherwise if it is of a simple reg + <imm> form, we track
> it via the RTL-SSA def_info for the register.
>
> For each BB, for a given kind of base, we build up a hash table mapping
> the base to an access_group.  The access_group data structure holds a
> list of accesses at each offset relative to the same base.  It uses a
> splay tree to support efficient insertion (while walking the bb), and
> the nodes are chained using a linked list to support efficient
> iteration (while doing the transformation).
>
> For each base, we then iterate over the access_group to identify
> adjacent accesses, and try to form load/store pairs for those insns that
> access adjacent memory.
>
> The pass is currently run twice, both before and after register
> allocation.  The first copy of the pass is run late in the pre-RA RTL
> pipeline, immediately after sched1, since it was found that sched1 was
> increasing register pressure when the pass was run before.  The second
> copy of the pass runs immediately before peephole2, so as to get any
> opportunities that the existing ldp/stp peepholes can handle.
>
> There are some cases that we punt on before RA, e.g.
> accesses relative to eliminable regs (such as the soft frame pointer).
> We do this since we can't know the elimination offset before RA, and we
> want to avoid the RA reloading the offset (due to being out of ldp/stp
> immediate range) as this can generate worse code.
>
> The post-RA copy of the pass is there to pick up the crumbs that were
> left behind / things we punted on in the pre-RA pass.  Among other
> things, it's needed to handle accesses relative to the stack pointer.
> It can also handle code that didn't exist at the time the pre-RA pass
> was run (spill code, prologue/epilogue code).
>
> This is an initial implementation, and there are (among other possible
> improvements) the following notable caveats / missing features that are
> left for future work, but could give further improvements:
>
>  - Moving accesses between BBs within in an EBB, see above.
>  - Out-of-range opportunities: currently the pass refuses to form pairs
>    if there isn't a suitable base register with an immediate in range
>    for ldp/stp, but it can be profitable to emit anchor addresses in the
>    case that there are four or more out-of-range nearby accesses that can
>    be formed into pairs.  This is handled by the current ldp/stp
>    peepholes, so it would be good to support this in the future.
>  - Discovery: currently we prioritize MEM_EXPR bases over RTL bases, which can
>    lead to us missing opportunities in the case that two accesses have 
> distinct
>    MEM_EXPR bases (i.e. different DECLs) but they are still adjacent in memory
>    (e.g. adjacent variables on the stack).  I hope to address this for GCC 15,
>    hopefully getting to the point where we can remove the ldp/stp peepholes 
> and
>    scheduling hooks.  Furthermore it would be nice to make the pass aware of
>    section anchors (adding these as a third kind of base) allowing merging
>    accesses to adjacent variables within the same section.
>
> gcc/ChangeLog:
>
>         * config.gcc: Add aarch64-ldp-fusion.o to extra_objs for aarch64.
>         * config/aarch64/aarch64-passes.def: Add copies of pass_ldp_fusion
>         before and after RA.
>         * config/aarch64/aarch64-protos.h (make_pass_ldp_fusion): Declare.
>         * config/aarch64/aarch64.opt (-mearly-ldp-fusion): New.
>         (-mlate-ldp-fusion): New.
>         (--param=aarch64-ldp-alias-check-limit): New.
>         (--param=aarch64-ldp-writeback): New.
>         * config/aarch64/t-aarch64: Add rule for aarch64-ldp-fusion.o.
>         * config/aarch64/aarch64-ldp-fusion.cc: New file.

Nice!  OK for trunk.  Thanks for your patience through the reviews,
and sorry for missing your reply to the initial review.

Richard

Reply via email to