On 15/12/2023 15:34, Richard Sandiford wrote:
> Alex Coplan <alex.cop...@arm.com> writes:
> > This is a v6 of the aarch64 load/store pair fusion pass, which
> > addresses the feedback from Richard's last review here:
> >
> > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640539.html
> >
> > In particular this version implements the suggested changes which
> > greatly simplify the double list walk.
> >
> > Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk?
> >
> > Thanks,
> > Alex
> >
> > -- >8 --
> >
> > This adds a new aarch64-specific RTL-SSA pass dedicated to forming load
> > and store pairs (LDPs and STPs).
> >
> > As a motivating example for the kind of thing this improves, take the
> > following testcase:
> >
> > extern double c[20];
> >
> > double f(double x)
> > {
> >   double y = x*x;
> >   y += c[16];
> >   y += c[17];
> >   y += c[18];
> >   y += c[19];
> >   return y;
> > }
> >
> > for which we currently generate (at -O2):
> >
> > f:
> >         adrp    x0, c
> >         add     x0, x0, :lo12:c
> >         ldp     d31, d29, [x0, 128]
> >         ldr     d30, [x0, 144]
> >         fmadd   d0, d0, d0, d31
> >         ldr     d31, [x0, 152]
> >         fadd    d0, d0, d29
> >         fadd    d0, d0, d30
> >         fadd    d0, d0, d31
> >         ret
> >
> > but with the pass, we generate:
> >
> > f:
> > .LFB0:
> >         adrp    x0, c
> >         add     x0, x0, :lo12:c
> >         ldp     d31, d29, [x0, 128]
> >         fmadd   d0, d0, d0, d31
> >         ldp     d30, d31, [x0, 144]
> >         fadd    d0, d0, d29
> >         fadd    d0, d0, d30
> >         fadd    d0, d0, d31
> >         ret
> >
> > The pass is local (only considers a BB at a time).  In theory, it should
> > be possible to extend it to run over EBBs, at least in the case of pure
> > (MEM_READONLY_P) loads, but this is left for future work.
> >
> > The pass works by identifying two kinds of bases: tree decls obtained
> > via MEM_EXPR, and RTL register bases in the form of RTL-SSA def_infos.
> > If a candidate memory access has a MEM_EXPR base, then we track it via
> > this base, and otherwise if it is of a simple reg + <imm> form, we track
> > it via the RTL-SSA def_info for the register.
> >
> > For each BB, for a given kind of base, we build up a hash table mapping
> > the base to an access_group.  The access_group data structure holds a
> > list of accesses at each offset relative to the same base.  It uses a
> > splay tree to support efficient insertion (while walking the bb), and
> > the nodes are chained using a linked list to support efficient
> > iteration (while doing the transformation).
> >
> > For each base, we then iterate over the access_group to identify
> > adjacent accesses, and try to form load/store pairs for those insns that
> > access adjacent memory.
> >
> > The pass is currently run twice, both before and after register
> > allocation.  The first copy of the pass is run late in the pre-RA RTL
> > pipeline, immediately after sched1, since it was found that sched1 was
> > increasing register pressure when the pass was run before.  The second
> > copy of the pass runs immediately before peephole2, so as to get any
> > opportunities that the existing ldp/stp peepholes can handle.
> >
> > There are some cases that we punt on before RA, e.g.
> > accesses relative to eliminable regs (such as the soft frame pointer).
> > We do this since we can't know the elimination offset before RA, and we
> > want to avoid the RA reloading the offset (due to being out of ldp/stp
> > immediate range) as this can generate worse code.
> >
> > The post-RA copy of the pass is there to pick up the crumbs that were
> > left behind / things we punted on in the pre-RA pass.  Among other
> > things, it's needed to handle accesses relative to the stack pointer.
> > It can also handle code that didn't exist at the time the pre-RA pass
> > was run (spill code, prologue/epilogue code).
> >
> > This is an initial implementation, and there are (among other possible
> > improvements) the following notable caveats / missing features that are
> > left for future work, but could give further improvements:
> >
> >  - Moving accesses between BBs within in an EBB, see above.
> >  - Out-of-range opportunities: currently the pass refuses to form pairs
> >    if there isn't a suitable base register with an immediate in range
> >    for ldp/stp, but it can be profitable to emit anchor addresses in the
> >    case that there are four or more out-of-range nearby accesses that can
> >    be formed into pairs.  This is handled by the current ldp/stp
> >    peepholes, so it would be good to support this in the future.
> >  - Discovery: currently we prioritize MEM_EXPR bases over RTL bases, which 
> > can
> >    lead to us missing opportunities in the case that two accesses have 
> > distinct
> >    MEM_EXPR bases (i.e. different DECLs) but they are still adjacent in 
> > memory
> >    (e.g. adjacent variables on the stack).  I hope to address this for GCC 
> > 15,
> >    hopefully getting to the point where we can remove the ldp/stp peepholes 
> > and
> >    scheduling hooks.  Furthermore it would be nice to make the pass aware of
> >    section anchors (adding these as a third kind of base) allowing merging
> >    accesses to adjacent variables within the same section.
> >
> > gcc/ChangeLog:
> >
> >         * config.gcc: Add aarch64-ldp-fusion.o to extra_objs for aarch64.
> >         * config/aarch64/aarch64-passes.def: Add copies of pass_ldp_fusion
> >         before and after RA.
> >         * config/aarch64/aarch64-protos.h (make_pass_ldp_fusion): Declare.
> >         * config/aarch64/aarch64.opt (-mearly-ldp-fusion): New.
> >         (-mlate-ldp-fusion): New.
> >         (--param=aarch64-ldp-alias-check-limit): New.
> >         (--param=aarch64-ldp-writeback): New.
> >         * config/aarch64/t-aarch64: Add rule for aarch64-ldp-fusion.o.
> >         * config/aarch64/aarch64-ldp-fusion.cc: New file.
> 
> Nice!  OK for trunk.  Thanks for your patience through the reviews,
> and sorry for missing your reply to the initial review.

Thanks a lot for the reviews, the code is definitely a lot better as a
result.  I've pushed the series to trunk up to and including the pass
(10/11).  I've left the memcpy patch for now as it needs more rebase
work.

Alex

> 
> Richard

Reply via email to