On 15/12/2023 15:34, Richard Sandiford wrote: > Alex Coplan <alex.cop...@arm.com> writes: > > This is a v6 of the aarch64 load/store pair fusion pass, which > > addresses the feedback from Richard's last review here: > > > > https://gcc.gnu.org/pipermail/gcc-patches/2023-December/640539.html > > > > In particular this version implements the suggested changes which > > greatly simplify the double list walk. > > > > Bootstrapped/regtested as a series on aarch64-linux-gnu, OK for trunk? > > > > Thanks, > > Alex > > > > -- >8 -- > > > > This adds a new aarch64-specific RTL-SSA pass dedicated to forming load > > and store pairs (LDPs and STPs). > > > > As a motivating example for the kind of thing this improves, take the > > following testcase: > > > > extern double c[20]; > > > > double f(double x) > > { > > double y = x*x; > > y += c[16]; > > y += c[17]; > > y += c[18]; > > y += c[19]; > > return y; > > } > > > > for which we currently generate (at -O2): > > > > f: > > adrp x0, c > > add x0, x0, :lo12:c > > ldp d31, d29, [x0, 128] > > ldr d30, [x0, 144] > > fmadd d0, d0, d0, d31 > > ldr d31, [x0, 152] > > fadd d0, d0, d29 > > fadd d0, d0, d30 > > fadd d0, d0, d31 > > ret > > > > but with the pass, we generate: > > > > f: > > .LFB0: > > adrp x0, c > > add x0, x0, :lo12:c > > ldp d31, d29, [x0, 128] > > fmadd d0, d0, d0, d31 > > ldp d30, d31, [x0, 144] > > fadd d0, d0, d29 > > fadd d0, d0, d30 > > fadd d0, d0, d31 > > ret > > > > The pass is local (only considers a BB at a time). In theory, it should > > be possible to extend it to run over EBBs, at least in the case of pure > > (MEM_READONLY_P) loads, but this is left for future work. > > > > The pass works by identifying two kinds of bases: tree decls obtained > > via MEM_EXPR, and RTL register bases in the form of RTL-SSA def_infos. > > If a candidate memory access has a MEM_EXPR base, then we track it via > > this base, and otherwise if it is of a simple reg + <imm> form, we track > > it via the RTL-SSA def_info for the register. > > > > For each BB, for a given kind of base, we build up a hash table mapping > > the base to an access_group. The access_group data structure holds a > > list of accesses at each offset relative to the same base. It uses a > > splay tree to support efficient insertion (while walking the bb), and > > the nodes are chained using a linked list to support efficient > > iteration (while doing the transformation). > > > > For each base, we then iterate over the access_group to identify > > adjacent accesses, and try to form load/store pairs for those insns that > > access adjacent memory. > > > > The pass is currently run twice, both before and after register > > allocation. The first copy of the pass is run late in the pre-RA RTL > > pipeline, immediately after sched1, since it was found that sched1 was > > increasing register pressure when the pass was run before. The second > > copy of the pass runs immediately before peephole2, so as to get any > > opportunities that the existing ldp/stp peepholes can handle. > > > > There are some cases that we punt on before RA, e.g. > > accesses relative to eliminable regs (such as the soft frame pointer). > > We do this since we can't know the elimination offset before RA, and we > > want to avoid the RA reloading the offset (due to being out of ldp/stp > > immediate range) as this can generate worse code. > > > > The post-RA copy of the pass is there to pick up the crumbs that were > > left behind / things we punted on in the pre-RA pass. Among other > > things, it's needed to handle accesses relative to the stack pointer. > > It can also handle code that didn't exist at the time the pre-RA pass > > was run (spill code, prologue/epilogue code). > > > > This is an initial implementation, and there are (among other possible > > improvements) the following notable caveats / missing features that are > > left for future work, but could give further improvements: > > > > - Moving accesses between BBs within in an EBB, see above. > > - Out-of-range opportunities: currently the pass refuses to form pairs > > if there isn't a suitable base register with an immediate in range > > for ldp/stp, but it can be profitable to emit anchor addresses in the > > case that there are four or more out-of-range nearby accesses that can > > be formed into pairs. This is handled by the current ldp/stp > > peepholes, so it would be good to support this in the future. > > - Discovery: currently we prioritize MEM_EXPR bases over RTL bases, which > > can > > lead to us missing opportunities in the case that two accesses have > > distinct > > MEM_EXPR bases (i.e. different DECLs) but they are still adjacent in > > memory > > (e.g. adjacent variables on the stack). I hope to address this for GCC > > 15, > > hopefully getting to the point where we can remove the ldp/stp peepholes > > and > > scheduling hooks. Furthermore it would be nice to make the pass aware of > > section anchors (adding these as a third kind of base) allowing merging > > accesses to adjacent variables within the same section. > > > > gcc/ChangeLog: > > > > * config.gcc: Add aarch64-ldp-fusion.o to extra_objs for aarch64. > > * config/aarch64/aarch64-passes.def: Add copies of pass_ldp_fusion > > before and after RA. > > * config/aarch64/aarch64-protos.h (make_pass_ldp_fusion): Declare. > > * config/aarch64/aarch64.opt (-mearly-ldp-fusion): New. > > (-mlate-ldp-fusion): New. > > (--param=aarch64-ldp-alias-check-limit): New. > > (--param=aarch64-ldp-writeback): New. > > * config/aarch64/t-aarch64: Add rule for aarch64-ldp-fusion.o. > > * config/aarch64/aarch64-ldp-fusion.cc: New file. > > Nice! OK for trunk. Thanks for your patience through the reviews, > and sorry for missing your reply to the initial review.
Thanks a lot for the reviews, the code is definitely a lot better as a result. I've pushed the series to trunk up to and including the pass (10/11). I've left the memcpy patch for now as it needs more rebase work. Alex > > Richard