https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991
--- Comment #4 from Alex Coplan <acoplan at gcc dot gnu.org> --- So the following is enough to fix the missed ldp due to alias analysis: diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc index 31d2c21c88f..ab49d955ccf 100644 --- a/gcc/pair-fusion.cc +++ b/gcc/pair-fusion.cc @@ -128,8 +128,12 @@ pair_fusion::run () if (!track_loads_p () && !track_stores_p ()) return; + init_alias_analysis (); + for (auto bb : crtl->ssa->bbs ()) process_block (bb); + + end_alias_analysis (); } // State used by the pass for a given basic block. that explains why sched1 was able to do the re-ordering but we weren't able to do it in ldp_fusion1 (sched1 makes these calls). Essentially this enables a mini-pass that establishes register equivalences and allows the calls to canon_rtx inside the alias machinery to re-write the memcpy accesses in terms of the sfp for alias disambiguation purposes. For the testcase in #c1: --- without-patch.s 2024-07-05 11:33:57.395927975 +0100 +++ with-patch.s 2024-07-05 11:33:32.164155523 +0100 @@ -17,9 +17,8 @@ bl g add x0, sp, 32 ldp q31, q30, [x19] - ldr q29, [x19, 32] str q31, [sp, 32] - ldr q31, [x19, 48] + ldp q29, q31, [x19, 32] stp q30, q29, [x0, 16] str q31, [x0, 48] bl h we still miss the stp in this case since the stores have different RTL bases (sfp vs memcpy pseudo) and no MEM_EXPR information. If we go ahead with the above change then in theory we could also make use of this register equivalence information during discovery (not just for alias analysis), allowing us to get the remaining stp. While the above patch seems to improve performance overall, there is one workload with a significant compile-time regression which needs investigating. There are also some codesize regressions which I think occur due to forming more stack-based LDPs, but this scuppers the IRA REG_EQUIV optimization to avoid spilling registers that were loaded from the stack. So a bit more work needed before we can go ahead with this.