[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991 --- Comment #9 from Alex Coplan --- (In reply to Richard Biener from comment #8) > Is this now fixed on trunk? No, not really. The codegen at -O2 on trunk is: f: stp x29, x30, [sp, -144]! mov x29, sp add x0, sp, 80 bl g ldp q28, q30, [sp, 80] add x0, sp, 16 ldp q29, q31, [sp, 112] str q28, [sp, 16] stp q30, q29, [sp, 32] str q31, [sp, 64] bl h ldp x29, x30, [sp], 144 ret Vlad's fix above helped reduce the frame size (thanks!). Immediately before that change (r15-7931), we have (again at -O2): f: stp x29, x30, [sp, -160]! mov x29, sp add x0, sp, 96 bl g ldp q28, q30, [sp, 96] add x0, sp, 32 ldp q29, q31, [sp, 128] str q28, [sp, 32] stp q30, q29, [x0, 16] str q31, [x0, 48] bl h ldp x29, x30, [sp], 160 ret so this is an improvement, but it looks like with these insns: str q28, [sp, 16] stp q30, q29, [sp, 32] str q31, [sp, 64] we're forming an stp in the middle, so we've still got work to do in ldp_fusion here. Also, as Andrew noted in #c5, the only reason we do better now is because of Wilco's change to turn the scheduler off on AArch64 (r15-6661-gc5db3f50bdf34ea96fd193a2a66d686401053bd2). Wilco also later re-enabled the scheduler at -O3 (r15-7871-gf870302515d5fcf7355f0108c3ead0038ff326fd), so e.g. taking the testcase in #c1 at -O3 on trunk, we get: f: stp x29, x30, [sp, -144]! mov x29, sp add x0, sp, 80 bl g ldp q31, q30, [sp, 80] add x0, sp, 16 ldr q29, [sp, 112] str q31, [sp, 16] ldr q31, [sp, 128] stp q30, q29, [sp, 32] str q31, [sp, 64] bl h ldp x29, x30, [sp], 144 ret i.e. we still have the same pathological interleaving caused by the scheduler (which ldp_fusion is currently unable to undo without the patch in #c4). So there is still work to do in ldp_fusion. The problem I ran into before is that the fix I proposed in #c4 isn't always beneficial. I believe this is because, although we form more pairs with that change, doing so before RA can scupper the RA's REG_EQUIV optimization and lead to spills. It needs more investigation to confirm this, but if so, I think we need some mechanism to allow the RA to either crack or look through paired loads/stores when it is beneficial to do so (e.g. to permit a REG_EQUIV optimization and avoid an additional spill). So there is more to do, but it is not at all straightforward to fix.
[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991 Richard Biener changed: What|Removed |Added Flags||needinfo?(acoplan at gcc dot gnu.o ||rg) Target Milestone|15.0|14.3 --- Comment #8 from Richard Biener --- Is this now fixed on trunk?
[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991 --- Comment #7 from GCC Commits --- The master branch has been updated by Vladimir Makarov : https://gcc.gnu.org/g:e355fe414aa3aaf215c7dd9dd789ce217a1b458c commit r15-7932-ge355fe414aa3aaf215c7dd9dd789ce217a1b458c Author: Vladimir N. Makarov Date: Mon Mar 10 16:26:59 2025 -0400 [PR114991][IRA]: Improve reg equiv invariant calculation In PR test case IRA preferred to allocate hard reg to a pseudo instead of its equivalence. This resulted in allocating caller-saved hard reg and generating save/restore insns in the function prologue/epilogue. The equivalence is an invariant (stack pointer plus offset) and the pseudo is used mostly as memory address. This happened as there was no simplification of insn after the invariant substitution. The patch adds the necessary code. gcc/ChangeLog: PR target/114991 * ira-costs.cc (equiv_can_be_consumed_p): Add new argument invariant_p. Add code for dealing with the invariant. (calculate_equiv_gains): Don't consider init insns. Pass the new argument to equiv_can_be_consumed_p. Don't treat invariant as memory. gcc/testsuite/ChangeLog: PR target/114991 * gcc.target/aarch64/pr114991.c: New test.
[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991 Andrew Pinski changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=28831 --- Comment #6 from Andrew Pinski --- This is also related to PR 28831.
[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991 Jeffrey A. Law changed: What|Removed |Added Priority|P3 |P2
[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991 --- Comment #5 from Andrew Pinski --- Note the testcase in comment #0 needs -fschedule-insns now to fail.
[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991 --- Comment #4 from Alex Coplan --- So the following is enough to fix the missed ldp due to alias analysis: diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc index 31d2c21c88f..ab49d955ccf 100644 --- a/gcc/pair-fusion.cc +++ b/gcc/pair-fusion.cc @@ -128,8 +128,12 @@ pair_fusion::run () if (!track_loads_p () && !track_stores_p ()) return; + init_alias_analysis (); + for (auto bb : crtl->ssa->bbs ()) process_block (bb); + + end_alias_analysis (); } // State used by the pass for a given basic block. that explains why sched1 was able to do the re-ordering but we weren't able to do it in ldp_fusion1 (sched1 makes these calls). Essentially this enables a mini-pass that establishes register equivalences and allows the calls to canon_rtx inside the alias machinery to re-write the memcpy accesses in terms of the sfp for alias disambiguation purposes. For the testcase in #c1: --- without-patch.s 2024-07-05 11:33:57.395927975 +0100 +++ with-patch.s2024-07-05 11:33:32.164155523 +0100 @@ -17,9 +17,8 @@ bl g add x0, sp, 32 ldp q31, q30, [x19] - ldr q29, [x19, 32] str q31, [sp, 32] - ldr q31, [x19, 48] + ldp q29, q31, [x19, 32] stp q30, q29, [x0, 16] str q31, [x0, 48] bl h we still miss the stp in this case since the stores have different RTL bases (sfp vs memcpy pseudo) and no MEM_EXPR information. If we go ahead with the above change then in theory we could also make use of this register equivalence information during discovery (not just for alias analysis), allowing us to get the remaining stp. While the above patch seems to improve performance overall, there is one workload with a significant compile-time regression which needs investigating. There are also some codesize regressions which I think occur due to forming more stack-based LDPs, but this scuppers the IRA REG_EQUIV optimization to avoid spilling registers that were loaded from the stack. So a bit more work needed before we can go ahead with this.
[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991 Alex Coplan changed: What|Removed |Added Status|NEW |ASSIGNED Assignee|unassigned at gcc dot gnu.org |acoplan at gcc dot gnu.org --- Comment #3 from Alex Coplan --- Mine for the aliasing issues/investigation, might be worth splitting off the RA problem to track that separately.
[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991 --- Comment #2 from Alex Coplan --- Here is some analysis on why we miss some of these opportunities in ldp_fusion. So initially in 267r.vregs we have some very clean RTL: 6: r101:DI=sfp:DI-0x40 7: x0:DI=r101:DI 8: call [`g'] argc:0 REG_CALL_DECL `g' 9: r102:DI=sfp:DI-0x80 10: r103:DI=sfp:DI-0x40 11: r104:V4SI=[r103:DI] 13: r105:V4SI=[r103:DI+0x10] 15: r106:V4SI=[r103:DI+0x20] 17: r107:V4SI=[r103:DI+0x30] 12: [r102:DI]=r104:V4SI 14: [r102:DI+0x10]=r105:V4SI 16: [r102:DI+0x20]=r106:V4SI 18: [r102:DI+0x30]=r107:V4SI if were to run the ldp/stp pass on this it should form the pairs without a problem. Of course things go downhill from here. The first slightly strange thing is that fwprop propagates the sfp into the first of each group of accesses (i.e. with offset 0), but not the others: 9: r102:DI=sfp:DI-0x80 11: r104:V4SI=[sfp:DI-0x40] 13: r105:V4SI=[r101:DI+0x10] 15: r106:V4SI=[r101:DI+0x20] 17: r107:V4SI=[r101:DI+0x30] REG_DEAD r103:DI 12: [sfp:DI-0x80]=r104:V4SI 14: [r102:DI+0x10]=r105:V4SI REG_DEAD r105:V4SI 16: [r102:DI+0x20]=r106:V4SI REG_DEAD r106:V4SI 18: [r102:DI+0x30]=r107:V4SI the RTL then stays mostly unchanged until sched1, where things really start to go downhill: 11: r104:V4SI=[sfp:DI-0x40] 9: r102:DI=sfp:DI-0x80 13: r105:V4SI=[r101:DI+0x10] 20: x0:DI=r102:DI REG_DEAD r102:DI REG_EQUAL sfp:DI-0x80 15: r106:V4SI=[r101:DI+0x20] 12: [sfp:DI-0x80]=r104:V4SI REG_DEAD r104:V4SI 17: r107:V4SI=[r101:DI+0x30] REG_DEAD r101:DI 14: [r102:DI+0x10]=r105:V4SI REG_DEAD r105:V4SI 16: [r102:DI+0x20]=r106:V4SI REG_DEAD r106:V4SI 18: [r102:DI+0x30]=r107:V4SI here the first of the stores (i12) has been moved up between the last pair of loads (i15, i17). Now the interesting thing is how sched1 knows that it is safe to perform this transformation. In the ldp_fusion1 pass we miss this pair because we think that the loads may alias with i12: cannot form pair (15,17) due to alias conflicts (12,12) so it would be good to look into how our alias analysis differs from what sched1 is doing. It's worth further noting that while the loads have MEM_EXPR information (they point to the var_decl for s) the stores do not. Presumably this is because the copy of s mandated by the ABI doesn't necessarily have a tree decl representation that the MEM_EXPRs could point to. Separately to the aliasing issue, because: - there is no MEM_EXPR information for the stores, and - forwprop1 substituted the sfp in for the first store ldp_fusion fails to discover the (i12,i14) store pair opportunity. As a result we unfortunately end up forming an stp in the middle. Interestingly if I turn off fwprop1 then we still fail to form the (12,14) pair due to aliasing. So it seems the main thing to investigate is how sched1 does its alias analysis and how that differs from what we're doing in ldp_fusion. I have some WIP patches that should improve the pair discovery and could potentially be extended to help with the case of the (12,14) pair here. Another thing that could help with that is if we populated the MEM_EXPR for the stores of the structure copy.
[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991 Alex Coplan changed: What|Removed |Added Last reconfirmed||2024-05-09 Status|UNCONFIRMED |NEW CC||acoplan at gcc dot gnu.org, ||vmakarov at gcc dot gnu.org Ever confirmed|0 |1 Keywords||missed-optimization, ra --- Comment #1 from Alex Coplan --- Confirmed. There is a lot to unpack here. Of course, the include isn't needed in this testcase and the problem can be seen more clearly with a slightly smaller array size: typedef struct { int arr[16]; } S; void g (S *); void h (S); void f(int x) { S s; g (&s); h (s); } In this case sizeof(S) = 64 so we should be able to do the copy with 2 LDPs + 2 STPs. So just for clarity, the missed ldp/stp started when we turned off the early ldp/stp formation in memcpy expansion, i.e. with r14-9373-g19b23bf3c32df3cbb96b3d898a1d7142f7bea4a0 . However, things already started to regress earlier for this testcase with r14-4944-gf55cdce3f8dd8503e080e35be59c5f5390f6d95e i.e. commit f55cdce3f8dd8503e080e35be59c5f5390f6d95e Author: Vladimir N. Makarov Date: Thu Oct 26 14:50:40 2023 [RA]: Modfify cost calculation for dealing with equivalences before that RA change we get: f: stp x29, x30, [sp, -144]! mov x29, sp add x0, sp, 80 bl g ldp q29, q28, [sp, 80] add x0, sp, 16 ldp q31, q30, [sp, 112] stp q29, q28, [sp, 16] stp q31, q30, [sp, 48] bl h ldp x29, x30, [sp], 144 ret and afterwards we get: f: stp x29, x30, [sp, -160]! mov x29, sp str x19, [sp, 16] add x19, sp, 96 mov x0, x19 bl g add x0, sp, 32 ldp q29, q28, [x19] ldp q31, q30, [x19, 32] stp q29, q28, [x0] stp q31, q30, [x0, 32] bl h ldr x19, [sp, 16] ldp x29, x30, [sp], 160 ret which is really not great as now we have a save/restore of x19 and the accesses end up using different (non-sp) registers which I suspect doesn't help with the ldp/stp formation (on trunk). I will try to give a detailed analysis on what goes wrong with the ldp/stp formation at the RTL level shortly (there are a lot of different issues), but I think that RA change is a contributing factor.
[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991 Wilco changed: What|Removed |Added Target||aarch64-*-* Target Milestone|--- |15.0