[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2025-03-25 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

--- Comment #9 from Alex Coplan  ---
(In reply to Richard Biener from comment #8)
> Is this now fixed on trunk?

No, not really.  The codegen at -O2 on trunk is:

f:
stp x29, x30, [sp, -144]!
mov x29, sp
add x0, sp, 80
bl  g
ldp q28, q30, [sp, 80]
add x0, sp, 16
ldp q29, q31, [sp, 112]
str q28, [sp, 16]
stp q30, q29, [sp, 32]
str q31, [sp, 64]
bl  h
ldp x29, x30, [sp], 144
ret

Vlad's fix above helped reduce the frame size (thanks!).  Immediately before
that change (r15-7931), we have (again at -O2):

f:
stp x29, x30, [sp, -160]!
mov x29, sp
add x0, sp, 96
bl  g
ldp q28, q30, [sp, 96]
add x0, sp, 32
ldp q29, q31, [sp, 128]
str q28, [sp, 32]
stp q30, q29, [x0, 16]
str q31, [x0, 48]
bl  h
ldp x29, x30, [sp], 160
ret

so this is an improvement, but it looks like with these insns:

str q28, [sp, 16]
stp q30, q29, [sp, 32]
str q31, [sp, 64]

we're forming an stp in the middle, so we've still got work to do in ldp_fusion
here.  Also, as Andrew noted in #c5, the only reason we do better now is
because of Wilco's change to turn the scheduler off on AArch64
(r15-6661-gc5db3f50bdf34ea96fd193a2a66d686401053bd2).  Wilco also later
re-enabled the scheduler at -O3
(r15-7871-gf870302515d5fcf7355f0108c3ead0038ff326fd), so e.g. taking the
testcase in #c1 at -O3 on trunk, we get:

f:
stp x29, x30, [sp, -144]!
mov x29, sp
add x0, sp, 80
bl  g
ldp q31, q30, [sp, 80]
add x0, sp, 16
ldr q29, [sp, 112]
str q31, [sp, 16]
ldr q31, [sp, 128]
stp q30, q29, [sp, 32]
str q31, [sp, 64]
bl  h
ldp x29, x30, [sp], 144
ret

i.e. we still have the same pathological interleaving caused by the scheduler
(which ldp_fusion is currently unable to undo without the patch in #c4).

So there is still work to do in ldp_fusion.  The problem I ran into before is
that the fix I proposed in #c4 isn't always beneficial.  I believe this is
because, although we form more pairs with that change, doing so before RA can
scupper the RA's REG_EQUIV optimization and lead to spills.  It needs more
investigation to confirm this, but if so, I think we need some mechanism to
allow the RA to either crack or look through paired loads/stores when it is
beneficial to do so (e.g. to permit a REG_EQUIV optimization and avoid an
additional spill).

So there is more to do, but it is not at all straightforward to fix.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2025-03-25 Thread rguenth at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

Richard Biener  changed:

   What|Removed |Added

  Flags||needinfo?(acoplan at gcc dot 
gnu.o
   ||rg)
   Target Milestone|15.0|14.3

--- Comment #8 from Richard Biener  ---
Is this now fixed on trunk?

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2025-03-10 Thread cvs-commit at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

--- Comment #7 from GCC Commits  ---
The master branch has been updated by Vladimir Makarov :

https://gcc.gnu.org/g:e355fe414aa3aaf215c7dd9dd789ce217a1b458c

commit r15-7932-ge355fe414aa3aaf215c7dd9dd789ce217a1b458c
Author: Vladimir N. Makarov 
Date:   Mon Mar 10 16:26:59 2025 -0400

[PR114991][IRA]: Improve reg equiv invariant calculation

In PR test case IRA preferred to allocate hard reg to a pseudo instead
of its equivalence.  This resulted in allocating caller-saved hard reg
and generating save/restore insns in the function prologue/epilogue.
The equivalence is an invariant (stack pointer plus offset) and the
pseudo is used mostly as memory address.  This happened as there was
no simplification of insn after the invariant substitution.  The patch
adds the necessary code.

gcc/ChangeLog:

PR target/114991
* ira-costs.cc (equiv_can_be_consumed_p): Add new argument
invariant_p.
Add code for dealing with the invariant.
(calculate_equiv_gains): Don't consider init insns.  Pass the new
argument to equiv_can_be_consumed_p.  Don't treat invariant as
memory.

gcc/testsuite/ChangeLog:

PR target/114991
* gcc.target/aarch64/pr114991.c: New test.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2025-02-21 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

Andrew Pinski  changed:

   What|Removed |Added

   See Also||https://gcc.gnu.org/bugzill
   ||a/show_bug.cgi?id=28831

--- Comment #6 from Andrew Pinski  ---
This is also related to PR 28831.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2025-02-08 Thread law at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

Jeffrey A. Law  changed:

   What|Removed |Added

   Priority|P3  |P2

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2025-01-27 Thread pinskia at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

--- Comment #5 from Andrew Pinski  ---
Note the testcase in comment #0 needs -fschedule-insns now to fail.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2024-07-05 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

--- Comment #4 from Alex Coplan  ---
So the following is enough to fix the missed ldp due to alias analysis:

diff --git a/gcc/pair-fusion.cc b/gcc/pair-fusion.cc
index 31d2c21c88f..ab49d955ccf 100644
--- a/gcc/pair-fusion.cc
+++ b/gcc/pair-fusion.cc
@@ -128,8 +128,12 @@ pair_fusion::run ()
   if (!track_loads_p () && !track_stores_p ())
 return;

+  init_alias_analysis ();
+
   for (auto bb : crtl->ssa->bbs ())
 process_block (bb);
+
+  end_alias_analysis ();
 }

 // State used by the pass for a given basic block.

that explains why sched1 was able to do the re-ordering but we weren't able to
do it in ldp_fusion1 (sched1 makes these calls).  Essentially this enables a
mini-pass that establishes register equivalences and allows the calls to
canon_rtx inside the alias machinery to re-write the memcpy accesses in terms
of the sfp for alias disambiguation purposes.  For the testcase in #c1:

--- without-patch.s 2024-07-05 11:33:57.395927975 +0100
+++ with-patch.s2024-07-05 11:33:32.164155523 +0100
@@ -17,9 +17,8 @@
bl  g
add x0, sp, 32
ldp q31, q30, [x19]
-   ldr q29, [x19, 32]
str q31, [sp, 32]
-   ldr q31, [x19, 48]
+   ldp q29, q31, [x19, 32]
stp q30, q29, [x0, 16]
str q31, [x0, 48]
bl  h

we still miss the stp in this case since the stores have different RTL bases
(sfp vs memcpy pseudo) and no MEM_EXPR information.  If we go ahead with the
above change then in theory we could also make use of this register equivalence
information during discovery (not just for alias analysis), allowing us to get
the remaining stp.

While the above patch seems to improve performance overall, there is one
workload with a significant compile-time regression which needs investigating.

There are also some codesize regressions which I think occur due to forming
more stack-based LDPs, but this scuppers the IRA REG_EQUIV optimization to
avoid spilling registers that were loaded from the stack.

So a bit more work needed before we can go ahead with this.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2024-05-09 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

Alex Coplan  changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
   Assignee|unassigned at gcc dot gnu.org  |acoplan at gcc dot 
gnu.org

--- Comment #3 from Alex Coplan  ---
Mine for the aliasing issues/investigation, might be worth splitting off the RA
problem to track that separately.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2024-05-09 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

--- Comment #2 from Alex Coplan  ---
Here is some analysis on why we miss some of these opportunities in ldp_fusion.
So initially in 267r.vregs we have some very clean RTL:

6: r101:DI=sfp:DI-0x40
7: x0:DI=r101:DI
8: call [`g'] argc:0
  REG_CALL_DECL `g'
9: r102:DI=sfp:DI-0x80
   10: r103:DI=sfp:DI-0x40
   11: r104:V4SI=[r103:DI]
   13: r105:V4SI=[r103:DI+0x10]
   15: r106:V4SI=[r103:DI+0x20]
   17: r107:V4SI=[r103:DI+0x30]
   12: [r102:DI]=r104:V4SI
   14: [r102:DI+0x10]=r105:V4SI
   16: [r102:DI+0x20]=r106:V4SI
   18: [r102:DI+0x30]=r107:V4SI

if were to run the ldp/stp pass on this it should form the pairs without a
problem.  Of course things go downhill from here.  The first slightly strange
thing is that fwprop propagates the sfp into the first of each group of
accesses (i.e. with offset 0), but not the others:

9: r102:DI=sfp:DI-0x80
   11: r104:V4SI=[sfp:DI-0x40]
   13: r105:V4SI=[r101:DI+0x10]
   15: r106:V4SI=[r101:DI+0x20]
   17: r107:V4SI=[r101:DI+0x30]
  REG_DEAD r103:DI
   12: [sfp:DI-0x80]=r104:V4SI
   14: [r102:DI+0x10]=r105:V4SI
  REG_DEAD r105:V4SI
   16: [r102:DI+0x20]=r106:V4SI
  REG_DEAD r106:V4SI
   18: [r102:DI+0x30]=r107:V4SI

the RTL then stays mostly unchanged until sched1, where things really start to
go downhill:

   11: r104:V4SI=[sfp:DI-0x40]
9: r102:DI=sfp:DI-0x80
   13: r105:V4SI=[r101:DI+0x10]
   20: x0:DI=r102:DI
  REG_DEAD r102:DI
  REG_EQUAL sfp:DI-0x80
   15: r106:V4SI=[r101:DI+0x20]
   12: [sfp:DI-0x80]=r104:V4SI
  REG_DEAD r104:V4SI
   17: r107:V4SI=[r101:DI+0x30]
  REG_DEAD r101:DI
   14: [r102:DI+0x10]=r105:V4SI
  REG_DEAD r105:V4SI
   16: [r102:DI+0x20]=r106:V4SI
  REG_DEAD r106:V4SI
   18: [r102:DI+0x30]=r107:V4SI

here the first of the stores (i12) has been moved up between the last pair of
loads (i15, i17).  Now the interesting thing is how sched1 knows that it is
safe to perform this transformation.  In the ldp_fusion1 pass we miss this pair
because we think that the loads may alias with i12:

cannot form pair (15,17) due to alias conflicts (12,12)

so it would be good to look into how our alias analysis differs from what
sched1 is doing.  It's worth further noting that while the loads have MEM_EXPR
information (they point to the var_decl for s) the stores do not.  Presumably
this is because the copy of s mandated by the ABI doesn't necessarily have a
tree decl representation that the MEM_EXPRs could point to.

Separately to the aliasing issue, because:
 - there is no MEM_EXPR information for the stores, and
 - forwprop1 substituted the sfp in for the first store
ldp_fusion fails to discover the (i12,i14) store pair opportunity.  As a result
we unfortunately end up forming an stp in the middle.

Interestingly if I turn off fwprop1 then we still fail to form the
(12,14) pair due to aliasing.

So it seems the main thing to investigate is how sched1 does its alias
analysis and how that differs from what we're doing in ldp_fusion.

I have some WIP patches that should improve the pair discovery and could
potentially be extended to help with the case of the (12,14) pair here.
Another thing that could help with that is if we populated the MEM_EXPR for the
stores of the structure copy.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2024-05-09 Thread acoplan at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

Alex Coplan  changed:

   What|Removed |Added

   Last reconfirmed||2024-05-09
 Status|UNCONFIRMED |NEW
 CC||acoplan at gcc dot gnu.org,
   ||vmakarov at gcc dot gnu.org
 Ever confirmed|0   |1
   Keywords||missed-optimization, ra

--- Comment #1 from Alex Coplan  ---
Confirmed.  There is a lot to unpack here.  Of course, the include isn't needed
in this testcase and the problem can be seen more clearly with a slightly
smaller array size:

typedef struct { int arr[16]; } S;

void g (S *);
void h (S);
void f(int x)
{
  S s;
  g (&s);
  h (s);
}

In this case sizeof(S) = 64 so we should be able to do the copy with 2 LDPs + 2
STPs.

So just for clarity, the missed ldp/stp started when we turned off the early
ldp/stp formation in memcpy expansion, i.e. with
r14-9373-g19b23bf3c32df3cbb96b3d898a1d7142f7bea4a0 .

However, things already started to regress earlier for this testcase with
r14-4944-gf55cdce3f8dd8503e080e35be59c5f5390f6d95e i.e.

commit f55cdce3f8dd8503e080e35be59c5f5390f6d95e
Author: Vladimir N. Makarov 
Date:   Thu Oct 26 14:50:40 2023

[RA]: Modfify cost calculation for dealing with equivalences

before that RA change we get:

f:
stp x29, x30, [sp, -144]!
mov x29, sp
add x0, sp, 80
bl  g
ldp q29, q28, [sp, 80]
add x0, sp, 16
ldp q31, q30, [sp, 112]
stp q29, q28, [sp, 16]
stp q31, q30, [sp, 48]
bl  h
ldp x29, x30, [sp], 144
ret

and afterwards we get:

f:
stp x29, x30, [sp, -160]!
mov x29, sp
str x19, [sp, 16]
add x19, sp, 96
mov x0, x19
bl  g
add x0, sp, 32
ldp q29, q28, [x19]
ldp q31, q30, [x19, 32]
stp q29, q28, [x0]
stp q31, q30, [x0, 32]
bl  h
ldr x19, [sp, 16]
ldp x29, x30, [sp], 160
ret

which is really not great as now we have a save/restore of x19 and the accesses
end up using different (non-sp) registers which I suspect doesn't help with the
ldp/stp formation (on trunk).

I will try to give a detailed analysis on what goes wrong with the ldp/stp
formation at the RTL level shortly (there are a lot of different issues), but I
think that RA change is a contributing factor.

[Bug target/114991] [14/15 Regression] AArch64: LDP pass does not handle some structure copies

2024-05-08 Thread wilco at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991

Wilco  changed:

   What|Removed |Added

 Target||aarch64-*-*
   Target Milestone|--- |15.0