https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114991
--- Comment #9 from Alex Coplan <acoplan at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #8)
> Is this now fixed on trunk?
No, not really. The codegen at -O2 on trunk is:
f:
stp x29, x30, [sp, -144]!
mov x29, sp
add x0, sp, 80
bl g
ldp q28, q30, [sp, 80]
add x0, sp, 16
ldp q29, q31, [sp, 112]
str q28, [sp, 16]
stp q30, q29, [sp, 32]
str q31, [sp, 64]
bl h
ldp x29, x30, [sp], 144
ret
Vlad's fix above helped reduce the frame size (thanks!). Immediately before
that change (r15-7931), we have (again at -O2):
f:
stp x29, x30, [sp, -160]!
mov x29, sp
add x0, sp, 96
bl g
ldp q28, q30, [sp, 96]
add x0, sp, 32
ldp q29, q31, [sp, 128]
str q28, [sp, 32]
stp q30, q29, [x0, 16]
str q31, [x0, 48]
bl h
ldp x29, x30, [sp], 160
ret
so this is an improvement, but it looks like with these insns:
str q28, [sp, 16]
stp q30, q29, [sp, 32]
str q31, [sp, 64]
we're forming an stp in the middle, so we've still got work to do in ldp_fusion
here. Also, as Andrew noted in #c5, the only reason we do better now is
because of Wilco's change to turn the scheduler off on AArch64
(r15-6661-gc5db3f50bdf34ea96fd193a2a66d686401053bd2). Wilco also later
re-enabled the scheduler at -O3
(r15-7871-gf870302515d5fcf7355f0108c3ead0038ff326fd), so e.g. taking the
testcase in #c1 at -O3 on trunk, we get:
f:
stp x29, x30, [sp, -144]!
mov x29, sp
add x0, sp, 80
bl g
ldp q31, q30, [sp, 80]
add x0, sp, 16
ldr q29, [sp, 112]
str q31, [sp, 16]
ldr q31, [sp, 128]
stp q30, q29, [sp, 32]
str q31, [sp, 64]
bl h
ldp x29, x30, [sp], 144
ret
i.e. we still have the same pathological interleaving caused by the scheduler
(which ldp_fusion is currently unable to undo without the patch in #c4).
So there is still work to do in ldp_fusion. The problem I ran into before is
that the fix I proposed in #c4 isn't always beneficial. I believe this is
because, although we form more pairs with that change, doing so before RA can
scupper the RA's REG_EQUIV optimization and lead to spills. It needs more
investigation to confirm this, but if so, I think we need some mechanism to
allow the RA to either crack or look through paired loads/stores when it is
beneficial to do so (e.g. to permit a REG_EQUIV optimization and avoid an
additional spill).
So there is more to do, but it is not at all straightforward to fix.