https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118611
Bug ID: 118611
Summary: LRA inserts unneeded reload on FMA chain
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Keywords: missed-optimization, ra
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
Target Milestone: ---
Target: aarch64*
The following example:
#include <arm_neon.h>
float32x4_t
bad (float32x4_t x, float32x4_t c0, float32x4_t c1, float32x4_t c3,
float32x4_t c2)
{
float32x4_t z2 = vmulq_f32 (x, x);
float32x4_t p1 = vfmaq_laneq_f32 (c1, z2, c3, 0);
float32x4_t p2 = vfmaq_laneq_f32 (c2, z2, c3, 2);
// Mov is inserted to save P1. (Correct behaviour)
float32x4_t p5 = vfmaq_f32 (p1, z2, p1);
float32x4_t p6 = vfmaq_f32 (p1, z2, p2);
// Mov is inserted to save P5, which is only used once. (Unneeded)
float32x4_t y = vfmaq_f32 (p5, x, p6);
return vfmaq_f32 (c0, x, y);
}
compiled with -O3 generates:
bad:
fmul v31.4s, v0.4s, v0.4s
fmla v2.4s, v31.4s, v3.s[0]
fmla v4.4s, v31.4s, v3.s[2]
mov v30.16b, v2.16b
fmla v30.4s, v31.4s, v2.4s
fmla v2.4s, v31.4s, v4.4s
mov v31.16b, v30.16b
fmla v31.4s, v0.4s, v2.4s
fmla v1.4s, v0.4s, v31.4s
mov v0.16b, v1.16b
ret
where the second MOV is unneeded because v30 isn't live after the FMA.
It seems that we know the lifetime
(insn 17 16 18 2 (set (reg:V4SF 101 [ _8 ])
(fma:V4SF (reg:V4SF 118 [ x ])
(reg:V4SF 102 [ _9 ])
(reg:V4SF 103 [ _10 ]))) "":11639:10 2407 {fmav4sf4}
(expr_list:REG_DEAD (reg:V4SF 103 [ _10 ])
(expr_list:REG_DEAD (reg:V4SF 102 [ _9 ])
(nil))))
but still:
Choosing alt 0 in insn 17: (0) =w (1) w (2) w (3) 0 {fmav4sf4}
Creating newreg=125 from oldreg=103, assigning class FP_REGS to r125
17: r125:V4SF={r118:V4SF*r102:V4SF+r125:V4SF}
REG_DEAD r103:V4SF
REG_DEAD r102:V4SF
Inserting insn reload before:
33: r125:V4SF=r103:V4SF
Inserting insn reload after:
34: r101:V4SF=r125:V4SF