https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114515
Bug ID: 114515 Summary: [14 Regression] Failure to use aarch64 lane forms after PR101523 Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- The following test regressed on aarch64 after g:839bc42772ba7af66af3bd16efed4a69511312ae (the fix for PR101523): typedef float v4sf __attribute__((vector_size(16))); void f (v4sf *ptr, float f) { ptr[0] = ptr[0] * (v4sf) { f, f, f, f }; ptr[1] = ptr[1] * (v4sf) { f, f, f, f }; } Compiled with -O2, we previously generated: ldp q1, q31, [x0] fmul v1.4s, v1.4s, v0.s[0] fmul v31.4s, v31.4s, v0.s[0] stp q1, q31, [x0] ret Now we generate: ldp q1, q31, [x0] dup v0.4s, v0.s[0] fmul v1.4s, v1.4s, v0.4s fmul v31.4s, v31.4s, v0.4s stp q1, q31, [x0] ret with the extra dup. The patch is trying to avoid cases where i3 is canonicalised by contextual information provided by i2. But here we place a full copy of i2 into i3 (creating an instruction that is no more expensive). This is a benefit in its own right because the two instructions can then execute in parallel rather than serially. But it also means that, as here, we might be able to remove i2 with later combinations. Perhaps we could also check whether i3 still contains the destination of i2?