https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109072
Bug ID: 109072 Summary: [12/13 Regression] SLP costs for vec duplicate too high since g:4963079769c99c4073adfd799885410ad484cbbe Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org CC: rsandifo at gcc dot gnu.org Target Milestone: --- Target: aarch64* The following example --- #include <arm_neon.h> float32x4_t f (float32x4_t v, float res) { float data[4]; data[0] = res; data[1] = res; data[2] = res; data[3] = res; return vld1q_f32 (&data[0]); } --- compiled with -Ofast fails to SLP starting with GCC 12. This used to generate: f: dup v0.4s, v1.s[0] ret and now generates: f: fmov w5, s1 fmov w1, s1 fmov w4, s1 fmov w0, s1 mov x2, 0 mov x3, 0 bfi x2, x5, 0, 32 bfi x3, x1, 0, 32 bfi x2, x4, 32, 32 bfi x3, x0, 32, 32 fmov d0, x2 ins v0.d[1], x3 ret The SLP costs went from: Vector cost: 2 Scalar cost: 4 to: Vector cost: 12 Scalar cost: 4 it looks like it's no longer costing it as a duplicate but instead 4 vec inserts. bisected to: commit g:4963079769c99c4073adfd799885410ad484cbbe Author: Richard Sandiford <richard.sandif...@arm.com> Date: Tue Feb 15 18:09:33 2022 +0000 vect+aarch64: Fix ldp_stp_* regressions ldp_stp_1.c, ldp_stp_4.c and ldp_stp_5.c have been failing since vectorisation was enabled at -O2. In all three cases SLP is generating vector code when scalar code would be better. ....