https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106106
Bug ID: 106106 Summary: SRA scalarizes structure copies Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- Target: aarch64* The following example #include <arm_neon.h> float32x2x2_t f2(const float *p1, const float *p2) { float32x2x2_t v = vld2_f32(p1); return vld2_lane_f32(p2, v, 1); } uses a type `float32x2x2_t` which is an array consisting of two `float32x2_t` types. This type fits within the maximum object size for SRA so it tries to scalarize it. However doing so it makes some useless copies: D.22939 = __builtin_aarch64_ld2v2sf (p1_2(D)); v = D.22939; __b = v; D.22937 = __builtin_aarch64_ld2_lanev2sf (p2_3(D), __b, 1); [tail call] becomes D.22939 = __builtin_aarch64_ld2v2sf (p1_2(D)); v$val$0_3 = D.22939.val[0]; v$val$1_5 = D.22939.val[1]; __b.val[0] = v$val$0_3; __b.val[1] = v$val$1_5; D.22937 = __builtin_aarch64_ld2_lanev2sf (p2_4(D), __b, 1); [tail call] having broken the structures up it causes problem for register allocation as these types require sequential register allocation and reload is unable to consolidate all the copies resulting in superfluous register moves: f2: ld2 {v2.2s - v3.2s}, [x0] mov v0.8b, v2.8b mov v1.8b, v3.8b ld2 {v0.s - v1.s}[1], [x1] ret The following snippet from a real library using intrinsics shows the resulting carnage https://godbolt.org/z/xnre3Pe34. Perhaps SRA should not scalarize a type if it's just being used in a copy? or have a way to prevent scalarization of certain types?