https://gcc.gnu.org/bugzilla/show_bug.cgi?id=109391
Bug ID: 109391 Summary: Inefficient codegen on AArch64 when structure types are returned Product: gcc Version: 13.0 Status: UNCONFIRMED Keywords: missed-optimization, ra Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org CC: rsandifo at gcc dot gnu.org Target Milestone: --- Target: aarch64* This example https://godbolt.org/z/Pe3f3ozGf --- #include <arm_neon.h> int16x8x3_t bsl(const uint16x8x3_t *check, const int16x8x3_t *in1, const int16x8x3_t *in2) { int16x8x3_t out; for (uint32_t j = 0; j < 3; j++) { out.val[j] = vbslq_s16(check->val[j], in1->val[j], in2->val[j]); } return out; } --- Generates: bsl: ldp q6, q16, [x1] ldp q0, q4, [x2] ldp q5, q7, [x0] bsl v5.16b, v6.16b, v0.16b ldr q0, [x2, 32] bsl v7.16b, v16.16b, v4.16b ldr q6, [x1, 32] mov v1.16b, v5.16b ldr q5, [x0, 32] bsl v5.16b, v6.16b, v0.16b mov v0.16b, v1.16b mov v1.16b, v7.16b mov v2.16b, v5.16b ret with 3 superfluous moves. It looks like reload is having trouble dealing with the new compound types as return arguments. So in RTL We have: (insn 17 20 22 2 (set (subreg:V8HI (reg/v:V3x8HI 105 [ out ]) 16) (xor:V8HI (and:V8HI (xor:V8HI (reg:V8HI 115 [ in2_11(D)->val[1] ]) (reg:V8HI 114 [ in1_10(D)->val[1] ])) (reg:V8HI 113 [ check_9(D)->val[1] ])) (reg:V8HI 115 [ in2_11(D)->val[1] ]))) "/app/example.c":7:16 discrim 1 2558 {aarch64_simd_bslv8hi_internal} (expr_list:REG_DEAD (reg:V8HI 115 [ in2_11(D)->val[1] ]) (expr_list:REG_DEAD (reg:V8HI 114 [ in1_10(D)->val[1] ]) (expr_list:REG_DEAD (reg:V8HI 113 [ check_9(D)->val[1] ]) (nil))))) (insn 22 17 29 2 (set (subreg:V8HI (reg/v:V3x8HI 105 [ out ]) 32) (xor:V8HI (and:V8HI (xor:V8HI (reg:V8HI 118 [ in2_11(D)->val[2] ]) (reg:V8HI 117 [ in1_10(D)->val[2] ])) (reg:V8HI 116 [ check_9(D)->val[2] ])) (reg:V8HI 118 [ in2_11(D)->val[2] ]))) "/app/example.c":7:16 discrim 1 2558 {aarch64_simd_bslv8hi_internal} (expr_list:REG_DEAD (reg:V8HI 118 [ in2_11(D)->val[2] ]) (expr_list:REG_DEAD (reg:V8HI 117 [ in1_10(D)->val[2] ]) (expr_list:REG_DEAD (reg:V8HI 116 [ check_9(D)->val[2] ]) (nil))))) (insn 29 22 30 2 (set (reg/i:V3x8HI 32 v0) (reg/v:V3x8HI 105 [ out ])) "/app/example.c":10:1 3964 {*aarch64_movv3x8hi} (expr_list:REG_DEAD (reg/v:V3x8HI 105 [ out ]) (nil))) (insn 30 29 37 2 (use (reg/i:V3x8HI 32 v0)) "/app/example.c":10:1 -1 (nil)) Reload then decides to insert a bunch of reloads: Choosing alt 0 in insn 17: (0) =w (1) 0 (2) w (3) w {aarch64_simd_bslv8hi_internal} Creating newreg=126 from oldreg=113, assigning class FP_REGS to r126 17: r126:V8HI=r115:V8HI^r114:V8HI&r126:V8HI^r115:V8HI REG_DEAD r115:V8HI REG_DEAD r114:V8HI REG_DEAD r113:V8HI Inserting insn reload before: 43: r126:V8HI=r113:V8HI Inserting insn reload after: 44: r105:V3x8HI#16=r126:V8HI which introduces these moves. The problem existed with the previous structure types as well (OImode etc) so it's not new but costs us lots of perf. I don't think I can fix this with the same pass as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106106 can I? It looks like in this case the RTL looks fine.