https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049
Bug ID: 104049 Summary: [12 Regression] vec_select to subreg lowering causes superfluous moves Product: gcc Version: 12.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- Target: aarch64-* Consider: int test (uint8_t *p, uint32_t t[1][1], int n) { int sum = 0; uint32_t a0; for (int i = 0; i < 4; i++, p++) t[i][0] = p[0]; for (int i = 0; i < 4; i++) { { int t0 = t[0][i] + t[0][i]; a0 = t0; }; sum += a0; } return (((uint16_t)sum) + ((uint32_t)sum >> 16)) >> 1; } Which after the reduction gets SLP'd used to generate at -O3 addv s0, v0.4s fmov w0, s0 lsr w1, w0, 16 add w0, w1, w0, uxth lsr w0, w0, 1 which was pretty good. However in GCC 12 we now generate worse code: addv s0, v0.4s fmov w0, s0 fmov w1, s0 and w0, w0, 65535 add w0, w0, w1, lsr 16 lsr w0, w0, 1 Notice the double transfer of the same value. This is because at the RTL level the original mov becomes a vec_select (insn 19 18 20 2 (set (reg:SI 102 [ _43 ]) (vec_select:SI (reg:V4SI 117) (parallel [ (const_int 0 [0]) ]))) -1 (nil)) which previously stayed as a vec_select and the RA would use this pattern for the w -> r move. Now however this vec_select gets transformed into a subreg 0, which causes combine to push the subreg into each instruction using reg 102. (insn 21 18 22 2 (set (reg:SI 120) (and:SI (subreg:SI (reg:V4SI 117) 0) (const_int 65535 [0xffff]))) "/app/example.c":30:27 492 {andsi3} (nil)) (insn 22 21 28 2 (set (reg:SI 121) (plus:SI (lshiftrt:SI (subreg:SI (reg:V4SI 117) 0) (const_int 16 [0x10])) (reg:SI 120))) "/app/example.c":30:27 211 {*add_lsr_si} (expr_list:REG_DEAD (reg:SI 120) (expr_list:REG_DEAD (reg:V4SI 117) (nil)))) and because these operations don't exist on the w side, reload is forced to materialized many duplicate moves from w -> r. So every operation that gets the subreg pushed into it for which we don't have an operation for on the w side gets an extra move. Aside from that, we seem to lose that the & can be folded into the subreg by simply truncating the subreg from SI to HI and zero extending that out. A different reproducer is #include <arm_neon.h> typedef int v4si __attribute__ ((vector_size (16))); int bar (v4si x) { unsigned int sum = vaddvq_s32 (x); return (((uint16_t)(sum & 0xffff)) + ((uint32_t)sum >> 16)); } Note that using -frename-registers does get us to an optimal sequence here which is better than GCC 11.