https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104049

            Bug ID: 104049
           Summary: [12 Regression] vec_select to subreg lowering causes
                    superfluous moves
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64-*

Consider:

int test (uint8_t *p, uint32_t t[1][1], int n) {

  int sum = 0;
  uint32_t a0;
  for (int i = 0; i < 4; i++, p++)
    t[i][0] = p[0];

  for (int i = 0; i < 4; i++) {
    {
      int t0 = t[0][i] + t[0][i];
      a0 = t0;
    };
    sum += a0;
  }
  return (((uint16_t)sum) + ((uint32_t)sum >> 16)) >> 1;
}

Which after the reduction gets SLP'd used to generate at -O3

        addv    s0, v0.4s
        fmov    w0, s0
        lsr     w1, w0, 16
        add     w0, w1, w0, uxth
        lsr     w0, w0, 1

which was pretty good. However in GCC 12 we now generate worse code:

        addv    s0, v0.4s
        fmov    w0, s0
        fmov    w1, s0
        and     w0, w0, 65535
        add     w0, w0, w1, lsr 16
        lsr     w0, w0, 1

Notice the double transfer of the same value.

This is because at the RTL level the original mov becomes a vec_select

(insn 19 18 20 2 (set (reg:SI 102 [ _43 ])
        (vec_select:SI (reg:V4SI 117)
            (parallel [
                    (const_int 0 [0])
                ]))) -1
     (nil))

which previously stayed as a vec_select and the RA would use this pattern for
the w -> r move.

Now however this vec_select gets transformed into a subreg 0, which causes
combine to push the subreg into each instruction using reg 102.

(insn 21 18 22 2 (set (reg:SI 120)
        (and:SI (subreg:SI (reg:V4SI 117) 0)
            (const_int 65535 [0xffff]))) "/app/example.c":30:27 492 {andsi3}
     (nil))
(insn 22 21 28 2 (set (reg:SI 121)
        (plus:SI (lshiftrt:SI (subreg:SI (reg:V4SI 117) 0)
                (const_int 16 [0x10]))
            (reg:SI 120))) "/app/example.c":30:27 211 {*add_lsr_si}
     (expr_list:REG_DEAD (reg:SI 120)
        (expr_list:REG_DEAD (reg:V4SI 117)
            (nil))))

and because these operations don't exist on the w side, reload is forced to
materialized many duplicate moves from w -> r.  So every operation that gets
the subreg pushed into it for which we don't have an operation for on the w
side gets an extra move.

Aside from that, we seem to lose that the & can be folded into the subreg by
simply truncating the subreg from SI to HI and zero extending that out.

A different reproducer is

#include <arm_neon.h>

typedef int v4si __attribute__ ((vector_size (16)));

int bar (v4si x)
{
  unsigned int sum = vaddvq_s32 (x);
  return (((uint16_t)(sum & 0xffff)) + ((uint32_t)sum >> 16));
}

Note that using -frename-registers does get us to an optimal sequence here
which is better than GCC 11.

Reply via email to