https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114192

            Bug ID: 114192
           Summary: scalar code left around following early break
                    vectorization of reduction
           Product: gcc
           Version: 14.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: acoplan at gcc dot gnu.org
  Target Milestone: ---

For the following testcase:

int a[1024];
int f4(int *x, int n)
{
    int sum = 0;
    for (int i = 0; i < n; i++)
    {
        sum += a[i];
        if (a[i] == 42)
            break;
    }
    return sum;
}

at -O3 on aarch64 we vectorize it and get the following vector loop:

.L4:
        cmp     x7, x2
        beq     .L23
.L6:
        ubfiz   x3, x2, 4, 32
        ldr     w6, [x4, x2, lsl 2]    // scalar load
        mov     v27.16b, v30.16b
        mov     w0, w5
        add     v30.4s, v30.4s, v25.4s
        add     w5, w5, w6             // scalar add
        ldr     q29, [x4, x3]
        add     x2, x2, 1
        cmeq    v31.4s, v29.4s, v26.4s
        add     v28.4s, v28.4s, v29.4s
        umaxp   v31.4s, v31.4s, v31.4s
        fmov    x3, d31
        cbz     x3, .L4

but here the old scalar code has been left around.  If we remove the early exit
from the loop, then although we still leave the scalar code around in the
vectorizer, it gets optimized away immediately by the following DCE pass.

Without the early exit, in the vectorizer dump we have:

  <bb 3> [local count: 860067200]:
  # sum_10 = PHI <sum_6(6), 0(9)>
  # i_12 = PHI <i_7(6), 0(9)>
  # vect_sum_10.8_25 = PHI <vect_sum_6.12_29(6), { 0, 0, 0, 0 }(9)>
  # vectp_a.9_26 = PHI <vectp_a.9_27(6), &a(9)>
  # ivtmp_32 = PHI <ivtmp_33(6), 0(9)>
  vect__1.11_28 = MEM <vector(4) int> [(int *)vectp_a.9_26];
  _1 = a[i_12]; // scalar load
  vect_sum_6.12_29 = vect__1.11_28 + vect_sum_10.8_25;
  sum_6 = _1 + sum_10;
  i_7 = i_12 + 1;
  vectp_a.9_27 = vectp_a.9_26 + 16;
  ivtmp_33 = ivtmp_32 + 1;
  if (ivtmp_33 < bnd.5_22)
    goto <bb 6>; [89.00%]
  else
    goto <bb 11>; [11.00%]

i.e. the scalar load is left around, but it seems to get cleaned up by the
(immediately following) dce pass:

  <bb 3> [local count: 860067200]:
  # vect_sum_10.8_25 = PHI <vect_sum_6.12_29(6), { 0, 0, 0, 0 }(9)>
  # vectp_a.9_26 = PHI <vectp_a.9_27(6), &a(9)>
  # ivtmp_32 = PHI <ivtmp_33(6), 0(9)>
  vect__1.11_28 = MEM <vector(4) int> [(int *)vectp_a.9_26];
  vect_sum_6.12_29 = vect__1.11_28 + vect_sum_10.8_25;
  vectp_a.9_27 = vectp_a.9_26 + 16;
  ivtmp_33 = ivtmp_32 + 1;
  if (ivtmp_33 < bnd.5_22)
    goto <bb 6>; [89.00%]
  else
    goto <bb 11>; [11.00%]

perhaps the dce needs improving to clean up the dead scalar code in the early
exit case, too.

Reply via email to