https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124162

            Bug ID: 124162
           Summary: [16 Regression] wrong first iteration mask in SVE loop
                    with partial vectors
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Keywords: wrong-code
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64*

The following loop

char b = 41;
int main() {
  signed char a[31];
#pragma GCC novector
  for (int c = 0; c < 31; ++c)
    a[c] = c * c + c % 5;
  {
    signed char *d = a;
#pragma GCC novector
    for (int c = 0; c < 31; ++c, b += -16)
      d[c] += b;
  }
  for (int c = 0; c < 31; ++c) {
    signed char e = c * c + c % 5 + 41 + c * -16;
    if (a[c] != e)
      __builtin_abort();
  }
}

compiled with -O2 -ftree-vectorize -msve-vector-bits=256 -march=armv8.2-a+sve

generates

        ptrue   p6.b, vl32
        add     x2, x2, :lo12:.LC0
        add     w5, w5, 16
        ld1rw   z25.s, p6/z, [x2]
        strb    w5, [x6, #:lo12:.LANCHOR0]
        mov     w0, 0
        mov     p7.b, p6.b
        mov     w2, 31
        index   z30.s, #0, #1
        mov     z26.s, #5
        mov     z27.b, #41
.L6:
        mov     z29.d, z30.d
        movprfx z28, z30
        add     z28.b, z28.b, #240
        mad     z29.b, p6/m, z28.b, z27.b
        mov     w3, w0
        movprfx z31, z30
        smulh   z31.s, p6/m, z31.s, z25.s
        add     w0, w0, 8
        asr     z31.s, z31.s, #1
        msb     z31.s, p6/m, z26.s, z30.s
        add     z31.b, z31.b, z29.b
        ld1b    z29.s, p7/z, [x1]
        cmpne   p7.b, p7/z, z31.b, z29.b
        b.any   .L15
        add     x1, x1, 8
        add     z30.s, z30.s, #8
        whilelo p7.s, w0, w2
        b.any   .L6

Which uses a predicate for the first iteration where all bits are 1. i.e. all
lanes active.
This causes the result of the cmpne to set the wrong CC flags.  The second
iteration uses

        whilelo p7.s, w0, w2

which gives the correct mask layout going forwards.

This is due to the CSE'ing code that tries to share predicates as much as
possible.

It creates the spares predicate VNx4QI from a VNx16QI and then has a truncate
operation.
The truncate operation results in a simple copy:

        mov     p7.b, p6.b

because in the data model for partial vectors the upper lanes are *don't care*.
So computations
using this vector are fine.  However for comparisons, or any operations setting
flags the predicate value does matter otherwise we get the wrong flags as the
above.

two ways to solve this:

1. restore the ptest for partial vector compares.  This would slow down the
loop though and introduce a second ptrue .s, VL8 predicate.

2. disable the sharing of partial vector predicates.  This allows us to remove
the ptest.  Since the ptest would introduce a second predicate here anyway I'm
leaning towards disabling sharing between partial and full predicates.

Testing a patch for both.

Filed just for a PR number.

Reply via email to