https://gcc.gnu.org/bugzilla/show_bug.cgi?id=124162
Bug ID: 124162
Summary: [16 Regression] wrong first iteration mask in SVE loop
with partial vectors
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Keywords: wrong-code
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
Target Milestone: ---
Target: aarch64*
The following loop
char b = 41;
int main() {
signed char a[31];
#pragma GCC novector
for (int c = 0; c < 31; ++c)
a[c] = c * c + c % 5;
{
signed char *d = a;
#pragma GCC novector
for (int c = 0; c < 31; ++c, b += -16)
d[c] += b;
}
for (int c = 0; c < 31; ++c) {
signed char e = c * c + c % 5 + 41 + c * -16;
if (a[c] != e)
__builtin_abort();
}
}
compiled with -O2 -ftree-vectorize -msve-vector-bits=256 -march=armv8.2-a+sve
generates
ptrue p6.b, vl32
add x2, x2, :lo12:.LC0
add w5, w5, 16
ld1rw z25.s, p6/z, [x2]
strb w5, [x6, #:lo12:.LANCHOR0]
mov w0, 0
mov p7.b, p6.b
mov w2, 31
index z30.s, #0, #1
mov z26.s, #5
mov z27.b, #41
.L6:
mov z29.d, z30.d
movprfx z28, z30
add z28.b, z28.b, #240
mad z29.b, p6/m, z28.b, z27.b
mov w3, w0
movprfx z31, z30
smulh z31.s, p6/m, z31.s, z25.s
add w0, w0, 8
asr z31.s, z31.s, #1
msb z31.s, p6/m, z26.s, z30.s
add z31.b, z31.b, z29.b
ld1b z29.s, p7/z, [x1]
cmpne p7.b, p7/z, z31.b, z29.b
b.any .L15
add x1, x1, 8
add z30.s, z30.s, #8
whilelo p7.s, w0, w2
b.any .L6
Which uses a predicate for the first iteration where all bits are 1. i.e. all
lanes active.
This causes the result of the cmpne to set the wrong CC flags. The second
iteration uses
whilelo p7.s, w0, w2
which gives the correct mask layout going forwards.
This is due to the CSE'ing code that tries to share predicates as much as
possible.
It creates the spares predicate VNx4QI from a VNx16QI and then has a truncate
operation.
The truncate operation results in a simple copy:
mov p7.b, p6.b
because in the data model for partial vectors the upper lanes are *don't care*.
So computations
using this vector are fine. However for comparisons, or any operations setting
flags the predicate value does matter otherwise we get the wrong flags as the
above.
two ways to solve this:
1. restore the ptest for partial vector compares. This would slow down the
loop though and introduce a second ptrue .s, VL8 predicate.
2. disable the sharing of partial vector predicates. This allows us to remove
the ptest. Since the ptest would introduce a second predicate here anyway I'm
leaning towards disabling sharing between partial and full predicates.
Testing a patch for both.
Filed just for a PR number.