https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98119
Bug ID: 98119 Summary: SVE: Wrong code with -O1 -ftree-vectorize -msve-vector-bits=512 -mtune=thunderx Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: acoplan at gcc dot gnu.org Target Milestone: --- AArch64 GCC miscompiles the following testcase: _Bool a[34]; int main() { for (long b = 0; b < 2; ++b) for (long c = 0; c < 17; ++c) a[b * 2 + c] = 1; for (long c = 0; c < 7; ++c) if (!a[2 + c]) __builtin_abort(); } at -O1 -ftree-vectorize -march=armv8.2-a+sve -msve-vector-bits=512 -mtune=thunderx. Removing any one of these flags, the issue goes away. Obviously, this is not a sensible choice of -mtune given that we're asking for SVE, but it seems that the scheduling should not result in a miscompile. Looking at a snippet of the broken code: main: .LFB0: .cfi_startproc adrp x2, .LANCHOR0 add x2, x2, :lo12:.LANCHOR0 and w3, w2, 63 and x0, x2, -64 // align x2 down add w1, w3, 17 whilelo p0.d, wzr, w1 whilelo p1.d, wzr, w3 not p0.b, p0/z, p1.b mov z0.b, #1 st1b z0.d, p0, [x0] // no-op (p0 all 0s) mov w3, 8 whilelo p0.d, w3, w1 b.none .L2 add x4, x0, 8 st1b z0.d, p0, [x4] // stores out-of-bounds add x0, x0, 16 mov w3, 16 whilelo p0.d, w3, w1 b.none .L2 st1b z0.d, p0, [x0] We initially compute the address of our array (a) in x2, and then align this down to the nearest 64-byte-aligned address, storing the result in x0. We then add 8 to this, and store a vector to this address. But this address can be out-of-bounds (suppose a is only 16-byte aligned). So things have already started to go downhill by this point.