https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125204
Bug ID: 125204
Summary: [15/16 Regression] SLP vectorization of loop with
early exit lost at -O3 -fvect-cost-model=unlimited
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: bug_hunters at yeah dot net
Target Milestone: ---
**Description:**
GCC 15.2.0 was able to vectorize a loop with a strided iteration pattern (`i -=
3`), an early exit condition (`if (data_0[idx].m0 != 0) break`), and a load
permutation (`{ 6 0 }`), using NEON SIMD instructions. GCC trunk completely
fails to vectorize this loop, reporting "unsupported SLP instances", despite
`-fvect-cost-model=unlimited` being explicitly specified.
**Test case:**
```c
#include <stdint.h>
#include <stddef.h>
typedef struct {
long m0;
short m1;
int m2;
}element_t_0;
long foo(
const element_t_0 * __restrict__ a,
long * __restrict__ out,
int n
) {
for (int i = n - 1; i >= 0; i -= 3)
{
out[i] = (((long)a[(i + 14)].m0));
if ((a[i].m0 != 0)) {
break;
}
}
return 0;
}
```
**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260413 (experimental) [trunk]
```
**Compilation options:**
```
-march=armv9-a+sve -ftree-vectorize -O3 -fopt-info-vec-all
-fvect-cost-model=unlimited
```
**GCC trunk output:**
```
<source>:16:27: missed: couldn't vectorize loop
<source>:16:27: missed: unsupported SLP instances
<source>:11:6: note: vectorized 0 loops in function.
<source>:19:12: note: ***** Analysis failed with vector mode VNx2DI
<source>:19:12: note: ***** Skipping vector mode VNx16QI, which would repeat
the analysis for VNx2DI
```
Generated assembly (fully scalar, no SIMD instructions used):
```assembly
test_auto_case_174:
subs w3, w2, #1
bmi .L2
sxtw x3, w3
ubfiz x4, x2, 3, 32
add x2, x0, w2, uxtw 4
sub x4, x4, x3, lsl 3
sub x4, x4, #8
add x1, x1, x4
b .L3
.L6:
tbnz w3, #31, .L2
.L3:
ldr x0, [x2, -16]
sub x2, x2, #48
ldr x4, [x2, 256]
str x4, [x1, x3, lsl 3]
sub x3, x3, #3
cbz x0, .L6
.L2:
mov x0, 0
ret
```
Also reproducible on Godbolt: https://godbolt.org/z/d5jjPaP3r.
**GCC 15.2.0 (for comparison):**
```
<source>:16:27: optimized: loop vectorized using 16 byte vectors
<source>:16:27: optimized: loop versioned for vectorization to enhance
alignment
<source>:11:6: note: vectorized 1 loops in function.
...
<source>:22:25: note: ***** Analysis succeeded with vector mode V16QI
<source>:22:25: note: SLPing BB part
<source>:22:25: note: Basic block will be vectorized using SLP
<source>:22:25: note: load permutation { 6 0 }
```
Key vectorized portion (NEON SIMD, using zip1/uzp1 for load permutation,
vectorized early exit check via cmtst/umaxp):
```assembly
.L6:
ldr q27, [x2]
ldr q31, [x2, 48]
sub x2, x2, #192
ldr d28, [x2, 288]
ldr d30, [x2, 336]
zip1 v31.2d, v31.2d, v27.2d
uzp1 v30.2d, v30.2d, v28.2d
shl v30.2d, v30.2d, 2
shl v31.2d, v31.2d, 2
add v30.2d, v30.2d, v26.2d
add v31.2d, v31.2d, v26.2d
str d30, [x5], -24
st1 {v30.d}[1], [x5]
str d31, [x3, 48]
st1 {v31.d}[1], [x6]
cmtst v31.2d, v31.2d, v31.2d
cmtst v30.2d, v30.2d, v30.2d
orr v31.16b, v31.16b, v30.16b
umaxp v31.4s, v31.4s, v31.4s
fmov x10, d31
cbz x10, .L4
```
Also reproducible on Godbolt: https://godbolt.org/z/eEhnhrEbP.
**Additional notes:**
- The loop uses a stride of -3 over a struct array (sizeof = 16 bytes) and
contains an early exit condition (`if (data_0[idx].m0 != 0) break`), which
makes vectorization more complex but not impossible.
- GCC 15.2.0 demonstrated that this loop can be vectorized via SLP, using a
load permutation `{ 6 0 }` to handle the strided access pattern and cmtst/umaxp
to check the early exit condition in parallel.
- GCC trunk completely fails to vectorize this loop, reporting "unsupported SLP
instances", and falls back to fully scalar code.
- The failure occurs in SLP pattern matching: the load permutation `{ 6 0 }`
that was successfully recognized in GCC 15 is now rejected as unsupported.
- This is not a costing issue: `-fvect-cost-model=unlimited` does not help.
- This is a significant regression for workloads with strided memory access
patterns over struct arrays combined with early exit conditions.