https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125204

            Bug ID: 125204
           Summary: [15/16 Regression] SLP vectorization of loop with
                    early exit lost at -O3 -fvect-cost-model=unlimited
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bug_hunters at yeah dot net
  Target Milestone: ---

**Description:**
GCC 15.2.0 was able to vectorize a loop with a strided iteration pattern (`i -=
3`), an early exit condition (`if (data_0[idx].m0 != 0) break`), and a load
permutation (`{ 6 0 }`), using NEON SIMD instructions. GCC trunk completely
fails to vectorize this loop, reporting "unsupported SLP instances", despite
`-fvect-cost-model=unlimited` being explicitly specified.

**Test case:**
```c
#include <stdint.h>
#include <stddef.h>
typedef struct {
    long m0;
    short m1;
    int m2;
}element_t_0;
long foo(
    const element_t_0 * __restrict__ a,
    long * __restrict__ out,
    int n
) {
    for (int i = n - 1; i >= 0; i -= 3)
    {
        out[i] = (((long)a[(i + 14)].m0));
        if ((a[i].m0 != 0)) {
            break;
        }
    }
    return 0;
}

```

**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260413 (experimental) [trunk]
```

**Compilation options:**
```
-march=armv9-a+sve -ftree-vectorize -O3 -fopt-info-vec-all
-fvect-cost-model=unlimited
```

**GCC trunk output:**
```
<source>:16:27: missed: couldn't vectorize loop
<source>:16:27: missed: unsupported SLP instances
<source>:11:6: note: vectorized 0 loops in function.
<source>:19:12: note: ***** Analysis failed with vector mode VNx2DI
<source>:19:12: note: ***** Skipping vector mode VNx16QI, which would repeat
the analysis for VNx2DI
```

Generated assembly (fully scalar, no SIMD instructions used):
```assembly
test_auto_case_174:
        subs    w3, w2, #1
        bmi     .L2
        sxtw    x3, w3
        ubfiz   x4, x2, 3, 32
        add     x2, x0, w2, uxtw 4
        sub     x4, x4, x3, lsl 3
        sub     x4, x4, #8
        add     x1, x1, x4
        b       .L3
.L6:
        tbnz    w3, #31, .L2
.L3:
        ldr     x0, [x2, -16]
        sub     x2, x2, #48
        ldr     x4, [x2, 256]
        str     x4, [x1, x3, lsl 3]
        sub     x3, x3, #3
        cbz     x0, .L6
.L2:
        mov     x0, 0
        ret
```

Also reproducible on Godbolt: https://godbolt.org/z/d5jjPaP3r.

**GCC 15.2.0 (for comparison):**
```
<source>:16:27: optimized: loop vectorized using 16 byte vectors
<source>:16:27: optimized:  loop versioned for vectorization to enhance
alignment
<source>:11:6: note: vectorized 1 loops in function.
...
<source>:22:25: note: ***** Analysis succeeded with vector mode V16QI
<source>:22:25: note: SLPing BB part
<source>:22:25: note: Basic block will be vectorized using SLP
<source>:22:25: note: load permutation { 6 0 }
```

Key vectorized portion (NEON SIMD, using zip1/uzp1 for load permutation,
vectorized early exit check via cmtst/umaxp):
```assembly
.L6:
        ldr     q27, [x2]
        ldr     q31, [x2, 48]
        sub     x2, x2, #192
        ldr     d28, [x2, 288]
        ldr     d30, [x2, 336]
        zip1    v31.2d, v31.2d, v27.2d
        uzp1    v30.2d, v30.2d, v28.2d
        shl     v30.2d, v30.2d, 2
        shl     v31.2d, v31.2d, 2
        add     v30.2d, v30.2d, v26.2d
        add     v31.2d, v31.2d, v26.2d
        str     d30, [x5], -24
        st1     {v30.d}[1], [x5]
        str     d31, [x3, 48]
        st1     {v31.d}[1], [x6]
        cmtst   v31.2d, v31.2d, v31.2d
        cmtst   v30.2d, v30.2d, v30.2d
        orr     v31.16b, v31.16b, v30.16b
        umaxp   v31.4s, v31.4s, v31.4s
        fmov    x10, d31
        cbz     x10, .L4
```

Also reproducible on Godbolt: https://godbolt.org/z/eEhnhrEbP.

**Additional notes:**
- The loop uses a stride of -3 over a struct array (sizeof = 16 bytes) and
contains an early exit condition (`if (data_0[idx].m0 != 0) break`), which
makes vectorization more complex but not impossible.
- GCC 15.2.0 demonstrated that this loop can be vectorized via SLP, using a
load permutation `{ 6 0 }` to handle the strided access pattern and cmtst/umaxp
to check the early exit condition in parallel.
- GCC trunk completely fails to vectorize this loop, reporting "unsupported SLP
instances", and falls back to fully scalar code.
- The failure occurs in SLP pattern matching: the load permutation `{ 6 0 }`
that was successfully recognized in GCC 15 is now rejected as unsupported.
- This is not a costing issue: `-fvect-cost-model=unlimited` does not help.
- This is a significant regression for workloads with strided memory access
patterns over struct arrays combined with early exit conditions.

Reply via email to