https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125750
Tamar Christina <tnfchris at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Last reconfirmed| |2026-06-12
Keywords| |missed-optimization
Status|UNCONFIRMED |NEW
Ever confirmed|0 |1
--- Comment #2 from Tamar Christina <tnfchris at gcc dot gnu.org> ---
There's quite a few things going on in those examples..
like
movi d27, #0
mov z26.d, z27.d
mov z25.d, z27.d
mov z24.d, z27.d
mov z23.d, z27.d
mov z22.d, z27.d
mov z21.d, z27.d
mov z20.d, z27.d
mov z19.d, z27.d
mov z18.d, z27.d
mov z17.d, z27.d
mov z16.d, z27.d
in the outerloop of compute_region_directions is blatantly dumb.. I have a
patch that fixes this.
I'll break these out in separate tickets next week, but a quick couple:
our cost model is reject BB vectrization of compute_region_means (works with
-mmax-vectorization and shows good codegen).
As for the main reported problem, the unrolling this is SLP build failing.
https://godbolt.org/z/873Ene4eW
focuses on this.
Note that LLVM vectorized this using Adv. SIMD.
In GCC multi-lane SLP build is failing:
missed: SLP induction not supported for variable-length vectors.
and we fall back to single lane SLP.
For single lane SLP to succeed each stream becomes a LOAD_LANES. i.e. we load
and permute.
So we didn't unroll, we just vectorized every stream individually. LLVM has
done the same thing, however they do so using smaller scalar loads and creating
vectors from them
ldp d30, d31, [x11]
movprfx z28, z21
add z28.d, z28.d, #4
ushll v22.8h, v22.8b, #0
ldr d29, [x11, #880]
ldr d8, [x11, #888]
add x9, x9, #4
cmgt v25.2s, v3.2s, v25.2s
zip1 v9.2s, v30.2s, v29.2s
zip2 v29.2s, v30.2s, v29.2s
Using -mautovec-preference=asimd-only gives us much better code as well (though
still suboptimal).
We pick SVE because the cost model things that load with gaps using LD4 is
beneficial. Part of it is the broken load lanes costing that I was arguing
with Richard about.
So for the loop above from a quick look
1. fix the costing, need to revive the patches
2. see if we can support SLP indunctions with VLA
3. See why we didn't optimize the permutes when using adv. simd
I'll try to break these down into smaller examples and subtasks next week.