[Bug target/122277] [16 regression] Costing VF 1 instead of 4 since r16-4411-gb6e802fd55d37e

rguenth at gcc dot gnu.org via Gcc-bugs Fri, 09 Jan 2026 05:36:51 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122277


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2026-01-09
     Ever confirmed|0                           |1

--- Comment #8 from Richard Biener <rguenth at gcc dot gnu.org> ---
But of course SLP is difficult for VLA uarchs unless we can turn the data load
at the root into a larger ld1 and then pun the result to the smaller element
type with SLP-lanes elements.  Because IIRC when not doing that we essentially
force VLS operation?

It has been noted in another PR that we currently have no heuristics as to
when to prefer single-lane vs. reduction chain SLP similar as to the
heuristic we have that scraps SLP data load/stores in favor of load/store-lane
(and thus single-lane operation).  We also do not cost one against the other.

In the future I'd like to start analysis with a single-lane SLP build
and implement SLP discovery by merging nodes.  If we split
vect_analyze_loop_2 further we can still somehow share the early work
before SLP analysis starts (and where we re-start for single-lane SLP) and
possibly cost both variants against each other.  Both things could be
done independently.  Note single-lane vs. not isn't binary but instead
it's really a per instance decision - so combinatorical explosion is
easily possible if we want to go the full way.  So having good heuristics
helps as well as handling power-of-two SLP cases "better" for VLA.

[Bug target/122277] [16 regression] Costing VF 1 instead of 4 since r16-4411-gb6e802fd55d37e

Reply via email to