On Thu, 23 Oct 2025, Richard Biener wrote:
> On Wed, 22 Oct 2025, Richard Sandiford wrote:
>
> > Richard Biener <[email protected]> writes:
> > > The following tries to addess one shortcoming of the SLP permute
> > > optimization phase which is that it tries to refuse layouts that
> > > cannot revert to layout 0 on an edge even if layout 0 isn't actually
> > > valid. This hurts in particular on VLA vector targets where many
> > > permutes cannot be code generated. While for fixed-length targets
> > > costs eventually sort things out in a way to chose a "no permutation"
> > > setting the above check prevents VLA targets from arriving there.
> >
> > I'm not sure this is the right way to go. The subpass is supposed
> > to be an optimisation pass rather than a legitimisation/"make something
> > vectorisable" pass. The cost function is purely aimed at reducing
> > permutations rather than at turning unvectorisable graphs into
> > vectorisation ones. If the pass happens to make something vectorisable,
> > that happens by chance rather than by design.
>
> Note while the change turns something not vectorizable into something
> vectorizable it does so by enabling permute optimization to see that
> a load permutation can be fully elided. This currently does not work
> because of the imposed "the reverse permute has to be code generatable"
> check(s).
>
> The existing handling of [in_layout_i == 0 &&] out_layout_i == 0 for
> a not code generatable permute made me think that the fact that
> layout 0 isn't code generatable shouldn't prevent us from arriving
> at the optimal (just from a cost perspective) layout configuration.
>
> To this extent, do you agree with the proposed change to
> internal_node_cost to anticipate redundant permute eliding?
> One could argue that vect_transform_slp_perm_load_1 should
> not fail for those - in this case it fails for
> group_size == dr_group_size == 16 but nunits [4,4], so
> a !repeating_p permute.
>
> > I can see that a similar type of pass might be useful for pushing
> > permutations around until the graph is vectorisable. But I think
> > that should be a separate subpass that runs first, with a different goal
> > function. Perhaps it would share enough code with the optimisation pass
> > that they could both be subclasses of a common base class (not sure).
>
> I'll note that at least for permutes on loads we're in the difficult
> situation that permute lowering only handles
> interleaving patterns right now and other permutes are
> not lowered but left to vectorizable_load. Trying to legitimize
> those earlier (anticipating what vectorizable_load would do) would
> be nice but also necessarily dependent on VF (permute lowering also
> looks at VF and at a point where it's not yet final ...)
>
> But as you say, legitimizing should be separate from optimizing but
> the issue at hand is that optimizing on a not legitimized graph
> ties its hands because of the reverse permute requirement ...
>
> > > The patch fixes up internal_node_cost to anticipate redundant permute
> > > eliding and short-cuts the above requirement on edges to partitions
> > > that do not have a prefered layout with the idea we shouldn't have
> > > the need to revert to layout zero for those.
> > >
> > > The downside is that we now can arrive at invalid layout configurations
> > > which we deal with simply giving up.
> >
> > Right. I think that for other testcases that are currently successfully
> > optimised, and that are vectorisable with and without the optimisation,
> > we might lose optimisations by doing this, because the pass could get
> > stuck in a dead end. The risk is probably higher for VLA testcases,
> > due to the same property that motivates this patch.
> >
> > The change therefore feels like a bit of a hack to me.
>
> It probably is, esp. singling out that -1 layout. The question is
> what the alternative is?
>
> What's the purpose of rejecting the layout at
>
> /* Reject the layout if it would make layout 0
> impossible
> for later partitions. This amounts to testing that
> the
> target supports reversing the layout change on
> edges
> to later partitions.
>
> In principle, it might be possible to push a layout
> change all the way down a graph, so that it never
> needs to be reversed and so that the target doesn't
> need to support the reverse operation. But it
> would
> be awkward to bail out if we hit a partition that
> does not support the new layout, especially since
> we are not dealing with a lattice. */
> is_possible &= edge_layout_cost (ud, other_node_i, 0,
> layout_i).is_possible
> ();
>
> ? It seems we want to enable the backward pass to undo everything the
> forward pass did, because that might have gone "too far"? But we
> do not check that we can "undo", aka go back to the initial chosen
> layout (which is not always 0 but matches the reverse load permutation?)
> My alternative idea was to not require the ability to go back if
> layout 0 is "impossible" in terms of internal_node_cost?
>
> > I've seen the emails about the PR go past but haven't had chance to look
> > at them properly. Wanted to reply to this first in case I missed the boat.
>
> For the case in question I looked at WRT hitting the assert in the
> backward pass it is that the backward pass computes the layout chosen
> by the forward pass is invalid (the forward pass pushed the invalid
> load permute down a bit, eventually hitting a fixed-layout store).
> But the minimum partition the backward pass would have chosen actually
> matched that of the forward pass (zero). So I suppose a first obvious
> thing would be to relax the assert to
>
> gcc_assert (min_layout_i == 0 || min_layout_cost.is_possible ());
>
> or alternatively or in addition to min_layout_i == partition.layout.
> The combined
>
> gcc_assert ((min_layout_i == 0 && partition.layout == min_layout_i)
> || min_layout_cost.is_possible ());
>
> still runs into ICEs. On x86-64 for min_layout_i == partition.layout == 0
> we have gcc.dg/vect/pr51000.c. For gcc.dg/vect/vect-pr122370.c we
> have a case where the forward pass assigns layout 1, I suspect with
> the way the code is written a
>
> gcc_assert (min_layout_i == 0
> || min_layout_cost.is_possible ());
>
> assert would always work out since we start trying with layout == 0
> (instead of the layout chosen by the forward pass for example).
>
> If we were not special-casing layout 0 we'd run into the very same
> validation issue of course and this should get us to chose layout
> 0 everywhere as fallback, similar to what I do with bailing out.
> Note the patch doesn't keep the layout changes but instead uses
> layout 0 everywhere in such case which _should_ be what we'd arrive
> at anyway without the proposed change.
So trying with just adjusting costs, anticipating permute eliding
for the backward pass via
diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
index f2ffd1f54e8..a5dbabccac3 100644
--- a/gcc/tree-vect-slp.cc
+++ b/gcc/tree-vect-slp.cc
@@ -6686,6 +6686,16 @@ vect_optimize_slp_pass::change_layout_cost
(slp_tree node,
if (from_layout_i == to_layout_i)
return 0;
+ if (from_layout_i == 0
+ && !SLP_TREE_LOAD_PERMUTATION (node).is_empty ())
+ {
+ auto_load_permutation_t tem (SLP_TREE_LANES (node));
+ tem.safe_splice (SLP_TREE_LOAD_PERMUTATION (node));
+ vect_slp_permute (m_perms[to_layout_i], tem, true);
+ if (is_redundant_permutation (node, tem))
+ return 0;
+ }
+
auto_vec<slp_tree, 1> children (1);
children.quick_push (node);
auto_lane_permutation_t perm (SLP_TREE_LANES (node));
doesn't fix this either. The forward pass arrives at using layout
zero everywhere but while the backward pass computes both layout
zero and layout one as valid for partition 1 they have the same
cost and thus we stay at zero, making partition 0 use layout 0
since 1 is computed not possible then. We are also re-using
layout costs from the forward pass but will accumulate onto
that costs based on a current layout?
I think for the case in question (permuted load feeding
associatable SLP reduction chain) the forward pass is the one
that should pick layout 1 for the load since the backward pass,
for the reduction SCC, has arbitrary choice with no cost
difference (it's costs for choosing layout 0 do take into account
the "impossible" cost for partition 0 - that would be factored
in only when processing partition 0).
Richard.
> Richard.
>
>
> > Thanks,
> > Richard
> >
> > > A testcase for this is gcc.dg/vect/vect-reduc-chain-4.c.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu. I verified that
> > > with RVV we now can use VLA vectors for the testcase. The patch
> > > will run through the riscv CI, some aarch64 testing is appreciated.
> > >
> > > Comments on the patch as well, I did try to understand the whole
> > > operation of SLP permute optimization but likely failed to capture
> > > some essential part ;)
> > >
> > > Thanks,
> > > Richard.
> > >
> > > PR tree-optimization/120687
> > > * tree-vect-slp.cc (vect_optimize_slp_pass::is_redundant_permutation):
> > > New function, split out from ...
> > > (vect_optimize_slp_pass::remove_redundant_permutations): ... here.
> > > Use it.
> > > (vect_optimize_slp_pass::backward_pass): Return whether
> > > all layouts are possible.
> > > (vect_optimize_slp_pass::internal_node_cost): When the effective
> > > load permutation is redundant assume zero cost.
> > > (vect_optimize_slp_pass::forward_pass): For a forward edge
> > > to a partition without a prefered layout do not require layout
> > > zero to be valid.
> > > (vect_optimize_slp_pass::run): When not all layouts are possible
> > > after the backward_pass, do nothing.
> > > ---
> > > gcc/tree-vect-slp.cc | 131 +++++++++++++++++++++++++------------------
> > > 1 file changed, 75 insertions(+), 56 deletions(-)
> > >
> > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > > index 9698709f567..4f276771d57 100644
> > > --- a/gcc/tree-vect-slp.cc
> > > +++ b/gcc/tree-vect-slp.cc
> > > @@ -6267,13 +6267,14 @@ private:
> > > slpg_layout_cost forward_cost (graph_edge *, unsigned int, unsigned
> > > int);
> > > slpg_layout_cost backward_cost (graph_edge *, unsigned int, unsigned
> > > int);
> > > void forward_pass ();
> > > - void backward_pass ();
> > > + bool backward_pass ();
> > >
> > > /* Rematerialization. */
> > > slp_tree get_result_with_layout (slp_tree, unsigned int);
> > > void materialize ();
> > >
> > > /* Clean-up. */
> > > + bool is_redundant_permutation (slp_tree, const load_permutation_t&);
> > > void remove_redundant_permutations ();
> > >
> > > /* Masked load lanes discovery. */
> > > @@ -6748,6 +6749,46 @@ change_vec_perm_layout (slp_tree node,
> > > lane_permutation_t &perm,
> > > vect_slp_permute (m_perms[out_layout_i], perm, true);
> > > }
> > >
> > > +/* Return whether PERM is a redundant load permutation for NODE. */
> > > +
> > > +bool
> > > +vect_optimize_slp_pass::is_redundant_permutation (slp_tree node,
> > > + const load_permutation_t&
> > > + perm)
> > > +{
> > > + if (!is_a <loop_vec_info> (m_vinfo))
> > > + return false;
> > > + loop_vec_info loop_vinfo = as_a<loop_vec_info> (m_vinfo);
> > > + bool this_load_permuted = false;
> > > + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> > > + if (perm[j] != j)
> > > + {
> > > + this_load_permuted = true;
> > > + break;
> > > + }
> > > + /* When this isn't a grouped access we know it's single element
> > > + and contiguous. */
> > > + if (!STMT_VINFO_GROUPED_ACCESS (SLP_TREE_SCALAR_STMTS (node)[0]))
> > > + {
> > > + if (!this_load_permuted
> > > + && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
> > > + || SLP_TREE_LANES (node) == 1))
> > > + return true;
> > > + return false;
> > > + }
> > > + stmt_vec_info first_stmt_info
> > > + = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]);
> > > + if (!this_load_permuted
> > > + /* The load requires permutation when unrolling exposes
> > > + a gap either because the group is larger than the SLP
> > > + group-size or because there is a gap between the groups. */
> > > + && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
> > > + || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
> > > + && DR_GROUP_GAP (first_stmt_info) == 0)))
> > > + return true;
> > > + return false;
> > > +}
> > > +
> > > /* Check whether the target allows NODE to be rearranged so that the
> > > node's
> > > output has layout OUT_LAYOUT_I. Return the cost of the change if so,
> > > in the same arbitrary units as for change_layout_cost. Return -1
> > > otherwise.
> > > @@ -6831,9 +6872,11 @@ vect_optimize_slp_pass::internal_node_cost
> > > (slp_tree node, int in_layout_i,
> > > poly_uint64 vf = 1;
> > > if (auto loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo))
> > > vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > > - unsigned int n_perms;
> > > - if (!vect_transform_slp_perm_load_1 (m_vinfo, node, tmp_perm,
> > > vNULL,
> > > - nullptr, vf, true, false, &n_perms))
> > > + unsigned int n_perms = 0;
> > > + if (!is_redundant_permutation (node, tmp_perm)
> > > + && !vect_transform_slp_perm_load_1 (m_vinfo, node, tmp_perm, vNULL,
> > > + nullptr, vf, true, false,
> > > + &n_perms))
> > > {
> > > auto rep = SLP_TREE_REPRESENTATIVE (node);
> > > if (out_layout_i == 0)
> > > @@ -7338,19 +7381,22 @@ vect_optimize_slp_pass::forward_pass ()
> > > else
> > > layout_costs.in_cost.add_parallel_cost (cost);
> > > }
> > > - else
> > > - /* Reject the layout if it would make layout 0 impossible
> > > - for later partitions. This amounts to testing that the
> > > - target supports reversing the layout change on edges
> > > - to later partitions.
> > > -
> > > - In principle, it might be possible to push a layout
> > > - change all the way down a graph, so that it never
> > > - needs to be reversed and so that the target doesn't
> > > - need to support the reverse operation. But it would
> > > - be awkward to bail out if we hit a partition that
> > > - does not support the new layout, especially since
> > > - we are not dealing with a lattice. */
> > > + /* Reject the layout if it would make layout 0 impossible
> > > + for later partitions. This amounts to testing that the
> > > + target supports reversing the layout change on edges
> > > + to later partitions.
> > > +
> > > + In principle, it might be possible to push a layout
> > > + change all the way down a graph, so that it never
> > > + needs to be reversed and so that the target doesn't
> > > + need to support the reverse operation. But it would
> > > + be awkward to bail out if we hit a partition that
> > > + does not support the new layout, especially since
> > > + we are not dealing with a lattice.
> > > +
> > > + At least when the other partition doesn't have a
> > > + prefered layout do not require the reverse permute. */
> > > + else if (m_partitions[other_vertex.partition].layout != -1)
> > > is_possible &= edge_layout_cost (ud, other_node_i, 0,
> > > layout_i).is_possible ();
> > > };
> > > @@ -7426,7 +7472,7 @@ vect_optimize_slp_pass::forward_pass ()
> > > /* Make a backward pass through the partitions, accumulating output
> > > costs.
> > > Make a final choice of layout for each partition. */
> > >
> > > -void
> > > +bool
> > > vect_optimize_slp_pass::backward_pass ()
> > > {
> > > for (unsigned int partition_i = m_partitions.length (); partition_i--
> > > > 0;)
> > > @@ -7498,9 +7544,11 @@ vect_optimize_slp_pass::backward_pass ()
> > > }
> > > }
> > >
> > > - gcc_assert (min_layout_cost.is_possible ());
> > > + if (!min_layout_cost.is_possible ())
> > > + return false;
> > > partition.layout = min_layout_i;
> > > }
> > > + return true;
> > > }
> > >
> > > /* Return a node that applies layout TO_LAYOUT_I to the original form of
> > > NODE.
> > > @@ -7760,39 +7808,8 @@
> > > vect_optimize_slp_pass::remove_redundant_permutations ()
> > > }
> > > else
> > > {
> > > - loop_vec_info loop_vinfo = as_a<loop_vec_info> (m_vinfo);
> > > - stmt_vec_info load_info;
> > > - bool this_load_permuted = false;
> > > - unsigned j;
> > > - FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info)
> > > - if (SLP_TREE_LOAD_PERMUTATION (node)[j] != j)
> > > - {
> > > - this_load_permuted = true;
> > > - break;
> > > - }
> > > - /* When this isn't a grouped access we know it's single element
> > > - and contiguous. */
> > > - if (!STMT_VINFO_GROUPED_ACCESS (SLP_TREE_SCALAR_STMTS (node)[0]))
> > > - {
> > > - if (!this_load_permuted
> > > - && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
> > > - || SLP_TREE_LANES (node) == 1))
> > > - SLP_TREE_LOAD_PERMUTATION (node).release ();
> > > - continue;
> > > - }
> > > - stmt_vec_info first_stmt_info
> > > - = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]);
> > > - if (!this_load_permuted
> > > - /* The load requires permutation when unrolling exposes
> > > - a gap either because the group is larger than the SLP
> > > - group-size or because there is a gap between the groups. */
> > > - && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
> > > - || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
> > > - && DR_GROUP_GAP (first_stmt_info) == 0)))
> > > - {
> > > - SLP_TREE_LOAD_PERMUTATION (node).release ();
> > > - continue;
> > > - }
> > > + if (is_redundant_permutation (node, SLP_TREE_LOAD_PERMUTATION (node)))
> > > + SLP_TREE_LOAD_PERMUTATION (node).release ();
> > > }
> > > }
> > > }
> > > @@ -7995,10 +8012,12 @@ vect_optimize_slp_pass::run ()
> > > if (m_perms.length () > 1)
> > > {
> > > forward_pass ();
> > > - backward_pass ();
> > > - if (dump_enabled_p ())
> > > - dump ();
> > > - materialize ();
> > > + if (backward_pass ())
> > > + {
> > > + if (dump_enabled_p ())
> > > + dump ();
> > > + materialize ();
> > > + }
> > > while (!m_perms.is_empty ())
> > > m_perms.pop ().release ();
> > > }
> >
>
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)