On Wed, 22 Oct 2025, Richard Sandiford wrote:
> Richard Biener <[email protected]> writes:
> > The following tries to addess one shortcoming of the SLP permute
> > optimization phase which is that it tries to refuse layouts that
> > cannot revert to layout 0 on an edge even if layout 0 isn't actually
> > valid. This hurts in particular on VLA vector targets where many
> > permutes cannot be code generated. While for fixed-length targets
> > costs eventually sort things out in a way to chose a "no permutation"
> > setting the above check prevents VLA targets from arriving there.
>
> I'm not sure this is the right way to go. The subpass is supposed
> to be an optimisation pass rather than a legitimisation/"make something
> vectorisable" pass. The cost function is purely aimed at reducing
> permutations rather than at turning unvectorisable graphs into
> vectorisation ones. If the pass happens to make something vectorisable,
> that happens by chance rather than by design.
Note while the change turns something not vectorizable into something
vectorizable it does so by enabling permute optimization to see that
a load permutation can be fully elided. This currently does not work
because of the imposed "the reverse permute has to be code generatable"
check(s).
The existing handling of [in_layout_i == 0 &&] out_layout_i == 0 for
a not code generatable permute made me think that the fact that
layout 0 isn't code generatable shouldn't prevent us from arriving
at the optimal (just from a cost perspective) layout configuration.
To this extent, do you agree with the proposed change to
internal_node_cost to anticipate redundant permute eliding?
One could argue that vect_transform_slp_perm_load_1 should
not fail for those - in this case it fails for
group_size == dr_group_size == 16 but nunits [4,4], so
a !repeating_p permute.
> I can see that a similar type of pass might be useful for pushing
> permutations around until the graph is vectorisable. But I think
> that should be a separate subpass that runs first, with a different goal
> function. Perhaps it would share enough code with the optimisation pass
> that they could both be subclasses of a common base class (not sure).
I'll note that at least for permutes on loads we're in the difficult
situation that permute lowering only handles
interleaving patterns right now and other permutes are
not lowered but left to vectorizable_load. Trying to legitimize
those earlier (anticipating what vectorizable_load would do) would
be nice but also necessarily dependent on VF (permute lowering also
looks at VF and at a point where it's not yet final ...)
But as you say, legitimizing should be separate from optimizing but
the issue at hand is that optimizing on a not legitimized graph
ties its hands because of the reverse permute requirement ...
> > The patch fixes up internal_node_cost to anticipate redundant permute
> > eliding and short-cuts the above requirement on edges to partitions
> > that do not have a prefered layout with the idea we shouldn't have
> > the need to revert to layout zero for those.
> >
> > The downside is that we now can arrive at invalid layout configurations
> > which we deal with simply giving up.
>
> Right. I think that for other testcases that are currently successfully
> optimised, and that are vectorisable with and without the optimisation,
> we might lose optimisations by doing this, because the pass could get
> stuck in a dead end. The risk is probably higher for VLA testcases,
> due to the same property that motivates this patch.
>
> The change therefore feels like a bit of a hack to me.
It probably is, esp. singling out that -1 layout. The question is
what the alternative is?
What's the purpose of rejecting the layout at
/* Reject the layout if it would make layout 0
impossible
for later partitions. This amounts to testing that
the
target supports reversing the layout change on
edges
to later partitions.
In principle, it might be possible to push a layout
change all the way down a graph, so that it never
needs to be reversed and so that the target doesn't
need to support the reverse operation. But it
would
be awkward to bail out if we hit a partition that
does not support the new layout, especially since
we are not dealing with a lattice. */
is_possible &= edge_layout_cost (ud, other_node_i, 0,
layout_i).is_possible
();
? It seems we want to enable the backward pass to undo everything the
forward pass did, because that might have gone "too far"? But we
do not check that we can "undo", aka go back to the initial chosen
layout (which is not always 0 but matches the reverse load permutation?)
My alternative idea was to not require the ability to go back if
layout 0 is "impossible" in terms of internal_node_cost?
> I've seen the emails about the PR go past but haven't had chance to look
> at them properly. Wanted to reply to this first in case I missed the boat.
For the case in question I looked at WRT hitting the assert in the
backward pass it is that the backward pass computes the layout chosen
by the forward pass is invalid (the forward pass pushed the invalid
load permute down a bit, eventually hitting a fixed-layout store).
But the minimum partition the backward pass would have chosen actually
matched that of the forward pass (zero). So I suppose a first obvious
thing would be to relax the assert to
gcc_assert (min_layout_i == 0 || min_layout_cost.is_possible ());
or alternatively or in addition to min_layout_i == partition.layout.
The combined
gcc_assert ((min_layout_i == 0 && partition.layout == min_layout_i)
|| min_layout_cost.is_possible ());
still runs into ICEs. On x86-64 for min_layout_i == partition.layout == 0
we have gcc.dg/vect/pr51000.c. For gcc.dg/vect/vect-pr122370.c we
have a case where the forward pass assigns layout 1, I suspect with
the way the code is written a
gcc_assert (min_layout_i == 0
|| min_layout_cost.is_possible ());
assert would always work out since we start trying with layout == 0
(instead of the layout chosen by the forward pass for example).
If we were not special-casing layout 0 we'd run into the very same
validation issue of course and this should get us to chose layout
0 everywhere as fallback, similar to what I do with bailing out.
Note the patch doesn't keep the layout changes but instead uses
layout 0 everywhere in such case which _should_ be what we'd arrive
at anyway without the proposed change.
Richard.
> Thanks,
> Richard
>
> > A testcase for this is gcc.dg/vect/vect-reduc-chain-4.c.
> >
> > Bootstrapped and tested on x86_64-unknown-linux-gnu. I verified that
> > with RVV we now can use VLA vectors for the testcase. The patch
> > will run through the riscv CI, some aarch64 testing is appreciated.
> >
> > Comments on the patch as well, I did try to understand the whole
> > operation of SLP permute optimization but likely failed to capture
> > some essential part ;)
> >
> > Thanks,
> > Richard.
> >
> > PR tree-optimization/120687
> > * tree-vect-slp.cc (vect_optimize_slp_pass::is_redundant_permutation):
> > New function, split out from ...
> > (vect_optimize_slp_pass::remove_redundant_permutations): ... here.
> > Use it.
> > (vect_optimize_slp_pass::backward_pass): Return whether
> > all layouts are possible.
> > (vect_optimize_slp_pass::internal_node_cost): When the effective
> > load permutation is redundant assume zero cost.
> > (vect_optimize_slp_pass::forward_pass): For a forward edge
> > to a partition without a prefered layout do not require layout
> > zero to be valid.
> > (vect_optimize_slp_pass::run): When not all layouts are possible
> > after the backward_pass, do nothing.
> > ---
> > gcc/tree-vect-slp.cc | 131 +++++++++++++++++++++++++------------------
> > 1 file changed, 75 insertions(+), 56 deletions(-)
> >
> > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> > index 9698709f567..4f276771d57 100644
> > --- a/gcc/tree-vect-slp.cc
> > +++ b/gcc/tree-vect-slp.cc
> > @@ -6267,13 +6267,14 @@ private:
> > slpg_layout_cost forward_cost (graph_edge *, unsigned int, unsigned int);
> > slpg_layout_cost backward_cost (graph_edge *, unsigned int, unsigned
> > int);
> > void forward_pass ();
> > - void backward_pass ();
> > + bool backward_pass ();
> >
> > /* Rematerialization. */
> > slp_tree get_result_with_layout (slp_tree, unsigned int);
> > void materialize ();
> >
> > /* Clean-up. */
> > + bool is_redundant_permutation (slp_tree, const load_permutation_t&);
> > void remove_redundant_permutations ();
> >
> > /* Masked load lanes discovery. */
> > @@ -6748,6 +6749,46 @@ change_vec_perm_layout (slp_tree node,
> > lane_permutation_t &perm,
> > vect_slp_permute (m_perms[out_layout_i], perm, true);
> > }
> >
> > +/* Return whether PERM is a redundant load permutation for NODE. */
> > +
> > +bool
> > +vect_optimize_slp_pass::is_redundant_permutation (slp_tree node,
> > + const load_permutation_t&
> > + perm)
> > +{
> > + if (!is_a <loop_vec_info> (m_vinfo))
> > + return false;
> > + loop_vec_info loop_vinfo = as_a<loop_vec_info> (m_vinfo);
> > + bool this_load_permuted = false;
> > + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> > + if (perm[j] != j)
> > + {
> > + this_load_permuted = true;
> > + break;
> > + }
> > + /* When this isn't a grouped access we know it's single element
> > + and contiguous. */
> > + if (!STMT_VINFO_GROUPED_ACCESS (SLP_TREE_SCALAR_STMTS (node)[0]))
> > + {
> > + if (!this_load_permuted
> > + && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
> > + || SLP_TREE_LANES (node) == 1))
> > + return true;
> > + return false;
> > + }
> > + stmt_vec_info first_stmt_info
> > + = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]);
> > + if (!this_load_permuted
> > + /* The load requires permutation when unrolling exposes
> > + a gap either because the group is larger than the SLP
> > + group-size or because there is a gap between the groups. */
> > + && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
> > + || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
> > + && DR_GROUP_GAP (first_stmt_info) == 0)))
> > + return true;
> > + return false;
> > +}
> > +
> > /* Check whether the target allows NODE to be rearranged so that the node's
> > output has layout OUT_LAYOUT_I. Return the cost of the change if so,
> > in the same arbitrary units as for change_layout_cost. Return -1
> > otherwise.
> > @@ -6831,9 +6872,11 @@ vect_optimize_slp_pass::internal_node_cost (slp_tree
> > node, int in_layout_i,
> > poly_uint64 vf = 1;
> > if (auto loop_vinfo = dyn_cast<loop_vec_info> (m_vinfo))
> > vf = LOOP_VINFO_VECT_FACTOR (loop_vinfo);
> > - unsigned int n_perms;
> > - if (!vect_transform_slp_perm_load_1 (m_vinfo, node, tmp_perm, vNULL,
> > - nullptr, vf, true, false, &n_perms))
> > + unsigned int n_perms = 0;
> > + if (!is_redundant_permutation (node, tmp_perm)
> > + && !vect_transform_slp_perm_load_1 (m_vinfo, node, tmp_perm, vNULL,
> > + nullptr, vf, true, false,
> > + &n_perms))
> > {
> > auto rep = SLP_TREE_REPRESENTATIVE (node);
> > if (out_layout_i == 0)
> > @@ -7338,19 +7381,22 @@ vect_optimize_slp_pass::forward_pass ()
> > else
> > layout_costs.in_cost.add_parallel_cost (cost);
> > }
> > - else
> > - /* Reject the layout if it would make layout 0 impossible
> > - for later partitions. This amounts to testing that the
> > - target supports reversing the layout change on edges
> > - to later partitions.
> > -
> > - In principle, it might be possible to push a layout
> > - change all the way down a graph, so that it never
> > - needs to be reversed and so that the target doesn't
> > - need to support the reverse operation. But it would
> > - be awkward to bail out if we hit a partition that
> > - does not support the new layout, especially since
> > - we are not dealing with a lattice. */
> > + /* Reject the layout if it would make layout 0 impossible
> > + for later partitions. This amounts to testing that the
> > + target supports reversing the layout change on edges
> > + to later partitions.
> > +
> > + In principle, it might be possible to push a layout
> > + change all the way down a graph, so that it never
> > + needs to be reversed and so that the target doesn't
> > + need to support the reverse operation. But it would
> > + be awkward to bail out if we hit a partition that
> > + does not support the new layout, especially since
> > + we are not dealing with a lattice.
> > +
> > + At least when the other partition doesn't have a
> > + prefered layout do not require the reverse permute. */
> > + else if (m_partitions[other_vertex.partition].layout != -1)
> > is_possible &= edge_layout_cost (ud, other_node_i, 0,
> > layout_i).is_possible ();
> > };
> > @@ -7426,7 +7472,7 @@ vect_optimize_slp_pass::forward_pass ()
> > /* Make a backward pass through the partitions, accumulating output costs.
> > Make a final choice of layout for each partition. */
> >
> > -void
> > +bool
> > vect_optimize_slp_pass::backward_pass ()
> > {
> > for (unsigned int partition_i = m_partitions.length (); partition_i-- >
> > 0;)
> > @@ -7498,9 +7544,11 @@ vect_optimize_slp_pass::backward_pass ()
> > }
> > }
> >
> > - gcc_assert (min_layout_cost.is_possible ());
> > + if (!min_layout_cost.is_possible ())
> > + return false;
> > partition.layout = min_layout_i;
> > }
> > + return true;
> > }
> >
> > /* Return a node that applies layout TO_LAYOUT_I to the original form of
> > NODE.
> > @@ -7760,39 +7808,8 @@
> > vect_optimize_slp_pass::remove_redundant_permutations ()
> > }
> > else
> > {
> > - loop_vec_info loop_vinfo = as_a<loop_vec_info> (m_vinfo);
> > - stmt_vec_info load_info;
> > - bool this_load_permuted = false;
> > - unsigned j;
> > - FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info)
> > - if (SLP_TREE_LOAD_PERMUTATION (node)[j] != j)
> > - {
> > - this_load_permuted = true;
> > - break;
> > - }
> > - /* When this isn't a grouped access we know it's single element
> > - and contiguous. */
> > - if (!STMT_VINFO_GROUPED_ACCESS (SLP_TREE_SCALAR_STMTS (node)[0]))
> > - {
> > - if (!this_load_permuted
> > - && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
> > - || SLP_TREE_LANES (node) == 1))
> > - SLP_TREE_LOAD_PERMUTATION (node).release ();
> > - continue;
> > - }
> > - stmt_vec_info first_stmt_info
> > - = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]);
> > - if (!this_load_permuted
> > - /* The load requires permutation when unrolling exposes
> > - a gap either because the group is larger than the SLP
> > - group-size or because there is a gap between the groups. */
> > - && (known_eq (LOOP_VINFO_VECT_FACTOR (loop_vinfo), 1U)
> > - || ((SLP_TREE_LANES (node) == DR_GROUP_SIZE (first_stmt_info))
> > - && DR_GROUP_GAP (first_stmt_info) == 0)))
> > - {
> > - SLP_TREE_LOAD_PERMUTATION (node).release ();
> > - continue;
> > - }
> > + if (is_redundant_permutation (node, SLP_TREE_LOAD_PERMUTATION (node)))
> > + SLP_TREE_LOAD_PERMUTATION (node).release ();
> > }
> > }
> > }
> > @@ -7995,10 +8012,12 @@ vect_optimize_slp_pass::run ()
> > if (m_perms.length () > 1)
> > {
> > forward_pass ();
> > - backward_pass ();
> > - if (dump_enabled_p ())
> > - dump ();
> > - materialize ();
> > + if (backward_pass ())
> > + {
> > + if (dump_enabled_p ())
> > + dump ();
> > + materialize ();
> > + }
> > while (!m_perms.is_empty ())
> > m_perms.pop ().release ();
> > }
>
--
Richard Biener <[email protected]>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)