On Tue, 22 Jun 2021, Richard Biener wrote:

> On Mon, 21 Jun 2021, Tamar Christina wrote:
> 
> > Hi Richi,
> > 
[...]
> > since we are removing the TWO_OPERANDS node we need to drop one of the 
> > multiply
> > and so we need to give it the original 2 vectors as a parameter.  The 
> > current
> > implementation emits a permute operation to combine the loads again and 
> > later
> > pushes the permute down.  This is problematic as it again requires us to do 
> > df
> > analysis early.
> > 
> > To counter this, in the patch I have changed loads to no longer come out of
> > build_slp with LOAD_PERMUTES but instead to have a VEC_PERM above each load.
> 
> Yep.  I've been there before (not sure if I ever sent you my 
> work-in-progress here).  There's some wrongs in your patch but I guess
> doing this exactly for the permutes optimize_slp handles would be fine.
> 
> We should see to do this independently of the stuff above, I can handle
> this and will prepare a patch for this later.

So it's of course difficult and the current optimize_slp tied closely
to what the original vect_attempt_slp_rearrange_stmts did.

If you for example consider

double x[2], y[2], z[4];
void foo ()
{
  z[0] = x[0] + y[0];
  z[1] = x[1] + y[1];
  z[2] = x[1] + y[0];
  z[3] = x[0] + y[1];
}

then the x[0], x[1] loads look unform enough to be handled
but of course we end up with a group_size of 4 here and
a { 0, 1, 1, 0 } load permutation which optimize_slp wouldn't
handle either.  Of course in the end we should have split
the SLP at vector size boundary but the question is how
we should ensure this (if at all ...) or if we should
eventually even create a SLP tree like

      { x[0], x[1] }
         |       \
         |     VEC_PERM { 0[1], 0[0] }
         \      /
      VEC_PERM { 0[0], 0[1], 1[0], 1[1] }
             |

for the load.  Note all lane splitting/duplicating has
effects on vectorization factor compute - one complication
I ran into when originally trying to split out load permutations
from loads.

I'm not sure whether your example motivating the load SLP
changes is good enough - if you consider that the loaded
values get modified, like as { x[0]/a + y[1]/a, x[1]/a - y[0]/a }
splitting the load permutation from the load will not get you
the division CSEd at SLP build and if we divide by a different value
there's no CSE opportunity.  What would work and what should work
right now is that you get a lane permute down but you'll not know
whether values are "related"?  If you need that info and that
directly on the child you can look at the representatives
DR group leader, if any, as a heuristic.

I've pasted below what I was playing with, it shows CSE for
cases like

double x[2], z[4];
void foo ()
{
  z[0] = x[0] + 2 * x[1];
  z[1] = x[1] + 2 * x[0];
}

but it breaks various vect.exp tests that look for permutes
being merged with reductions (thus the optimize_slp propagation
somehow doesn't work - as I said it's a bit fragile).

Richard.

diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c
index 6a26ccdd290..187bbfb70db 100644
--- a/gcc/tree-vect-slp.c
+++ b/gcc/tree-vect-slp.c
@@ -343,12 +343,14 @@ vect_slp_tree_uniform_p (slp_tree node)
 }
 
 /* Find the place of the data-ref in STMT_INFO in the interleaving chain
-   that starts from FIRST_STMT_INFO.  Return -1 if the data-ref is not a part
-   of the chain.  */
+   that starts from FIRST_STMT_INFO.  If ADD_GAPS is true then if there's
+   a gap between elements account for that as well.
+   Return -1 if the data-ref is not a part of the chain.  */
 
 int
 vect_get_place_in_interleaving_chain (stmt_vec_info stmt_info,
-                                     stmt_vec_info first_stmt_info)
+                                     stmt_vec_info first_stmt_info,
+                                     bool add_gaps)
 {
   stmt_vec_info next_stmt_info = first_stmt_info;
   int result = 0;
@@ -362,7 +364,7 @@ vect_get_place_in_interleaving_chain (stmt_vec_info 
stmt_info,
        return result;
       next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info);
       if (next_stmt_info)
-       result += DR_GROUP_GAP (next_stmt_info);
+       result += add_gaps ? DR_GROUP_GAP (next_stmt_info) : 1;
     }
   while (next_stmt_info);
 
@@ -1769,24 +1771,65 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
        {
          *max_nunits = this_max_nunits;
          (*tree_size)++;
-         node = vect_create_new_slp_node (node, stmts, 0);
-         SLP_TREE_VECTYPE (node) = vectype;
          /* And compute the load permutation.  Whether it is actually
             a permutation depends on the unrolling factor which is
             decided later.  */
-         vec<unsigned> load_permutation;
          int j;
          stmt_vec_info load_info;
+         stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmts[0]);
+         vec<unsigned> load_permutation;
          load_permutation.create (group_size);
-         stmt_vec_info first_stmt_info
-           = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]);
-         FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info)
+         bool any_non_contiguous = false;
+         FOR_EACH_VEC_ELT (stmts, j, load_info)
            {
              int load_place = vect_get_place_in_interleaving_chain
-                 (load_info, first_stmt_info);
+                 (load_info, first_stmt_info, true);
              gcc_assert (load_place != -1);
+             if (load_place != j)
+               any_non_contiguous = true;
              load_permutation.safe_push (load_place);
            }
+         if (any_non_contiguous
+             && DR_GROUP_SIZE (first_stmt_info) == group_size)
+           {
+             vec<stmt_vec_info> alt_stmts;
+             alt_stmts.create (group_size);
+             for (stmt_vec_info cur = first_stmt_info; cur;
+                  cur = DR_GROUP_NEXT_ELEMENT (cur))
+               alt_stmts.quick_push (cur);
+             if (alt_stmts.length () == group_size)
+               {
+                 load_permutation.release ();
+                 slp_tree load = vect_build_slp_tree (vinfo, alt_stmts,
+                                                      group_size,
+                                                      &this_max_nunits,
+                                                      matches, limit,
+                                                      tree_size, bst_map);
+                 if (!load)
+                   {
+                     alt_stmts.release ();
+                     matches[0] = false;
+                     return NULL;
+                   }
+                 lane_permutation_t lane_permute;
+                 lane_permute.create (group_size);
+                 FOR_EACH_VEC_ELT (stmts, j, load_info)
+                   {
+                     int load_place = vect_get_place_in_interleaving_chain
+                                        (load_info, first_stmt_info, false);
+                     gcc_assert (load_place != -1);
+                     lane_permute.quick_push (std::make_pair (0, load_place));
+                   }
+                 node = vect_create_new_slp_node (node, stmts, 1);
+                 SLP_TREE_CODE (node) = VEC_PERM_EXPR;
+                 SLP_TREE_LANE_PERMUTATION (node) = lane_permute;
+                 SLP_TREE_VECTYPE (node) = vectype;
+                 SLP_TREE_CHILDREN (node).quick_push (load);
+                 return node;
+               }
+           }
+         node = vect_create_new_slp_node (node, stmts, 0);
+         SLP_TREE_VECTYPE (node) = vectype;
          SLP_TREE_LOAD_PERMUTATION (node) = load_permutation;
          return node;
        }
@@ -3628,7 +3671,24 @@ vect_optimize_slp (vec_info *vinfo)
         as entries to the reverse graph.  */
       if (!slpg->vertices[idx].succ)
        bitmap_set_bit (n_visited, idx);
-      /* Loads are the only thing generating permutes.  */
+
+      /* Simple permute generating loads have been split into
+         non-permuted loads and a separate lane permutation.  */
+      if (SLP_TREE_CODE (node) == VEC_PERM_EXPR
+         && SLP_TREE_CHILDREN (node).length () == 1)
+       {
+         gcc_assert (SLP_TREE_LANES (node)
+                     == SLP_TREE_LANES (SLP_TREE_CHILDREN (node)[0]));
+         vec<unsigned> perm = vNULL;
+         perm.safe_grow (SLP_TREE_LANES (node), true);
+         for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
+           perm[j] = SLP_TREE_LANE_PERMUTATION (node)[j].second;
+         perms.safe_push (perm);
+         n_perm[idx] = perms.length () - 1;
+         continue;
+       }
+
+      /* Loads are the only remaining things generating permutes.  */
       if (!SLP_TREE_LOAD_PERMUTATION (node).exists ())
        continue;
 
diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c
index d95e359daae..2fb876dcfd7 100644
--- a/gcc/tree-vect-stmts.c
+++ b/gcc/tree-vect-stmts.c
@@ -8805,8 +8805,8 @@ vectorizable_load (vec_info *vinfo,
          if (grouped_load)
            cst_offset
              = (tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (vectype)))
-                * vect_get_place_in_interleaving_chain (stmt_info,
-                                                        first_stmt_info));
+                * vect_get_place_in_interleaving_chain
+                    (stmt_info, first_stmt_info, true));
          group_size = 1;
          ref_type = reference_alias_ptr_type (DR_REF (dr_info->dr));
        }
diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h
index fa28336d429..82e86855ca1 100644
--- a/gcc/tree-vectorizer.h
+++ b/gcc/tree-vectorizer.h
@@ -2033,7 +2033,8 @@ extern bool can_duplicate_and_interleave_p (vec_info *, 
unsigned int, tree,
                                            tree * = NULL, tree * = NULL);
 extern void duplicate_and_interleave (vec_info *, gimple_seq *, tree,
                                      vec<tree>, unsigned int, vec<tree> &);
-extern int vect_get_place_in_interleaving_chain (stmt_vec_info, stmt_vec_info);
+extern int vect_get_place_in_interleaving_chain (stmt_vec_info, stmt_vec_info,
+                                                bool);
 extern bool vect_update_shared_vectype (stmt_vec_info, tree);
 extern slp_tree vect_create_new_slp_node (unsigned, tree_code);
 extern void vect_free_slp_tree (slp_tree);

Reply via email to