On Tue, 22 Jun 2021, Richard Biener wrote: > On Mon, 21 Jun 2021, Tamar Christina wrote: > > > Hi Richi, > > [...] > > since we are removing the TWO_OPERANDS node we need to drop one of the > > multiply > > and so we need to give it the original 2 vectors as a parameter. The > > current > > implementation emits a permute operation to combine the loads again and > > later > > pushes the permute down. This is problematic as it again requires us to do > > df > > analysis early. > > > > To counter this, in the patch I have changed loads to no longer come out of > > build_slp with LOAD_PERMUTES but instead to have a VEC_PERM above each load. > > Yep. I've been there before (not sure if I ever sent you my > work-in-progress here). There's some wrongs in your patch but I guess > doing this exactly for the permutes optimize_slp handles would be fine. > > We should see to do this independently of the stuff above, I can handle > this and will prepare a patch for this later.
So it's of course difficult and the current optimize_slp tied closely to what the original vect_attempt_slp_rearrange_stmts did. If you for example consider double x[2], y[2], z[4]; void foo () { z[0] = x[0] + y[0]; z[1] = x[1] + y[1]; z[2] = x[1] + y[0]; z[3] = x[0] + y[1]; } then the x[0], x[1] loads look unform enough to be handled but of course we end up with a group_size of 4 here and a { 0, 1, 1, 0 } load permutation which optimize_slp wouldn't handle either. Of course in the end we should have split the SLP at vector size boundary but the question is how we should ensure this (if at all ...) or if we should eventually even create a SLP tree like { x[0], x[1] } | \ | VEC_PERM { 0[1], 0[0] } \ / VEC_PERM { 0[0], 0[1], 1[0], 1[1] } | for the load. Note all lane splitting/duplicating has effects on vectorization factor compute - one complication I ran into when originally trying to split out load permutations from loads. I'm not sure whether your example motivating the load SLP changes is good enough - if you consider that the loaded values get modified, like as { x[0]/a + y[1]/a, x[1]/a - y[0]/a } splitting the load permutation from the load will not get you the division CSEd at SLP build and if we divide by a different value there's no CSE opportunity. What would work and what should work right now is that you get a lane permute down but you'll not know whether values are "related"? If you need that info and that directly on the child you can look at the representatives DR group leader, if any, as a heuristic. I've pasted below what I was playing with, it shows CSE for cases like double x[2], z[4]; void foo () { z[0] = x[0] + 2 * x[1]; z[1] = x[1] + 2 * x[0]; } but it breaks various vect.exp tests that look for permutes being merged with reductions (thus the optimize_slp propagation somehow doesn't work - as I said it's a bit fragile). Richard. diff --git a/gcc/tree-vect-slp.c b/gcc/tree-vect-slp.c index 6a26ccdd290..187bbfb70db 100644 --- a/gcc/tree-vect-slp.c +++ b/gcc/tree-vect-slp.c @@ -343,12 +343,14 @@ vect_slp_tree_uniform_p (slp_tree node) } /* Find the place of the data-ref in STMT_INFO in the interleaving chain - that starts from FIRST_STMT_INFO. Return -1 if the data-ref is not a part - of the chain. */ + that starts from FIRST_STMT_INFO. If ADD_GAPS is true then if there's + a gap between elements account for that as well. + Return -1 if the data-ref is not a part of the chain. */ int vect_get_place_in_interleaving_chain (stmt_vec_info stmt_info, - stmt_vec_info first_stmt_info) + stmt_vec_info first_stmt_info, + bool add_gaps) { stmt_vec_info next_stmt_info = first_stmt_info; int result = 0; @@ -362,7 +364,7 @@ vect_get_place_in_interleaving_chain (stmt_vec_info stmt_info, return result; next_stmt_info = DR_GROUP_NEXT_ELEMENT (next_stmt_info); if (next_stmt_info) - result += DR_GROUP_GAP (next_stmt_info); + result += add_gaps ? DR_GROUP_GAP (next_stmt_info) : 1; } while (next_stmt_info); @@ -1769,24 +1771,65 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node, { *max_nunits = this_max_nunits; (*tree_size)++; - node = vect_create_new_slp_node (node, stmts, 0); - SLP_TREE_VECTYPE (node) = vectype; /* And compute the load permutation. Whether it is actually a permutation depends on the unrolling factor which is decided later. */ - vec<unsigned> load_permutation; int j; stmt_vec_info load_info; + stmt_vec_info first_stmt_info = DR_GROUP_FIRST_ELEMENT (stmts[0]); + vec<unsigned> load_permutation; load_permutation.create (group_size); - stmt_vec_info first_stmt_info - = DR_GROUP_FIRST_ELEMENT (SLP_TREE_SCALAR_STMTS (node)[0]); - FOR_EACH_VEC_ELT (SLP_TREE_SCALAR_STMTS (node), j, load_info) + bool any_non_contiguous = false; + FOR_EACH_VEC_ELT (stmts, j, load_info) { int load_place = vect_get_place_in_interleaving_chain - (load_info, first_stmt_info); + (load_info, first_stmt_info, true); gcc_assert (load_place != -1); + if (load_place != j) + any_non_contiguous = true; load_permutation.safe_push (load_place); } + if (any_non_contiguous + && DR_GROUP_SIZE (first_stmt_info) == group_size) + { + vec<stmt_vec_info> alt_stmts; + alt_stmts.create (group_size); + for (stmt_vec_info cur = first_stmt_info; cur; + cur = DR_GROUP_NEXT_ELEMENT (cur)) + alt_stmts.quick_push (cur); + if (alt_stmts.length () == group_size) + { + load_permutation.release (); + slp_tree load = vect_build_slp_tree (vinfo, alt_stmts, + group_size, + &this_max_nunits, + matches, limit, + tree_size, bst_map); + if (!load) + { + alt_stmts.release (); + matches[0] = false; + return NULL; + } + lane_permutation_t lane_permute; + lane_permute.create (group_size); + FOR_EACH_VEC_ELT (stmts, j, load_info) + { + int load_place = vect_get_place_in_interleaving_chain + (load_info, first_stmt_info, false); + gcc_assert (load_place != -1); + lane_permute.quick_push (std::make_pair (0, load_place)); + } + node = vect_create_new_slp_node (node, stmts, 1); + SLP_TREE_CODE (node) = VEC_PERM_EXPR; + SLP_TREE_LANE_PERMUTATION (node) = lane_permute; + SLP_TREE_VECTYPE (node) = vectype; + SLP_TREE_CHILDREN (node).quick_push (load); + return node; + } + } + node = vect_create_new_slp_node (node, stmts, 0); + SLP_TREE_VECTYPE (node) = vectype; SLP_TREE_LOAD_PERMUTATION (node) = load_permutation; return node; } @@ -3628,7 +3671,24 @@ vect_optimize_slp (vec_info *vinfo) as entries to the reverse graph. */ if (!slpg->vertices[idx].succ) bitmap_set_bit (n_visited, idx); - /* Loads are the only thing generating permutes. */ + + /* Simple permute generating loads have been split into + non-permuted loads and a separate lane permutation. */ + if (SLP_TREE_CODE (node) == VEC_PERM_EXPR + && SLP_TREE_CHILDREN (node).length () == 1) + { + gcc_assert (SLP_TREE_LANES (node) + == SLP_TREE_LANES (SLP_TREE_CHILDREN (node)[0])); + vec<unsigned> perm = vNULL; + perm.safe_grow (SLP_TREE_LANES (node), true); + for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) + perm[j] = SLP_TREE_LANE_PERMUTATION (node)[j].second; + perms.safe_push (perm); + n_perm[idx] = perms.length () - 1; + continue; + } + + /* Loads are the only remaining things generating permutes. */ if (!SLP_TREE_LOAD_PERMUTATION (node).exists ()) continue; diff --git a/gcc/tree-vect-stmts.c b/gcc/tree-vect-stmts.c index d95e359daae..2fb876dcfd7 100644 --- a/gcc/tree-vect-stmts.c +++ b/gcc/tree-vect-stmts.c @@ -8805,8 +8805,8 @@ vectorizable_load (vec_info *vinfo, if (grouped_load) cst_offset = (tree_to_uhwi (TYPE_SIZE_UNIT (TREE_TYPE (vectype))) - * vect_get_place_in_interleaving_chain (stmt_info, - first_stmt_info)); + * vect_get_place_in_interleaving_chain + (stmt_info, first_stmt_info, true)); group_size = 1; ref_type = reference_alias_ptr_type (DR_REF (dr_info->dr)); } diff --git a/gcc/tree-vectorizer.h b/gcc/tree-vectorizer.h index fa28336d429..82e86855ca1 100644 --- a/gcc/tree-vectorizer.h +++ b/gcc/tree-vectorizer.h @@ -2033,7 +2033,8 @@ extern bool can_duplicate_and_interleave_p (vec_info *, unsigned int, tree, tree * = NULL, tree * = NULL); extern void duplicate_and_interleave (vec_info *, gimple_seq *, tree, vec<tree>, unsigned int, vec<tree> &); -extern int vect_get_place_in_interleaving_chain (stmt_vec_info, stmt_vec_info); +extern int vect_get_place_in_interleaving_chain (stmt_vec_info, stmt_vec_info, + bool); extern bool vect_update_shared_vectype (stmt_vec_info, tree); extern slp_tree vect_create_new_slp_node (unsigned, tree_code); extern void vect_free_slp_tree (slp_tree);