On Mon, 10 Jun 2024, Manolis Tsamis wrote: > On Wed, Jun 5, 2024 at 11:07 AM Richard Biener <rguent...@suse.de> wrote: > > > > On Tue, 4 Jun 2024, Manolis Tsamis wrote: > > > > > This change adds a function that checks for SLP nodes with multiple > > > occurrences > > > of the same statement (e.g. {A, B, A, B, ...}) and tries to rearrange the > > > node > > > so that there are no duplicates. A vec_perm is then introduced to > > > recreate the > > > original ordering. These duplicates can appear due to how two_operators > > > nodes > > > are handled, and they prevent vectorization in some cases. > > > > So the trick is that when we have two operands we elide duplicate lanes > > so we can do discovery for a single combined operand instead which we > > then decompose into the required two again. That's a nice one. > > > > But as implemented this will fail SLP discovery if the combined operand > > fails discovery possibly because of divergence in downstream defs. That > > is, it doesn't fall back to separate discovery. I suspect the situation > > of duplicate lanes isn't common but then I would also suspect that > > divergence _is_ common. > > > That's a good point; I checked out and at least for the x264 testcase > provided SLP discovery succeeds in both cases but in one case > vectorization fails later on due to the unsupported load permutations > among others. > I think that's what Tamar also mentioned and it makes it hard to > decide whether to apply the pattern based on if discovery fails. > > > The discovery code is already quite complex with the way it possibly > > swaps operands of lanes, fitting in this as another variant to try (first) > > is likely going to be a bit awkward. A way out might be to split the > > function or to make the re-try in the caller which could indicate whether > > to apply this pattern trick or not. That said - can you try to get > > data on how often the trick applies and discovery succeeds and how > > often discovery fails but discovery would suceed without applying the > > pattern (say, on SPEC)? > > > I checked out SPEC and this pattern only triggers on x264 and in that > case discovery succeeds. So we don't have any data on the pattern > applying but discovery failing. > > > I also suppose instead of hardcoding three patterns for a fixed > > size it should be possible to see there's > > only (at most) half unique lanes in both operands (and one less in one > > operand if the number of lanes is odd) and compute the un-swizzling lane > > permutes during this discovery, removing the need of the explicit enum > > and open-coding each case? > > > Yes, that's a fair point. I will change that in the next iteration. > > > Another general note is that trying (and then undo on fail) such ticks > > eats at the discovery limit we have in place to avoid exponential run-off > > in exactly this degenerate cases. > > > > So, most importantly, the points you and Tamar mentioned got me > thinking about the transformation again, why it is useful and when it > applies. > In this initial implementation I tried to make this independant from > the two_operators logic and apply it when possible, which brings up > all these issues about discovery and usefulness of the pattern in > general. > E.g. If we had just [a, b, a, b] + [c, d, c, d] without two_operators > I sort of doubt it would be worth it to apply the transformation in > most cases (except of course if it enables vectorization, but as I > understand it it is hard to tell when that happens). > On the other hand, if we know that we're dealing with two_operators > nodes then the argument changes, as we know that we'll duplicate these > nodes. > > In turn, it may be best to try to see this as a 'two_operators > lowering strategy' improvement instead of a generic rearrangement > pattern. > Specifically for x264, we're given code like > > int t0 = s0 + s1; > int t1 = s0 - s1; > int t2 = s2 + s3; > int t3 = s2 - s3; > > and currently we lower that to VEC_PERM<(A + B), (A - B)>(...) with A > = [s0, s0, s2, s2], B = [s1, s1, s3, s3] which doesn't work very well > (due to element duplication). > With this patch we do VEC_PERM<(A + B), (A - B)>(...) with A = > VEC_PERM<C, C>(...), B = VEC_PERM<C, C>(...), C = [s0, s1, s2, s3] > instead which works good.
So I'm not sure I buy the argument that [a, b, a, b] + [c, d, c, d] is much different from the two-operators version. [a, b, a, b] + [c, d, c, d] is one of the variants (the plus) we discover. Isn't this really about us stupidly trying to force the use of a larger vector rather than doing the two-operator handling by doing VEC_PERM <{s0, s2} + {s1, s3}, {s0, s2} - {s1, s3}, { 0, 2, 1, 3 }> and thus recursing with smaller group size and then "interleaving" the result? Which might only work well in exactly the case where we have the same number of + and -. Of course in both forms 'A' and 'B' (or the smaller vectors) should be SLP nodes re-used. > But it is obvious that there are other strategies to lower this too > and they may be even better (by taking advantage of the fact that we > know we're dealing with a two_operators node *and* have duplicate > elements). > For example doing VEC_PERM<(A + B), (A - B)>(...) with A = [s0, s1, > s2, s3] and B = VEC_PERM<A, A>(1, 0, 3, 2) looks interesting too and > is only possible because we combine two_operators and rearrangement. > > Do you believe that narrowing this to a "two_operators lowering > improvement" makes more sense and addresses at least some of the > issues mentioned? > I'm currently testing to see the code that we generate with other > strategies and will reach out once I have new results. I think doing this in the if (two_operators) code in vect_build_slp_tree_2 only where we can take advantage of knowing that SLP build for the original A and B children succeeded would be good. It should side-step the discovery cost issue. If you do your inter-mixing scheme you'll still need to re-discover but FAILing should be possible easily by falling back to the already built children nodes? Richard. > Thanks, > Manolis > > > Thanks, > > Richard. > > > > > This targets the vectorization of the SPEC2017 x264 pixel_satd functions. > > > In some processors a larger than 10% improvement on x264 has been > > > observed. > > > > > > See also: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138 > > > > > > gcc/ChangeLog: > > > > > > * tree-vect-slp.cc (enum slp_oprnd_pattern): new enum for > > > rearrangement > > > patterns. > > > (try_rearrange_oprnd_info): Detect if a node corresponds to one of > > > the > > > patterns. > > > > > > gcc/testsuite/ChangeLog: > > > > > > * gcc.target/aarch64/vect-slp-two-operator.c: New test. > > > > > > Signed-off-by: Manolis Tsamis <manolis.tsa...@vrull.eu> > > > --- > > > > > > .../aarch64/vect-slp-two-operator.c | 42 ++++ > > > gcc/tree-vect-slp.cc | 234 ++++++++++++++++++ > > > 2 files changed, 276 insertions(+) > > > create mode 100644 > > > gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > > b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > > new file mode 100644 > > > index 00000000000..2db066a0b6e > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/aarch64/vect-slp-two-operator.c > > > @@ -0,0 +1,42 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-O2 -ftree-vectorize -fdump-tree-vect > > > -fdump-tree-vect-details" } */ > > > + > > > +typedef unsigned char uint8_t; > > > +typedef unsigned int uint32_t; > > > + > > > +#define HADAMARD4(d0, d1, d2, d3, s0, s1, s2, s3) {\ > > > + int t0 = s0 + s1;\ > > > + int t1 = s0 - s1;\ > > > + int t2 = s2 + s3;\ > > > + int t3 = s2 - s3;\ > > > + d0 = t0 + t2;\ > > > + d1 = t1 + t3;\ > > > + d2 = t0 - t2;\ > > > + d3 = t1 - t3;\ > > > +} > > > + > > > +static uint32_t abs2( uint32_t a ) > > > +{ > > > + uint32_t s = ((a>>15)&0x10001)*0xffff; > > > + return (a+s)^s; > > > +} > > > + > > > +void sink(uint32_t tmp[4][4]); > > > + > > > +int x264_pixel_satd_8x4( uint8_t *pix1, int i_pix1, uint8_t *pix2, int > > > i_pix2 ) > > > +{ > > > + uint32_t tmp[4][4]; > > > + int sum = 0; > > > + for( int i = 0; i < 4; i++, pix1 += i_pix1, pix2 += i_pix2 ) > > > + { > > > + uint32_t a0 = (pix1[0] - pix2[0]) + ((pix1[4] - pix2[4]) << 16); > > > + uint32_t a1 = (pix1[1] - pix2[1]) + ((pix1[5] - pix2[5]) << 16); > > > + uint32_t a2 = (pix1[2] - pix2[2]) + ((pix1[6] - pix2[6]) << 16); > > > + uint32_t a3 = (pix1[3] - pix2[3]) + ((pix1[7] - pix2[7]) << 16); > > > + HADAMARD4( tmp[i][0], tmp[i][1], tmp[i][2], tmp[i][3], > > > a0,a1,a2,a3 ); > > > + } > > > + sink(tmp); > > > +} > > > + > > > +/* { dg-final { scan-tree-dump "vectorizing stmts using SLP" "vect" } } > > > */ > > > +/* { dg-final { scan-tree-dump-times "vectorized 1 loops" 1 "vect" } } */ > > > diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc > > > index bf1f467f53f..e395db0e185 100644 > > > --- a/gcc/tree-vect-slp.cc > > > +++ b/gcc/tree-vect-slp.cc > > > @@ -40,6 +40,7 @@ along with GCC; see the file COPYING3. If not see > > > #include "tree-vectorizer.h" > > > #include "langhooks.h" > > > #include "gimple-walk.h" > > > +#include "gimple-pretty-print.h" > > > #include "dbgcnt.h" > > > #include "tree-vector-builder.h" > > > #include "vec-perm-indices.h" > > > @@ -1829,6 +1830,141 @@ vect_slp_build_two_operator_nodes (slp_tree perm, > > > tree vectype, > > > SLP_TREE_CHILDREN (perm).quick_push (child2); > > > } > > > > > > +enum slp_oprnd_pattern > > > +{ > > > + SLP_OPRND_PATTERN_NONE, > > > + SLP_OPRND_PATTERN_ABAB, > > > + SLP_OPRND_PATTERN_AABB, > > > + SLP_OPRND_PATTERN_ABBA > > > +}; > > > + > > > +/* Check if OPRNDS_INFO has duplicated nodes that correspond to a > > > predefined > > > + pattern described by SLP_OPRND_PATTERN and return it. */ > > > + > > > +static int > > > +try_rearrange_oprnd_info (vec<slp_oprnd_info> &oprnds_info, unsigned > > > group_size) > > > +{ > > > + unsigned i; > > > + slp_oprnd_info info; > > > + > > > + if (oprnds_info.length () != 2 || group_size % 4 != 0) > > > + return SLP_OPRND_PATTERN_NONE; > > > + > > > + if (!oprnds_info[0]->def_stmts[0] > > > + || !is_a<gassign *> (oprnds_info[0]->def_stmts[0]->stmt)) > > > + return SLP_OPRND_PATTERN_NONE; > > > + > > > + enum tree_code code > > > + = gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt); > > > + FOR_EACH_VEC_ELT (oprnds_info, i, info) > > > + for (unsigned int j = 0; j < group_size; j += 1) > > > + { > > > + if (!info->def_stmts[j] > > > + || !is_a<gassign *> (info->def_stmts[j]->stmt) > > > + || STMT_VINFO_DATA_REF (info->def_stmts[j])) > > > + return SLP_OPRND_PATTERN_NONE; > > > + /* Don't mix different operations. */ > > > + if (gimple_assign_rhs_code (info->def_stmts[j]->stmt) != code) > > > + return SLP_OPRND_PATTERN_NONE; > > > + } > > > + > > > + if (gimple_assign_rhs_code (oprnds_info[0]->def_stmts[0]->stmt) > > > + != gimple_assign_rhs_code (oprnds_info[1]->def_stmts[0]->stmt)) > > > + return SLP_OPRND_PATTERN_NONE; > > > + > > > + int pattern = SLP_OPRND_PATTERN_NONE; > > > + FOR_EACH_VEC_ELT (oprnds_info, i, info) > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + int cur_pattern = SLP_OPRND_PATTERN_NONE; > > > + /* Check for an ABAB... pattern. */ > > > + if ((info->def_stmts[j] == info->def_stmts[j + 2]) > > > + && (info->def_stmts[j + 1] == info->def_stmts[j + 3]) > > > + && (info->def_stmts[j] != info->def_stmts[j + 1])) > > > + cur_pattern = SLP_OPRND_PATTERN_ABAB; > > > + /* Check for an AABB... pattern. */ > > > + else if ((info->def_stmts[j] == info->def_stmts[j + 1]) > > > + && (info->def_stmts[j + 2] == info->def_stmts[j + 3]) > > > + && (info->def_stmts[j] != info->def_stmts[j + 2])) > > > + cur_pattern = SLP_OPRND_PATTERN_AABB; > > > + /* Check for an ABBA... pattern. */ > > > + else if ((info->def_stmts[j] == info->def_stmts[j + 3]) > > > + && (info->def_stmts[j + 1] == info->def_stmts[j + 2]) > > > + && (info->def_stmts[j] != info->def_stmts[j + 1])) > > > + cur_pattern = SLP_OPRND_PATTERN_ABBA; > > > + /* Unrecognised pattern. */ > > > + else > > > + return SLP_OPRND_PATTERN_NONE; > > > + > > > + if (pattern == SLP_OPRND_PATTERN_NONE) > > > + pattern = cur_pattern; > > > + /* Multiple patterns detected. */ > > > + else if (cur_pattern != pattern) > > > + return SLP_OPRND_PATTERN_NONE; > > > + } > > > + > > > + gcc_checking_assert (pattern != SLP_OPRND_PATTERN_NONE); > > > + > > > + if (dump_enabled_p ()) > > > + { > > > + if (pattern == SLP_OPRND_PATTERN_ABAB) > > > + dump_printf (MSG_NOTE, "ABAB"); > > > + else if (pattern == SLP_OPRND_PATTERN_AABB) > > > + dump_printf (MSG_NOTE, "AABB"); > > > + else if (pattern == SLP_OPRND_PATTERN_ABBA) > > > + dump_printf (MSG_NOTE, "ABBA"); > > > + dump_printf (MSG_NOTE, " pattern detected.\n"); > > > + } > > > + > > > + if (pattern == SLP_OPRND_PATTERN_ABAB || pattern == > > > SLP_OPRND_PATTERN_ABBA) > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + /* Given oprnd[0] -> A1, B1, A1, B1, A2, B2, A2, B2, ... > > > + Given oprnd[1] -> C1, D1, C1, D1, C2, D2, C2, D2, ... > > > + Create a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... */ > > > + oprnds_info[0]->def_stmts[j+2] = oprnds_info[1]->def_stmts[j]; > > > + oprnds_info[0]->ops[j+2] = oprnds_info[1]->ops[j]; > > > + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+1]; > > > + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+1]; > > > + } > > > + else if (pattern == SLP_OPRND_PATTERN_AABB) > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + /* Given oprnd[0] -> A1, A1, B1, B1, A2, A2, B2, B2, ... > > > + Given oprnd[1] -> C1, C1, D1, D1, C2, C2, D2, D2, ... > > > + Create a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... */ > > > + > > > + /* The ordering here is at least to some extent arbitrary. > > > + A generilized version needs to use some explicit ordering. */ > > > + oprnds_info[0]->def_stmts[j+1] = oprnds_info[1]->def_stmts[j]; > > > + oprnds_info[0]->ops[j+1] = oprnds_info[1]->ops[j]; > > > + oprnds_info[0]->def_stmts[j+2] = oprnds_info[0]->def_stmts[j+2]; > > > + oprnds_info[0]->ops[j+2] = oprnds_info[0]->ops[j+2]; > > > + oprnds_info[0]->def_stmts[j+3] = oprnds_info[1]->def_stmts[j+2]; > > > + oprnds_info[0]->ops[j+3] = oprnds_info[1]->ops[j+2]; > > > + } > > > + > > > + if (dump_enabled_p ()) > > > + { > > > + dump_printf (MSG_NOTE, "Recurse with:\n"); > > > + for (unsigned int j = 0; j < group_size; j++) > > > + { > > > + dump_printf (MSG_NOTE, " "); > > > + print_gimple_stmt (dump_file, oprnds_info[0]->def_stmts[j]->stmt, > > > 0); > > > + } > > > + } > > > + > > > + /* Since we've merged the two nodes in one, make the second one a copy > > > of > > > + the first. */ > > > + for (unsigned int j = 0; j < group_size; j++) > > > + { > > > + oprnds_info[1]->def_stmts[j] = oprnds_info[0]->def_stmts[j]; > > > + oprnds_info[1]->ops[j] = oprnds_info[0]->ops[j]; > > > + } > > > + > > > + return pattern; > > > +} > > > + > > > /* Recursively build an SLP tree starting from NODE. > > > Fail (and return a value not equal to zero) if def-stmts are not > > > isomorphic, require data permutation or are of unsupported types of > > > @@ -2409,6 +2545,10 @@ out: > > > > > > stmt_info = stmts[0]; > > > > > > + int rearrange_pattern = SLP_OPRND_PATTERN_NONE; > > > + if (is_a<gassign *> (stmt_info->stmt)) > > > + rearrange_pattern = try_rearrange_oprnd_info (oprnds_info, > > > group_size); > > > + > > > /* Create SLP_TREE nodes for the definition node/s. */ > > > FOR_EACH_VEC_ELT (oprnds_info, i, oprnd_info) > > > { > > > @@ -2669,6 +2809,100 @@ fail: > > > *tree_size += this_tree_size + 1; > > > *max_nunits = this_max_nunits; > > > > > > + /* If we applied any rearrangmenets then we need to reconstruct the > > > original > > > + elements with an additional permutation layer. */ > > > + if (rearrange_pattern != SLP_OPRND_PATTERN_NONE) > > > + { > > > + slp_tree one = new _slp_tree; > > > + slp_tree two = new _slp_tree; > > > + SLP_TREE_DEF_TYPE (one) = vect_internal_def; > > > + SLP_TREE_DEF_TYPE (two) = vect_internal_def; > > > + SLP_TREE_VECTYPE (one) = vectype; > > > + SLP_TREE_VECTYPE (two) = vectype; > > > + SLP_TREE_CHILDREN (one).safe_splice (children); > > > + SLP_TREE_CHILDREN (two).safe_splice (children); > > > + > > > + SLP_TREE_CODE (one) = VEC_PERM_EXPR; > > > + SLP_TREE_CODE (two) = VEC_PERM_EXPR; > > > + SLP_TREE_REPRESENTATIVE (one) = stmts[0]; > > > + SLP_TREE_REPRESENTATIVE (two) = stmts[2]; > > > + lane_permutation_t &perm_one = SLP_TREE_LANE_PERMUTATION (one); > > > + lane_permutation_t &perm_two = SLP_TREE_LANE_PERMUTATION (two); > > > + > > > + if (rearrange_pattern == SLP_OPRND_PATTERN_ABAB) > > > + { > > > + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... > > > + Create node "one" -> A1, B1, A1, B1, A2, B2, A2, B2, ... > > > + Create node "two" -> C1, D1, C1, D1, C2, D2, C2, D2, ... */ > > > + > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > > + > > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + } > > > + } > > > + else if (rearrange_pattern == SLP_OPRND_PATTERN_AABB) > > > + { > > > + /* Given a single node -> A1, C1, B1, D1, A2, C2, B2, D2, ... > > > + Create node "one" -> A1, A1, B1, B1, A2, A2, B2, B2, ... > > > + Create node "two" -> C1, C1, D1, D1, C2, C2, D2, D2, ... */ > > > + > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + perm_one.safe_push (std::make_pair (0, j + 2)); > > > + perm_one.safe_push (std::make_pair (0, j + 2)); > > > + > > > + perm_two.safe_push (std::make_pair (0, j + 1)); > > > + perm_two.safe_push (std::make_pair (0, j + 1)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + } > > > + } > > > + else if (rearrange_pattern == SLP_OPRND_PATTERN_ABBA) > > > + { > > > + /* Given a single node -> A1, B1, C1, D1, A2, B2, C2, D2, ... > > > + Create node "one" -> A1, B1, B1, A1, A2, B2, B2, A2, ... > > > + Create node "two" -> C1, D1, D1, C1, C2, D2, D2, C2, ... */ > > > + > > > + for (unsigned int j = 0; j < group_size; j += 4) > > > + { > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > > + perm_one.safe_push (std::make_pair (0, j + 1)); > > > + perm_one.safe_push (std::make_pair (0, j + 0)); > > > + > > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + perm_two.safe_push (std::make_pair (0, j + 3)); > > > + perm_two.safe_push (std::make_pair (0, j + 2)); > > > + } > > > + } > > > + > > > + slp_tree child; > > > + FOR_EACH_VEC_ELT (SLP_TREE_CHILDREN (two), i, child) > > > + SLP_TREE_REF_COUNT (child)++; > > > + > > > + node = vect_create_new_slp_node (node, stmts, 2); > > > + SLP_TREE_VECTYPE (node) = vectype; > > > + SLP_TREE_CHILDREN (node).quick_push (one); > > > + SLP_TREE_CHILDREN (node).quick_push (two); > > > + > > > + SLP_TREE_LANES (one) = stmts.length (); > > > + SLP_TREE_LANES (two) = stmts.length (); > > > + > > > + children.truncate (0); > > > + children.safe_splice (SLP_TREE_CHILDREN (node)); > > > + } > > > + > > > if (two_operators) > > > { > > > /* ??? We'd likely want to either cache in bst_map sth like > > > > > > > -- > > Richard Biener <rguent...@suse.de> > > SUSE Software Solutions Germany GmbH, > > Frankenstrasse 146, 90461 Nuernberg, Germany; > > GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg) > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)