Richard Biener <rguent...@suse.de> writes: > On Thu, 23 Jun 2022, Richard Sandiford wrote: >> In a reduction pair like: >> >> typedef float T; >> >> void >> f1 (T *x) >> { >> T res1 = 0; >> T res2 = 0; >> for (int i = 0; i < 100; ++i) >> { >> res1 += x[i * 2]; >> res2 += x[i * 2 + 1]; >> } >> x[0] = res1; >> x[1] = res2; >> } >> >> it isn't easy to predict whether the initial reduction group will >> be { res1, res2 } or { res2, res1 }. If the initial group ends up >> being { res2, res1 }, vect_optimize_slp will try to swap it around >> in order to remove an unncessary permute on the load. This gives: >> >> f1: >> .LFB0: >> .cfi_startproc >> movi v0.4s, 0 >> mov x1, x0 >> add x2, x0, 800 >> .p2align 3,,7 >> .L2: >> ldr q1, [x1], 16 >> fadd v0.4s, v0.4s, v1.4s >> cmp x2, x1 >> bne .L2 >> dup d1, v0.d[1] >> fadd v0.2s, v0.2s, v1.2s >> str d0, [x0] >> ret >> >> But there are no specific ordering constraints for non-consecutive >> loads. For example, in: >> >> void >> f2 (T *x) >> { >> T res1 = 0; >> T res2 = 0; >> for (int i = 0; i < 100; ++i) >> { >> res1 += x[i * 4]; >> res2 += x[i * 4 + 2]; >> } >> x[0] = res1; >> x[1] = res2; >> } >> >> the reduction group happens to be { res2, res1 }, and so we get >> a load permutation of { 2, 0, 6, 4 }. On aarch64 this gives: >> >> f2: >> .LFB1: >> .cfi_startproc >> adrp x2, .LC0 >> mov x1, x0 >> movi v2.4s, 0 >> ldr q3, [x2, #:lo12:.LC0] >> add x2, x0, 1568 >> .p2align 3,,7 >> .L6: >> ldp q0, q1, [x1] >> add x1, x1, 32 >> tbl v0.16b, {v0.16b - v1.16b}, v3.16b >> fadd v2.4s, v2.4s, v0.4s >> cmp x1, x2 >> bne .L6 >> ...untidy code, 18 insns... >> ret >> >> where we have to load the permute selector from memory and use >> the general TBL instruction. If the order happened to be { res1, res2 } >> instead we would generate the much tidier: >> >> f2: >> .LFB1: >> .cfi_startproc >> movi v1.4s, 0 >> mov x1, x0 >> add x2, x0, 1568 >> .p2align 3,,7 >> .L6: >> ldp q0, q2, [x1] >> add x1, x1, 32 >> uzp1 v0.4s, v0.4s, v2.4s >> fadd v1.4s, v1.4s, v0.4s >> cmp x1, x2 >> bne .L6 >> ldr d0, [x0, 1568] >> ldr d5, [x0, 1576] >> ldr s2, [x0, 1584] >> dup d3, v1.d[1] >> ldr s4, [x0, 1592] >> zip1 v0.2s, v0.2s, v5.2s >> ins v2.s[1], v4.s[0] >> fadd v0.2s, v0.2s, v2.2s >> fadd v0.2s, v0.2s, v1.2s >> fadd v0.2s, v0.2s, v3.2s >> str d0, [x0] >> ret >> >> This WIP patch makes vect_optimize_slp try to put load permutations >> into index order. However, this new transform might be a loss if it >> forces permutations elsewhere. For example, given: >> >> void >> f3 (T *restrict x, T *restrict y, T *restrict z) >> { >> for (int i = 0; i < 100; ++i) >> { >> x[i * 2] = y[i * 4 + 2] + z[i * 4 + 2]; >> x[i * 2 + 1] = y[i * 4] + z[i * 4]; >> } >> } >> >> we would generate: >> >> .L9: >> ldr q0, [x1, x3] >> ldr q3, [x6, x3] >> ldr q1, [x2, x3] >> ldr q2, [x5, x3] >> add x3, x3, 32 >> uzp1 v0.4s, v0.4s, v3.4s >> uzp1 v1.4s, v1.4s, v2.4s >> fadd v0.4s, v0.4s, v1.4s >> rev64 v0.4s, v0.4s >> str q0, [x4], 16 >> cmp x3, 1568 >> bne .L9 >> >> (3 permutes) rather than: >> >> .L9: >> ldr q0, [x1, x3] >> ldr q1, [x6, x3] >> ldr q2, [x2, x3] >> ldr q3, [x5, x3] >> tbl v0.16b, {v0.16b - v1.16b}, v4.16b >> add x3, x3, 32 >> tbl v2.16b, {v2.16b - v3.16b}, v4.16b >> fadd v0.4s, v0.4s, v2.4s >> str q0, [x4], 16 >> cmp x3, 1568 >> bne .L9 >> >> (2 permutes). >> >> One simple(ish) fix would be to conditionalise the new case on >> whether all "roots" of the load are reduction groups. Alternatively, >> we could treat the new case as a soft preference and back out if it >> would require any new VEC_PERM_EXPR nodes to be created. This would >> require a propagation back from uses to definitions. >> >> WDYT? Does this look like a feasible direction? If so, any thoughts >> on when the new case should be enabled? >> >> The patch keeps the bijection requirement, since relaxing that seems >> like separate work. >> >> Tested on aarch64-linux-gnu, no regressions. >> >> Thanks, >> Richard >> >> --- >> gcc/tree-vect-slp.cc | 41 ++++++++++++++--------------------------- >> 1 file changed, 14 insertions(+), 27 deletions(-) >> >> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc >> index dab5daddcc5..fde35d8c954 100644 >> --- a/gcc/tree-vect-slp.cc >> +++ b/gcc/tree-vect-slp.cc >> @@ -20,6 +20,7 @@ along with GCC; see the file COPYING3. If not see >> <http://www.gnu.org/licenses/>. */ >> >> #include "config.h" >> +#define INCLUDE_ALGORITHM >> #include "system.h" >> #include "coretypes.h" >> #include "backend.h" >> @@ -3698,43 +3699,29 @@ vect_optimize_slp (vec_info *vinfo) >> if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt)) >> continue; >> dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt); >> - unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0; >> - bool any_permute = false; >> - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) >> - { >> - unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j]; >> - imin = MIN (imin, idx); >> - imax = MAX (imax, idx); >> - if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j) >> - any_permute = true; >> - } >> - /* If there's no permute no need to split one out. */ >> - if (!any_permute) >> - continue; >> - /* If the span doesn't match we'd disrupt VF computation, avoid >> - that for now. */ >> - if (imax - imin + 1 != SLP_TREE_LANES (node)) >> + >> + auto_vec<unsigned> sorted; >> + sorted.safe_splice (SLP_TREE_LOAD_PERMUTATION (node)); >> + std::sort (sorted.begin (), sorted.end ()); >> + if (std::equal (sorted.begin (), sorted.end (), >> + SLP_TREE_LOAD_PERMUTATION (node).begin ())) >> continue; >> >> /* For now only handle true permutes, like >> vect_attempt_slp_rearrange_stmts did. This allows us to be lazy >> when permuting constants and invariants keeping the permute >> bijective. */ >> - auto_sbitmap load_index (SLP_TREE_LANES (node)); >> - bitmap_clear (load_index); >> - for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) >> - bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin); >> - unsigned j; >> - for (j = 0; j < SLP_TREE_LANES (node); ++j) >> - if (!bitmap_bit_p (load_index, j)) >> - break; >> - if (j != SLP_TREE_LANES (node)) >> + if (std::adjacent_find (sorted.begin (), sorted.end ()) != sorted.end >> ()) >> continue; > > So without the std:: rewrite this is the only change, where > previously we rejected { 0, 2, ... } because '1' was missing > but now we only reject duplicates?
All three changes work together really. I thought the std:: stuff made things simpler, but the things that use std:: would still need to be rewritten if the patch didn't use std::. For a load permute (Lo) like { 2, 0, 6, 4 } we want the lane permute (La) to be { 1, 0, 3, 2 }, so that the new load permute (Lo') is { 0, 2, 4, 6 }. So the “SLP_TREE_LOAD_PERMUTATION (node)[j] - imin” below is no longer enough when computing La, since subtracting the minimum doesn't have the necessary compressive effect. We instead need to set La[i] to the index of Lo[i] in Lo'. And the easiest way of doing that seemed to be to compute Lo' (by sorting) and then doing a binary search for each element. Having the sorted array also makes it easy to see whether the permute is a no-op and whether the same source element appears twice. > As the comment says this > is what vect_attempt_slp_rearrange_stmts did so we would need > to adjust the comment as well. I thought the comment meant the that permute entered into the perm array is a bijection on [0, SLP_TREE_LANES (node) - 1]. That's still true with the patch, and so for example we still don't need to worry about: /* ??? But for constants once we want to handle non-bijective permutes we have to verify the permute, when unifying lanes, will not unify different constants. For example see gcc.dg/vect/bb-slp-14.c for a case that would break. */ (I hope). > (I'm unsure about the std:: stuff but mostly because I'm not > familiar with it and its semantics) > > The current "propagation" isn't really designed to find > optimal solutions, it rather relies on the initial > permute "anticipation" we set up and simply propagates > those up as far as possible. I think at some point we need > to come up with something more elaborate and rewrite this all. > > It might also be that a target handles { 2, 0 } but not > { 0, 2 } while we know it handles the unpermuted case always. OK. I guess we should reject new permutes if vect_transform_slp_perm_load is false for them and true for the original. > Can you think of a way to express this as dataflow problem > capturing finding the optimal solution as well? I think it's inevitably going to be a two-way thing: propagate what's best for producers as far as possible, then propagate back what's best for consumers as far as possible, or the other way around. Thanks, Richard > >> vec<unsigned> perm = vNULL; >> perm.safe_grow (SLP_TREE_LANES (node), true); >> for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j) >> - perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin; >> + { >> + auto it = std::lower_bound (sorted.begin (), sorted.end (), >> + SLP_TREE_LOAD_PERMUTATION (node)[j]); >> + perm[j] = it - sorted.begin (); >> + } >> perms.safe_push (perm); >> vertices[idx].perm_in = perms.length () - 1; >> vertices[idx].perm_out = perms.length () - 1; >> @@ -6647,7 +6634,7 @@ vect_transform_slp_perm_load (vec_info *vinfo, >> { >> stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0]; >> int vec_index = 0; >> - tree vectype = STMT_VINFO_VECTYPE (stmt_info); >> + tree vectype = SLP_TREE_VECTYPE (node); > > that looks like an obvious fix. > > Richard. > >> unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length (); >> unsigned int mask_element; >> machine_mode mode; >>