Re: [PATCH] RFC: Optimise SLP permutes of non-consecutive loads

Richard Sandiford via Gcc-patches Fri, 24 Jun 2022 02:08:18 -0700

Richard Biener <rguent...@suse.de> writes:
> On Thu, 23 Jun 2022, Richard Sandiford wrote:
>> In a reduction pair like:
>> 
>>   typedef float T;
>> 
>>   void
>>   f1 (T *x)
>>   {
>>     T res1 = 0;
>>     T res2 = 0;
>>     for (int i = 0; i < 100; ++i)
>>       {
>>      res1 += x[i * 2];
>>      res2 += x[i * 2 + 1];
>>       }
>>     x[0] = res1;
>>     x[1] = res2;
>>   }
>> 
>> it isn't easy to predict whether the initial reduction group will
>> be { res1, res2 } or { res2, res1 }.  If the initial group ends up
>> being { res2, res1 }, vect_optimize_slp will try to swap it around
>> in order to remove an unncessary permute on the load.  This gives:
>> 
>> f1:
>> .LFB0:
>>         .cfi_startproc
>>         movi    v0.4s, 0
>>         mov     x1, x0
>>         add     x2, x0, 800
>>         .p2align 3,,7
>> .L2:
>>         ldr     q1, [x1], 16
>>         fadd    v0.4s, v0.4s, v1.4s
>>         cmp     x2, x1
>>         bne     .L2
>>         dup     d1, v0.d[1]
>>         fadd    v0.2s, v0.2s, v1.2s
>>         str     d0, [x0]
>>         ret
>> 
>> But there are no specific ordering constraints for non-consecutive
>> loads.  For example, in:
>> 
>>   void
>>   f2 (T *x)
>>   {
>>     T res1 = 0;
>>     T res2 = 0;
>>     for (int i = 0; i < 100; ++i)
>>       {
>>      res1 += x[i * 4];
>>      res2 += x[i * 4 + 2];
>>       }
>>     x[0] = res1;
>>     x[1] = res2;
>>   }
>> 
>> the reduction group happens to be { res2, res1 }, and so we get
>> a load permutation of { 2, 0, 6, 4 }.  On aarch64 this gives:
>> 
>> f2:
>> .LFB1:
>>         .cfi_startproc
>>         adrp    x2, .LC0
>>         mov     x1, x0
>>         movi    v2.4s, 0
>>         ldr     q3, [x2, #:lo12:.LC0]
>>         add     x2, x0, 1568
>>         .p2align 3,,7
>> .L6:
>>         ldp     q0, q1, [x1]
>>         add     x1, x1, 32
>>         tbl     v0.16b, {v0.16b - v1.16b}, v3.16b
>>         fadd    v2.4s, v2.4s, v0.4s
>>         cmp     x1, x2
>>         bne     .L6
>>      ...untidy code, 18 insns...
>>         ret
>> 
>> where we have to load the permute selector from memory and use
>> the general TBL instruction.  If the order happened to be { res1, res2 }
>> instead we would generate the much tidier:
>> 
>> f2:
>> .LFB1:
>>         .cfi_startproc
>>         movi    v1.4s, 0
>>         mov     x1, x0
>>         add     x2, x0, 1568
>>         .p2align 3,,7
>> .L6:
>>         ldp     q0, q2, [x1]
>>         add     x1, x1, 32
>>         uzp1    v0.4s, v0.4s, v2.4s
>>         fadd    v1.4s, v1.4s, v0.4s
>>         cmp     x1, x2
>>         bne     .L6
>>         ldr     d0, [x0, 1568]
>>         ldr     d5, [x0, 1576]
>>         ldr     s2, [x0, 1584]
>>         dup     d3, v1.d[1]
>>         ldr     s4, [x0, 1592]
>>         zip1    v0.2s, v0.2s, v5.2s
>>         ins     v2.s[1], v4.s[0]
>>         fadd    v0.2s, v0.2s, v2.2s
>>         fadd    v0.2s, v0.2s, v1.2s
>>         fadd    v0.2s, v0.2s, v3.2s
>>         str     d0, [x0]
>>         ret
>> 
>> This WIP patch makes vect_optimize_slp try to put load permutations
>> into index order.  However, this new transform might be a loss if it
>> forces permutations elsewhere.  For example, given:
>> 
>> void
>> f3 (T *restrict x, T *restrict y, T *restrict z)
>> {
>>   for (int i = 0; i < 100; ++i)
>>     {
>>       x[i * 2] = y[i * 4 + 2] + z[i * 4 + 2];
>>       x[i * 2 + 1] = y[i * 4] + z[i * 4];
>>     }
>> }
>> 
>> we would generate:
>> 
>> .L9:
>>         ldr     q0, [x1, x3]
>>         ldr     q3, [x6, x3]
>>         ldr     q1, [x2, x3]
>>         ldr     q2, [x5, x3]
>>         add     x3, x3, 32
>>         uzp1    v0.4s, v0.4s, v3.4s
>>         uzp1    v1.4s, v1.4s, v2.4s
>>         fadd    v0.4s, v0.4s, v1.4s
>>         rev64   v0.4s, v0.4s
>>         str     q0, [x4], 16
>>         cmp     x3, 1568
>>         bne     .L9
>> 
>> (3 permutes) rather than:
>> 
>> .L9:
>>         ldr     q0, [x1, x3]
>>         ldr     q1, [x6, x3]
>>         ldr     q2, [x2, x3]
>>         ldr     q3, [x5, x3]
>>         tbl     v0.16b, {v0.16b - v1.16b}, v4.16b
>>         add     x3, x3, 32
>>         tbl     v2.16b, {v2.16b - v3.16b}, v4.16b
>>         fadd    v0.4s, v0.4s, v2.4s
>>         str     q0, [x4], 16
>>         cmp     x3, 1568
>>         bne     .L9
>> 
>> (2 permutes).
>> 
>> One simple(ish) fix would be to conditionalise the new case on
>> whether all "roots" of the load are reduction groups.  Alternatively,
>> we could treat the new case as a soft preference and back out if it
>> would require any new VEC_PERM_EXPR nodes to be created.  This would
>> require a propagation back from uses to definitions.
>> 
>> WDYT?  Does this look like a feasible direction?  If so, any thoughts
>> on when the new case should be enabled?
>> 
>> The patch keeps the bijection requirement, since relaxing that seems
>> like separate work.
>> 
>> Tested on aarch64-linux-gnu, no regressions.
>> 
>> Thanks,
>> Richard
>> 
>> ---
>>  gcc/tree-vect-slp.cc | 41 ++++++++++++++---------------------------
>>  1 file changed, 14 insertions(+), 27 deletions(-)
>> 
>> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
>> index dab5daddcc5..fde35d8c954 100644
>> --- a/gcc/tree-vect-slp.cc
>> +++ b/gcc/tree-vect-slp.cc
>> @@ -20,6 +20,7 @@ along with GCC; see the file COPYING3.  If not see
>>  <http://www.gnu.org/licenses/>.  */
>>  
>>  #include "config.h"
>> +#define INCLUDE_ALGORITHM
>>  #include "system.h"
>>  #include "coretypes.h"
>>  #include "backend.h"
>> @@ -3698,43 +3699,29 @@ vect_optimize_slp (vec_info *vinfo)
>>        if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
>>      continue;
>>        dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
>> -      unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0;
>> -      bool any_permute = false;
>> -      for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
>> -    {
>> -      unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j];
>> -      imin = MIN (imin, idx);
>> -      imax = MAX (imax, idx);
>> -      if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
>> -        any_permute = true;
>> -    }
>> -      /* If there's no permute no need to split one out.  */
>> -      if (!any_permute)
>> -    continue;
>> -      /* If the span doesn't match we'd disrupt VF computation, avoid
>> -     that for now.  */
>> -      if (imax - imin + 1 != SLP_TREE_LANES (node))
>> +
>> +      auto_vec<unsigned> sorted;
>> +      sorted.safe_splice (SLP_TREE_LOAD_PERMUTATION (node));
>> +      std::sort (sorted.begin (), sorted.end ());
>> +      if (std::equal (sorted.begin (), sorted.end (),
>> +                  SLP_TREE_LOAD_PERMUTATION (node).begin ()))
>>      continue;
>>  
>>        /* For now only handle true permutes, like
>>       vect_attempt_slp_rearrange_stmts did.  This allows us to be lazy
>>       when permuting constants and invariants keeping the permute
>>       bijective.  */
>> -      auto_sbitmap load_index (SLP_TREE_LANES (node));
>> -      bitmap_clear (load_index);
>> -      for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
>> -    bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin);
>> -      unsigned j;
>> -      for (j = 0; j < SLP_TREE_LANES (node); ++j)
>> -    if (!bitmap_bit_p (load_index, j))
>> -      break;
>> -      if (j != SLP_TREE_LANES (node))
>> +      if (std::adjacent_find (sorted.begin (), sorted.end ()) != sorted.end 
>> ())
>>      continue;
>
> So without the std:: rewrite this is the only change, where
> previously we rejected { 0, 2, ... } because '1' was missing
> but now we only reject duplicates?


All three changes work together really.  I thought the std:: stuff
made things simpler, but the things that use std:: would still need
to be rewritten if the patch didn't use std::.

For a load permute (Lo) like { 2, 0, 6, 4 } we want the lane permute (La)
to be { 1, 0, 3, 2 }, so that the new load permute (Lo') is { 0, 2, 4, 6 }.

So the “SLP_TREE_LOAD_PERMUTATION (node)[j] - imin” below is no longer
enough when computing La, since subtracting the minimum doesn't have
the necessary compressive effect.  We instead need to set La[i] to the
index of Lo[i] in Lo'.  And the easiest way of doing that seemed to be
to compute Lo' (by sorting) and then doing a binary search for each
element.  Having the sorted array also makes it easy to see whether
the permute is a no-op and whether the same source element appears
twice.

> As the comment says this
> is what vect_attempt_slp_rearrange_stmts did so we would need
> to adjust the comment as well.

I thought the comment meant the that permute entered into the perm
array is a bijection on [0, SLP_TREE_LANES (node) - 1].  That's still
true with the patch, and so for example we still don't need to worry
about:

                  /* ???  But for constants once we want to handle
                     non-bijective permutes we have to verify the permute,
                     when unifying lanes, will not unify different constants.
                     For example see gcc.dg/vect/bb-slp-14.c for a case
                     that would break.  */

(I hope).

> (I'm unsure about the std:: stuff but mostly because I'm not
> familiar with it and its semantics)
>
> The current "propagation" isn't really designed to find
> optimal solutions, it rather relies on the initial
> permute "anticipation" we set up and simply propagates
> those up as far as possible.  I think at some point we need
> to come up with something more elaborate and rewrite this all.
>
> It might also be that a target handles { 2, 0 } but not
> { 0, 2 } while we know it handles the unpermuted case always.

OK.  I guess we should reject new permutes if vect_transform_slp_perm_load
is false for them and true for the original.

> Can you think of a way to express this as dataflow problem
> capturing finding the optimal solution as well?

I think it's inevitably going to be a two-way thing: propagate what's
best for producers as far as possible, then propagate back what's
best for consumers as far as possible, or the other way around.

Thanks,
Richard

>
>>        vec<unsigned> perm = vNULL;
>>        perm.safe_grow (SLP_TREE_LANES (node), true);
>>        for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
>> -    perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
>> +    {
>> +      auto it = std::lower_bound (sorted.begin (), sorted.end (),
>> +                                  SLP_TREE_LOAD_PERMUTATION (node)[j]);
>> +      perm[j] = it - sorted.begin ();
>> +    }
>>        perms.safe_push (perm);
>>        vertices[idx].perm_in = perms.length () - 1;
>>        vertices[idx].perm_out = perms.length () - 1;
>> @@ -6647,7 +6634,7 @@ vect_transform_slp_perm_load (vec_info *vinfo,
>>  {
>>    stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
>>    int vec_index = 0;
>> -  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
>> +  tree vectype = SLP_TREE_VECTYPE (node);
>
> that looks like an obvious fix.
>
> Richard.
>
>>    unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
>>    unsigned int mask_element;
>>    machine_mode mode;
>>

Re: [PATCH] RFC: Optimise SLP permutes of non-consecutive loads

Reply via email to