Re: [PATCH] RFC: Optimise SLP permutes of non-consecutive loads

Richard Biener via Gcc-patches Fri, 24 Jun 2022 00:25:27 -0700

On Thu, 23 Jun 2022, Richard Sandiford wrote:

> In a reduction pair like:
> 
>   typedef float T;
> 
>   void
>   f1 (T *x)
>   {
>     T res1 = 0;
>     T res2 = 0;
>     for (int i = 0; i < 100; ++i)
>       {
>       res1 += x[i * 2];
>       res2 += x[i * 2 + 1];
>       }
>     x[0] = res1;
>     x[1] = res2;
>   }
> 
> it isn't easy to predict whether the initial reduction group will
> be { res1, res2 } or { res2, res1 }.  If the initial group ends up
> being { res2, res1 }, vect_optimize_slp will try to swap it around
> in order to remove an unncessary permute on the load.  This gives:
> 
> f1:
> .LFB0:
>         .cfi_startproc
>         movi    v0.4s, 0
>         mov     x1, x0
>         add     x2, x0, 800
>         .p2align 3,,7
> .L2:
>         ldr     q1, [x1], 16
>         fadd    v0.4s, v0.4s, v1.4s
>         cmp     x2, x1
>         bne     .L2
>         dup     d1, v0.d[1]
>         fadd    v0.2s, v0.2s, v1.2s
>         str     d0, [x0]
>         ret
> 
> But there are no specific ordering constraints for non-consecutive
> loads.  For example, in:
> 
>   void
>   f2 (T *x)
>   {
>     T res1 = 0;
>     T res2 = 0;
>     for (int i = 0; i < 100; ++i)
>       {
>       res1 += x[i * 4];
>       res2 += x[i * 4 + 2];
>       }
>     x[0] = res1;
>     x[1] = res2;
>   }
> 
> the reduction group happens to be { res2, res1 }, and so we get
> a load permutation of { 2, 0, 6, 4 }.  On aarch64 this gives:
> 
> f2:
> .LFB1:
>         .cfi_startproc
>         adrp    x2, .LC0
>         mov     x1, x0
>         movi    v2.4s, 0
>         ldr     q3, [x2, #:lo12:.LC0]
>         add     x2, x0, 1568
>         .p2align 3,,7
> .L6:
>         ldp     q0, q1, [x1]
>         add     x1, x1, 32
>         tbl     v0.16b, {v0.16b - v1.16b}, v3.16b
>         fadd    v2.4s, v2.4s, v0.4s
>         cmp     x1, x2
>         bne     .L6
>       ...untidy code, 18 insns...
>         ret
> 
> where we have to load the permute selector from memory and use
> the general TBL instruction.  If the order happened to be { res1, res2 }
> instead we would generate the much tidier:
> 
> f2:
> .LFB1:
>         .cfi_startproc
>         movi    v1.4s, 0
>         mov     x1, x0
>         add     x2, x0, 1568
>         .p2align 3,,7
> .L6:
>         ldp     q0, q2, [x1]
>         add     x1, x1, 32
>         uzp1    v0.4s, v0.4s, v2.4s
>         fadd    v1.4s, v1.4s, v0.4s
>         cmp     x1, x2
>         bne     .L6
>         ldr     d0, [x0, 1568]
>         ldr     d5, [x0, 1576]
>         ldr     s2, [x0, 1584]
>         dup     d3, v1.d[1]
>         ldr     s4, [x0, 1592]
>         zip1    v0.2s, v0.2s, v5.2s
>         ins     v2.s[1], v4.s[0]
>         fadd    v0.2s, v0.2s, v2.2s
>         fadd    v0.2s, v0.2s, v1.2s
>         fadd    v0.2s, v0.2s, v3.2s
>         str     d0, [x0]
>         ret
> 
> This WIP patch makes vect_optimize_slp try to put load permutations
> into index order.  However, this new transform might be a loss if it
> forces permutations elsewhere.  For example, given:
> 
> void
> f3 (T *restrict x, T *restrict y, T *restrict z)
> {
>   for (int i = 0; i < 100; ++i)
>     {
>       x[i * 2] = y[i * 4 + 2] + z[i * 4 + 2];
>       x[i * 2 + 1] = y[i * 4] + z[i * 4];
>     }
> }
> 
> we would generate:
> 
> .L9:
>         ldr     q0, [x1, x3]
>         ldr     q3, [x6, x3]
>         ldr     q1, [x2, x3]
>         ldr     q2, [x5, x3]
>         add     x3, x3, 32
>         uzp1    v0.4s, v0.4s, v3.4s
>         uzp1    v1.4s, v1.4s, v2.4s
>         fadd    v0.4s, v0.4s, v1.4s
>         rev64   v0.4s, v0.4s
>         str     q0, [x4], 16
>         cmp     x3, 1568
>         bne     .L9
> 
> (3 permutes) rather than:
> 
> .L9:
>         ldr     q0, [x1, x3]
>         ldr     q1, [x6, x3]
>         ldr     q2, [x2, x3]
>         ldr     q3, [x5, x3]
>         tbl     v0.16b, {v0.16b - v1.16b}, v4.16b
>         add     x3, x3, 32
>         tbl     v2.16b, {v2.16b - v3.16b}, v4.16b
>         fadd    v0.4s, v0.4s, v2.4s
>         str     q0, [x4], 16
>         cmp     x3, 1568
>         bne     .L9
> 
> (2 permutes).
> 
> One simple(ish) fix would be to conditionalise the new case on
> whether all "roots" of the load are reduction groups.  Alternatively,
> we could treat the new case as a soft preference and back out if it
> would require any new VEC_PERM_EXPR nodes to be created.  This would
> require a propagation back from uses to definitions.
> 
> WDYT?  Does this look like a feasible direction?  If so, any thoughts
> on when the new case should be enabled?
> 
> The patch keeps the bijection requirement, since relaxing that seems
> like separate work.
> 
> Tested on aarch64-linux-gnu, no regressions.
> 
> Thanks,
> Richard
> 
> ---
>  gcc/tree-vect-slp.cc | 41 ++++++++++++++---------------------------
>  1 file changed, 14 insertions(+), 27 deletions(-)
> 
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index dab5daddcc5..fde35d8c954 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -20,6 +20,7 @@ along with GCC; see the file COPYING3.  If not see
>  <http://www.gnu.org/licenses/>.  */
>  
>  #include "config.h"
> +#define INCLUDE_ALGORITHM
>  #include "system.h"
>  #include "coretypes.h"
>  #include "backend.h"
> @@ -3698,43 +3699,29 @@ vect_optimize_slp (vec_info *vinfo)
>        if (!STMT_VINFO_GROUPED_ACCESS (dr_stmt))
>       continue;
>        dr_stmt = DR_GROUP_FIRST_ELEMENT (dr_stmt);
> -      unsigned imin = DR_GROUP_SIZE (dr_stmt) + 1, imax = 0;
> -      bool any_permute = false;
> -      for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> -     {
> -       unsigned idx = SLP_TREE_LOAD_PERMUTATION (node)[j];
> -       imin = MIN (imin, idx);
> -       imax = MAX (imax, idx);
> -       if (idx - SLP_TREE_LOAD_PERMUTATION (node)[0] != j)
> -         any_permute = true;
> -     }
> -      /* If there's no permute no need to split one out.  */
> -      if (!any_permute)
> -     continue;
> -      /* If the span doesn't match we'd disrupt VF computation, avoid
> -      that for now.  */
> -      if (imax - imin + 1 != SLP_TREE_LANES (node))
> +
> +      auto_vec<unsigned> sorted;
> +      sorted.safe_splice (SLP_TREE_LOAD_PERMUTATION (node));
> +      std::sort (sorted.begin (), sorted.end ());
> +      if (std::equal (sorted.begin (), sorted.end (),
> +                   SLP_TREE_LOAD_PERMUTATION (node).begin ()))
>       continue;
>  
>        /* For now only handle true permutes, like
>        vect_attempt_slp_rearrange_stmts did.  This allows us to be lazy
>        when permuting constants and invariants keeping the permute
>        bijective.  */
> -      auto_sbitmap load_index (SLP_TREE_LANES (node));
> -      bitmap_clear (load_index);
> -      for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> -     bitmap_set_bit (load_index, SLP_TREE_LOAD_PERMUTATION (node)[j] - imin);
> -      unsigned j;
> -      for (j = 0; j < SLP_TREE_LANES (node); ++j)
> -     if (!bitmap_bit_p (load_index, j))
> -       break;
> -      if (j != SLP_TREE_LANES (node))
> +      if (std::adjacent_find (sorted.begin (), sorted.end ()) != sorted.end 
> ())
>       continue;


So without the std:: rewrite this is the only change, where
previously we rejected { 0, 2, ... } because '1' was missing
but now we only reject duplicates?  As the comment says this
is what vect_attempt_slp_rearrange_stmts did so we would need
to adjust the comment as well.

(I'm unsure about the std:: stuff but mostly because I'm not
familiar with it and its semantics)

The current "propagation" isn't really designed to find
optimal solutions, it rather relies on the initial
permute "anticipation" we set up and simply propagates
those up as far as possible.  I think at some point we need
to come up with something more elaborate and rewrite this all.

It might also be that a target handles { 2, 0 } but not
{ 0, 2 } while we know it handles the unpermuted case always.

Can you think of a way to express this as dataflow problem
capturing finding the optimal solution as well?

>        vec<unsigned> perm = vNULL;
>        perm.safe_grow (SLP_TREE_LANES (node), true);
>        for (unsigned j = 0; j < SLP_TREE_LANES (node); ++j)
> -     perm[j] = SLP_TREE_LOAD_PERMUTATION (node)[j] - imin;
> +     {
> +       auto it = std::lower_bound (sorted.begin (), sorted.end (),
> +                                   SLP_TREE_LOAD_PERMUTATION (node)[j]);
> +       perm[j] = it - sorted.begin ();
> +     }
>        perms.safe_push (perm);
>        vertices[idx].perm_in = perms.length () - 1;
>        vertices[idx].perm_out = perms.length () - 1;
> @@ -6647,7 +6634,7 @@ vect_transform_slp_perm_load (vec_info *vinfo,
>  {
>    stmt_vec_info stmt_info = SLP_TREE_SCALAR_STMTS (node)[0];
>    int vec_index = 0;
> -  tree vectype = STMT_VINFO_VECTYPE (stmt_info);
> +  tree vectype = SLP_TREE_VECTYPE (node);

that looks like an obvious fix.

Richard.

>    unsigned int group_size = SLP_TREE_SCALAR_STMTS (node).length ();
>    unsigned int mask_element;
>    machine_mode mode;
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH, Frankenstra

Re: [PATCH] RFC: Optimise SLP permutes of non-consecutive loads

Reply via email to