RE: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]

Tamar Christina via Gcc-patches Wed, 16 Nov 2022 07:25:41 -0800

Hi,

This patch is causing several ICEs because it changes the permutes from a 
single register
permute to a multi register due to the lowering of the expressions to different 
SSA names.


See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107717

I have a prototype fix which adds a new rule to CSE this back to a single 
register permute,
but would this be the right solution? It seems hard to later on during expand 
realize that
the two operands are the same.

It's probably also ok to just block this from happening after vec_lower, 
however I'm worried that
If it wasn't CSE'd before vec_lower it'll lower it so something much less 
efficient.

Thanks,
Tamar

> -----Original Message-----
> From: Gcc-patches <gcc-patches-
> [email protected]> On Behalf Of Richard
> Biener via Gcc-patches
> Sent: Monday, November 14, 2022 2:53 PM
> To: Hongyu Wang <[email protected]>
> Cc: Prathamesh Kulkarni <[email protected]>; Richard
> Sandiford <[email protected]>; Hongyu Wang
> <[email protected]>; [email protected]; gcc-
> [email protected]
> Subject: Re: [PATCH] Optimize VEC_PERM_EXPR with same permutation
> index and operation [PR98167]
> 
> On Thu, Nov 10, 2022 at 3:27 PM Hongyu Wang
> <[email protected]> wrote:
> >
> > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > think a lambda function is fine to use.  The alternative (used by
> > > the vectorizer in some places) is to use sth like
> > >
> > >  auto_sbitmap seen (nelts);
> > >  for (i = 0; i < nelts; i++)
> > >    {
> > >      if (!bitmap_set_bit (seen, i))
> > >        break;
> > >      count++;
> > >    }
> > >  full_perm_p = count == nelts;
> > >
> > > I'll note that you should still check .encoding
> > > ().encoded_full_vector_p () and only bother to check that case, that's a
> very simple check.
> >
> > Thanks for the good example! We also tried using wide_int as a bitmask
> > but your code looks more simple and reasonable.
> >
> > Updated the patch accordingly.
> 
> OK.
> 
> Thanks,
> Richard.
> 
> > Richard Biener <[email protected]> 于2022年11月10日周四
> 16:56写道：
> >
> >
> > >
> > > On Thu, Nov 10, 2022 at 3:27 AM Hongyu Wang
> <[email protected]> wrote:
> > > >
> > > > Hi Prathamesh and Richard,
> > > >
> > > > Thanks for the review and nice suggestions!
> > > >
> > > > > > I guess the transform should work as long as mask is same for
> > > > > > both vectors even if it's not constant ?
> > > > >
> > > > > Yes, please change accordingly (and maybe push separately).
> > > > >
> > > >
> > > > Removed VECTOR_CST for integer ops.
> > > >
> > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > otherwise it will crash for VLA vectors.
> > > > >
> > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > elements and that is not trivial though.  But indeed add
> > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > >
> > > > Added.
> > > >
> > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > encoding that isn't trivial but covers all elements) and then
> > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > the scan is O(n log n).
> > > >
> > > > The .qsort () approach requires an extra cmp_func that IMO would
> > > > not be feasible to be implemented in match.pd (I suppose lambda
> > > > function would not be a good idea either).
> > > > Another solution would be using hash_set but it does not work here
> > > > for int64_t or poly_int64 type.
> > > > So I kept current O(n^2) simple code here, and I suppose usually
> > > > the permutation indices would be a small number even for O(n^2)
> > > > complexity.
> > >
> > > Well, with AVX512 v64qi that's 64*64 == 4096 cases to check.  I
> > > think a lambda function is fine to use.  The alternative (used by
> > > the vectorizer in some places) is to use sth like
> > >
> > >  auto_sbitmap seen (nelts);
> > >  for (i = 0; i < nelts; i++)
> > >    {
> > >      if (!bitmap_set_bit (seen, i))
> > >        break;
> > >      count++;
> > >    }
> > >  full_perm_p = count == nelts;
> > >
> > > I'll note that you should still check .encoding
> > > ().encoded_full_vector_p () and only bother to check that case, that's a
> very simple check.
> > >
> > > >
> > > > Attached updated patch.
> > > >
> > > > Richard Biener via Gcc-patches <[email protected]>
> > > > 于2022年11月8日周二 22:38写道：
> > > >
> > > >
> > > > >
> > > > > On Fri, Nov 4, 2022 at 7:44 AM Prathamesh Kulkarni via
> > > > > Gcc-patches <[email protected]> wrote:
> > > > > >
> > > > > > On Fri, 4 Nov 2022 at 05:36, Hongyu Wang via Gcc-patches
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > This is a follow-up patch for PR98167
> > > > > > >
> > > > > > > The sequence
> > > > > > >      c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > >      c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > >      c3 = c1 op c2
> > > > > > > can be optimized to
> > > > > > >      c = a op b
> > > > > > >      c3 = VEC_PERM_EXPR (c, c, mask) for all integer vector
> > > > > > > operation, and float operation with full permutation.
> > > > > > >
> > > > > > > Bootstrapped & regrtested on x86_64-pc-linux-gnu.
> > > > > > >
> > > > > > > Ok for trunk?
> > > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > >         PR target/98167
> > > > > > >         * match.pd: New perm + vector op patterns for int and fp
> vector.
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > >         PR target/98167
> > > > > > >         * gcc.target/i386/pr98167.c: New test.
> > > > > > > ---
> > > > > > >  gcc/match.pd                            | 49 
> > > > > > > +++++++++++++++++++++++++
> > > > > > >  gcc/testsuite/gcc.target/i386/pr98167.c | 44
> > > > > > > ++++++++++++++++++++++
> > > > > > >  2 files changed, 93 insertions(+)  create mode 100644
> > > > > > > gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > >
> > > > > > > diff --git a/gcc/match.pd b/gcc/match.pd index
> > > > > > > 194ba8f5188..b85ad34f609 100644
> > > > > > > --- a/gcc/match.pd
> > > > > > > +++ b/gcc/match.pd
> > > > > > > @@ -8189,3 +8189,52 @@ and,
> > > > > > >   (bit_and (negate @0) integer_onep@1)
> > > > > > >   (if (!TYPE_OVERFLOW_SANITIZED (type))
> > > > > > >    (bit_and @0 @1)))
> > > > > > > +
> > > > > > > +/* Optimize
> > > > > > > +   c1 = VEC_PERM_EXPR (a, a, mask)
> > > > > > > +   c2 = VEC_PERM_EXPR (b, b, mask)
> > > > > > > +   c3 = c1 op c2
> > > > > > > +   -->
> > > > > > > +   c = a op b
> > > > > > > +   c3 = VEC_PERM_EXPR (c, c, mask)
> > > > > > > +   For all integer non-div operations.  */ (for op (plus
> > > > > > > +minus mult bit_and bit_ior bit_xor
> > > > > > > +        lshift rshift)
> > > > > > > + (simplify
> > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> VECTOR_CST@2))
> > > > > > > +    (if (VECTOR_INTEGER_TYPE_P (type))
> > > > > > > +     (vec_perm (op @0 @1) (op @0 @1) @2))))
> > > > > > Just wondering, why should mask be CST here ?
> > > > > > I guess the transform should work as long as mask is same for
> > > > > > both vectors even if it's not constant ?
> > > > >
> > > > > Yes, please change accordingly (and maybe push separately).
> > > > >
> > > > > > > +
> > > > > > > +/* Similar for float arithmetic when permutation constant covers
> > > > > > > +   all vector elements.  */ (for op (plus minus mult)
> > > > > > > +(simplify
> > > > > > > +  (op (vec_perm @0 @0 VECTOR_CST@2) (vec_perm @1 @1
> VECTOR_CST@2))
> > > > > > > +    (if (VECTOR_FLOAT_TYPE_P (type))
> > > > > > > +     (with
> > > > > > > +      {
> > > > > > > +       tree perm_cst = @2;
> > > > > > > +       vec_perm_builder builder;
> > > > > > > +       bool full_perm_p = false;
> > > > > > > +       if (tree_to_vec_perm_builder (&builder, perm_cst))
> > > > > > > +         {
> > > > > > > +           /* Create a vec_perm_indices for the integer vector.  
> > > > > > > */
> > > > > > > +           int nelts = TYPE_VECTOR_SUBPARTS
> > > > > > > +(type).to_constant ();
> > > > > > If this transform is meant only for VLS vectors, I guess you
> > > > > > should bail out if TYPE_VECTOR_SUBPARTS is not constant,
> > > > > > otherwise it will crash for VLA vectors.
> > > > >
> > > > > I suppose it's difficult to create a VLA permute that covers all
> > > > > elements and that is not trivial though.  But indeed add
> > > > > ().is_constant to the VECTOR_FLOAT_TYPE_P guard.
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Prathamesh
> > > > > > > +           vec_perm_indices sel (builder, 1, nelts);
> > > > > > > +
> > > > > > > +           /* Check if perm indices covers all vector elements.  
> > > > > > > */
> > > > > > > +           int count = 0, i, j;
> > > > > > > +           for (i = 0; i < nelts; i++)
> > > > > > > +             for (j = 0; j < nelts; j++)
> > > > >
> > > > > Meh, that's quadratic!  I suggest to check .encoding
> > > > > ().encoded_full_vector_p () (as said I can't think of a non-full
> > > > > encoding that isn't trivial but covers all elements) and then
> > > > > simply .qsort () the vector_builder (it derives from vec<>) so
> > > > > the scan is O(n log n).
> > > > >
> > > > > Maybe Richard has a better idea here though.
> > > > >
> > > > > Otherwise looks OK, though with these kind of (* (op ..) (op
> > > > > ..)) patterns it's always that they explode the match decision
> > > > > tree, we'd ideally have a way to match those with (op ..) (op
> > > > > ..) first to be able to share more of the matching code.  That
> > > > > said, match.pd is a less than ideal place for these (but mostly
> > > > > because of the way we code generate *-match.cc)
> > > > >
> > > > > Richard.
> > > > >
> > > > > > > +               {
> > > > > > > +                 if (sel[j].to_constant () == i)
> > > > > > > +                   {
> > > > > > > +                     count++;
> > > > > > > +                     break;
> > > > > > > +                   }
> > > > > > > +               }
> > > > > > > +           full_perm_p = count == nelts;
> > > > > > > +         }
> > > > > > > +       }
> > > > > > > +       (if (full_perm_p)
> > > > > > > +       (vec_perm (op @0 @1) (op @0 @1) @2))))))
> > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > new file mode 100644
> > > > > > > index 00000000000..40e0ac11332
> > > > > > > --- /dev/null
> > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pr98167.c
> > > > > > > @@ -0,0 +1,44 @@
> > > > > > > +/* PR target/98167 */
> > > > > > > +/* { dg-do compile } */
> > > > > > > +/* { dg-options "-O2 -mavx2" } */
> > > > > > > +
> > > > > > > +/* { dg-final { scan-assembler-times "vpshufd\t" 8 } } */
> > > > > > > +/* { dg-final { scan-assembler-times "vpermilps\t" 3 } } */
> > > > > > > +
> > > > > > > +#define VEC_PERM_4 \
> > > > > > > +  2, 3, 1, 0
> > > > > > > +#define VEC_PERM_8 \
> > > > > > > +  4, 5, 6, 7, 3, 2, 1, 0
> > > > > > > +#define VEC_PERM_16 \
> > > > > > > +  8, 9, 10, 11, 12, 13, 14, 15, 7, 6, 5, 4, 3, 2, 1, 0
> > > > > > > +
> > > > > > > +#define TYPE_PERM_OP(type, size, op, name) \
> > > > > > > +  typedef type v##size##s##type __attribute__
> > > > > > > +((vector_size(4*size))); \
> > > > > > > +  v##size##s##type type##foo##size##i_##name
> (v##size##s##type a, \
> > > > > > > +
> > > > > > > +v##size##s##type b) \
> > > > > > > +  { \
> > > > > > > +    v##size##s##type a1 = __builtin_shufflevector (a, a, \
> > > > > > > +                                                  
> > > > > > > VEC_PERM_##size); \
> > > > > > > +    v##size##s##type b1 = __builtin_shufflevector (b, b, \
> > > > > > > +                                                  
> > > > > > > VEC_PERM_##size); \
> > > > > > > +    return a1 op b1; \
> > > > > > > +  }
> > > > > > > +
> > > > > > > +#define INT_PERMS(op, name) \
> > > > > > > +  TYPE_PERM_OP (int, 4, op, name) \
> > > > > > > +
> > > > > > > +#define FP_PERMS(op, name) \
> > > > > > > +  TYPE_PERM_OP (float, 4, op, name) \
> > > > > > > +
> > > > > > > +INT_PERMS (+, add)
> > > > > > > +INT_PERMS (-, sub)
> > > > > > > +INT_PERMS (*, mul)
> > > > > > > +INT_PERMS (|, ior)
> > > > > > > +INT_PERMS (^, xor)
> > > > > > > +INT_PERMS (&, and)
> > > > > > > +INT_PERMS (<<, shl)
> > > > > > > +INT_PERMS (>>, shr)
> > > > > > > +FP_PERMS (+, add)
> > > > > > > +FP_PERMS (-, sub)
> > > > > > > +FP_PERMS (*, mul)
> > > > > > > +
> > > > > > > --
> > > > > > > 2.18.1
> > > > > > >

RE: [PATCH] Optimize VEC_PERM_EXPR with same permutation index and operation [PR98167]

Reply via email to