Re: V6 [PATCH] Optimize vector constructor

Richard Biener Wed, 08 May 2019 05:05:02 -0700

On Fri, May 3, 2019 at 6:54 PM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Thu, May 2, 2019 at 10:53 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> >
> > On Thu, May 2, 2019 at 7:55 AM Richard Biener
> > <richard.guent...@gmail.com> wrote:
> > >
> > > On Thu, May 2, 2019 at 4:54 PM Richard Biener
> > > <richard.guent...@gmail.com> wrote:
> > > >
> > > > On Mon, Mar 11, 2019 at 8:03 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> > > > >
> > > > > On Fri, Mar 8, 2019 at 7:03 PM Richard Biener
> > > > > <richard.guent...@gmail.com> wrote:
> > > > > >
> > > > > > On Fri, Mar 8, 2019 at 9:49 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Mar 7, 2019 at 9:51 AM H.J. Lu <hjl.to...@gmail.com> 
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > On Wed, Mar 6, 2019 at 8:33 PM Richard Biener
> > > > > > > > <richard.guent...@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, Mar 6, 2019 at 8:46 AM H.J. Lu <hjl.to...@gmail.com> 
> > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > On Tue, Mar 5, 2019 at 1:46 AM H.J. Lu 
> > > > > > > > > > <hjl.to...@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Mon, Mar 04, 2019 at 12:55:04PM +0100, Richard Biener 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Sun, Mar 3, 2019 at 10:13 PM H.J. Lu 
> > > > > > > > > > > > <hjl.to...@gmail.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Sun, Mar 03, 2019 at 06:40:09AM -0800, Andrew 
> > > > > > > > > > > > > Pinski wrote:
> > > > > > > > > > > > > > )
> > > > > > > > > > > > > > ,On Sun, Mar 3, 2019 at 6:32 AM H.J. Lu 
> > > > > > > > > > > > > > <hjl.to...@gmail.com> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > For vector init constructor:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > > typedef float __v4sf __attribute__ 
> > > > > > > > > > > > > > > ((__vector_size__ (16)));
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > __v4sf
> > > > > > > > > > > > > > > foo (__v4sf x, float f)
> > > > > > > > > > > > > > > {
> > > > > > > > > > > > > > >   __v4sf y = { f, x[1], x[2], x[3] };
> > > > > > > > > > > > > > >   return y;
> > > > > > > > > > > > > > > }
> > > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > we can optimize vector init constructor with 
> > > > > > > > > > > > > > > vector copy or permute
> > > > > > > > > > > > > > > followed by a single scalar insert:
> > > > > > > > > > >
> > > > > > > > > > > > and you want to advance to the _1 = BIT_INSERT_EXPR 
> > > > > > > > > > > > here.  The easiest way
> > > > > > > > > > > > is to emit a new stmt for _2 = copy ...; and do the 
> > > > > > > > > > > > set_rhs with the
> > > > > > > > > > > > BIT_INSERT_EXPR.
> > > > > > > > > > >
> > > > > > > > > > > Thanks for BIT_INSERT_EXPR suggestion.  I am testing this 
> > > > > > > > > > > patch.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > H.J.
> > > > > > > > > > > ---
> > > > > > > > > > > We can optimize vector constructor with vector copy or 
> > > > > > > > > > > permute followed
> > > > > > > > > > > by a single scalar insert:
> > > > > > > > > > >
> > > > > > > > > > >   __v4sf y;
> > > > > > > > > > >   __v4sf D.1930;
> > > > > > > > > > >   float _1;
> > > > > > > > > > >   float _2;
> > > > > > > > > > >   float _3;
> > > > > > > > > > >
> > > > > > > > > > >   <bb 2> :
> > > > > > > > > > >   _1 = BIT_FIELD_REF <x_9(D), 32, 96>;
> > > > > > > > > > >   _2 = BIT_FIELD_REF <x_9(D), 32, 64>;
> > > > > > > > > > >   _3 = BIT_FIELD_REF <x_9(D), 32, 32>;
> > > > > > > > > > >   y_6 = {f_5(D), _3, _2, _1};
> > > > > > > > > > >   return y_6;
> > > > > > > > > > >
> > > > > > > > > > > with
> > > > > > > > > > >
> > > > > > > > > > >  __v4sf y;
> > > > > > > > > > >   __v4sf D.1930;
> > > > > > > > > > >   float _1;
> > > > > > > > > > >   float _2;
> > > > > > > > > > >   float _3;
> > > > > > > > > > >   vector(4) float _8;
> > > > > > > > > > >
> > > > > > > > > > >   <bb 2> :
> > > > > > > > > > >   _1 = BIT_FIELD_REF <x_9(D), 32, 96>;
> > > > > > > > > > >   _2 = BIT_FIELD_REF <x_9(D), 32, 64>;
> > > > > > > > > > >   _3 = BIT_FIELD_REF <x_9(D), 32, 32>;
> > > > > > > > > > >   _8 = x_9(D);
> > > > > > > > > > >   y_6 = BIT_INSERT_EXPR <x_9(D), f_5(D), 0 (32 bits)>;
> > > > > > > > > > >   return y_6;
> > > > > > > > > > >
> > > > > > > > > > > gcc/
> > > > > > > > > > >
> > > > > > > > > > >         PR tree-optimization/88828
> > > > > > > > > > >         * tree-ssa-forwprop.c 
> > > > > > > > > > > (simplify_vector_constructor): Optimize
> > > > > > > > > > >         vector init constructor with vector copy or 
> > > > > > > > > > > permute followed
> > > > > > > > > > >         by a single scalar insert.
> > > > > > > > > > >
> > > > > > > > > > > gcc/testsuite/
> > > > > > > > > > >
> > > > > > > > > > >         PR tree-optimization/88828
> > > > > > > > > > >         * gcc.target/i386/pr88828-1a.c: New test.
> > > > > > > > > > >         * gcc.target/i386/pr88828-2b.c: Likewise.
> > > > > > > > > > >         * gcc.target/i386/pr88828-2.c: Likewise.
> > > > > > > > > > >         * gcc.target/i386/pr88828-3a.c: Likewise.
> > > > > > > > > > >         * gcc.target/i386/pr88828-3b.c: Likewise.
> > > > > > > > > > >         * gcc.target/i386/pr88828-3c.c: Likewise.
> > > > > > > > > > >         * gcc.target/i386/pr88828-3d.c: Likewise.
> > > > > > > > > > >         * gcc.target/i386/pr88828-4a.c: Likewise.
> > > > > > > > > > >         * gcc.target/i386/pr88828-4b.c: Likewise.
> > > > > > > > > > >         * gcc.target/i386/pr88828-5a.c: Likewise.
> > > > > > > > > > >         * gcc.target/i386/pr88828-5b.c: Likewise.
> > > > > > > > > > >         * gcc.target/i386/pr88828-6a.c: Likewise.
> > > > > > > > > > >         * gcc.target/i386/pr88828-6b.c: Likewise.
> > > > > > > > > >
> > > > > > > > > > Here is the updated patch with run-time tests.
> > > > > > > > >
> > > > > > > > > -      if (TREE_CODE (elt->value) != SSA_NAME)
> > > > > > > > > +      if (TREE_CODE (ce->value) != SSA_NAME)
> > > > > > > > >         return false;
> > > > > > > > >
> > > > > > > > > hmm, so it doesn't allow { 0, v[1], v[2], v[3] }?  I think 
> > > > > > > > > the single
> > > > > > > > > scalar value can be a constant as well.
> > > > > > > >
> > > > > > > > Fixed.
> > > > > > > >
> > > > > > > > >        if (!def_stmt)
> > > > > > > > > -       return false;
> > > > > > > > > +       {
> > > > > > > > > +         if (gimple_nop_p (SSA_NAME_DEF_STMT (ce->value)))
> > > > > > > > >
> > > > > > > > > if (SSA_NAME_IS_DEFAULT_DEF (ce->value))
> > > > > > > > >
> > > > > > > > > +           {
> > > > > > > > >
> > > > > > > > > also you seem to disallow
> > > > > > > > >
> > > > > > > > >   { i + 1, v[1], v[2], v[3] }
> > > > > > > >
> > > > > > > > Fixed by
> > > > > > > >
> > > > > > > >      if (code != BIT_FIELD_REF)
> > > > > > > >         {
> > > > > > > >           /* Only allow one scalar insert.  */
> > > > > > > >           if (nscalars != 0)
> > > > > > > >             return false;
> > > > > > > >
> > > > > > > >           nscalars = 1;
> > > > > > > >           insert = true;
> > > > > > > >           scalar_idx = i;
> > > > > > > >           sel.quick_push (i);
> > > > > > > >           scalar_element = ce->value;
> > > > > > > >           continue;
> > > > > > > >         }
> > > > > > > >
> > > > > > > > > because get_prop_source_stmt will return the definition 
> > > > > > > > > computing
> > > > > > > > > i + 1 in this case and your code will be skipped?
> > > > > > > > >
> > > > > > > > > I think you can simplify the code by treating scalar_element 
> > > > > > > > > != NULL
> > > > > > > > > as nscalars == 1 and eliding nscalars.
> > > > > > > >
> > > > > > > > It works only if
> > > > > > > >
> > > > > > > > TYPE_VECTOR_SUBPARTS (type).to_constant ()  == (nscalars + 
> > > > > > > > nvectors)
> > > > > > > >
> > > > > > > > We need to check both nscalars and nvectors.  Elide nscalar
> > > > > > > > check doesn't help much here.
> > > > > > > >
> > > > > > > > > -      if (conv_code == ERROR_MARK)
> > > > > > > > > +      if (conv_code == ERROR_MARK && !insert)
> > > > > > > > >         gimple_assign_set_rhs_with_ops (gsi, VEC_PERM_EXPR, 
> > > > > > > > > orig[0],
> > > > > > > > >                                         orig[1], op2);
> > > > > > > > >        else
> > > > > > > > > @@ -2148,10 +2198,25 @@ simplify_vector_constructor 
> > > > > > > > > (gimple_stmt_iterator *gsi)
> > > > > > > > >                                    VEC_PERM_EXPR, orig[0], 
> > > > > > > > > orig[1], op2);
> > > > > > > > >           orig[0] = gimple_assign_lhs (perm);
> > > > > > > > >           gsi_insert_before (gsi, perm, GSI_SAME_STMT);
> > > > > > > > > -         gimple_assign_set_rhs_with_ops (gsi, conv_code, 
> > > > > > > > > orig[0],
> > > > > > > > > +         gimple_assign_set_rhs_with_ops (gsi,
> > > > > > > > > +                                         (conv_code != 
> > > > > > > > > ERROR_MARK
> > > > > > > > > +                                          ? conv_code
> > > > > > > > > +                                          : NOP_EXPR),
> > > > > > > > > +                                         orig[0],
> > > > > > > > >                                           NULL_TREE, 
> > > > > > > > > NULL_TREE);
> > > > > > > > >
> > > > > > > > > I believe you should elide the last stmt for conv_code == 
> > > > > > > > > ERROR_MARK,
> > > > > > > > > that is, why did you need to add the && !insert check in the 
> > > > > > > > > guarding condition
> > > > > > > >
> > > > > > > > When conv_code == ERROR_MARK, we still need
> > > > > > > >
> > > > > > > >        gimple *perm
> > > > > > > >             = gimple_build_assign (make_ssa_name (TREE_TYPE 
> > > > > > > > (orig[0])),
> > > > > > > >                                    VEC_PERM_EXPR, orig[0], 
> > > > > > > > orig[1], op2);
> > > > > > > >           orig[0] = gimple_assign_lhs (perm);
> > > > > > > >           gsi_insert_before (gsi, perm, GSI_SAME_STMT);
> > > > > > > >           gimple_assign_set_rhs_with_ops (gsi,  NOP_EXPR,
> > > > > > > >                                           orig[0],
> > > > > > > >                                           NULL_TREE, NULL_TREE);
> > > > > > > >
> > > > > > > > Otherwise, scalar insert won't work.
> > > > > > > >
> > > > > > > > > (this path should already do the correct thing?).  Note that 
> > > > > > > > > in all
> > > > > > > > > cases it looks
> > > > > > > > > that with conv_code != ERROR_MARK you may end up doing a 
> > > > > > > > > float->int
> > > > > > > > > or int->float conversion on a value it wasn't done on before 
> > > > > > > > > which might
> > > > > > > > > raise exceptions?  That is, do we need to make sure we 
> > > > > > > > > permute a
> > > > > > > > > value we already do convert into the place we're going to 
> > > > > > > > > insert to?
> > > > > > > >
> > > > > > > > This couldn't happen:
> > > > > > > >
> > > > > > > >       if (type == TREE_TYPE (ref))
> > > > > > > >          {
> > > > > > > >            /* The RHS vector has the same type as LHS.  */
> > > > > > > >            if (rhs_vector == NULL)
> > > > > > > >              rhs_vector = ref;
> > > > > > > >            /* Check if all RHS vector elements come fome the 
> > > > > > > > same
> > > > > > > >               vector.  */
> > > > > > > >            if (rhs_vector == ref)
> > > > > > > >              nvectors++;
> > > > > > > >          }
> > > > > > > > ...
> > > > > > > >   if (insert
> > > > > > > >       && (nvectors == 0
> > > > > > > >           || (TYPE_VECTOR_SUBPARTS (type).to_constant ()
> > > > > > > >               != (nscalars + nvectors))))
> > > > > > > >     return false;
> > > > > >
> > > > > > I see - that looks like a missed case then?
> > > > > >
> > > > > >  { 1., (float)v[1], (float)v[2], (float)v[3] }
> > > > > >
> > > > > > with integer vector v?
> > > > >
> > > > > True.
> > > > >
> > > > > > I'll have a look at the full patch next week (it's GCC 10 material 
> > > > > > in any case).
> > > > > >
> > > >
> > > > Now looking again.  I still don't like the new "structure" of the loop
> > > > very much.
> > > > A refactoring like the attached should make it easier to clearly 
> > > > separate the
> > > > cases where we reach a vector def and where not.
> > >
> > > Now attached.
> > >
> > > > Do you want me to take over the patch?
> > > >
> >
>
> Here is the updated patch on top of your patch plus my fix.


Thanks - when doing the constant vector I was thinking of the following patch
to handle your cases.  It doesn't use insertion but eventually leaves that
to a separate transform.  Instead it handles non-constants similar to constants
by permuting from a uniform vector.  Thus

__attribute__((noinline, noclone))
__v4sf
foo2 (__v4sf x, float f)
{
  __v4sf y = { f, x[1], x[2], x[3] };
  return y;
}

becomes

  _4 = {f_2(D), f_2(D), f_2(D), f_2(D)};
  y_3 = VEC_PERM_EXPR <x_5(D), _4, { 4, 1, 2, 3 }>;
  return y_3;

this allows us to handle an arbitrary number of inserts of this
single value.  It also ensures we can actually perform the
permutation while for the insertion we currently do not have
a convenient way to query whether the target can perform
it efficiently (IIRC x86 needs AVX to insert to arbitrary lanes
with a single instruction?).  Similarly if the user writes the above
in source using __builtin_shuffle we'd want to optimize it as well.

The patch as attached only passes some of your testcases,
the following FAIL:

FAIL: gcc.target/i386/pr88828-2a.c scan-assembler movss
FAIL: gcc.target/i386/pr88828-2a.c scan-assembler-not movaps
FAIL: gcc.target/i386/pr88828-2a.c scan-assembler-not movlhps
FAIL: gcc.target/i386/pr88828-2a.c scan-assembler-not unpcklps
FAIL: gcc.target/i386/pr88828-2b.c scan-assembler-times vpermilps 1
FAIL: gcc.target/i386/pr88828-2b.c scan-assembler-times vmovss 1
FAIL: gcc.target/i386/pr88828-2c.c scan-assembler movss
FAIL: gcc.target/i386/pr88828-2c.c scan-assembler-not movaps
FAIL: gcc.target/i386/pr88828-2c.c scan-assembler-not movlhps
FAIL: gcc.target/i386/pr88828-2c.c scan-assembler-not unpcklps
FAIL: gcc.target/i386/pr88828-2d.c scan-assembler-times vpermilps 1
FAIL: gcc.target/i386/pr88828-2d.c scan-assembler-times vmovss 1
FAIL: gcc.target/i386/pr88828-3a.c scan-assembler movss
FAIL: gcc.target/i386/pr88828-3a.c scan-assembler-times shufps 2
FAIL: gcc.target/i386/pr88828-3a.c scan-assembler-times movaps 1
FAIL: gcc.target/i386/pr88828-3a.c scan-assembler-not movlhps
FAIL: gcc.target/i386/pr88828-3a.c scan-assembler-not unpcklps
FAIL: gcc.target/i386/pr88828-3b.c scan-assembler-times vpermilps 1
FAIL: gcc.target/i386/pr88828-3b.c scan-assembler-times vinsertps 1
FAIL: gcc.target/i386/pr88828-3b.c scan-assembler-not vshufps
FAIL: gcc.target/i386/pr88828-3c.c scan-assembler-times vpermilps 1
FAIL: gcc.target/i386/pr88828-3c.c scan-assembler-times vinsertps 1
FAIL: gcc.target/i386/pr88828-3c.c scan-assembler-not vshufps

Making the patch emit inserts for single insert locations is of course
still possible but you get to arrive at heuristics like your choice
of permuting the original lane into the later overwritten lane which
might be a choice making the permute impossible or more expensive?

The original purpose of simplify_vector_constructor was to simplify
the IL, not so much optimal code-generation in the end but I wonder
if we can rely on RTL expansion or later RTL optimization to do
the optimal choices here?

I guess simplify_permutation could perform a VEC_PERM
into an insert if the remaining permutation would be a no-op
but RTL optimization handles this case well already.

Whether code-generation for a one vector permute plus insert or
a two-vector permute is better in the end I don't know - at least
the permute expansion has a chance to see the combined
instruction.

Do you think the remaining cases above can be handled in the
backend?

Comments?

Thanks,
Richard.

2019-05-08  Richard Biener  <rguent...@suse.de>

        PR tree-optimization/88828
        * tree-ssa-forwprop.c (simplify_vector_constructor): Handle
        permuting in a single non-constant element not extracted
        from a vector.


> --
> H.J.

fix-pr88828-2
Description: Binary data

Re: V6 [PATCH] Optimize vector constructor

Reply via email to