On Wed, Mar 6, 2019 at 8:46 AM H.J. Lu <hjl.to...@gmail.com> wrote:
>
> On Tue, Mar 5, 2019 at 1:46 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> >
> > On Mon, Mar 04, 2019 at 12:55:04PM +0100, Richard Biener wrote:
> > > On Sun, Mar 3, 2019 at 10:13 PM H.J. Lu <hjl.to...@gmail.com> wrote:
> > > >
> > > > On Sun, Mar 03, 2019 at 06:40:09AM -0800, Andrew Pinski wrote:
> > > > > )
> > > > > ,On Sun, Mar 3, 2019 at 6:32 AM H.J. Lu <hjl.to...@gmail.com> wrote:
> > > > > >
> > > > > > For vector init constructor:
> > > > > >
> > > > > > ---
> > > > > > typedef float __v4sf __attribute__ ((__vector_size__ (16)));
> > > > > >
> > > > > > __v4sf
> > > > > > foo (__v4sf x, float f)
> > > > > > {
> > > > > >   __v4sf y = { f, x[1], x[2], x[3] };
> > > > > >   return y;
> > > > > > }
> > > > > > ---
> > > > > >
> > > > > > we can optimize vector init constructor with vector copy or permute
> > > > > > followed by a single scalar insert:
> >
> > > and you want to advance to the _1 = BIT_INSERT_EXPR here.  The easiest way
> > > is to emit a new stmt for _2 = copy ...; and do the set_rhs with the
> > > BIT_INSERT_EXPR.
> >
> > Thanks for BIT_INSERT_EXPR suggestion.  I am testing this patch.
> >
> >
> > H.J.
> > ---
> > We can optimize vector constructor with vector copy or permute followed
> > by a single scalar insert:
> >
> >   __v4sf y;
> >   __v4sf D.1930;
> >   float _1;
> >   float _2;
> >   float _3;
> >
> >   <bb 2> :
> >   _1 = BIT_FIELD_REF <x_9(D), 32, 96>;
> >   _2 = BIT_FIELD_REF <x_9(D), 32, 64>;
> >   _3 = BIT_FIELD_REF <x_9(D), 32, 32>;
> >   y_6 = {f_5(D), _3, _2, _1};
> >   return y_6;
> >
> > with
> >
> >  __v4sf y;
> >   __v4sf D.1930;
> >   float _1;
> >   float _2;
> >   float _3;
> >   vector(4) float _8;
> >
> >   <bb 2> :
> >   _1 = BIT_FIELD_REF <x_9(D), 32, 96>;
> >   _2 = BIT_FIELD_REF <x_9(D), 32, 64>;
> >   _3 = BIT_FIELD_REF <x_9(D), 32, 32>;
> >   _8 = x_9(D);
> >   y_6 = BIT_INSERT_EXPR <x_9(D), f_5(D), 0 (32 bits)>;
> >   return y_6;
> >
> > gcc/
> >
> >         PR tree-optimization/88828
> >         * tree-ssa-forwprop.c (simplify_vector_constructor): Optimize
> >         vector init constructor with vector copy or permute followed
> >         by a single scalar insert.
> >
> > gcc/testsuite/
> >
> >         PR tree-optimization/88828
> >         * gcc.target/i386/pr88828-1a.c: New test.
> >         * gcc.target/i386/pr88828-2b.c: Likewise.
> >         * gcc.target/i386/pr88828-2.c: Likewise.
> >         * gcc.target/i386/pr88828-3a.c: Likewise.
> >         * gcc.target/i386/pr88828-3b.c: Likewise.
> >         * gcc.target/i386/pr88828-3c.c: Likewise.
> >         * gcc.target/i386/pr88828-3d.c: Likewise.
> >         * gcc.target/i386/pr88828-4a.c: Likewise.
> >         * gcc.target/i386/pr88828-4b.c: Likewise.
> >         * gcc.target/i386/pr88828-5a.c: Likewise.
> >         * gcc.target/i386/pr88828-5b.c: Likewise.
> >         * gcc.target/i386/pr88828-6a.c: Likewise.
> >         * gcc.target/i386/pr88828-6b.c: Likewise.
>
> Here is the updated patch with run-time tests.

-      if (TREE_CODE (elt->value) != SSA_NAME)
+      if (TREE_CODE (ce->value) != SSA_NAME)
        return false;

hmm, so it doesn't allow { 0, v[1], v[2], v[3] }?  I think the single
scalar value can be a constant as well.

       if (!def_stmt)
-       return false;
+       {
+         if (gimple_nop_p (SSA_NAME_DEF_STMT (ce->value)))

if (SSA_NAME_IS_DEFAULT_DEF (ce->value))

+           {

also you seem to disallow

  { i + 1, v[1], v[2], v[3] }

because get_prop_source_stmt will return the definition computing
i + 1 in this case and your code will be skipped?

I think you can simplify the code by treating scalar_element != NULL
as nscalars == 1 and eliding nscalars.

-      if (conv_code == ERROR_MARK)
+      if (conv_code == ERROR_MARK && !insert)
        gimple_assign_set_rhs_with_ops (gsi, VEC_PERM_EXPR, orig[0],
                                        orig[1], op2);
       else
@@ -2148,10 +2198,25 @@ simplify_vector_constructor (gimple_stmt_iterator *gsi)
                                   VEC_PERM_EXPR, orig[0], orig[1], op2);
          orig[0] = gimple_assign_lhs (perm);
          gsi_insert_before (gsi, perm, GSI_SAME_STMT);
-         gimple_assign_set_rhs_with_ops (gsi, conv_code, orig[0],
+         gimple_assign_set_rhs_with_ops (gsi,
+                                         (conv_code != ERROR_MARK
+                                          ? conv_code
+                                          : NOP_EXPR),
+                                         orig[0],
                                          NULL_TREE, NULL_TREE);

I believe you should elide the last stmt for conv_code == ERROR_MARK,
that is, why did you need to add the && !insert check in the guarding condition
(this path should already do the correct thing?).  Note that in all
cases it looks
that with conv_code != ERROR_MARK you may end up doing a float->int
or int->float conversion on a value it wasn't done on before which might
raise exceptions?  That is, do we need to make sure we permute a
value we already do convert into the place we're going to insert to?

+  if (insert)
+    {
+      /* Generate a single scalar insert.  */
+      tree var = make_ssa_name (type);
+      tree val = gimple_assign_rhs1 (stmt);
+      gimple *copy = gimple_build_assign (var, val);

I believe this doesn't properly copy the stmt in case it is a permute.
You can use (note the use of gsi_stmt - gimple_assign_set_rhs_with_ops
can re-allocate the stmt)

        gimple *copy = gimple_copy (gsi_stmt (*gsi));
        gimple_assign_set_lhs (copy, var);

+      gsi_insert_before (gsi, copy, GSI_SAME_STMT);
+      tree bitpos = bitsize_int (scalar_idx * elem_size);
+      gimple_assign_set_rhs_with_ops (gsi, BIT_INSERT_EXPR, var,
+                                     scalar_element, bitpos);
+    }

Otherwise looks OK to me.

As separate followup patch it might be interesting to support

 { 0, a[1], a[2], 3 }

kinds as well, thus combining a VECTOR_CST (which is
reasonably cheap to create) with another vector.  That should
be maybe done as a first patch given this is just a two-vector
permute which the code already handles apart from not
recognizing the implicit constant vector participating.

Similar

 { 0, a[1], b[2], 3 }

where the combination of a and b is blended with another
constant vector.  I'm not sure if handling an arbitrary number
of scalar elements should be done in a similar way, that is,
implementing

 { s1, a[1], a[2], s2, s3, b[0], b[1], b[2] }

as

  tem = VEC_PERM <a, b, { ... }>
  tem2 = { s1, 0, 0, s2, s3, 0, 0, 0 }
  res = VEC_PERM <tem, tem2, { blend-mask }>

where constructing tem2 should take at most
N-1 inserts (the first element to insert into tem2
can use a splat or if element zero a zero-extending move).

Doing this effectively lifts the restriction of only
handling two vectors - we'd incrementally do
two-vector permute plus blend of the rest which has
its constructor re-processed.

But as said - the code is already a bit awkward so changing
this in multiple reivisions is preferred and the single-element
case is certainly sth to do via a BIT_INSERT_EXPR.

Thanks,
Richard.

> --
> H.J.

Reply via email to