Re: VEC_DUPLICATE_EXPR options (was Re: [PATCH, RFC, rs6000] enable GIMPLE folding of vec_splat)

Richard Biener Thu, 23 Aug 2018 06:51:37 -0700

On Sat, Aug 18, 2018 at 7:01 PM Richard Sandiford
<rdsandif...@googlemail.com> wrote:
>
> Richard Biener <richard.guent...@gmail.com> writes:
> > On Tue, Aug 7, 2018 at 9:25 PM Will Schmidt <will_schm...@vnet.ibm.com> 
> > wrote:
> >>
> >> Hi
> >> Enable GIMPLE folding of the vec_splat() intrinsic.
> >>
> >> For review.. feedback is expected. :-)
> >>
> >> I came up with the following after spending some time poking around
> >> at the tree_vec_extract() and vector_element() functions as seen
> >> in tree-vect-generic.c looking for insights.  Some of this seems a
> >> bit clunky to me yet, but this is functional as far as make-check
> >> can tell, and is as far as I can get without additional input.
> >>
> >> This uses the tree_vec_extract() function out of tree-vect-generic.c
> >> to retrieve the splat value, which is a BIT_FIELD_REF.   That function is
> >> made non-static as part of this change.
> >>
> >> In review of the .gimple output, this folding takes a sample testcase
> >> of
> >>         vector bool int testb_0  (vector bool int x)
> >>         {
> >>           return vec_splat (x, 0b00000);
> >>         }
> >> from:
> >> testb_0 (__vector __bool int x)
> >> {
> >>   __vector __bool intD.1486 D.2855;
> >>
> >>   _1 = VIEW_CONVERT_EXPR<__vector signed intD.1468>(xD.2778);
> >>   _2 = __builtin_altivec_vspltwD.1919 (_1, 0);
> >>   D.2855 = VIEW_CONVERT_EXPR<__vector __bool intD.1486>(_2);
> >>   return D.2855;
> >> }
> >> to:
> >>  testb_0 (__vector __bool int x)
> >> {
> >>   __vector __bool intD.1486 D.2855;
> >>
> >>   _1 = VIEW_CONVERT_EXPR<__vector signed intD.1468>(xD.2778);
> >>   D.2856 = BIT_FIELD_REF <_1, 32, 0>;
> >>   _2 = {D.2856, D.2856, D.2856, D.2856};
> >>   D.2855 = VIEW_CONVERT_EXPR<__vector __bool intD.1486>(_2);
> >>   return D.2855;
> >> }
> >>
> >>
> >> Testcases are being posted as a separate patch.
> >>
> >> OK for trunk?
> >
> > It certainly works, nowadays we have VEC_DUPLICATE_EXPR though
> > so you could simply do
> >
> >  _1 = VIEW_CONVERT_EXPR<__vector signed intD.1468>(xD.2778);
> >  D.2856 = BIT_FIELD_REF <_1, 32, 0>;
> >  D.2855 = VEC_DUPLICATE_EXPR <__vector __bool intD.1486> (D.2856);
> >
> > not sure what variant is better for followup optimizations and whether
> > we already fold that { D.2856, D.2856, D.2856, D.2856 }; to 
> > VEC_DUPLICATE_EXPR.
> >
> > Richard may know more.
>
> We only use VEC_DUPLICATE_EXPR for variable-length vectors.  I still owe
> you a GCC 9 update for:
>
>   https://gcc.gnu.org/ml/gcc-patches/2017-12/msg01039.html
>
> (hadn't forgotten!)
>
> It's a diversion from the patch, sorry, but I think the main alternatives are:
>
> (1) Allow either operand of a VEC_PERM_EXPR to be a scalar
>     ------------------------------------------------------
>
>     Handles VEC_DUPLICATE_EXPR <a> as:
>         VEC_PERM_EXPR <a, a, { 0, ... }>
>
>     Handles IFN_VEC_SHL_INSERT <avec, b> as:
>         VEC_PERM_EXPR <b, avec, { 0, 1, 2, 3, ... }>
>     or: VEC_PERM_EXPR <avec, b, { n, 0, 1, 2, ... }>
>
>     Can also insert a single "a" at the beginning of a vector of "b":
>         VEC_PERM_EXPR <a, b, { 0, 1, 1, 1, ... }>
>
>     Disdvantages:
>
>     (a) It's still a bit limited.  E.g. it can't insert two scalars at
>         the beginning of a vector of "b", which might be useful for SLP.
>
>     (b) Either all targets would need to handle this case explicitly
>         or something would need to massage the tree permute mask before
>         the targets see it.  E.g.:
>
>            VEC_PERM_EXPR <a, b, { 0, 1, 1, ... }>
>         -> permute of { 0, n, n, ... } on two vector broadcast rtxes
>
>     (c) There'd still be an inconsistency.  Either:
>         (i) we'd only use VEC_PERM_EXPRs for variable-length vectors or
>         (ii) we'd have to check before creating a CONSTRUCTOR whether
>              it should be represented as a VEC_PERM_EXPR instead.
>
> (2) Allow only the second operand of a VEC_PERM_EXPR to be a scalar
>     ---------------------------------------------------------------
>
>     Handles VEC_DUPLICATE_EXPR <a> as:
>         VEC_PERM_EXPR <{ 0, ... }, a, { n, ... }>
>
>     Handles IFN_VEC_SHL_INSERT <avec, b> as:
>         VEC_PERM_EXPR <avec, b, { n, 0, 1, 2, ... }>;
>
>     Advantages:
>
>     - Avoids the need for target changes: the mask is the same if the scalar
>       is replaced by a vector.


Interesting idea ;)

>     Disadvantages:
>
>     - (a) still holds (and now we can't do a two-scalar permute).
>
>     - (c) still holds.
>
> (3) Keep VEC_DUPLICATE_EXPR but use it for everything
>     -------------------------------------------------
>
>     Advantages:
>
>     - The handling of duplicates becomes consistent across vector types.

I think this is definitely important - I don't like too much the idea of
having some stuff only used for the variable-size case.

>     - There are no changes to VEC_PERM_EXPR or targets.
>
>     - We can still replace IFN_VEC_SHL_INSERT with a VEC_PERM_EXPR,
>       as a separate change.
>
>     Disadvantages:
>
>     - (c) still holds.
>
>     (d) Vector duplication stays a bit of a special case (which was the
>         original problem).
>
> (4) Use the VECTOR_CST encoding for CONSTRUCTORs too
>     ------------------------------------------------
>
>     Advantages:
>
>     - Reduces memory overhead of CONSTRUCTORs for vector broadcasts
>       in constant-length vectors.
>
>     - There's only one tree code to check.
>
>     - Many more vectors have the same representation for constant- and
>       variable-length vectors.
>
>     - There's more consistency between constant element values and
>       variable element values.
>
>     - We can still replace IFN_VEC_SHL_INSERT with a VEC_PERM_EXPR,
>       as a separate change.
>
>     Disadvantages:
>
>     (e) Completely following the VECTOR_CST encoding would mean having
>         to encode the stepped forms based on two values in the series.
>         E.g. a 3-pattern:
>
>           { a : b : c : ... }
>
>         (ad-hoc syntax) would be:
>
>           { a, b, c, c * 2 - b, c * 3 - b, ... }
>
>         so the CONSTRUCTOR would include hidden arithmetic.  This means
>         that the value of any given element (rather than just a directly-
>         encoded one) would in general not be a gimple value.  E.g. in
>         the above case, asking for element 3 would give a GENERIC tree.

Eh.  The "advantage" would be it handles the VEC_SERIES_EXPR case as well?

>
>     (f) If vector CONSTRUCTORs seem like a mistake then this will too.

I still think they are ;)

> (5) Like (4), but don't allow the stepped case
>     ------------------------------------------
>
>     Advantages:
>
>     - As for (4), and avoids the hidden arithmetic in (e).
>
>     Disadvantages:
>
>     - (f) still holds.
>
>     (g) Stepped vectors need to be handled separately.
>
> For (f), we could (and might need to) replace CONSTRUCTOR with a new
> tree code.  It would still have to be GIMPLE_SINGLE_RHS though, if we
> keep the current "gassigns have 4 operands" assumption.
>
> I'm not sure (g) is really a disadvantage.  We could replace
> VEC_SERIES_EXPR <a, b> with:
>
>     basev = { a, ... }
>     stepv = { b, ... }
>     tmp = stepv * { 0, 1, 2, ... };
>     series = basev + tmp
>
> and match it as an IFN_VEC_SERIES for targets that have it.
> So code constructing vectors would never have to know about
> VEC_SERIES_EXPR/IFN_VEC_SERIES or IFN_VEC_SHL_INSERT.
>
> So I'm leaning towards (5) at the moment.  What do you think?

Generally I dislike CONSTRUCTOR because it hides the implementation
complexity.  That is, for most vector operations we only allow things
in the IL that matches what the target is willing to expand (which doesn't
necessarily mean it expands to one target instruction ...).  At least after
vector lowering.

So this would advocate for exposing all HW instructions as IFN_* but of
course that's too messy to handle in optimizers.  Instead what we do
for VEC_PERM_EXPR via can_vec_perm_const_p looks like a good
way forward to me.

That means (2) looks appealling to me given a two-scalar generic permute
isn't likely to be present in HW.  Still VEC_PERM_EXPR <scalar,
scalar, { 0, 1 }>
as way to concat two scalars to a vector makes sense to me.

We currently have

/* Vector permutation expression.  A = VEC_PERM_EXPR<v0, v1, mask> means

   N = length(mask)
   foreach i in N:
     M = mask[i] % (2*N)
     A = M < N ? v0[M] : v1[M-N]

   V0 and V1 are vectors of the same type.  MASK is an integer-typed
   vector.  The number of MASK elements must be the same with the
   number of elements in V0 and V1.  The size of the inner type
   of the MASK and of the V0 and V1 must be the same.
*/
DEFTREECODE (VEC_PERM_EXPR, "vec_perm_expr", tcc_expression, 3)

and GIMPLE checking enforces same types for the permutation result
and the two permuted vectors.

I'd relax that a bit to allow scalar / vector V0 / V1 and not require
the result to be the same as either vector type but just have the
same element type and element count as the mask vector.  That
would allow concats.

The goal would be that all primitive permute/concat (and extract?)
operations ISAs have can be expressed as a single VEC_PERM_EXPR.

And vector CONSTRUCTORs could be lowered using some automated
generator (as described by the "rewrite vectorizer" talks at the last
Cauldrons) that is seeded by some kind of machine description for
permutation operations.

That said, this would leave VEC_SERIES_EXPR alone unless we'd want
to split it up as you say and it wouldn't complicate CONSTRUCTORs
but walk in the direction of exposing some more target details
at the expense of target complication in their vec_perm_const_ok hook
(until that automated generator thing materializes).

Richard.

>
> Thanks,
> Richard

Re: VEC_DUPLICATE_EXPR options (was Re: [PATCH, RFC, rs6000] enable GIMPLE folding of vec_splat)

Reply via email to