Re: [RFC] [v2] Extend fold_vec_perm to handle VLA vectors

Prathamesh Kulkarni via Gcc-patches Thu, 10 Aug 2023 07:34:08 -0700

On Tue, 8 Aug 2023 at 15:27, Richard Sandiford
<richard.sandif...@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
> > On Fri, 4 Aug 2023 at 20:36, Richard Sandiford
> > <richard.sandif...@arm.com> wrote:
> >>
> >> Full review this time, sorry for the skipping the tests earlier.
> > Thanks for the detailed review! Please find my responses inline below.
> >>
> >> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
> >> > diff --git a/gcc/fold-const.cc b/gcc/fold-const.cc
> >> > index 7e5494dfd39..680d0e54fd4 100644
> >> > --- a/gcc/fold-const.cc
> >> > +++ b/gcc/fold-const.cc
> >> > @@ -85,6 +85,10 @@ along with GCC; see the file COPYING3.  If not see
> >> >  #include "vec-perm-indices.h"
> >> >  #include "asan.h"
> >> >  #include "gimple-range.h"
> >> > +#include <algorithm>
> >>
> >> This should be included by defining INCLUDE_ALGORITHM instead.
> > Done. Just curious, why do we use this macro instead of directly
> > including <algorithm> ?
>
> AIUI, one of the reasons for having every file start with includes
> of config.h and (b)system.h, in that order, is to ensure that a small
> and predictable amount of GCC-specific stuff happens before including
> the system header files.  That helps to avoid OS-specific clashes between
> GCC code and system headers.
>
> But another major reason is that system.h ends by poisoning a lot of
> stuff that system headers would be entitled to use.
Ah OK, thanks for the clarification!
>
> >> > +  tree_vector_builder builder (vectype, npatterns, nelts_per_pattern);
> >> > +
> >> > +  // Fill a0 for each pattern
> >> > +  for (unsigned i = 0; i < npatterns; i++)
> >> > +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
> >> > +
> >> > +  if (nelts_per_pattern == 1)
> >> > +    return builder.build ();
> >> > +
> >> > +  // Fill a1 for each pattern
> >> > +  for (unsigned i = 0; i < npatterns; i++)
> >> > +    builder.quick_push (build_int_cst (inner_type, rand () % 100));
> >> > +
> >> > +  if (nelts_per_pattern == 2)
> >> > +    return builder.build ();
> >> > +
> >> > +  for (unsigned i = npatterns * 2; i < npatterns * nelts_per_pattern; 
> >> > i++)
> >> > +    {
> >> > +      tree prev_elem = builder[i - npatterns];
> >> > +      int prev_elem_val = TREE_INT_CST_LOW (prev_elem);
> >> > +      int val = prev_elem_val + S;
> >> > +      builder.quick_push (build_int_cst (inner_type, val));
> >> > +    }
> >> > +
> >> > +  return builder.build ();
> >> > +}
> >> > +
> >> > +static void
> >> > +validate_res (unsigned npatterns, unsigned nelts_per_pattern,
> >> > +           tree res, tree *expected_res)
> >> > +{
> >> > +  ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) == npatterns);
> >> > +  ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) == nelts_per_pattern);
> >>
> >> I don't think this is safe when the inputs are randomised.  E.g. we
> >> could by chance end up with a vector of all zeros, which would have
> >> a single pattern and a single element per pattern, regardless of the
> >> shapes of the inputs.
> >>
> >> Given the way that vector_builder<T, Shape, Derived>::finalize
> >> canonicalises the encoding, it should be safe to use:
> >>
> >> * VECTOR_CST_NPATTERNS (res) <= npatterns
> >> * vector_cst_encoded_nelts (res) <= npatterns * nelts_per_pattern
> >>
> >> If we do that then...
> >>
> >> > +
> >> > +  for (unsigned i = 0; i < vector_cst_encoded_nelts (res); i++)
> >>
> >> ...this loop bound should be npatterns * nelts_per_pattern instead.
> > Ah indeed. Fixed, thanks.
>
> The patch instead does:
>
>   ASSERT_TRUE (VECTOR_CST_NPATTERNS (res) <= npatterns);
>   ASSERT_TRUE (VECTOR_CST_NELTS_PER_PATTERN (res) <= nelts_per_pattern);
>
> I think the version I suggested is safer.  It's not the goal of the
> canonicalisation algorithm to reduce both npattners and nelts_per_pattern
> individually.  The algorithm can increase nelts_per_pattern in order
> to decrease npatterns.
Oops, sorry I misread, will fix in the next patch.
>
> >> > +  {
> >> > +    tree arg0 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
> >> > +    tree arg1 = build_vec_cst_rand (integer_type_node, 1, 3, 2);
> >> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> >> > +
> >> > +    vec_perm_builder builder (arg0_len, 1, 3);
> >> > +    builder.quick_push (arg0_len);
> >> > +    builder.quick_push (arg0_len + 1);
> >> > +    builder.quick_push (arg0_len + 2);
> >> > +
> >> > +    vec_perm_indices sel (builder, 2, arg0_len);
> >> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, 
> >> > NULL, true);
> >> > +    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt 
> >> > (arg1, 1),
> >> > +                         vector_cst_elt (arg1, 2) };
> >> > +    validate_res (1, 3, res, expected_res);
> >> > +  }
> >> > +
> >> > +  /* Case 3: Leading element of arg1, stepped sequence: pattern 0 of 
> >> > arg0.
> >> > +     sel = {len, 0, 0, 0, 2, 0, ...}
> >> > +     npatterns = 2, nelts_per_pattern = 3.
> >> > +     Use extra pattern {0, ...} to lower number of elements per 
> >> > pattern.  */
> >> > +  {
> >> > +    tree arg0 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> >> > +    tree arg1 = build_vec_cst_rand (char_type_node, 1, 3, 2);
> >> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> >> > +
> >> > +    vec_perm_builder builder (arg0_len, 2, 3);
> >> > +    builder.quick_push (arg0_len);
> >> > +    int mask_elems[] = { 0, 0, 0, 2, 0 };
> >> > +    for (int i = 0; i < 5; i++)
> >> > +      builder.quick_push (mask_elems[i]);
> >>
> >> This leaves one of the elements unspecified.
> > Sorry, I didn't understand.
> > It first pushes len in:
> > builder.quick_push (arg0_len)
> > and then pushes the remaining indices in the loop:
> > for (int i = 0; i < 5; i++)
> >   builder.quick_push (mask_elems[i])
> > So overall, builder will have 6 elements: {len, 0, 0, 0, 2, 0}
>
> Ah, right.  But in that case I think it would be clearer to put arg0_len
> in mask_elems.
Right of course, sorry. For some reason I thought earlier I couldn't
initialize array of poly_uint64 :/
>
> > +/* Try to fold permutation of ARG0 and ARG1 with SEL selector when
> > +   the input vectors are VECTOR_CST. Return NULL_TREE otherwise.
> > +   REASON has same purpose as described in
> > +   valid_mask_for_fold_vec_perm_cst_p.  */
> > +
> > +
>
> Nit: too much vertical space.
Will fix in next patch, thanks.
>
> > +/* Invoke tests for fold_vec_perm_cst.  */
> > +
> > +static void
> > +test ()
> > +{
> > +  /* Conditionally execute fold_vec_perm_cst tests, if target supports
> > +     VLA vectors. Use a compile time check so we avoid instantiating
> > +     poly_uint64 with N > 1 on targets that do not support VLA vectors.  */
> > +  if constexpr (poly_int_traits<poly_uint64>::num_coeffs > 1)
>
> That's a C++17 feature, so we can't use it.
>
> Instead, how about:
>
> static bool
> is_simple_vla_size (poly_uint64 size)
> {
>   if (size.is_constant ())
>     return false;
>   for (int i = 1; i < ARRAY_SIZE (size.coeffs); ++i)
>     if (size[i] != (i <= 1 ? size[0] : 0))
Just wondering is this should be (i == 1 ? size[0] : 0) since i is
initialized to 1 ?
IIUC, is_simple_vla_size should return true for polynomials of first
degree and having same coeff like 4 + 4x ?
>       return false;
>   return true;
> }
>
>
>   FOR_EACH_MODE_IN_CLASS (mode, MODE_VECTOR_INT)
>     {
>       auto nunits = GET_MODE_NUNITS (mode);
>       if (!is_simple_vla_size (nunits))
>         continue;
>       if (nunits[0] ...)
>         test_... (mode);
>       ...
>
>     }
>
> test_vnx4si_v4si and test_v4si_vnx4si look good.  But with the
> loop structure above, I think we can apply the test_vnx4si and
> test_vnx16qi to more cases.  So the classification isn't the
> exact number of elements, but instead a limit.
>
> I think the nunits[0] conditions for test_vnx4si are as follows
> (inspection only, so could be wrong):
>
> > +/* Test cases where result and input vectors are VNx4SI  */
> > +
> > +static void
> > +test_vnx4si (machine_mode vmode)
> > +{
> > +  /* Case 1: mask = {0, ...} */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 1, 1);
> > +    builder.quick_push (0);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (res, 0) };
This should be { vector_cst_elt (arg0, 0) }; will fix in next patch.
> > +    validate_res (1, 1, res, expected_res);
> > +  }
>
> nunits[0] >= 2 (could be all nunits if the inputs had nelts_per_pattern==1,
> which I think would be better)
IIUC, the vectors that can be used for a particular test should have
nunits[0] >= res_npatterns,
where res_npatterns is as computed in fold_vec_perm_cst without the
canonicalization ?
For above test -- res_npatterns = max(2, max (2, 1)) == 2, so we
require nunits[0] >= 2 ?
Which implies we can use above test for vectors with length 2 + 2x, 4 + 4x, etc.


Sorry if this sounds like a silly question -- Won't nunits[0] >= 2
cover all nunits,
since a vector, at a minimum, will contain 2 elements ?

I will send a patch shortly addressing above suggestions.

Thanks,
Prathamesh
>
> > +
> > +  /* Case 2: mask = {len, ...} */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 1, 1);
> > +    builder.quick_push (len);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg1, 0) };
> > +    validate_res (1, 1, res, expected_res);
> > +  }
>
> same
>
> > +
> > +  /* Case 3: mask = { 0, len, ... } */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 2, 1);
> > +    builder.quick_push (0);
> > +    builder.quick_push (len);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt 
> > (arg1, 0) };
> > +    validate_res (2, 1, res, expected_res);
> > +  }
>
> nunits[0] >= 2
>
> > +
> > +  /* Case 4: mask = { 0, len, 1, len+1, ... } */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 2, 2);
> > +    builder.quick_push (0);
> > +    builder.quick_push (len);
> > +    builder.quick_push (1);
> > +    builder.quick_push (len + 1);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt 
> > (arg1, 0),
> > +                         vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> > +                       };
> > +    validate_res (2, 2, res, expected_res);
> > +  }
>
> nunits[0] >= 2
>
> > +
> > +  /* Case 5: mask = { 0, len, 1, len+1, .... }
> > +     npatterns = 4, nelts_per_pattern = 1 */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 4, 1);
> > +    builder.quick_push (0);
> > +    builder.quick_push (len);
> > +    builder.quick_push (1);
> > +    builder.quick_push (len + 1);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt 
> > (arg1, 0),
> > +                         vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> > +                       };
> > +    validate_res (4, 1, res, expected_res);
> > +  }
>
> nunits[0] >= 4
>
> > +
> > +  /* Case 6: mask = {0, 4, ...}
> > +     npatterns = 1, nelts_per_pattern = 2.
> > +     This should return NULL_TREE because the index 4 may choose
> > +     from either arg0 or arg1 depending on vector length.  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    poly_uint64 len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (len, 1, 2);
> > +    builder.quick_push (0);
> > +    builder.quick_push (4);
> > +    vec_perm_indices sel (builder, 2, len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +    ASSERT_TRUE (res == NULL_TREE);
> > +  }
>
> nunits[0] == 2 or == 4 (could be <= 4 if the inputs had nelts_per_pattern==1)
>
> > +
> > +  /* Case 7: npatterns(arg0) = 4 > npatterns(sel) = 2
> > +     mask = {0, len, 1, len + 1, ...}
> > +     sel_npatterns = 2, sel_nelts_per_pattern = 2.  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    tree arg1 = build_vec_cst_rand (vmode, 2, 3, 1);
> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (arg0_len, 2, 2);
> > +    builder.quick_push (0);
> > +    builder.quick_push (arg0_len);
> > +    builder.quick_push (1);
> > +    builder.quick_push (arg0_len + 1);
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt 
> > (arg1, 0),
> > +                         vector_cst_elt (arg0, 1), vector_cst_elt (arg1, 1)
> > +                       };
> > +    validate_res (2, 2, res, expected_res);
> > +  }
>
> nunits[0] >= 2
>
>
> > +
> > +  /* Case 8: sel = {0, 1, 2, ...}
> > +     npatterns = 1, nelts_per_pattern = 3
> > +     expected res: { arg0[0], arg0[1], arg0[2], ... } */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> > +    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (arg0_len, 1, 3);
> > +    builder.quick_push (0);
> > +    builder.quick_push (1);
> > +    builder.quick_push (2);
> > +
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel);
> > +    tree expected_res[] = { vector_cst_elt (arg0, 0), vector_cst_elt 
> > (arg0, 1),
> > +                         vector_cst_elt (arg0, 2) };
> > +    validate_res (1, 3, res, expected_res);
> > +  }
>
> all nunits[0]
>
> > +
> > +  /* Case 9: sel = {len, len + 1, len + 2, ... }
> > +     npatterns = 1, nelts_per_pattern = 3
> > +     expected res: { op1[0], op1[1], op1[2], ... }  */
> > +  {
> > +    tree arg0 = build_vec_cst_rand (vmode, 1, 3, 2);
> > +    tree arg1 = build_vec_cst_rand (vmode, 1, 3, 2);
> > +    poly_uint64 arg0_len = TYPE_VECTOR_SUBPARTS (TREE_TYPE (arg0));
> > +
> > +    vec_perm_builder builder (arg0_len, 1, 3);
> > +    builder.quick_push (arg0_len);
> > +    builder.quick_push (arg0_len + 1);
> > +    builder.quick_push (arg0_len + 2);
> > +
> > +    vec_perm_indices sel (builder, 2, arg0_len);
> > +    tree res = fold_vec_perm_cst (TREE_TYPE (arg0), arg0, arg1, sel, NULL);
> > +    tree expected_res[] = { vector_cst_elt (arg1, 0), vector_cst_elt 
> > (arg1, 1),
> > +                         vector_cst_elt (arg1, 2) };
> > +    validate_res (1, 3, res, expected_res);
> > +  }
>
> all nunits[0]
>
> Same idea for the others.
>
> Thanks,
> Richard

Re: [RFC] [v2] Extend fold_vec_perm to handle VLA vectors

Reply via email to