On Wed, Dec 11, 2019 at 5:55 PM Wilco Dijkstra <wilco.dijks...@arm.com> wrote:
>
> Hi Richard,
>
> >> +(match (ctz_table_index @1 @2 @3)
> >> +  (rshift (mult (bit_and (negate @1) @1) INTEGER_CST@2) INTEGER_CST@3))
> >
> > You need a :c on the bit_and
>
> Fixed.
>
> > +  unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc);
> > +  unsigned shiftval = tree_to_uhwi (tshift);
> > +  unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type));
>
> > In the even that a __int128_t IFN_CTZ is supported the above might ICE with
> > too large constants so please wither use wide-int ops or above verify
> > tree_fits_{u,s}hwi () before doing the conversions (the conversion from
> > TYPE_SIZE should always succeed though).
>
> I've moved the initialization of val much later so we have done all the 
> checks and
> know for sure the mulc will fit in a HWint.
>
> > Hmm.  So this verifies that for a subset of all possible inputs the table
> > computes the correct value.
> >
> > a) how do we know this verification is exhaustive?
> > b) we do this for every array access matching the pattern
>
> It checks all the values that matter, which is the number of bits plus the 
> special
> handling of ctz(0). An array may contain entries which can never be referenced
> (see ctz2() in the testcase), so we don't care what the value is in those 
> cases.
> Very few accesses can match the pattern given it is very specific and there 
> are
> many checks before it tries to check the contents of the array.
>
> > I suggest you do
> >  tree ctor = ctor_for_folding (array);
> >  if (!ctor || TREE_CODE (ctor) != CONSTRUCTOR)
> >    return false;
> >
> > and then perform the verification on the constructor elements directly.
> > That's a lot cheaper.  Watch out for array_ref_low_bound which you
> > don't get passed in here - thus pass in the ARRAY_REF, not the array.
> >
> > I believe it's also wrong in your code above (probably miscompiles
> > a fortran equivalent of your testcase or fails verification/transform).
> >
> > When you do the verification on the ctor_for_folding then you
> > can in theory lookup the varpool node for 'array' and cache
> > the verification result there.
>
> I've updated it to use the ctor, but it meant adding another code path to
> handle string literals. It's not clear how the array_ref_low_bound affects the
> initializer, but I now reject it if it is non-zero.
>
> >> +      tree lhs = gimple_assign_lhs (stmt);
> >> +      bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), val);
> >
> > since we're using the optab entry shouldn't you check for == 2 here?
>
> Yes, that looks more correct (it's not clear what 1 means exactly).
>
> > Please check this before building the call.
>
> I've reordered the checks so it returns before it builds any gimple if it 
> cannot do
> the transformation.
>
> > For all of the above using gimple_build () style stmt building and
> > a final gsi_replace_with_seq would be more straight-forward.
>
> I've changed that, but it meant always inserting the nop convert, otherwise
> it does not make the code easier to follow.

OK.

Thanks,
Richard.

> Cheers,
> Wilco
>
>
> [PATCH v3] PR90838: Support ctz idioms
>
> v3: Directly walk over the array initializer and other tweaks based on review.
> v2: Use fwprop pass rather than match.pd
>
> Support common idioms for count trailing zeroes using an array lookup.
> The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
> constant which when multiplied by a power of 2 contains a unique value
> in the top 5 or 6 bits.  This is then indexed into a table which maps it
> to the number of trailing zeroes.  When the table is valid, we emit a
> sequence using the target defined value for ctz (0):
>
> int ctz1 (unsigned x)
> {
>   static const char table[32] =
>     {
>       0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
>       31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
>     };
>
>   return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
> }
>
> Is optimized to:
>
>         rbit    w0, w0
>         clz     w0, w0
>         and     w0, w0, 31
>         ret
>
> Bootstrapped on AArch64. OK for commit?
>
> ChangeLog:
>
> 2019-12-11  Wilco Dijkstra  <wdijk...@arm.com>
>
>         PR tree-optimization/90838
>         * tree-ssa-forwprop.c (check_ctz_array): Add new function.
>         (check_ctz_string): Likewise.
>         (optimize_count_trailing_zeroes): Likewise.
>         (simplify_count_trailing_zeroes): Likewise.
>         (pass_forwprop::execute): Try ctz simplification.
>         * match.pd: Add matching for ctz idioms.
>         * testsuite/gcc.target/aarch64/pr90838.c: New test.
>
> --
> diff --git a/gcc/match.pd b/gcc/match.pd
> index 
> 3b7a5ce4e9a4de4f983ccdc696ad406a7932c08c..410cd6eaae0cdc9de7e01d5496de0595b7ea15ba
>  100644
> --- a/gcc/match.pd
> +++ b/gcc/match.pd
> @@ -6116,3 +6116,11 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT)
>  (simplify
>   (vec_perm vec_same_elem_p@0 @0 @1)
>   @0)
> +
> +/* Match count trailing zeroes for simplify_count_trailing_zeroes in fwprop.
> +   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
> +   constant which when multiplied by a power of 2 contains a unique value
> +   in the top 5 or 6 bits.  This is then indexed into a table which maps it
> +   to the number of trailing zeroes.  */
> +(match (ctz_table_index @1 @2 @3)
> +  (rshift (mult (bit_and:c (negate @1) @1) INTEGER_CST@2) INTEGER_CST@3))
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr90838.c 
> b/gcc/testsuite/gcc.target/aarch64/pr90838.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..bff3144c0d1b3984016e5a404e986eae785c73ed
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr90838.c
> @@ -0,0 +1,64 @@
> +/* { dg-do compile } */
> +/* { dg-options "-O2" } */
> +
> +int ctz1 (unsigned x)
> +{
> +  static const char table[32] =
> +    {
> +      0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8,
> +      31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9
> +    };
> +
> +  return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27];
> +}
> +
> +int ctz2 (unsigned x)
> +{
> +  const int u = 0;
> +  static short table[64] =
> +    {
> +      32, 0, 1,12, 2, 6, u,13, 3, u, 7, u, u, u, u,14,
> +      10, 4, u, u, 8, u, u,25, u, u, u, u, u,21,27,15,
> +      31,11, 5, u, u, u, u, u, 9, u, u,24, u, u,20,26,
> +      30, u, u, u, u,23, u,19,29, u,22,18,28,17,16, u
> +    };
> +
> +  x = (x & -x) * 0x0450FBAF;
> +  return table[x >> 26];
> +}
> +
> +int ctz3 (unsigned x)
> +{
> +  static int table[32] =
> +    {
> +      0, 1, 2,24, 3,19, 6,25, 22, 4,20,10,16, 7,12,26,
> +      31,23,18, 5,21, 9,15,11,30,17, 8,14,29,13,28,27
> +    };
> +
> +  if (x == 0) return 32;
> +  x = (x & -x) * 0x04D7651F;
> +  return table[x >> 27];
> +}
> +
> +static const unsigned long long magic = 0x03f08c5392f756cdULL;
> +
> +static const char table[64] = {
> +     0,  1, 12,  2, 13, 22, 17,  3,
> +    14, 33, 23, 36, 18, 58, 28,  4,
> +    62, 15, 34, 26, 24, 48, 50, 37,
> +    19, 55, 59, 52, 29, 44, 39,  5,
> +    63, 11, 21, 16, 32, 35, 57, 27,
> +    61, 25, 47, 49, 54, 51, 43, 38,
> +    10, 20, 31, 56, 60, 46, 53, 42,
> +     9, 30, 45, 41,  8, 40,  7,  6,
> +};
> +
> +int ctz4 (unsigned long x)
> +{
> +  unsigned long lsb = x & -x;
> +  return table[(lsb * magic) >> 58];
> +}
> +
> +/* { dg-final { scan-assembler-times "clz\t" 4 } } */
> +/* { dg-final { scan-assembler-times "and\t" 2 } } */
> +/* { dg-final { scan-assembler-not "cmp\t.*0" } } */
> diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c
> index 
> 4a831242c0e1418d89523aaee0eef900d8fcc3bb..219fa158fa7bc373a63ede3853781c35c36948bc
>  100644
> --- a/gcc/tree-ssa-forwprop.c
> +++ b/gcc/tree-ssa-forwprop.c
> @@ -48,6 +48,8 @@ along with GCC; see the file COPYING3.  If not see
>  #include "optabs-tree.h"
>  #include "tree-vector-builder.h"
>  #include "vec-perm-indices.h"
> +#include "internal-fn.h"
> +#include "cgraph.h"
>
>  /* This pass propagates the RHS of assignment statements into use
>     sites of the LHS of the assignment.  It's basically a specialized
> @@ -1778,6 +1780,184 @@ simplify_rotate (gimple_stmt_iterator *gsi)
>    return true;
>  }
>
> +
> +/* Check whether an array contains a valid ctz table.  */
> +static bool
> +check_ctz_array (tree ctor, unsigned HOST_WIDE_INT mulc,
> +                HOST_WIDE_INT &zero_val, unsigned shift, unsigned bits)
> +{
> +  tree elt, idx;
> +  unsigned HOST_WIDE_INT i;
> +  unsigned mask = (1 << (bits - shift)) - 1;
> +  unsigned matched = 0;
> +
> +  zero_val = 0;
> +
> +  FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (ctor), i, idx, elt)
> +    {
> +      if (TREE_CODE (idx) != INTEGER_CST || TREE_CODE (elt) != INTEGER_CST)
> +       return false;
> +      if (i > bits * 2)
> +       return false;
> +
> +      unsigned HOST_WIDE_INT index = tree_to_shwi (idx);
> +      HOST_WIDE_INT val = tree_to_shwi (elt);
> +
> +      if (index == 0)
> +       {
> +         zero_val = val;
> +         matched++;
> +       }
> +
> +      if (val >= 0 && val < bits && (((mulc << val) >> shift) & mask) == 
> index)
> +       matched++;
> +
> +      if (matched > bits)
> +       return true;
> +    }
> +
> +  return false;
> +}
> +
> +/* Check whether a string contains a valid ctz table.  */
> +static bool
> +check_ctz_string (tree string, unsigned HOST_WIDE_INT mulc,
> +                 HOST_WIDE_INT &zero_val, unsigned shift, unsigned bits)
> +{
> +  unsigned HOST_WIDE_INT len = TREE_STRING_LENGTH (string);
> +  unsigned mask = (1 << (bits - shift)) - 1;
> +  unsigned matched = 0;
> +  const unsigned char *p = (const unsigned char *) TREE_STRING_POINTER 
> (string);
> +
> +  if (len < bits || len > bits * 2)
> +    return false;
> +
> +  zero_val = p[0];
> +
> +  for (unsigned i = 0; i < len; i++)
> +    if (p[i] < bits && (((mulc << p[i]) >> shift) & mask) == i)
> +      matched++;
> +
> +  return matched == bits;
> +}
> +
> +/* Recognize count trailing zeroes idiom.
> +   The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic
> +   constant which when multiplied by a power of 2 contains a unique value
> +   in the top 5 or 6 bits.  This is then indexed into a table which maps it
> +   to the number of trailing zeroes.  Array[0] is returned so the caller can
> +   emit an appropriate sequence depending on whether ctz (0) is defined on
> +   the target.  */
> +static bool
> +optimize_count_trailing_zeroes (tree array_ref, tree x, tree mulc,
> +                               tree tshift, HOST_WIDE_INT &zero_val)
> +{
> +  tree type = TREE_TYPE (array_ref);
> +  tree array = TREE_OPERAND (array_ref, 0);
> +
> +  gcc_assert (TREE_CODE (mulc) == INTEGER_CST);
> +  gcc_assert (TREE_CODE (tshift) == INTEGER_CST);
> +
> +  tree input_type = TREE_TYPE (x);
> +  unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type));
> +
> +  /* Check the array is not wider than integer type and the input is a 32-bit
> +     or 64-bit type.  */
> +  if (TYPE_PRECISION (type) > 32)
> +    return false;
> +  if (input_bits != 32 && input_bits != 64)
> +    return false;
> +
> +  if (!direct_internal_fn_supported_p (IFN_CTZ, input_type, 
> OPTIMIZE_FOR_BOTH))
> +    return false;
> +
> +  /* Check the lower bound of the array is zero.  */
> +  tree low = array_ref_low_bound (array_ref);
> +  if (!low || !integer_zerop (low))
> +    return false;
> +
> +  unsigned shiftval = tree_to_uhwi (tshift);
> +
> +  /* Check the shift extracts the top 5..7 bits.  */
> +  if (shiftval < input_bits - 7 || shiftval > input_bits - 5)
> +    return false;
> +
> +  tree ctor = ctor_for_folding (array);
> +  if (!ctor)
> +    return false;
> +
> +  unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc);
> +
> +  if (TREE_CODE (ctor) == CONSTRUCTOR)
> +    return check_ctz_array (ctor, val, zero_val, shiftval, input_bits);
> +  else if (TREE_CODE (ctor) == STRING_CST)
> +    return check_ctz_string (ctor, val, zero_val, shiftval, input_bits);
> +
> +  return false;
> +}
> +
> +/* Match.pd function to match the ctz expression.  */
> +extern bool gimple_ctz_table_index (tree, tree *, tree (*)(tree));
> +
> +static bool
> +simplify_count_trailing_zeroes (gimple_stmt_iterator *gsi)
> +{
> +  gimple *stmt = gsi_stmt (*gsi);
> +  tree array_ref = gimple_assign_rhs1 (stmt);
> +  tree res_ops[3];
> +  HOST_WIDE_INT zero_val;
> +
> +  gcc_checking_assert (TREE_CODE (array_ref) == ARRAY_REF);
> +
> +  if (!gimple_ctz_table_index (TREE_OPERAND (array_ref, 1), &res_ops[0], 
> NULL))
> +    return false;
> +
> +  if (optimize_count_trailing_zeroes (array_ref, res_ops[0],
> +                                     res_ops[1], res_ops[2], zero_val))
> +    {
> +      tree type = TREE_TYPE (res_ops[0]);
> +      HOST_WIDE_INT ctzval;
> +      HOST_WIDE_INT type_size = tree_to_shwi (TYPE_SIZE (type));
> +      bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), ctzval) == 
> 2;
> +
> +      /* Skip if there is no value defined at zero, or if we can't easily
> +        return the correct value for zero.  */
> +      if (!zero_ok)
> +       return false;
> +      if (zero_val != ctzval && !(zero_val == 0 && ctzval == type_size))
> +       return false;
> +
> +      gimple_seq seq = NULL;
> +      gimple *g;
> +      gcall *call = gimple_build_call_internal (IFN_CTZ, 1, res_ops[0]);
> +      gimple_set_location (call, gimple_location (stmt));
> +      gimple_set_lhs (call, make_ssa_name (integer_type_node));
> +      gimple_seq_add_stmt (&seq, call);
> +
> +      tree prev_lhs = gimple_call_lhs (call);
> +
> +      /* Emit ctz (x) & 31 if ctz (0) is 32 but we need to return 0.  */
> +      if (zero_val == 0 && ctzval == type_size)
> +       {
> +         g = gimple_build_assign (make_ssa_name (integer_type_node),
> +                                  BIT_AND_EXPR, prev_lhs,
> +                                  build_int_cst (integer_type_node,
> +                                                 type_size - 1));
> +         gimple_set_location (g, gimple_location (stmt));
> +         gimple_seq_add_stmt (&seq, g);
> +         prev_lhs = gimple_assign_lhs (g);
> +       }
> +
> +      g = gimple_build_assign (gimple_assign_lhs (stmt), NOP_EXPR, prev_lhs);
> +      gimple_seq_add_stmt (&seq, g);
> +      gsi_replace_with_seq (gsi, seq, true);
> +      return true;
> +    }
> +
> +  return false;
> +}
> +
> +
>  /* Combine an element access with a shuffle.  Returns true if there were
>     any changes made, else it returns false.  */
>
> @@ -2867,6 +3047,8 @@ pass_forwprop::execute (function *fun)
>                     else if (code == CONSTRUCTOR
>                              && TREE_CODE (TREE_TYPE (rhs1)) == VECTOR_TYPE)
>                       changed = simplify_vector_constructor (&gsi);
> +                   else if (code == ARRAY_REF)
> +                     changed = simplify_count_trailing_zeroes (&gsi);
>                     break;
>                   }
>

Reply via email to