On Wed, Dec 11, 2019 at 5:55 PM Wilco Dijkstra <wilco.dijks...@arm.com> wrote: > > Hi Richard, > > >> +(match (ctz_table_index @1 @2 @3) > >> + (rshift (mult (bit_and (negate @1) @1) INTEGER_CST@2) INTEGER_CST@3)) > > > > You need a :c on the bit_and > > Fixed. > > > + unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc); > > + unsigned shiftval = tree_to_uhwi (tshift); > > + unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type)); > > > In the even that a __int128_t IFN_CTZ is supported the above might ICE with > > too large constants so please wither use wide-int ops or above verify > > tree_fits_{u,s}hwi () before doing the conversions (the conversion from > > TYPE_SIZE should always succeed though). > > I've moved the initialization of val much later so we have done all the > checks and > know for sure the mulc will fit in a HWint. > > > Hmm. So this verifies that for a subset of all possible inputs the table > > computes the correct value. > > > > a) how do we know this verification is exhaustive? > > b) we do this for every array access matching the pattern > > It checks all the values that matter, which is the number of bits plus the > special > handling of ctz(0). An array may contain entries which can never be referenced > (see ctz2() in the testcase), so we don't care what the value is in those > cases. > Very few accesses can match the pattern given it is very specific and there > are > many checks before it tries to check the contents of the array. > > > I suggest you do > > tree ctor = ctor_for_folding (array); > > if (!ctor || TREE_CODE (ctor) != CONSTRUCTOR) > > return false; > > > > and then perform the verification on the constructor elements directly. > > That's a lot cheaper. Watch out for array_ref_low_bound which you > > don't get passed in here - thus pass in the ARRAY_REF, not the array. > > > > I believe it's also wrong in your code above (probably miscompiles > > a fortran equivalent of your testcase or fails verification/transform). > > > > When you do the verification on the ctor_for_folding then you > > can in theory lookup the varpool node for 'array' and cache > > the verification result there. > > I've updated it to use the ctor, but it meant adding another code path to > handle string literals. It's not clear how the array_ref_low_bound affects the > initializer, but I now reject it if it is non-zero. > > >> + tree lhs = gimple_assign_lhs (stmt); > >> + bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), val); > > > > since we're using the optab entry shouldn't you check for == 2 here? > > Yes, that looks more correct (it's not clear what 1 means exactly). > > > Please check this before building the call. > > I've reordered the checks so it returns before it builds any gimple if it > cannot do > the transformation. > > > For all of the above using gimple_build () style stmt building and > > a final gsi_replace_with_seq would be more straight-forward. > > I've changed that, but it meant always inserting the nop convert, otherwise > it does not make the code easier to follow.
OK. Thanks, Richard. > Cheers, > Wilco > > > [PATCH v3] PR90838: Support ctz idioms > > v3: Directly walk over the array initializer and other tweaks based on review. > v2: Use fwprop pass rather than match.pd > > Support common idioms for count trailing zeroes using an array lookup. > The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic > constant which when multiplied by a power of 2 contains a unique value > in the top 5 or 6 bits. This is then indexed into a table which maps it > to the number of trailing zeroes. When the table is valid, we emit a > sequence using the target defined value for ctz (0): > > int ctz1 (unsigned x) > { > static const char table[32] = > { > 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, > 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9 > }; > > return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27]; > } > > Is optimized to: > > rbit w0, w0 > clz w0, w0 > and w0, w0, 31 > ret > > Bootstrapped on AArch64. OK for commit? > > ChangeLog: > > 2019-12-11 Wilco Dijkstra <wdijk...@arm.com> > > PR tree-optimization/90838 > * tree-ssa-forwprop.c (check_ctz_array): Add new function. > (check_ctz_string): Likewise. > (optimize_count_trailing_zeroes): Likewise. > (simplify_count_trailing_zeroes): Likewise. > (pass_forwprop::execute): Try ctz simplification. > * match.pd: Add matching for ctz idioms. > * testsuite/gcc.target/aarch64/pr90838.c: New test. > > -- > diff --git a/gcc/match.pd b/gcc/match.pd > index > 3b7a5ce4e9a4de4f983ccdc696ad406a7932c08c..410cd6eaae0cdc9de7e01d5496de0595b7ea15ba > 100644 > --- a/gcc/match.pd > +++ b/gcc/match.pd > @@ -6116,3 +6116,11 @@ DEFINE_INT_AND_FLOAT_ROUND_FN (RINT) > (simplify > (vec_perm vec_same_elem_p@0 @0 @1) > @0) > + > +/* Match count trailing zeroes for simplify_count_trailing_zeroes in fwprop. > + The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic > + constant which when multiplied by a power of 2 contains a unique value > + in the top 5 or 6 bits. This is then indexed into a table which maps it > + to the number of trailing zeroes. */ > +(match (ctz_table_index @1 @2 @3) > + (rshift (mult (bit_and:c (negate @1) @1) INTEGER_CST@2) INTEGER_CST@3)) > diff --git a/gcc/testsuite/gcc.target/aarch64/pr90838.c > b/gcc/testsuite/gcc.target/aarch64/pr90838.c > new file mode 100644 > index > 0000000000000000000000000000000000000000..bff3144c0d1b3984016e5a404e986eae785c73ed > --- /dev/null > +++ b/gcc/testsuite/gcc.target/aarch64/pr90838.c > @@ -0,0 +1,64 @@ > +/* { dg-do compile } */ > +/* { dg-options "-O2" } */ > + > +int ctz1 (unsigned x) > +{ > + static const char table[32] = > + { > + 0, 1, 28, 2, 29, 14, 24, 3, 30, 22, 20, 15, 25, 17, 4, 8, > + 31, 27, 13, 23, 21, 19, 16, 7, 26, 12, 18, 6, 11, 5, 10, 9 > + }; > + > + return table[((unsigned)((x & -x) * 0x077CB531U)) >> 27]; > +} > + > +int ctz2 (unsigned x) > +{ > + const int u = 0; > + static short table[64] = > + { > + 32, 0, 1,12, 2, 6, u,13, 3, u, 7, u, u, u, u,14, > + 10, 4, u, u, 8, u, u,25, u, u, u, u, u,21,27,15, > + 31,11, 5, u, u, u, u, u, 9, u, u,24, u, u,20,26, > + 30, u, u, u, u,23, u,19,29, u,22,18,28,17,16, u > + }; > + > + x = (x & -x) * 0x0450FBAF; > + return table[x >> 26]; > +} > + > +int ctz3 (unsigned x) > +{ > + static int table[32] = > + { > + 0, 1, 2,24, 3,19, 6,25, 22, 4,20,10,16, 7,12,26, > + 31,23,18, 5,21, 9,15,11,30,17, 8,14,29,13,28,27 > + }; > + > + if (x == 0) return 32; > + x = (x & -x) * 0x04D7651F; > + return table[x >> 27]; > +} > + > +static const unsigned long long magic = 0x03f08c5392f756cdULL; > + > +static const char table[64] = { > + 0, 1, 12, 2, 13, 22, 17, 3, > + 14, 33, 23, 36, 18, 58, 28, 4, > + 62, 15, 34, 26, 24, 48, 50, 37, > + 19, 55, 59, 52, 29, 44, 39, 5, > + 63, 11, 21, 16, 32, 35, 57, 27, > + 61, 25, 47, 49, 54, 51, 43, 38, > + 10, 20, 31, 56, 60, 46, 53, 42, > + 9, 30, 45, 41, 8, 40, 7, 6, > +}; > + > +int ctz4 (unsigned long x) > +{ > + unsigned long lsb = x & -x; > + return table[(lsb * magic) >> 58]; > +} > + > +/* { dg-final { scan-assembler-times "clz\t" 4 } } */ > +/* { dg-final { scan-assembler-times "and\t" 2 } } */ > +/* { dg-final { scan-assembler-not "cmp\t.*0" } } */ > diff --git a/gcc/tree-ssa-forwprop.c b/gcc/tree-ssa-forwprop.c > index > 4a831242c0e1418d89523aaee0eef900d8fcc3bb..219fa158fa7bc373a63ede3853781c35c36948bc > 100644 > --- a/gcc/tree-ssa-forwprop.c > +++ b/gcc/tree-ssa-forwprop.c > @@ -48,6 +48,8 @@ along with GCC; see the file COPYING3. If not see > #include "optabs-tree.h" > #include "tree-vector-builder.h" > #include "vec-perm-indices.h" > +#include "internal-fn.h" > +#include "cgraph.h" > > /* This pass propagates the RHS of assignment statements into use > sites of the LHS of the assignment. It's basically a specialized > @@ -1778,6 +1780,184 @@ simplify_rotate (gimple_stmt_iterator *gsi) > return true; > } > > + > +/* Check whether an array contains a valid ctz table. */ > +static bool > +check_ctz_array (tree ctor, unsigned HOST_WIDE_INT mulc, > + HOST_WIDE_INT &zero_val, unsigned shift, unsigned bits) > +{ > + tree elt, idx; > + unsigned HOST_WIDE_INT i; > + unsigned mask = (1 << (bits - shift)) - 1; > + unsigned matched = 0; > + > + zero_val = 0; > + > + FOR_EACH_CONSTRUCTOR_ELT (CONSTRUCTOR_ELTS (ctor), i, idx, elt) > + { > + if (TREE_CODE (idx) != INTEGER_CST || TREE_CODE (elt) != INTEGER_CST) > + return false; > + if (i > bits * 2) > + return false; > + > + unsigned HOST_WIDE_INT index = tree_to_shwi (idx); > + HOST_WIDE_INT val = tree_to_shwi (elt); > + > + if (index == 0) > + { > + zero_val = val; > + matched++; > + } > + > + if (val >= 0 && val < bits && (((mulc << val) >> shift) & mask) == > index) > + matched++; > + > + if (matched > bits) > + return true; > + } > + > + return false; > +} > + > +/* Check whether a string contains a valid ctz table. */ > +static bool > +check_ctz_string (tree string, unsigned HOST_WIDE_INT mulc, > + HOST_WIDE_INT &zero_val, unsigned shift, unsigned bits) > +{ > + unsigned HOST_WIDE_INT len = TREE_STRING_LENGTH (string); > + unsigned mask = (1 << (bits - shift)) - 1; > + unsigned matched = 0; > + const unsigned char *p = (const unsigned char *) TREE_STRING_POINTER > (string); > + > + if (len < bits || len > bits * 2) > + return false; > + > + zero_val = p[0]; > + > + for (unsigned i = 0; i < len; i++) > + if (p[i] < bits && (((mulc << p[i]) >> shift) & mask) == i) > + matched++; > + > + return matched == bits; > +} > + > +/* Recognize count trailing zeroes idiom. > + The canonical form is array[((x & -x) * C) >> SHIFT] where C is a magic > + constant which when multiplied by a power of 2 contains a unique value > + in the top 5 or 6 bits. This is then indexed into a table which maps it > + to the number of trailing zeroes. Array[0] is returned so the caller can > + emit an appropriate sequence depending on whether ctz (0) is defined on > + the target. */ > +static bool > +optimize_count_trailing_zeroes (tree array_ref, tree x, tree mulc, > + tree tshift, HOST_WIDE_INT &zero_val) > +{ > + tree type = TREE_TYPE (array_ref); > + tree array = TREE_OPERAND (array_ref, 0); > + > + gcc_assert (TREE_CODE (mulc) == INTEGER_CST); > + gcc_assert (TREE_CODE (tshift) == INTEGER_CST); > + > + tree input_type = TREE_TYPE (x); > + unsigned input_bits = tree_to_shwi (TYPE_SIZE (input_type)); > + > + /* Check the array is not wider than integer type and the input is a 32-bit > + or 64-bit type. */ > + if (TYPE_PRECISION (type) > 32) > + return false; > + if (input_bits != 32 && input_bits != 64) > + return false; > + > + if (!direct_internal_fn_supported_p (IFN_CTZ, input_type, > OPTIMIZE_FOR_BOTH)) > + return false; > + > + /* Check the lower bound of the array is zero. */ > + tree low = array_ref_low_bound (array_ref); > + if (!low || !integer_zerop (low)) > + return false; > + > + unsigned shiftval = tree_to_uhwi (tshift); > + > + /* Check the shift extracts the top 5..7 bits. */ > + if (shiftval < input_bits - 7 || shiftval > input_bits - 5) > + return false; > + > + tree ctor = ctor_for_folding (array); > + if (!ctor) > + return false; > + > + unsigned HOST_WIDE_INT val = tree_to_uhwi (mulc); > + > + if (TREE_CODE (ctor) == CONSTRUCTOR) > + return check_ctz_array (ctor, val, zero_val, shiftval, input_bits); > + else if (TREE_CODE (ctor) == STRING_CST) > + return check_ctz_string (ctor, val, zero_val, shiftval, input_bits); > + > + return false; > +} > + > +/* Match.pd function to match the ctz expression. */ > +extern bool gimple_ctz_table_index (tree, tree *, tree (*)(tree)); > + > +static bool > +simplify_count_trailing_zeroes (gimple_stmt_iterator *gsi) > +{ > + gimple *stmt = gsi_stmt (*gsi); > + tree array_ref = gimple_assign_rhs1 (stmt); > + tree res_ops[3]; > + HOST_WIDE_INT zero_val; > + > + gcc_checking_assert (TREE_CODE (array_ref) == ARRAY_REF); > + > + if (!gimple_ctz_table_index (TREE_OPERAND (array_ref, 1), &res_ops[0], > NULL)) > + return false; > + > + if (optimize_count_trailing_zeroes (array_ref, res_ops[0], > + res_ops[1], res_ops[2], zero_val)) > + { > + tree type = TREE_TYPE (res_ops[0]); > + HOST_WIDE_INT ctzval; > + HOST_WIDE_INT type_size = tree_to_shwi (TYPE_SIZE (type)); > + bool zero_ok = CTZ_DEFINED_VALUE_AT_ZERO (TYPE_MODE (type), ctzval) == > 2; > + > + /* Skip if there is no value defined at zero, or if we can't easily > + return the correct value for zero. */ > + if (!zero_ok) > + return false; > + if (zero_val != ctzval && !(zero_val == 0 && ctzval == type_size)) > + return false; > + > + gimple_seq seq = NULL; > + gimple *g; > + gcall *call = gimple_build_call_internal (IFN_CTZ, 1, res_ops[0]); > + gimple_set_location (call, gimple_location (stmt)); > + gimple_set_lhs (call, make_ssa_name (integer_type_node)); > + gimple_seq_add_stmt (&seq, call); > + > + tree prev_lhs = gimple_call_lhs (call); > + > + /* Emit ctz (x) & 31 if ctz (0) is 32 but we need to return 0. */ > + if (zero_val == 0 && ctzval == type_size) > + { > + g = gimple_build_assign (make_ssa_name (integer_type_node), > + BIT_AND_EXPR, prev_lhs, > + build_int_cst (integer_type_node, > + type_size - 1)); > + gimple_set_location (g, gimple_location (stmt)); > + gimple_seq_add_stmt (&seq, g); > + prev_lhs = gimple_assign_lhs (g); > + } > + > + g = gimple_build_assign (gimple_assign_lhs (stmt), NOP_EXPR, prev_lhs); > + gimple_seq_add_stmt (&seq, g); > + gsi_replace_with_seq (gsi, seq, true); > + return true; > + } > + > + return false; > +} > + > + > /* Combine an element access with a shuffle. Returns true if there were > any changes made, else it returns false. */ > > @@ -2867,6 +3047,8 @@ pass_forwprop::execute (function *fun) > else if (code == CONSTRUCTOR > && TREE_CODE (TREE_TYPE (rhs1)) == VECTOR_TYPE) > changed = simplify_vector_constructor (&gsi); > + else if (code == ARRAY_REF) > + changed = simplify_count_trailing_zeroes (&gsi); > break; > } >