[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 --- Comment #17 from Jakub Jelinek --- Or try to deal with this in combine. We have: (insn 91 90 92 18 (parallel [ (set (reg:DI 149 [ D.1943 ]) (and:DI (reg:DI 147 [ D.1943 ]) (const_int 1 [0x1]))) (clobber (reg:CC 17 flags)) ]) vrp66-1.c:32 379 {*anddi_1} (expr_list:REG_DEAD (reg:DI 147 [ D.1943 ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil (insn 92 91 94 18 (parallel [ (set (reg:DI 150 [ D.1943 ]) (xor:DI (reg:DI 149 [ D.1943 ]) (const_int 1 [0x1]))) (clobber (reg:CC 17 flags)) ]) vrp66-1.c:32 401 {*xordi_1} (expr_list:REG_DEAD (reg:DI 149 [ D.1943 ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil (insn 94 92 95 18 (parallel [ (set (reg:QI 93 [ D.1944 ]) (and:QI (subreg:QI (reg:DI 150 [ D.1943 ]) 0) (const_int 1 [0x1]))) (clobber (reg:CC 17 flags)) ]) vrp66-1.c:32 383 {*andqi_1} (expr_list:REG_DEAD (reg:DI 150 [ D.1943 ]) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil and I believe nonzero_bits is smart enough to say that nonzero_bits of (subreg:QI (reg:DI 150 [ D.1943 ]) 0) is 1. But the problem is that simplify-rtl attempts to match all kinds of weird patterns, like: (parallel [ (set (reg:QI 93 [ D.1944 ]) (eq:QI (zero_extract:DI (reg:DI 147 [ D.1943 ]) (const_int 1 [0x1]) (const_int 0 [0])) (const_int 0 [0]))) (clobber (reg:CC 17 flags)) ]) etc. Perhaps if everything fails, if for and with constant it would just try to call nonzero_bits on the non-constant operand and if only bits set in the constant mask are set, it would try to match the stmt into a plain assignment?
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 --- Comment #16 from Jakub Jelinek --- So, I had a brief look at the #c14 issues. --- gcc/match.pd.jj2015-01-05 13:07:14.0 +0100 +++ gcc/match.pd2015-01-16 16:27:08.053817209 +0100 @@ -614,6 +614,16 @@ (define_operator_list inverted_tcc_compa (cmp (bit_and (lshift integer_onep @0) integer_onep) integer_zerop) (icmp @0 { build_zero_cst (TREE_TYPE (@0)); }))) +/* ((type) ((A & 1) ^ 1)) == 0 -> (A & 1) != 0 + ((type) ((A & 1) ^ 1)) != 0 -> (A & 1) == 0 */ +(for cmp (ne eq) + icmp (eq ne) + (simplify + (cmp (convert?@0 (bit_xor (bit_and@1 @2 integer_onep) integer_onep)) + integer_zerop) + (if (INTEGRAL_TYPE_P (TREE_TYPE (@1)) && INTEGRAL_TYPE_P (TREE_TYPE (@0))) + (icmp @1 { build_zero_cst (TREE_TYPE (@1)); } + /* Simplifications of conversions. */ /* Basic strip-useless-type-conversions / strip_nops. */ worked (tried vrp66.c only) and managed to simplify *.optimized dump, though there isn't DCE post forwprop4 so the dead ^1 and casts still stayed. But apparently RTL optimization passes managed to fix that up and we get the same assembly, so not sure if the above is worth pushing in. Then I wanted to do at least something with the useless andl $1, %eax; xorl $1, %eax; andl $1, %eax where the second andl is definitely complete unneeded. I see two issues there. One is that the ^ 1 does not have range info on it, as it has been created by VRP2 pass. For that I have: --- gcc/tree-vrp.c.jj2015-01-15 20:25:11.0 +0100 +++ gcc/tree-vrp.c2015-01-16 17:33:21.041941783 +0100 @@ -8993,11 +8993,21 @@ simplify_truth_ops_using_ranges (gimple_ /* For A != B we substitute A ^ B. Either with conversion. */ else if (need_conversion) { - tree tem = make_ssa_name (TREE_TYPE (op0)); + tree type = TREE_TYPE (op0); + tree tem = make_ssa_name (type); gassign *newop = gimple_build_assign (tem, BIT_XOR_EXPR, op0, op1); gsi_insert_before (gsi, newop, GSI_SAME_STMT); gimple_assign_set_rhs_with_ops (gsi, NOP_EXPR, tem); + + /* Record value range info for TEM. */ + value_range_t vr = VR_INITIALIZER; + extract_range_from_binary_expr (&vr, BIT_XOR_EXPR, type, op0, op1); + if (INTEGRAL_TYPE_P (type) + && (vr.type == VR_RANGE || vr.type == VR_ANTI_RANGE) + && TREE_CODE (vr.min) == INTEGER_CST + && TREE_CODE (vr.max) == INTEGER_CST) +set_range_info (tem, vr.type, vr.min, vr.max); } /* Or without. */ else patch (also tested just on vrp66.c). And the second issue is that nothing actually uses the range info during expansion. In particular, this is from reduce_to_bit_field_precision, but unfortunately we don't pass the original tree to that function. If we did, say by adding (optionally NULL?) an extra tree argument to REDUCE_BIT_FIELD macro, then we could in that function check if it is an SSA_NAME for which we do have range info, and if the range info suggests we don't have to mask (for unsigned types) or shift left/right (for signed types), then it could avoid that step. Richard, what do you think about that? Of course, for rtxes that don't have any corresponding tree we would just pass NULL or some other value to make it clear that we don't know the range info.
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 --- Comment #15 from Uroš Bizjak --- (In reply to Jakub Jelinek from comment #14) > So, supposedly there is something we want to match-and-simplify, perhaps > also something we want to simplify at the RTL level, and check if > bt+set{,n}c might not be beneficial here compared to the shift + and + xor. Generation of "bt" is quite fragile and heavily depends on combiner to do its job correctly. IIRC, there were some talks to introduce bit-test tree code (and corresponding RTX ?) to enable further optimizations. Something like how bswap is now handled throughout the compilation.
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 Jakub Jelinek changed: What|Removed |Added CC||uros at gcc dot gnu.org --- Comment #14 from Jakub Jelinek --- Patch committed. There is still an issue that remains, e.g. on vrp66.c testcase we often get code like: _120 = 3314649325744685057 >> _119; _121 = _120 & 1; _25 = _121 ^ 1; _122 = (_Bool) _25; if (_122 != 0) when actually _120 = 3314649325744685057 >> _119; _121 = _120 & 1; if (_121 == 0) would be enough and shorter. And it isn't just a GIMPLE preference, it actually shows up in the generated code. On vrp66.c on x86_64 at -O2, I'm seeing just 24 btq instructions (before the optimization 0) and 22 shrq instructions (again, before 0), where the assembly typically looks like: movabsq $-9223372032543031257, %rax movl%edi, %ecx shrq%cl, %rax andl$1, %eax xorq$1, %rax andl$1, %eax i.e. pretty much the same nonsense, at least the last andl $1 is redundant, because the first andl ensures %eax is 0 or 1, the xorq turns that into 1 or 0 and so the second andl is useless. It might be nicer if the code used btq/setc instead. So, supposedly there is something we want to match-and-simplify, perhaps also something we want to simplify at the RTL level, and check if bt+set{,n}c might not be beneficial here compared to the shift + and + xor.
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 --- Comment #13 from Jakub Jelinek --- Author: jakub Date: Fri Oct 17 10:54:54 2014 New Revision: 216393 URL: https://gcc.gnu.org/viewcvs?rev=216393&root=gcc&view=rev Log: PR tree-optimization/63464 * gimple.h (gimple_seq_discard): New prototype. * gimple.c: Include stringpool.h and tree-ssanames.h. (gimple_seq_discard): New function. * optabs.h (lshift_cheap_p): New prototype. * optabs.c (lshift_cheap_p): New function, moved from... * tree-switch-conversion.c (lshift_cheap_p): ... here. * tree-ssa-reassoc.c: Include gimplify.h and optabs.h. (reassoc_branch_fixups): New variable. (update_range_test): Add otherrangep and seq arguments. Unshare exp. If otherrange is NULL, use for other ranges array of pointers pointed by otherrangep instead. Emit seq before gimplified statements for tem. (optimize_range_tests_diff): Adjust update_range_test caller. (optimize_range_tests_xor): Likewise. Fix up comment. (extract_bit_test_mask, optimize_range_tests_to_bit_test): New functions. (optimize_range_tests): Adjust update_range_test caller. Call optimize_range_tests_to_bit_test. (branch_fixup): New function. (execute_reassoc): Call branch_fixup. * gcc.dg/torture/pr63464.c: New test. * gcc.dg/tree-ssa/reassoc-37.c: New test. * gcc.dg/tree-ssa/reassoc-38.c: New test. Added: trunk/gcc/testsuite/gcc.dg/torture/pr63464.c trunk/gcc/testsuite/gcc.dg/tree-ssa/reassoc-37.c trunk/gcc/testsuite/gcc.dg/tree-ssa/reassoc-38.c Modified: trunk/gcc/ChangeLog trunk/gcc/gimple.c trunk/gcc/gimple.h trunk/gcc/optabs.c trunk/gcc/optabs.h trunk/gcc/testsuite/ChangeLog trunk/gcc/tree-ssa-reassoc.c trunk/gcc/tree-switch-conversion.c
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 --- Comment #12 from Jakub Jelinek --- Created attachment 33724 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33724&action=edit gcc5-pr63464.patch Untested patch.
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 --- Comment #11 from rguenther at suse dot de --- On Mon, 13 Oct 2014, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 > > --- Comment #9 from Jakub Jelinek --- > Created attachment 33697 > --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33697&action=edit > gcc5-pr63464.patch > > WIP patch. What is missing: > 1) the optimize_range_tests_to_bit_test call should be guarded by > lshift_cheap_p (), Richard, any preference on where to declare that function > (tree-switch-conversion.h and include that in tree-ssa-reassoc.c, somewhere > else?) In optabs.[ch]? > 2) much more importantly, it right now doesn't actually fixup the IL, so > instead of the desirable jump around the shift we have there just > BIT_IOR_EXPR (and the shift actually happens to be done first, before > the range test). Richard, any preference how to represent it in the IL > from in between the optimization and fixup once all bbs are > reassociated? I've been thinking about e.g. some pass-through internal > call on the TRUTH_ORIF_EXPR operand. Another option might be, if we'd > adjust update_range_test, so that it would not only emit a > gimplification of the range test, but also an optional gimple_seq before > that, would be to push all the SSA_NAMEs that would be otherwise passed > to the internal call, into some vector, and just split bb after the def > stmt of those SSA_NAMEs and also before the (single, hopefully) user of > that SSA_NAME, adding an edge around the middle of the bb and PHI. Can't you "queue" the BIT_IOR_EXPR stmt somewhere and after reassoc finished split the BBs as desired? Richard.
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 --- Comment #10 from Jakub Jelinek --- Created attachment 33698 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33698&action=edit bittest.c Testcase I've been playing with.
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 --- Comment #9 from Jakub Jelinek --- Created attachment 33697 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33697&action=edit gcc5-pr63464.patch WIP patch. What is missing: 1) the optimize_range_tests_to_bit_test call should be guarded by lshift_cheap_p (), Richard, any preference on where to declare that function (tree-switch-conversion.h and include that in tree-ssa-reassoc.c, somewhere else?) 2) much more importantly, it right now doesn't actually fixup the IL, so instead of the desirable jump around the shift we have there just BIT_IOR_EXPR (and the shift actually happens to be done first, before the range test). Richard, any preference how to represent it in the IL from in between the optimization and fixup once all bbs are reassociated? I've been thinking about e.g. some pass-through internal call on the TRUTH_ORIF_EXPR operand. Another option might be, if we'd adjust update_range_test, so that it would not only emit a gimplification of the range test, but also an optional gimple_seq before that, would be to push all the SSA_NAMEs that would be otherwise passed to the internal call, into some vector, and just split bb after the def stmt of those SSA_NAMEs and also before the (single, hopefully) user of that SSA_NAME, adding an edge around the middle of the bb and PHI.
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 --- Comment #8 from Jakub Jelinek --- Author: jakub Date: Fri Oct 10 12:15:30 2014 New Revision: 216072 URL: https://gcc.gnu.org/viewcvs?rev=216072&root=gcc&view=rev Log: PR tree-optimization/63464 * tree-switch-conversion.c (struct case_bit_test): Remove hi and lo fields, add wide_int mask field. (emit_case_bit_tests): Add MAXVAL argument, rewrite uses of hi/lo fields into wide_int mask operations, optimize by pretending minval to be 0 if maxval is small enough. (process_switch): Adjust caller. Modified: trunk/gcc/ChangeLog trunk/gcc/tree-switch-conversion.c
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 --- Comment #7 from Marc Glisse --- The SLP version is slightly slower than the bit test in this case (at least on my old desktop), but it can more easily handle testing for characters that are not within 64 of each other. __m128i b=_mm_set1_epi8(*s); __m128i m=_mm_set_epi8('\n','\r',',',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' ',' '); __m128i e=_mm_cmpeq_epi8(b,m); bool found=_mm_movemask_epi8(e)!=0; Though we are missing REDUC_TRUTH_OR_EXPR to model the last line.
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 --- Comment #6 from Marc Glisse --- (In reply to Jakub Jelinek from comment #1) > We have this optimization implemented for switches, Thanks, that's indeed the most natural place for it, although I hadn't thought of testing that... Glibc's strspn has: __STRING_INLINE size_t __strspn_c3 (const char *__s, int __accept1, int __accept2, int __accept3) { size_t __result = 0; /* Please note that __accept1 to __accept3 never can be '\0'. */ while (__s[__result] == __accept1 || __s[__result] == __accept2 || __s[__result] == __accept3) ++__result; return __result; } This is only called when optimizing and with a second argument that is a string literal, but it is still inconvenient to turn that into a switch, so it seems useful to optimize this form as well (well, maybe we could expand __builtin_strspn in gcc instead so it also works for larger constant second arguments, but creating a loop is not so nice and there are plenty of other similar functions). (By the way, those optimizations are protected by a test __builtin_constant_p (s) which only seems to be true if passing _directly_ a string literal, maybe __builtin_constant_p could be made to return true a bit more often, or glibc could test instead __builtin_constant_p (s[0]) etc) (For completeness, I also compared with a "repne scasb; je" version I found somewhere in glibc, which was more than 20 times slower, and quite difficult to generate since we don't allow clobbers on asm goto...)
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 --- Comment #5 from rguenther at suse dot de --- On Tue, 7 Oct 2014, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 > > --- Comment #4 from Jakub Jelinek --- > Created attachment 33658 > --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33658&action=edit > gcc5-pr63464.patch > > Updated patch for the switchconv, this time checking rtx costs. > > As for reassoc, the problem I see is that this kind of optimization needs to > split basic blocks, as left shift by negative or >= word bit size is undefined > behavior, so the expected generated code is probably jump around the left > shift. > I think reassoc pass is not prepared to see splitting of basic blocks, nor > adding new PHI nodes etc. In the: > int > foo (int x) > { > return x == 1 || x == 2 || x == 4 || x == 6 || x == 15 || x == 17; > } > case we actually have 2 basic blocks and there is no other test ored in in > either of the basic blocks, so we could perform it even without creating a new > bb, but I'd say that very often we will not be that lucky. But you could do preparations and do the actual transform splitting blocks as a 2nd phase?
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 --- Comment #4 from Jakub Jelinek --- Created attachment 33658 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33658&action=edit gcc5-pr63464.patch Updated patch for the switchconv, this time checking rtx costs. As for reassoc, the problem I see is that this kind of optimization needs to split basic blocks, as left shift by negative or >= word bit size is undefined behavior, so the expected generated code is probably jump around the left shift. I think reassoc pass is not prepared to see splitting of basic blocks, nor adding new PHI nodes etc. In the: int foo (int x) { return x == 1 || x == 2 || x == 4 || x == 6 || x == 15 || x == 17; } case we actually have 2 basic blocks and there is no other test ored in in either of the basic blocks, so we could perform it even without creating a new bb, but I'd say that very often we will not be that lucky.
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 --- Comment #3 from Jakub Jelinek --- Created attachment 33654 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33654&action=edit gcc5-pr63464.patch Untested patch to avoid the subtraction of info.range_min from index. Might not always be a net win, if the mask constant(s) is(are) more costly if we don't subtract minval than when we do. E.g. on x86_64, if without subtracting we need to use movabsq + and with a reg, while with subtracting just sub + and with an immediate mask. So perhaps we need to build all the masks for both cases and sum up rtx cost of all the masks and the sub. This is for the switch opt. In tree-ssa-reassoc, I'll see what can be reused, but probably not very much (perhaps the rtx_cost code if we add it for the switch conversion).
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 --- Comment #2 from rguenther at suse dot de --- On Mon, 6 Oct 2014, jakub at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 > > Jakub Jelinek changed: > >What|Removed |Added > > Status|UNCONFIRMED |NEW >Last reconfirmed||2014-10-06 > CC||jakub at gcc dot gnu.org, >||rguenth at gcc dot gnu.org, >||steven at gcc dot gnu.org > Ever confirmed|0 |1 > > --- Comment #1 from Jakub Jelinek --- > We have this optimization implemented for switches, if you compile > char*f3(char*s){ > do > { > switch (*s) > { > case ' ': > case ',': > case '\r': > case '\n': > ++s; > continue; > default: > return s; > } > } > while (1); > } > > then it will do the bit test, see r189173 (and various follow-up fixes for > that). > Now, we can argue whether in this case it is beneficial to perform the > MINUS_EXPR or if maxval is small enough (e.g. when maxval is smaller than > BITS_PER_WORD), just assume minval is 0. > > And then the question is, if we should teach reassoc range optimizations to > reuse emit_case_bit_tests, or convert such tests into a GIMPLE_SWITCH and > expand as such. > > Richard/Steven, thoughts about this? Factoring out the code-gen part of switch-conversion and use it from reassoc range optimization? I don't think creating a GIMPLE_SWITCH from reassoc is a good idea (given that switch conversion runs early). Richard.
[Bug tree-optimization/63464] compare one character to many: faster
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63464 Jakub Jelinek changed: What|Removed |Added Status|UNCONFIRMED |NEW Last reconfirmed||2014-10-06 CC||jakub at gcc dot gnu.org, ||rguenth at gcc dot gnu.org, ||steven at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Jakub Jelinek --- We have this optimization implemented for switches, if you compile char*f3(char*s){ do { switch (*s) { case ' ': case ',': case '\r': case '\n': ++s; continue; default: return s; } } while (1); } then it will do the bit test, see r189173 (and various follow-up fixes for that). Now, we can argue whether in this case it is beneficial to perform the MINUS_EXPR or if maxval is small enough (e.g. when maxval is smaller than BITS_PER_WORD), just assume minval is 0. And then the question is, if we should teach reassoc range optimizations to reuse emit_case_bit_tests, or convert such tests into a GIMPLE_SWITCH and expand as such. Richard/Steven, thoughts about this?