Re: [Patch] Improving jump-thread pass for PR 54742
Richard Biener wrote: On the llvm test-suite, I have seen one ICE with my fsm jump-thread patch. This patch fixes the problem: diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c index 12f83ba..f8c736e 100644 --- a/gcc/tree-ssa-threadupdate.c +++ b/gcc/tree-ssa-threadupdate.c @@ -2564,6 +2564,7 @@ thread_through_all_blocks (bool may_peel_loop_headers) FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) { if (!loop-header +|| !loop_latch_edge (loop) || !bitmap_bit_p (threaded_blocks, loop-header-index)) continue; retval |= thread_through_loop_header (loop, may_peel_loop_headers); Ok to commit after regstrap? This seems to be indicating that we have with no edge from the latch block to the header block. I'd like to know better how we got into that state. It Also returns null for loops with multiple latches. So the patch looks OK for me. The bug I was seeing has been fixed by the patch for: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64284 Thanks, Sebastian
Re: [Patch] Improving jump-thread pass for PR 54742
On Mon, Dec 8, 2014 at 10:49 PM, Steve Ellcey sell...@mips.com wrote: On Sat, 2014-12-06 at 19:21 +, Sebastian Pop wrote: I think it does not make sense to duplicate paths at -Os: I disabled the FSM jump-threading when optimizing for size like this. diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c index 29b20c8..ce70311 100644 --- a/gcc/tree-ssa-threadedge.c +++ b/gcc/tree-ssa-threadedge.c @@ -1335,8 +1335,9 @@ thread_through_normal_block (edge e, return 1; } if (!flag_expensive_optimizations + || optimize_function_for_size_p (cfun) || TREE_CODE (cond) != SSA_NAME || e-dest-loop_father != e-src-loop_father || loop_depth (e-dest-loop_father) == 0) return 0; I will regstrap and commit the attached patch. Bootstrapped and regression tested on x86_64-linux. Committed r218451. Sebastian Thanks for getting all this checked in Sebastian, I have tested it on coremark and I am getting the speed up that I expect. But I am a little confused about turning off jump threading. I am getting the optimization on coremark with -O3 and that is great and if I use '-O3 -fno-expensive-optimizations' then I don't see this part of the jump threading, also great. But I was surprised that if I just used '-O3 -fno-thread-jumps' then I still see this optimization. Is that expected? Should this test also check flag_thread_jumps? Or should that be getting checked somewhere else? -fthread-jumps is an RTL optimization flag and ignored on GIMPLE. Richard. Steve Ellcey
Re: [Patch] Improving jump-thread pass for PR 54742
Richard Biener wrote: On Mon, Dec 8, 2014 at 10:49 PM, Steve Ellcey sell...@mips.com wrote: expected? Should this test also check flag_thread_jumps? Or should that be getting checked somewhere else? -fthread-jumps is an RTL optimization flag and ignored on GIMPLE. Does it make sense to add a -f[no-]tree-thread-jumps to enable/disable the tree jump threading? I could also add -f[no-]tree-fsm-thread-jumps. Opinions? On the llvm test-suite, I have seen one ICE with my fsm jump-thread patch. This patch fixes the problem: diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c index 12f83ba..f8c736e 100644 --- a/gcc/tree-ssa-threadupdate.c +++ b/gcc/tree-ssa-threadupdate.c @@ -2564,6 +2564,7 @@ thread_through_all_blocks (bool may_peel_loop_headers) FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) { if (!loop-header +|| !loop_latch_edge (loop) || !bitmap_bit_p (threaded_blocks, loop-header-index)) continue; retval |= thread_through_loop_header (loop, may_peel_loop_headers); Ok to commit after regstrap? Thanks, Sebastian
Re: [Patch] Improving jump-thread pass for PR 54742
On 12/09/14 10:38, Sebastian Pop wrote: Richard Biener wrote: On Mon, Dec 8, 2014 at 10:49 PM, Steve Ellcey sell...@mips.com wrote: expected? Should this test also check flag_thread_jumps? Or should that be getting checked somewhere else? -fthread-jumps is an RTL optimization flag and ignored on GIMPLE. Does it make sense to add a -f[no-]tree-thread-jumps to enable/disable the tree jump threading? I could also add -f[no-]tree-fsm-thread-jumps. Opinions? Our option handling is a bit of a mess if we look at it from the user standpoint. Given that the vast majority of jump threading happens on gimple, ISTM that -f[no-]thread-jumps ought to be controlling the gimple implementation. One could easily argue that the user doesn't really care about where in the pipeline the optimization is implemented. My vote would be to just make -fthread-jumps control both RTL and gimple jump threading. On the llvm test-suite, I have seen one ICE with my fsm jump-thread patch. This patch fixes the problem: diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c index 12f83ba..f8c736e 100644 --- a/gcc/tree-ssa-threadupdate.c +++ b/gcc/tree-ssa-threadupdate.c @@ -2564,6 +2564,7 @@ thread_through_all_blocks (bool may_peel_loop_headers) FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) { if (!loop-header +|| !loop_latch_edge (loop) || !bitmap_bit_p (threaded_blocks, loop-header-index)) continue; retval |= thread_through_loop_header (loop, may_peel_loop_headers); Ok to commit after regstrap? This seems to be indicating that we have with no edge from the latch block to the header block. I'd like to know better how we got into that state. Also, a test for the GCC testsuite would be good. I have no idea what license covers the LLVM testsuite. But given a good analysis of the problem we may be able to write a suitable test independent of the LLVM test. jeff
Re: [Patch] Improving jump-thread pass for PR 54742
On December 9, 2014 7:39:48 PM CET, Jeff Law l...@redhat.com wrote: On 12/09/14 10:38, Sebastian Pop wrote: Richard Biener wrote: On Mon, Dec 8, 2014 at 10:49 PM, Steve Ellcey sell...@mips.com wrote: expected? Should this test also check flag_thread_jumps? Or should that be getting checked somewhere else? -fthread-jumps is an RTL optimization flag and ignored on GIMPLE. Does it make sense to add a -f[no-]tree-thread-jumps to enable/disable the tree jump threading? I could also add -f[no-]tree-fsm-thread-jumps. Opinions? Our option handling is a bit of a mess if we look at it from the user standpoint. Given that the vast majority of jump threading happens on gimple, ISTM that -f[no-]thread-jumps ought to be controlling the gimple implementation. One could easily argue that the user doesn't really care about where in the pipeline the optimization is implemented. My vote would be to just make -fthread-jumps control both RTL and gimple jump threading. Works for me. On the llvm test-suite, I have seen one ICE with my fsm jump-thread patch. This patch fixes the problem: diff --git a/gcc/tree-ssa-threadupdate.c b/gcc/tree-ssa-threadupdate.c index 12f83ba..f8c736e 100644 --- a/gcc/tree-ssa-threadupdate.c +++ b/gcc/tree-ssa-threadupdate.c @@ -2564,6 +2564,7 @@ thread_through_all_blocks (bool may_peel_loop_headers) FOR_EACH_LOOP (loop, LI_FROM_INNERMOST) { if (!loop-header +|| !loop_latch_edge (loop) || !bitmap_bit_p (threaded_blocks, loop-header-index)) continue; retval |= thread_through_loop_header (loop, may_peel_loop_headers); Ok to commit after regstrap? This seems to be indicating that we have with no edge from the latch block to the header block. I'd like to know better how we got into that state. It Also returns null for loops with multiple latches. So the patch looks OK for me. Thanks, Richard. Also, a test for the GCC testsuite would be good. I have no idea what license covers the LLVM testsuite. But given a good analysis of the problem we may be able to write a suitable test independent of the LLVM test. jeff
Re: [Patch] Improving jump-thread pass for PR 54742
On 12/09/14 12:43, Richard Biener wrote: This seems to be indicating that we have with no edge from the latch block to the header block. I'd like to know better how we got into that state. It Also returns null for loops with multiple latches. So the patch looks OK for me. Ah, OK. Jeff
Re: [Patch] Improving jump-thread pass for PR 54742
On Dec 9, 2014, at 10:39 AM, Jeff Law l...@redhat.com wrote: Also, a test for the GCC testsuite would be good. I have no idea what license covers the LLVM testsuite. But given a good analysis of the problem we may be able to write a suitable test independent of the LLVM test. So, the usual engineerings rules should work just fine on it. delta reduce and submit.
Re: [Patch] Improving jump-thread pass for PR 54742
On 12/06/14 06:47, Sebastian Pop wrote: Jeff Law wrote: OK to commit. Thanks for your patience. Can you follow-up with a change which throttles this optimization when -Os is in effect. You can check optimize_function_for_size_p (cfun) and simply avoid the backward traversal or you could allow it in that case if the amount of copying is suitably small. Your call. I think it does not make sense to duplicate paths at -Os: I disabled the FSM jump-threading when optimizing for size like this. diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c index 29b20c8..ce70311 100644 --- a/gcc/tree-ssa-threadedge.c +++ b/gcc/tree-ssa-threadedge.c @@ -1335,8 +1335,9 @@ thread_through_normal_block (edge e, return 1; } if (!flag_expensive_optimizations + || optimize_function_for_size_p (cfun) || TREE_CODE (cond) != SSA_NAME || e-dest-loop_father != e-src-loop_father || loop_depth (e-dest-loop_father) == 0) return 0; I will regstrap and commit the attached patch. Looks good to me. Thanks for taking care of it. Jeff
Re: [Patch] Improving jump-thread pass for PR 54742
On Sat, 2014-12-06 at 19:21 +, Sebastian Pop wrote: I think it does not make sense to duplicate paths at -Os: I disabled the FSM jump-threading when optimizing for size like this. diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c index 29b20c8..ce70311 100644 --- a/gcc/tree-ssa-threadedge.c +++ b/gcc/tree-ssa-threadedge.c @@ -1335,8 +1335,9 @@ thread_through_normal_block (edge e, return 1; } if (!flag_expensive_optimizations + || optimize_function_for_size_p (cfun) || TREE_CODE (cond) != SSA_NAME || e-dest-loop_father != e-src-loop_father || loop_depth (e-dest-loop_father) == 0) return 0; I will regstrap and commit the attached patch. Bootstrapped and regression tested on x86_64-linux. Committed r218451. Sebastian Thanks for getting all this checked in Sebastian, I have tested it on coremark and I am getting the speed up that I expect. But I am a little confused about turning off jump threading. I am getting the optimization on coremark with -O3 and that is great and if I use '-O3 -fno-expensive-optimizations' then I don't see this part of the jump threading, also great. But I was surprised that if I just used '-O3 -fno-thread-jumps' then I still see this optimization. Is that expected? Should this test also check flag_thread_jumps? Or should that be getting checked somewhere else? Steve Ellcey
Re: [Patch] Improving jump-thread pass for PR 54742
Jeff Law wrote: OK to commit. Thanks for your patience. Can you follow-up with a change which throttles this optimization when -Os is in effect. You can check optimize_function_for_size_p (cfun) and simply avoid the backward traversal or you could allow it in that case if the amount of copying is suitably small. Your call. I think it does not make sense to duplicate paths at -Os: I disabled the FSM jump-threading when optimizing for size like this. diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c index 29b20c8..ce70311 100644 --- a/gcc/tree-ssa-threadedge.c +++ b/gcc/tree-ssa-threadedge.c @@ -1335,8 +1335,9 @@ thread_through_normal_block (edge e, return 1; } if (!flag_expensive_optimizations + || optimize_function_for_size_p (cfun) || TREE_CODE (cond) != SSA_NAME || e-dest-loop_father != e-src-loop_father || loop_depth (e-dest-loop_father) == 0) return 0; I will regstrap and commit the attached patch. Sebastian From 1e2efaa2e3121170a938cd479d979b55c37cc4a4 Mon Sep 17 00:00:00 2001 From: Sebastian Pop s@samsung.com Date: Fri, 26 Sep 2014 14:54:20 -0500 Subject: [PATCH] extend jump thread for finite state automata PR tree-optimization/54742 * params.def (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): New. * doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): Documented. * tree-cfg.c (split_edge_bb_loc): Export. * tree-cfg.h (split_edge_bb_loc): Declared extern. * tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the original value of cond when simplification fails. (fsm_find_thread_path): New. (fsm_find_control_statement_thread_paths): New. (thread_through_normal_block): Call find_control_statement_thread_paths. * tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print EDGE_FSM_THREAD. (verify_seme): New. (duplicate_seme_region): New. (thread_through_all_blocks): Generate code for EDGE_FSM_THREAD edges calling duplicate_seme_region. * tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_FSM_THREAD. * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New test. * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c: New test. --- gcc/ChangeLog| 29 +++ gcc/doc/invoke.texi | 12 + gcc/params.def | 15 ++ gcc/testsuite/ChangeLog |8 + gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c | 43 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 127 ++ gcc/tree-cfg.c |2 +- gcc/tree-cfg.h |1 + gcc/tree-ssa-threadedge.c| 278 +- gcc/tree-ssa-threadupdate.c | 203 +++- gcc/tree-ssa-threadupdate.h |1 + 11 files changed, 716 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c diff --git a/gcc/ChangeLog b/gcc/ChangeLog index b340b51..6cfd339 100644 --- a/gcc/ChangeLog +++ b/gcc/ChangeLog @@ -1,3 +1,32 @@ +2014-12-06 James Greenhalgh james.greenha...@arm.com + Sebastian Pop s@samsung.com + Brian Rzycki b.rzy...@samsung.com + + PR tree-optimization/54742 + * params.def (max-fsm-thread-path-insns, max-fsm-thread-length, + max-fsm-thread-paths): New. + + * doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length, + max-fsm-thread-paths): Documented. + + * tree-cfg.c (split_edge_bb_loc): Export. + * tree-cfg.h (split_edge_bb_loc): Declared extern. + + * tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the + original value of cond when simplification fails. + (fsm_find_thread_path): New. + (fsm_find_control_statement_thread_paths): New. + (thread_through_normal_block): Call find_control_statement_thread_paths. + + * tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print + EDGE_FSM_THREAD. + (verify_seme): New. + (duplicate_seme_region): New. + (thread_through_all_blocks): Generate code for EDGE_FSM_THREAD edges + calling duplicate_seme_region. + + * tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_FSM_THREAD. + 2014-12-06 H.J. Lu hongjiu...@intel.com PR target/64200 diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 82f0794..70d1336 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -10623,6 +10623,18 @@ large and significantly increase compile time at optimization level @option{-O1} and higher. This parameter is a maximum nubmer of statements in a single generated constructor. Default value is 5000. +@item max-fsm-thread-path-insns +Maximum number of instructions to copy when duplicating blocks on a +finite state automaton jump thread path. The default is 100. + +@item
Re: [Patch] Improving jump-thread pass for PR 54742
Sebastian Pop wrote: Jeff Law wrote: OK to commit. Thanks for your patience. Can you follow-up with a change which throttles this optimization when -Os is in effect. You can check optimize_function_for_size_p (cfun) and simply avoid the backward traversal or you could allow it in that case if the amount of copying is suitably small. Your call. I think it does not make sense to duplicate paths at -Os: I disabled the FSM jump-threading when optimizing for size like this. diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c index 29b20c8..ce70311 100644 --- a/gcc/tree-ssa-threadedge.c +++ b/gcc/tree-ssa-threadedge.c @@ -1335,8 +1335,9 @@ thread_through_normal_block (edge e, return 1; } if (!flag_expensive_optimizations + || optimize_function_for_size_p (cfun) || TREE_CODE (cond) != SSA_NAME || e-dest-loop_father != e-src-loop_father || loop_depth (e-dest-loop_father) == 0) return 0; I will regstrap and commit the attached patch. Bootstrapped and regression tested on x86_64-linux. Committed r218451. Sebastian
Re: [Patch] Improving jump-thread pass for PR 54742
On 12/04/14 02:14, Sebastian Pop wrote: Sebastian Pop wrote: a fail I have not seen in the past: FAIL: gcc.c-torture/compile/pr27571.c -Os (internal compiler error) I am still investigating why this fails: as far as I can see for now this is because in copying the FSM path we create an internal loop that is then discovered by the loop verifier as a natural loop and is not yet in the existing loop sturctures. I will try to fix this in duplicate_seme by invalidating the loop structure after we code generated all the FSM paths. I will submit an updated patch when it passes regtest. We need at least this patch to fix the fail: @@ -2518,6 +2518,7 @@ thread_through_all_blocks (bool may_peel_loop_headers) if (duplicate_seme_region (entry, exit, region, len - 1, NULL)) { /* We do not update dominance info. */ free_dominance_info (CDI_DOMINATORS); bitmap_set_bit (threaded_blocks, entry-src-index); + retval = true; } And this will trigger in the end of the code gen function: if (retval) loops_state_set (LOOPS_NEED_FIXUP); That will fix the loop structures. I'm testing this patch on top of the one I have just sent out. That looks correct to me. Jeff
Re: [Patch] Improving jump-thread pass for PR 54742
On 12/04/14 07:29, Sebastian Pop wrote: Sebastian Pop wrote: Jeff Law wrote: I'm a bit worried about compile-time impacts of the all the recursion I will also restrict the recursion to the loop in which we look for the FSM thread. The attached patch includes this change. It passed bootstrap and regression test on x86_64-linux. Ok to commit? OK to commit. Thanks for your patience. Can you follow-up with a change which throttles this optimization when -Os is in effect. You can check optimize_function_for_size_p (cfun) and simply avoid the backward traversal or you could allow it in that case if the amount of copying is suitably small. Your call. jeff
Re: [Patch] Improving jump-thread pass for PR 54742
Jeff Law wrote: +@item max-fsm-thread-path-insns +Maximum number of instructions to copy when duplicating blocks on a +finite state automaton jump thread path. The default is 100. + +@item max-fsm-thread-length +Maximum number of basic blocks on a finite state automaton jump thread +path. The default is 10. + +@item max-fsm-thread-paths +Maximum number of new jump thread paths to create for a finite state +automaton. The default is 50. Has there been any tuning on these defaults. I don't have any strong opinions about what they ought to be, this is more to get any such information recorded on the lists for historical purposes. I have not tuned any of these defaults other than making sure that coremark is still jump-threaded. gcc.dg/tree-ssa/ssa-dom-thread-7.c is a test-case that will check that we always optimize coremark. I think it's worth a note in the debug dump anytime you abort threading when you hit a limit. Done. I'm a bit worried about compile-time impacts of the all the recursion, but I'm willing to wait and see if it turns out to be a problem in practice. Done, as Richi suggested, checking the flag_expensive_optimizations. @@ -689,6 +692,12 @@ simplify_control_stmt_condition (edge e, pass specific callback to try and simplify it further. */ if (cached_lhs ! is_gimple_min_invariant (cached_lhs)) cached_lhs = (*simplify) (stmt, stmt); + + /* We couldn't find an invariant. But, callers of this + function may be able to do something useful with the + unmodified destination. */ + if (!cached_lhs) +cached_lhs = original_lhs; } else cached_lhs = NULL; Can't you just use COND rather than stuffing its value away into ORIGINAL_LHS?CACHED_LHS may be better in some cases if it's an SSA_NAME (and it should be), but I doubt it matters in practice. Or is it the case that you have to have the original condition -- without any context sensitive equivalences used to simplify the condition. I think we need to start the search for FSM jump-threads with the original non simplified condition. +/* Return true if there is at least one path from START_BB to END_BB. + VISITED_BBS is used to make sure we don't fall into an infinite loop. */ + +static bool +fsm_find_thread_path (basic_block start_bb, basic_block end_bb, + vecbasic_block, va_gc *path, + hash_setbasic_block *visited_bbs, int n_insns) Update comment to indicate how PATH is used to return a path from START_BB to END_BB. Done. + for (gsi = gsi_start_bb (bb); !gsi_end_p (gsi); + gsi_next (gsi)) +++n_insns; Probably don't want to count labels and GIMPLE_NOPs. Probably do want to count non-virtual PHIs since those may end up as a copies or constant initializations. Done. + if (j == 0) +kind = EDGE_START_FSM_THREAD; + else if (single_pred_p (e-src)) +kind = EDGE_NO_COPY_SRC_BLOCK; + else { +kind = EDGE_COPY_SRC_JOINER_BLOCK; +++joiners; + } Presumably the mis-formatting was added when you tracked the # joiners. AFAICT that is a write-only variable and ought to be removed. Along with the braces on the final ELSE which should restore proper formatting. Done. @@ -2343,6 +2493,55 @@ thread_through_all_blocks (bool may_peel_loop_headers) threaded_blocks = BITMAP_ALLOC (NULL); memset (thread_stats, 0, sizeof (thread_stats)); + for (i = 0; i paths.length ();) Comment before this loop. I can see what you're doing, but I'm already very familiar with this code. Basically what are you looking for in this loop and what do you do? Done. Overall I think this is very very close and I really like the overall direction. There's a few minor issues noted above and with those addressed, I think we should be ready to go. Thanks for your careful review. Please let me know if there still are things I can improve in the attached patch. The path passes bootstrap on x86_64-linux and powerpc64-linux, and regtest except a fail I have not seen in the past: FAIL: gcc.c-torture/compile/pr27571.c -Os (internal compiler error) I am still investigating why this fails: as far as I can see for now this is because in copying the FSM path we create an internal loop that is then discovered by the loop verifier as a natural loop and is not yet in the existing loop sturctures. I will try to fix this in duplicate_seme by invalidating the loop structure after we code generated all the FSM paths. I will submit an updated patch when it passes regtest. Removing most of tree-ssa-threadupdate.c and using SEME duplication would be a huge step forward for making this code more understandable. I look forward to any work you do in this space in the future. I will clean up the patch
Re: [Patch] Improving jump-thread pass for PR 54742
Sebastian Pop wrote: a fail I have not seen in the past: FAIL: gcc.c-torture/compile/pr27571.c -Os (internal compiler error) I am still investigating why this fails: as far as I can see for now this is because in copying the FSM path we create an internal loop that is then discovered by the loop verifier as a natural loop and is not yet in the existing loop sturctures. I will try to fix this in duplicate_seme by invalidating the loop structure after we code generated all the FSM paths. I will submit an updated patch when it passes regtest. We need at least this patch to fix the fail: @@ -2518,6 +2518,7 @@ thread_through_all_blocks (bool may_peel_loop_headers) if (duplicate_seme_region (entry, exit, region, len - 1, NULL)) { /* We do not update dominance info. */ free_dominance_info (CDI_DOMINATORS); bitmap_set_bit (threaded_blocks, entry-src-index); + retval = true; } And this will trigger in the end of the code gen function: if (retval) loops_state_set (LOOPS_NEED_FIXUP); That will fix the loop structures. I'm testing this patch on top of the one I have just sent out. Sebastian
Re: [Patch] Improving jump-thread pass for PR 54742
Sebastian Pop wrote: Sebastian Pop wrote: a fail I have not seen in the past: FAIL: gcc.c-torture/compile/pr27571.c -Os (internal compiler error) I am still investigating why this fails: as far as I can see for now this is because in copying the FSM path we create an internal loop that is then discovered by the loop verifier as a natural loop and is not yet in the existing loop sturctures. I will try to fix this in duplicate_seme by invalidating the loop structure after we code generated all the FSM paths. I will submit an updated patch when it passes regtest. We need at least this patch to fix the fail: @@ -2518,6 +2518,7 @@ thread_through_all_blocks (bool may_peel_loop_headers) if (duplicate_seme_region (entry, exit, region, len - 1, NULL)) { /* We do not update dominance info. */ free_dominance_info (CDI_DOMINATORS); bitmap_set_bit (threaded_blocks, entry-src-index); + retval = true; } And this will trigger in the end of the code gen function: if (retval) loops_state_set (LOOPS_NEED_FIXUP); That will fix the loop structures. I'm testing this patch on top of the one I have just sent out. This passed bootstrap and regression test on x86_64-linux.
Re: [Patch] Improving jump-thread pass for PR 54742
Jeff Law wrote: I'm a bit worried about compile-time impacts of the all the recursion I will also restrict the recursion to the loop in which we look for the FSM thread, like this: diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c index a6fb361..9a153bb 100644 --- a/gcc/tree-ssa-threadedge.c +++ b/gcc/tree-ssa-threadedge.c @@ -959,13 +959,17 @@ thread_around_empty_blocks (edge taken_edge, /* Return true if the CFG contains at least one path from START_BB to END_BB. When a path is found, record in PATH the blocks from END_BB to START_BB. - VISITED_BBS is used to make sure we don't fall into an infinite loop. */ + VISITED_BBS is used to make sure we don't fall into an infinite loop. Bound + the recursion to basic blocks belonging to LOOP. */ static bool fsm_find_thread_path (basic_block start_bb, basic_block end_bb, vecbasic_block, va_gc *path, - hash_setbasic_block *visited_bbs, int n_insns) + hash_setbasic_block *visited_bbs, loop_p loop) { + if (loop != start_bb-loop_father) +return false; + if (start_bb == end_bb) { vec_safe_push (path, start_bb); @@ -977,7 +981,7 @@ fsm_find_thread_path (basic_block start_bb, basic_block end_bb, edge e; edge_iterator ei; FOR_EACH_EDGE (e, ei, start_bb-succs) - if (fsm_find_thread_path (e-dest, end_bb, path, visited_bbs, n_insns)) + if (fsm_find_thread_path (e-dest, end_bb, path, visited_bbs, loop)) { vec_safe_push (path, start_bb); return true; @@ -1035,7 +1039,8 @@ fsm_find_control_statement_thread_paths (tree expr, { hash_setbasic_block *visited_bbs = new hash_setbasic_block; - if (fsm_find_thread_path (var_bb, e-src, next_path, visited_bbs, 0)) + if (fsm_find_thread_path (var_bb, e-src, next_path, visited_bbs, + e-src-loop_father)) ++e_count; delete visited_bbs;
Re: [Patch] Improving jump-thread pass for PR 54742
Sebastian Pop wrote: Jeff Law wrote: I'm a bit worried about compile-time impacts of the all the recursion I will also restrict the recursion to the loop in which we look for the FSM thread. The attached patch includes this change. It passed bootstrap and regression test on x86_64-linux. Ok to commit? Thanks, Sebastian From 0e3312921c07c0e0dd5c1bf5f24050b2336475ef Mon Sep 17 00:00:00 2001 From: Sebastian Pop s@samsung.com Date: Fri, 26 Sep 2014 14:54:20 -0500 Subject: [PATCH] extend jump thread for finite state automata PR 54742 Adapted from a patch from James Greenhalgh. * params.def (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): New. * doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): Documented. * tree-cfg.c (split_edge_bb_loc): Export. * tree-cfg.h (split_edge_bb_loc): Declared extern. * tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the original value of cond when simplification fails. (fsm_find_thread_path): New. (fsm_find_control_statement_thread_paths): New. (fsm_thread_through_normal_block): Call find_control_statement_thread_paths. * tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print EDGE_FSM_THREAD. (verify_seme): New. (duplicate_seme_region): New. (thread_through_all_blocks): Generate code for EDGE_FSM_THREAD edges calling duplicate_seme_region. * tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_FSM_THREAD. * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New. * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c: New. --- gcc/doc/invoke.texi | 12 + gcc/params.def | 15 ++ gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c | 43 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 127 ++ gcc/tree-cfg.c |2 +- gcc/tree-cfg.h |1 + gcc/tree-ssa-threadedge.c| 277 +- gcc/tree-ssa-threadupdate.c | 203 +++- gcc/tree-ssa-threadupdate.h |1 + 9 files changed, 678 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 89edddb..074183f 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -10624,6 +10624,18 @@ large and significantly increase compile time at optimization level @option{-O1} and higher. This parameter is a maximum nubmer of statements in a single generated constructor. Default value is 5000. +@item max-fsm-thread-path-insns +Maximum number of instructions to copy when duplicating blocks on a +finite state automaton jump thread path. The default is 100. + +@item max-fsm-thread-length +Maximum number of basic blocks on a finite state automaton jump thread +path. The default is 10. + +@item max-fsm-thread-paths +Maximum number of new jump thread paths to create for a finite state +automaton. The default is 50. + @end table @end table diff --git a/gcc/params.def b/gcc/params.def index 9b21c07..edf3f53 100644 --- a/gcc/params.def +++ b/gcc/params.def @@ -1140,6 +1140,21 @@ DEFPARAM (PARAM_CHKP_MAX_CTOR_SIZE, Maximum number of statements to be included into a single static constructor generated by Pointer Bounds Checker, 5000, 100, 0) + +DEFPARAM (PARAM_MAX_FSM_THREAD_PATH_INSNS, + max-fsm-thread-path-insns, + Maximum number of instructions to copy when duplicating blocks on a finite state automaton jump thread path, + 100, 1, 99) + +DEFPARAM (PARAM_MAX_FSM_THREAD_LENGTH, + max-fsm-thread-length, + Maximum number of basic blocks on a finite state automaton jump thread path, + 10, 1, 99) + +DEFPARAM (PARAM_MAX_FSM_THREAD_PATHS, + max-fsm-thread-paths, + Maximum number of new jump thread paths to create for a finite state automaton, + 50, 1, 99) /* Local variables: diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c new file mode 100644 index 000..bb34a74 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c @@ -0,0 +1,43 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-dom1-details } */ +/* { dg-final { scan-tree-dump-times FSM 6 dom1 } } */ +/* { dg-final { cleanup-tree-dump dom1 } } */ + +int sum0, sum1, sum2, sum3; +int foo (char *s, char **ret) +{ + int state=0; + char c; + + for (; *s state != 4; s++) +{ + c = *s; + if (c == '*') + { + s++; + break; + } + switch (state) + { + case 0: + if (c == '+') + state = 1; + else if (c != '-') + sum0+=c; + break; + case 1: + if (c == '+') + state = 2; + else if (c == '-') + state = 0; + else + sum1+=c; + break; + default: + break; + } + +} + *ret = s; +
Re: [Patch] Improving jump-thread pass for PR 54742
On Mon, Dec 1, 2014 at 10:06 PM, Jeff Law l...@redhat.com wrote: On 11/25/14 14:16, Sebastian Pop wrote: Sebastian Pop wrote: I will bootstrap and regression test this patch on x86_64-linux and powerpc64-linux. I will also run it on our internal benchmarks, coremark, and the llvm test-suite. I will also include a longer testcase that makes sure we do not regress on coremark. Done all the above. Attached is the new patch with a new testcase. I have also added verify_seme inspired by the recent patch adding verify_sese. Sebastian 0001-extend-jump-thread-for-finite-state-automata-PR-5474.patch From ca222d5222fb976c7aa258d3e3c04e593f42f7a2 Mon Sep 17 00:00:00 2001 From: Sebastian Pops@samsung.com Date: Fri, 26 Sep 2014 14:54:20 -0500 Subject: [PATCH] extend jump thread for finite state automata PR 54742 Adapted from a patch from James Greenhalgh. * params.def (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): New. * doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): Documented. * tree-cfg.c (split_edge_bb_loc): Export. * tree-cfg.h (split_edge_bb_loc): Declared extern. * tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the original value of cond when simplification fails. (fsm_find_thread_path): New. (fsm_find_control_statement_thread_paths): New. (fsm_thread_through_normal_block): Call find_control_statement_thread_paths. * tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print EDGE_START_FSM_THREAD. (verify_seme): New. (duplicate_seme_region): New. (thread_through_all_blocks): Generate code for EDGE_START_FSM_THREAD edges calling gimple_duplicate_sese_region. * tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_START_FSM_THREAD. * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New. * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c: New. --- gcc/doc/invoke.texi | 12 ++ gcc/params.def | 15 ++ gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c | 43 + gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 127 + gcc/tree-cfg.c |2 +- gcc/tree-cfg.h |1 + gcc/tree-ssa-threadedge.c| 215 +- gcc/tree-ssa-threadupdate.c | 201 +++- gcc/tree-ssa-threadupdate.h |1 + 9 files changed, 614 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 89edddb..074183f 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -10624,6 +10624,18 @@ large and significantly increase compile time at optimization level @option{-O1} and higher. This parameter is a maximum nubmer of statements in a single generated constructor. Default value is 5000. +@item max-fsm-thread-path-insns +Maximum number of instructions to copy when duplicating blocks on a +finite state automaton jump thread path. The default is 100. + +@item max-fsm-thread-length +Maximum number of basic blocks on a finite state automaton jump thread +path. The default is 10. + +@item max-fsm-thread-paths +Maximum number of new jump thread paths to create for a finite state +automaton. The default is 50. Has there been any tuning on these defaults. I don't have any strong opinions about what they ought to be, this is more to get any such information recorded on the lists for historical purposes. I think it's worth a note in the debug dump anytime you abort threading when you hit a limit. I'm a bit worried about compile-time impacts of the all the recursion, but I'm willing to wait and see if it turns out to be a problem in practice. Please consider restricting it to -fexpensive-optimizations (-O2+). Richard. diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c index 8b0b7b8..c9fe212 100644 --- a/gcc/tree-ssa-threadedge.c +++ b/gcc/tree-ssa-threadedge.c @@ -56,6 +56,8 @@ along with GCC; see the file COPYING3. If not see #include params.h #include tree-ssa-threadedge.h #include builtins.h +#include cfg.h +#include cfganal.h /* To avoid code explosion due to jump threading, we limit the number of statements we are going to copy. This variable @@ -661,6 +663,7 @@ simplify_control_stmt_condition (edge e, rather than use a relational operator. These are simpler to handle. */ if (TREE_CODE (cond) == SSA_NAME) { + tree original_lhs = cond; cached_lhs = cond; /* Get the
Re: [Patch] Improving jump-thread pass for PR 54742
On 12/02/14 03:15, Richard Biener wrote: I'm a bit worried about compile-time impacts of the all the recursion, but I'm willing to wait and see if it turns out to be a problem in practice. Please consider restricting it to -fexpensive-optimizations (-O2+). Yea, let's go ahead and do that. jeff
Re: [Patch] Improving jump-thread pass for PR 54742
On 11/25/14 14:16, Sebastian Pop wrote: Sebastian Pop wrote: I will bootstrap and regression test this patch on x86_64-linux and powerpc64-linux. I will also run it on our internal benchmarks, coremark, and the llvm test-suite. I will also include a longer testcase that makes sure we do not regress on coremark. Done all the above. Attached is the new patch with a new testcase. I have also added verify_seme inspired by the recent patch adding verify_sese. Sebastian 0001-extend-jump-thread-for-finite-state-automata-PR-5474.patch From ca222d5222fb976c7aa258d3e3c04e593f42f7a2 Mon Sep 17 00:00:00 2001 From: Sebastian Pops@samsung.com Date: Fri, 26 Sep 2014 14:54:20 -0500 Subject: [PATCH] extend jump thread for finite state automata PR 54742 Adapted from a patch from James Greenhalgh. * params.def (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): New. * doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): Documented. * tree-cfg.c (split_edge_bb_loc): Export. * tree-cfg.h (split_edge_bb_loc): Declared extern. * tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the original value of cond when simplification fails. (fsm_find_thread_path): New. (fsm_find_control_statement_thread_paths): New. (fsm_thread_through_normal_block): Call find_control_statement_thread_paths. * tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print EDGE_START_FSM_THREAD. (verify_seme): New. (duplicate_seme_region): New. (thread_through_all_blocks): Generate code for EDGE_START_FSM_THREAD edges calling gimple_duplicate_sese_region. * tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_START_FSM_THREAD. * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New. * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c: New. --- gcc/doc/invoke.texi | 12 ++ gcc/params.def | 15 ++ gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c | 43 + gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 127 + gcc/tree-cfg.c |2 +- gcc/tree-cfg.h |1 + gcc/tree-ssa-threadedge.c| 215 +- gcc/tree-ssa-threadupdate.c | 201 +++- gcc/tree-ssa-threadupdate.h |1 + 9 files changed, 614 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 89edddb..074183f 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -10624,6 +10624,18 @@ large and significantly increase compile time at optimization level @option{-O1} and higher. This parameter is a maximum nubmer of statements in a single generated constructor. Default value is 5000. +@item max-fsm-thread-path-insns +Maximum number of instructions to copy when duplicating blocks on a +finite state automaton jump thread path. The default is 100. + +@item max-fsm-thread-length +Maximum number of basic blocks on a finite state automaton jump thread +path. The default is 10. + +@item max-fsm-thread-paths +Maximum number of new jump thread paths to create for a finite state +automaton. The default is 50. Has there been any tuning on these defaults. I don't have any strong opinions about what they ought to be, this is more to get any such information recorded on the lists for historical purposes. I think it's worth a note in the debug dump anytime you abort threading when you hit a limit. I'm a bit worried about compile-time impacts of the all the recursion, but I'm willing to wait and see if it turns out to be a problem in practice. diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c index 8b0b7b8..c9fe212 100644 --- a/gcc/tree-ssa-threadedge.c +++ b/gcc/tree-ssa-threadedge.c @@ -56,6 +56,8 @@ along with GCC; see the file COPYING3. If not see #include params.h #include tree-ssa-threadedge.h #include builtins.h +#include cfg.h +#include cfganal.h /* To avoid code explosion due to jump threading, we limit the number of statements we are going to copy. This variable @@ -661,6 +663,7 @@ simplify_control_stmt_condition (edge e, rather than use a relational operator. These are simpler to handle. */ if (TREE_CODE (cond) == SSA_NAME) { + tree original_lhs = cond; cached_lhs = cond; /* Get the variable's current value from the equivalence chains. @@ -689,6 +692,12 @@ simplify_control_stmt_condition (edge e, pass specific callback to try and simplify it further. */ if (cached_lhs ! is_gimple_min_invariant
Re: [Patch] Improving jump-thread pass for PR 54742
On Mon, Nov 24, 2014 at 11:05 PM, Sebastian Pop seb...@gmail.com wrote: Jeff Law wrote: On 11/23/14 15:22, Sebastian Pop wrote: The second patch attached limits the search for FSM jump threads to loops. With that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap (and 424 jump threads on powerpc64-linux bootstrap.) Yea, that was one of the things I was going to poke at as well as a quick scan of your patch gave me the impression it wasn't limited to loops. Again, I haven't looked much at the patch, but I got the impression you're doing a backwards walk through the predecessors to discover the result of the COND_EXPR. Correct? Yes. That's something I'd been wanting to do -- basically start with a COND_EXPR, then walk the dataflow backwards substituting values into the COND_EXPR (possibly creating non-gimple). Ultimately the goal is to substitute and fold, getting to a constant :-) The forward exhaustive stuff we do now is, crazy. The backwards approach could be decoupled from DOM VRP into an independent pass, which I think would be wise. Using a SEME region copier is also something I really wanted to do long term. In fact, I believe a lot of tree-ssa-threadupdate.c ought to be ripped out and replaced with a SEME based copier. I did an experiment around these lines over the week-end, and now that you mention it, I feel less shy to speak about; well the patch does not yet pass bootstrap, and there still are about 20 failing test-cases. I feel better reading the code generation part of jump-threading after this patch ;-) Basically I think all the tree-ssa-threadupdate.c can be replaced by duplicate_seme_region that generalizes the code generation. Btw I once thought about doing on-the-fly lattice use/update and folding during basic-block copying (or even re-generating expressions via simplifying gimple_build ()). Or have a substitute-and-fold like facility that can run on SEME regions and do this. Richard. It appears you've built at least parts of two pieces needed to all this as a Bodik style optimizer. Which is exactly the long term direction I think this code ought to take. One of the reasons I think we see more branches is that in sese region copying we do not use the knowledge of the value of the condition for the last branch in a jump-thread path: we rely on other propagation passes to remove the branch. The last attached patch adds: /* Remove the last branch in the jump thread path. */ remove_ctrl_stmt_and_useless_edges (region_copy[n_region - 1], exit-dest); That's certainly a possibility. But I would expect that even with this limitation something would be picking up the fact that the branch is statically computable (even if it's an RTL optimizer). But it's definitely something to look for. Please let me know if the attached patches are producing better results on gcc. For the trunk: instructions:1339016494968 branches :243568982489 First version of your patch: instructions:1339739533291 branches: 243806615986 Latest version of your patch: instructions:1339749122609 branches: 243809838262 I think I got about the same results. I got my scripts installed on the gcc-farm. I first used an x86_64 gcc75 and valgrind was crashing not recognizing how to decode an instruction. Then I moved to gcc112 a powerpc64-linux where I got this data from stage2 cc1plus compiling the same file alias.ii at -O2: (I got 3 runs of each mostly because there is a bit of noise in all these numbers) $ valgrind --tool=cachegrind --cache-sim=no --branch-sim=yes ./cc1plus -O2 ~/alias.ii all 4 patches: ==153617== I refs: 13,914,038,211 ==153617== ==153617== Branches: 1,926,407,760 (1,879,827,481 cond + 46,580,279 ind) ==153617== Mispredicts: 144,890,904 ( 132,094,105 cond + 12,796,799 ind) ==153617== Mispred rate: 7.5% ( 7.0% + 27.4% ) ==34993== I refs: 13,915,335,629 ==34993== ==34993== Branches: 1,926,597,919 (1,880,017,558 cond + 46,580,361 ind) ==34993== Mispredicts: 144,974,266 ( 132,177,440 cond + 12,796,826 ind) ==34993== Mispred rate: 7.5% ( 7.0% + 27.4% ) ==140841== I refs: 13,915,334,459 ==140841== ==140841== Branches: 1,926,597,819 (1,880,017,458 cond + 46,580,361 ind) ==140841== Mispredicts: 144,974,296 ( 132,177,470 cond + 12,796,826 ind) ==140841== Mispred rate: 7.5% ( 7.0% + 27.4% ) patch 1: ==99902== I refs: 13,915,069,710 ==99902== ==99902== Branches: 1,926,963,813 (1,880,376,148 cond + 46,587,665 ind) ==99902== Mispredicts: 145,501,564 ( 132,656,576 cond + 12,844,988 ind) ==99902== Mispred rate: 7.5% ( 7.0% + 27.5% ) ==3907== I refs: 13,915,082,469 ==3907== ==3907== Branches:
Re: [Patch] Improving jump-thread pass for PR 54742
On 2014.11.24 at 22:05 +, Sebastian Pop wrote: I got my scripts installed on the gcc-farm. I first used an x86_64 gcc75 and valgrind was crashing not recognizing how to decode an instruction. Then I moved to gcc112 a powerpc64-linux where I got this data from stage2 cc1plus compiling the same file alias.ii at -O2: (I got 3 runs of each mostly because there is a bit of noise in all these numbers) $ valgrind --tool=cachegrind --cache-sim=no --branch-sim=yes ./cc1plus -O2 ~/alias.ii BTW perf is also available on gcc112: trippels@gcc2-power8 ~ % perf list List of pre-defined events (to be used in -e): cpu-cycles OR cycles [Hardware event] instructions [Hardware event] cache-references [Hardware event] cache-misses [Hardware event] branch-instructions OR branches[Hardware event] branch-misses [Hardware event] stalled-cycles-frontend OR idle-cycles-frontend[Hardware event] stalled-cycles-backend OR idle-cycles-backend [Hardware event] cpu-clock [Software event] task-clock [Software event] page-faults OR faults [Software event] context-switches OR cs [Software event] cpu-migrations OR migrations [Software event] minor-faults [Software event] major-faults [Software event] alignment-faults [Software event] emulation-faults [Software event] dummy [Software event] L1-dcache-loads[Hardware cache event] L1-dcache-load-misses [Hardware cache event] L1-dcache-store-misses [Hardware cache event] L1-dcache-prefetches [Hardware cache event] L1-icache-loads[Hardware cache event] L1-icache-load-misses [Hardware cache event] L1-icache-prefetches [Hardware cache event] LLC-loads [Hardware cache event] LLC-load-misses[Hardware cache event] LLC-stores [Hardware cache event] LLC-store-misses [Hardware cache event] LLC-prefetches [Hardware cache event] dTLB-load-misses [Hardware cache event] iTLB-load-misses [Hardware cache event] branch-loads [Hardware cache event] branch-load-misses [Hardware cache event] rNNN [Raw hardware event descriptor] cpu/t1=v1[,t2=v2,t3 ...]/modifier [Raw hardware event descriptor] (see 'man perf-list' on how to encode it) mem:addr[:access][Hardware breakpoint] -- Markus
Re: [Patch] Improving jump-thread pass for PR 54742
On 11/24/14 21:55, Jeff Law wrote: On 11/24/14 18:09, Sebastian Pop wrote: Sebastian Pop wrote: I removed the return -1 and started a bootstrap on powerpc64-linux. Bootstrap passed on top of the 4 previous patches on powerpc64-linux. I will report the valgrind output. The output from valgrind looks closer to the output of master with no other patches: still 1M more instructions executed, and 300K more branches Just ran my suite where we get ~25k more branches, which definitely puts us in the noise. (that's with all 4 patches + fixing the return value ). I'm going to look at little closer at this stuff tomorrow, but I think we've resolved the performance issue. I'll dig deeper into the implementation tomorrow as well. I was running without your followup patches (must have used the wrong bits from my git stash), so those results are bogus, but in a good way. After fixing that goof, I'm seeing consistent improvements with your set of 4 patches and the fix for the wrong return code. Across the suite, ~140M fewer branches, not huge, but definitely not in the noise. So, time to dig into the implementation :-) Jeff ps. In case you're curious about the noise, it's primarily address hashing.
Re: [Patch] Improving jump-thread pass for PR 54742
Jeff Law wrote: On 11/24/14 21:55, Jeff Law wrote: On 11/24/14 18:09, Sebastian Pop wrote: Sebastian Pop wrote: I removed the return -1 and started a bootstrap on powerpc64-linux. Bootstrap passed on top of the 4 previous patches on powerpc64-linux. I will report the valgrind output. The output from valgrind looks closer to the output of master with no other patches: still 1M more instructions executed, and 300K more branches Just ran my suite where we get ~25k more branches, which definitely puts us in the noise. (that's with all 4 patches + fixing the return value ). I'm going to look at little closer at this stuff tomorrow, but I think we've resolved the performance issue. I'll dig deeper into the implementation tomorrow as well. I was running without your followup patches (must have used the wrong bits from my git stash), so those results are bogus, but in a good way. After fixing that goof, I'm seeing consistent improvements with your set of 4 patches and the fix for the wrong return code. Across the suite, ~140M fewer branches, not huge, but definitely not in the noise. Thanks for your testing. So, time to dig into the implementation :-) To ease the review, I squashed all the patches in a single one. I will bootstrap and regression test this patch on x86_64-linux and powerpc64-linux. I will also run it on our internal benchmarks, coremark, and the llvm test-suite. I will also include a longer testcase that makes sure we do not regress on coremark. Sebastian From db0f6817768920b497225484fab24a20e5ddf556 Mon Sep 17 00:00:00 2001 From: Sebastian Pop s@samsung.com Date: Fri, 26 Sep 2014 14:54:20 -0500 Subject: [PATCH] extend jump thread for finite state automata PR 54742 Adapted from a patch from James Greenhalgh. * params.def (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): New. * doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): Documented. * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New. * tree-cfg.c (split_edge_bb_loc): Export. * tree-cfg.h (split_edge_bb_loc): Declared extern. * tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the original value of cond when simplification fails. (fsm_find_thread_path): New. (fsm_find_control_statement_thread_paths): New. (fsm_thread_through_normal_block): Call find_control_statement_thread_paths. * tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print EDGE_START_FSM_THREAD. (duplicate_seme_region): New. (thread_through_all_blocks): Generate code for EDGE_START_FSM_THREAD edges calling gimple_duplicate_sese_region. * tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_START_FSM_THREAD. --- gcc/doc/invoke.texi | 12 ++ gcc/params.def | 15 ++ gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c | 38 gcc/tree-cfg.c | 2 +- gcc/tree-cfg.h | 1 + gcc/tree-ssa-threadedge.c| 215 ++- gcc/tree-ssa-threadupdate.c | 198 - gcc/tree-ssa-threadupdate.h | 1 + 8 files changed, 479 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 89edddb..074183f 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -10624,6 +10624,18 @@ large and significantly increase compile time at optimization level @option{-O1} and higher. This parameter is a maximum nubmer of statements in a single generated constructor. Default value is 5000. +@item max-fsm-thread-path-insns +Maximum number of instructions to copy when duplicating blocks on a +finite state automaton jump thread path. The default is 100. + +@item max-fsm-thread-length +Maximum number of basic blocks on a finite state automaton jump thread +path. The default is 10. + +@item max-fsm-thread-paths +Maximum number of new jump thread paths to create for a finite state +automaton. The default is 50. + @end table @end table diff --git a/gcc/params.def b/gcc/params.def index 9b21c07..edf3f53 100644 --- a/gcc/params.def +++ b/gcc/params.def @@ -1140,6 +1140,21 @@ DEFPARAM (PARAM_CHKP_MAX_CTOR_SIZE, Maximum number of statements to be included into a single static constructor generated by Pointer Bounds Checker, 5000, 100, 0) + +DEFPARAM (PARAM_MAX_FSM_THREAD_PATH_INSNS, + max-fsm-thread-path-insns, + Maximum number of instructions to copy when duplicating blocks on a finite state automaton jump thread path, + 100, 1, 99) + +DEFPARAM (PARAM_MAX_FSM_THREAD_LENGTH, + max-fsm-thread-length, + Maximum number of basic blocks on a finite state automaton jump thread path, + 10, 1, 99) + +DEFPARAM (PARAM_MAX_FSM_THREAD_PATHS, + max-fsm-thread-paths, + Maximum number of new
Re: [Patch] Improving jump-thread pass for PR 54742
Sebastian Pop wrote: I will bootstrap and regression test this patch on x86_64-linux and powerpc64-linux. I will also run it on our internal benchmarks, coremark, and the llvm test-suite. I will also include a longer testcase that makes sure we do not regress on coremark. Done all the above. Attached is the new patch with a new testcase. I have also added verify_seme inspired by the recent patch adding verify_sese. Sebastian From ca222d5222fb976c7aa258d3e3c04e593f42f7a2 Mon Sep 17 00:00:00 2001 From: Sebastian Pop s@samsung.com Date: Fri, 26 Sep 2014 14:54:20 -0500 Subject: [PATCH] extend jump thread for finite state automata PR 54742 Adapted from a patch from James Greenhalgh. * params.def (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): New. * doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): Documented. * tree-cfg.c (split_edge_bb_loc): Export. * tree-cfg.h (split_edge_bb_loc): Declared extern. * tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the original value of cond when simplification fails. (fsm_find_thread_path): New. (fsm_find_control_statement_thread_paths): New. (fsm_thread_through_normal_block): Call find_control_statement_thread_paths. * tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print EDGE_START_FSM_THREAD. (verify_seme): New. (duplicate_seme_region): New. (thread_through_all_blocks): Generate code for EDGE_START_FSM_THREAD edges calling gimple_duplicate_sese_region. * tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_START_FSM_THREAD. * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New. * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c: New. --- gcc/doc/invoke.texi | 12 ++ gcc/params.def | 15 ++ gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c | 43 + gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c | 127 + gcc/tree-cfg.c |2 +- gcc/tree-cfg.h |1 + gcc/tree-ssa-threadedge.c| 215 +- gcc/tree-ssa-threadupdate.c | 201 +++- gcc/tree-ssa-threadupdate.h |1 + 9 files changed, 614 insertions(+), 3 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-7.c diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 89edddb..074183f 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -10624,6 +10624,18 @@ large and significantly increase compile time at optimization level @option{-O1} and higher. This parameter is a maximum nubmer of statements in a single generated constructor. Default value is 5000. +@item max-fsm-thread-path-insns +Maximum number of instructions to copy when duplicating blocks on a +finite state automaton jump thread path. The default is 100. + +@item max-fsm-thread-length +Maximum number of basic blocks on a finite state automaton jump thread +path. The default is 10. + +@item max-fsm-thread-paths +Maximum number of new jump thread paths to create for a finite state +automaton. The default is 50. + @end table @end table diff --git a/gcc/params.def b/gcc/params.def index 9b21c07..edf3f53 100644 --- a/gcc/params.def +++ b/gcc/params.def @@ -1140,6 +1140,21 @@ DEFPARAM (PARAM_CHKP_MAX_CTOR_SIZE, Maximum number of statements to be included into a single static constructor generated by Pointer Bounds Checker, 5000, 100, 0) + +DEFPARAM (PARAM_MAX_FSM_THREAD_PATH_INSNS, + max-fsm-thread-path-insns, + Maximum number of instructions to copy when duplicating blocks on a finite state automaton jump thread path, + 100, 1, 99) + +DEFPARAM (PARAM_MAX_FSM_THREAD_LENGTH, + max-fsm-thread-length, + Maximum number of basic blocks on a finite state automaton jump thread path, + 10, 1, 99) + +DEFPARAM (PARAM_MAX_FSM_THREAD_PATHS, + max-fsm-thread-paths, + Maximum number of new jump thread paths to create for a finite state automaton, + 50, 1, 99) /* Local variables: diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c new file mode 100644 index 000..bb34a74 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c @@ -0,0 +1,43 @@ +/* { dg-do compile } */ +/* { dg-options -O2 -fdump-tree-dom1-details } */ +/* { dg-final { scan-tree-dump-times FSM 6 dom1 } } */ +/* { dg-final { cleanup-tree-dump dom1 } } */ + +int sum0, sum1, sum2, sum3; +int foo (char *s, char **ret) +{ + int state=0; + char c; + + for (; *s state != 4; s++) +{ + c = *s; + if (c == '*') + { + s++; + break; + } + switch (state) + { + case 0: + if (c == '+') + state = 1; + else if (c != '-') + sum0+=c; + break; + case 1: + if (c == '+') +
Re: [Patch] Improving jump-thread pass for PR 54742
On 11/23/14 15:22, Sebastian Pop wrote: The second patch attached limits the search for FSM jump threads to loops. With that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap (and 424 jump threads on powerpc64-linux bootstrap.) Yea, that was one of the things I was going to poke at as well as a quick scan of your patch gave me the impression it wasn't limited to loops. Again, I haven't looked much at the patch, but I got the impression you're doing a backwards walk through the predecessors to discover the result of the COND_EXPR. Correct? That's something I'd been wanting to do -- basically start with a COND_EXPR, then walk the dataflow backwards substituting values into the COND_EXPR (possibly creating non-gimple). Ultimately the goal is to substitute and fold, getting to a constant :-) The forward exhaustive stuff we do now is, crazy. The backwards approach could be decoupled from DOM VRP into an independent pass, which I think would be wise. Using a SEME region copier is also something I really wanted to do long term. In fact, I believe a lot of tree-ssa-threadupdate.c ought to be ripped out and replaced with a SEME based copier. It appears you've built at least parts of two pieces needed to all this as a Bodik style optimizer. Which is exactly the long term direction I think this code ought to take. One of the reasons I think we see more branches is that in sese region copying we do not use the knowledge of the value of the condition for the last branch in a jump-thread path: we rely on other propagation passes to remove the branch. The last attached patch adds: /* Remove the last branch in the jump thread path. */ remove_ctrl_stmt_and_useless_edges (region_copy[n_region - 1], exit-dest); That's certainly a possibility. But I would expect that even with this limitation something would be picking up the fact that the branch is statically computable (even if it's an RTL optimizer). But it's definitely something to look for. Please let me know if the attached patches are producing better results on gcc. For the trunk: instructions:1339016494968 branches :243568982489 First version of your patch: instructions:1339739533291 branches: 243806615986 Latest version of your patch: instructions:1339749122609 branches: 243809838262 Which is in the noise for this test. Which makes me wonder if I botched something on the latest run. It doesn't appear so, but I'm re-running just to be sure. I'm also turning on -g so that I can use cg_annotate to poke a bit deeper and perhaps identify one or more concrete examples where your patch is making this worse. Jeff
Re: [Patch] Improving jump-thread pass for PR 54742
Jeff Law wrote: On 11/23/14 15:22, Sebastian Pop wrote: The second patch attached limits the search for FSM jump threads to loops. With that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap (and 424 jump threads on powerpc64-linux bootstrap.) Yea, that was one of the things I was going to poke at as well as a quick scan of your patch gave me the impression it wasn't limited to loops. Again, I haven't looked much at the patch, but I got the impression you're doing a backwards walk through the predecessors to discover the result of the COND_EXPR. Correct? Yes. That's something I'd been wanting to do -- basically start with a COND_EXPR, then walk the dataflow backwards substituting values into the COND_EXPR (possibly creating non-gimple). Ultimately the goal is to substitute and fold, getting to a constant :-) The forward exhaustive stuff we do now is, crazy. The backwards approach could be decoupled from DOM VRP into an independent pass, which I think would be wise. Using a SEME region copier is also something I really wanted to do long term. In fact, I believe a lot of tree-ssa-threadupdate.c ought to be ripped out and replaced with a SEME based copier. I did an experiment around these lines over the week-end, and now that you mention it, I feel less shy to speak about; well the patch does not yet pass bootstrap, and there still are about 20 failing test-cases. I feel better reading the code generation part of jump-threading after this patch ;-) Basically I think all the tree-ssa-threadupdate.c can be replaced by duplicate_seme_region that generalizes the code generation. It appears you've built at least parts of two pieces needed to all this as a Bodik style optimizer. Which is exactly the long term direction I think this code ought to take. One of the reasons I think we see more branches is that in sese region copying we do not use the knowledge of the value of the condition for the last branch in a jump-thread path: we rely on other propagation passes to remove the branch. The last attached patch adds: /* Remove the last branch in the jump thread path. */ remove_ctrl_stmt_and_useless_edges (region_copy[n_region - 1], exit-dest); That's certainly a possibility. But I would expect that even with this limitation something would be picking up the fact that the branch is statically computable (even if it's an RTL optimizer). But it's definitely something to look for. Please let me know if the attached patches are producing better results on gcc. For the trunk: instructions:1339016494968 branches :243568982489 First version of your patch: instructions:1339739533291 branches: 243806615986 Latest version of your patch: instructions:1339749122609 branches: 243809838262 I think I got about the same results. I got my scripts installed on the gcc-farm. I first used an x86_64 gcc75 and valgrind was crashing not recognizing how to decode an instruction. Then I moved to gcc112 a powerpc64-linux where I got this data from stage2 cc1plus compiling the same file alias.ii at -O2: (I got 3 runs of each mostly because there is a bit of noise in all these numbers) $ valgrind --tool=cachegrind --cache-sim=no --branch-sim=yes ./cc1plus -O2 ~/alias.ii all 4 patches: ==153617== I refs: 13,914,038,211 ==153617== ==153617== Branches: 1,926,407,760 (1,879,827,481 cond + 46,580,279 ind) ==153617== Mispredicts: 144,890,904 ( 132,094,105 cond + 12,796,799 ind) ==153617== Mispred rate: 7.5% ( 7.0% + 27.4% ) ==34993== I refs: 13,915,335,629 ==34993== ==34993== Branches: 1,926,597,919 (1,880,017,558 cond + 46,580,361 ind) ==34993== Mispredicts: 144,974,266 ( 132,177,440 cond + 12,796,826 ind) ==34993== Mispred rate: 7.5% ( 7.0% + 27.4% ) ==140841== I refs: 13,915,334,459 ==140841== ==140841== Branches: 1,926,597,819 (1,880,017,458 cond + 46,580,361 ind) ==140841== Mispredicts: 144,974,296 ( 132,177,470 cond + 12,796,826 ind) ==140841== Mispred rate: 7.5% ( 7.0% + 27.4% ) patch 1: ==99902== I refs: 13,915,069,710 ==99902== ==99902== Branches: 1,926,963,813 (1,880,376,148 cond + 46,587,665 ind) ==99902== Mispredicts: 145,501,564 ( 132,656,576 cond + 12,844,988 ind) ==99902== Mispred rate: 7.5% ( 7.0% + 27.5% ) ==3907== I refs: 13,915,082,469 ==3907== ==3907== Branches: 1,926,965,218 (1,880,377,471 cond + 46,587,747 ind) ==3907== Mispredicts: 145,501,569 ( 132,656,554 cond + 12,845,015 ind) ==3907== Mispred rate: 7.5% ( 7.0% + 27.5% ) ==44271== I refs: 13,915,111,997 ==44271== ==44271== Branches: 1,926,968,863 (1,880,380,952 cond + 46,587,911 ind) ==44271== Mispredicts: 145,501,858
Re: [Patch] Improving jump-thread pass for PR 54742
Sebastian Pop wrote: Using a SEME region copier is also something I really wanted to do long term. In fact, I believe a lot of tree-ssa-threadupdate.c ought to be ripped out and replaced with a SEME based copier. I did an experiment around these lines over the week-end, and now that you mention it, I feel less shy to speak about; well the patch does not yet pass bootstrap, and there still are about 20 failing test-cases. I feel better reading the code generation part of jump-threading after this patch ;-) Basically I think all the tree-ssa-threadupdate.c can be replaced by duplicate_seme_region that generalizes the code generation. For reference, here is the patch I am speaking about. Sebastian From c7213811e2ec2443df9ffc3ca72b3b15a6c9aaf9 Mon Sep 17 00:00:00 2001 From: Sebastian Pop seb...@gmail.com Date: Fri, 21 Nov 2014 13:17:12 -0600 Subject: [PATCH 2/3] use duplicate_seme to generate code for jump threading --- gcc/tree-cfg.c | 142 +++ gcc/tree-cfg.h |2 + gcc/tree-ssa-threadedge.c | 61 +- gcc/tree-ssa-threadupdate.c | 2487 +-- gcc/tree-ssa-threadupdate.h | 23 +- 5 files changed, 222 insertions(+), 2493 deletions(-) diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c index 6d96c52..d6dc442 100644 --- a/gcc/tree-cfg.c +++ b/gcc/tree-cfg.c @@ -6120,6 +6120,148 @@ gimple_duplicate_sese_region (edge entry, edge exit, return true; } +/* Duplicates a REGION (set of N_REGION basic blocks). The edge ENTRY is + redirected to the duplicate of the region. Dominance and loop information is + updated if UPDATE_DOMINANCE is true, but not the SSA web. If + UPDATE_DOMINANCE is false then we assume that the caller will update the + dominance information after calling this function. The new basic blocks are + stored to REGION_COPY in the same order as they had in REGION, provided that + REGION_COPY is not NULL. + + Returns false if it is unable to copy the region, true otherwise. */ + +bool +gimple_duplicate_seme_region (edge entry, + basic_block *region, unsigned n_region, + basic_block *region_copy, + bool update_dominance) +{ + unsigned i; + bool free_region_copy = false, copying_header = false; + struct loop *loop = entry-dest-loop_father; + vecbasic_block doms; + edge redirected; + int memo_loop_header_no = 0, memo_loop_latch_no = 0; + int total_freq = 0, entry_freq = 0; + gcov_type total_count = 0, entry_count = 0; + + if (!can_copy_bbs_p (region, n_region)) +return false; + + /* Some sanity checking. Note that we do not check for all possible + missuses of the functions. I.e. if you ask to copy something weird, + it will work, but the state of structures probably will not be + correct. */ + for (i = 0; i n_region; i++) +{ + /* We do not handle subloops, i.e. all the blocks must belong to the + same loop. */ + if (region[i]-loop_father != loop) + return false; + + /* If we are copying a region that starts and ends in an arbirary place in + the loop: keep track of which block will become our loop header. */ + if (region[i] != entry-dest region[i] == loop-header) + memo_loop_header_no = i; + + /* And which block will become our loop latch. */ + if (region[i] != entry-src region[i] == loop-latch) + memo_loop_latch_no = i; +} + + initialize_original_copy_tables (); + + if (copying_header) +set_loop_copy (loop, loop_outer (loop)); + else +set_loop_copy (loop, loop); + + if (!region_copy) +{ + region_copy = XNEWVEC (basic_block, n_region); + free_region_copy = true; +} + + /* Record blocks outside the region that are dominated by something + inside. */ + if (update_dominance) +{ + doms.create (0); + doms = get_dominated_by_region (CDI_DOMINATORS, region, n_region); +} + + if (entry-dest-count) +{ + total_count = entry-dest-count; + entry_count = entry-count; + /* Fix up corner cases, to avoid division by zero or creation of negative + frequencies. */ + if (entry_count total_count) + entry_count = total_count; +} + else +{ + total_freq = entry-dest-frequency; + entry_freq = EDGE_FREQUENCY (entry); + /* Fix up corner cases, to avoid division by zero or creation of negative + frequencies. */ + if (total_freq == 0) + total_freq = 1; + else if (entry_freq total_freq) + entry_freq = total_freq; +} + + copy_bbs (region, n_region, region_copy, NULL, 0, NULL, loop, + split_edge_bb_loc (entry), update_dominance); + if (total_count) +{ + scale_bbs_frequencies_gcov_type (region, n_region, + total_count - entry_count, + total_count); + scale_bbs_frequencies_gcov_type (region_copy, n_region, entry_count, + total_count); +} + else +{ + scale_bbs_frequencies_int (region, n_region, total_freq - entry_freq, + total_freq); +
Re: [Patch] Improving jump-thread pass for PR 54742
On 11/24/14 15:05, Sebastian Pop wrote: I did an experiment around these lines over the week-end, and now that you mention it, I feel less shy to speak about; well the patch does not yet pass bootstrap, and there still are about 20 failing test-cases. I feel better reading the code generation part of jump-threading after this patch ;-) Basically I think all the tree-ssa-threadupdate.c can be replaced by duplicate_seme_region that generalizes the code generation. Clearly next stage1 stuff, but definitely the right direction IMHO. If you get the chance look at Bodik's thesis.As far as I know he's the only person to really look at how to structure context sensitive optimizations in a sane way. I got my scripts installed on the gcc-farm. I first used an x86_64 gcc75 and valgrind was crashing not recognizing how to decode an instruction. Then I moved to gcc112 a powerpc64-linux where I got this data from stage2 cc1plus compiling the same file alias.ii at -O2: (I got 3 runs of each mostly because there is a bit of noise in all these numbers) Yea, glibc valgrind really need to update in lock-step as glibc gains support for new ISAs. Certain instructions are supposed to be interpreted as nops, but valgrind instead raises an illegal instruction. There's a bit of noise when using valgrind like this, but it has definitely proven useful in the past. I'm looking at bitmap_ior_and_compl right now. Not sure if cg_annotate is sending me on a wild goose chase yet or not. jeff
Re: [Patch] Improving jump-thread pass for PR 54742
On 11/23/14 15:22, Sebastian Pop wrote: Jeff Law wrote: PS: I have run some perf analysis with the patch: - on a bootstrap of GCC I see 3209 FSM jump threads - libpng and libjpeg contain FSM jump threads, the perf increase is in the 1% (measured on simulators and reduced data sets) - coremark gets jump threaded (as expected) - I'm setting up the llvm test-suite and I will report perf differences So that's *far* more jump threads than I would expect this to find in a bootstrap of GCC -- like 3 orders of magnitude more than I'd expect to find. The second patch attached limits the search for FSM jump threads to loops. With that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap (and 424 jump threads on powerpc64-linux bootstrap.) [ ... ] So why are we returning -1 (block should not be duplicated and not suitable for a joiner) at the end of thread_through_normal_block? /* When COND cannot be simplified, try to find paths from a control statement back through the PHI nodes which would affect that control statement. */ vecbasic_block, va_gc *bb_path; vec_alloc (bb_path, n_basic_blocks_for_fn (cfun)); vec_safe_push (bb_path, e-dest); hash_setgimple *visited_phis = new hash_setgimple; max_threaded_paths = PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATHS); fsm_find_control_statement_thread_paths (cond, visited_phis, bb_path); delete visited_phis; vec_free (bb_path); return -1; Returning -1 (instead of 0) says stop, there's no possibility to threading something on that path. I think that's suppressing some profitable jump threads. I haven't done more than verify the bitmap code returns to its prior state with that change. jeff
Re: [Patch] Improving jump-thread pass for PR 54742
Jeff Law wrote: On 11/23/14 15:22, Sebastian Pop wrote: Jeff Law wrote: PS: I have run some perf analysis with the patch: - on a bootstrap of GCC I see 3209 FSM jump threads - libpng and libjpeg contain FSM jump threads, the perf increase is in the 1% (measured on simulators and reduced data sets) - coremark gets jump threaded (as expected) - I'm setting up the llvm test-suite and I will report perf differences So that's *far* more jump threads than I would expect this to find in a bootstrap of GCC -- like 3 orders of magnitude more than I'd expect to find. The second patch attached limits the search for FSM jump threads to loops. With that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap (and 424 jump threads on powerpc64-linux bootstrap.) [ ... ] So why are we returning -1 (block should not be duplicated and not suitable for a joiner) at the end of thread_through_normal_block? /* When COND cannot be simplified, try to find paths from a control statement back through the PHI nodes which would affect that control statement. */ vecbasic_block, va_gc *bb_path; vec_alloc (bb_path, n_basic_blocks_for_fn (cfun)); vec_safe_push (bb_path, e-dest); hash_setgimple *visited_phis = new hash_setgimple; max_threaded_paths = PARAM_VALUE (PARAM_MAX_FSM_THREAD_PATHS); fsm_find_control_statement_thread_paths (cond, visited_phis, bb_path); delete visited_phis; vec_free (bb_path); return -1; Returning -1 (instead of 0) says stop, there's no possibility to threading something on that path. I think that's suppressing some profitable jump threads. Thanks for spotting this. I haven't done more than verify the bitmap code returns to its prior state with that change. I removed the return -1 and started a bootstrap on powerpc64-linux. I will report the valgrind output. Thanks, Sebastian
Re: [Patch] Improving jump-thread pass for PR 54742
Sebastian Pop wrote: I removed the return -1 and started a bootstrap on powerpc64-linux. Bootstrap passed on top of the 4 previous patches on powerpc64-linux. I will report the valgrind output. The output from valgrind looks closer to the output of master with no other patches: still 1M more instructions executed, and 300K more branches master no-patch: ==129233== I refs: 13,910,221,913 ==129233== ==129233== Branches: 1,925,715,095 (1,879,277,776 cond + 46,437,319 ind) ==129233== Mispredicts: 144,133,332 ( 131,510,534 cond + 12,622,798 ind) ==129233== Mispred rate: 7.4% ( 6.9% + 27.1% ) 4 previous patches + patch to return 0: ==149012== I refs: 13,911,870,743 ==149012== ==149012== Branches: 1,926,092,629 (1,879,657,768 cond + 46,434,861 ind) ==149012== Mispredicts: 145,551,513 ( 132,827,091 cond + 12,724,422 ind) ==149012== Mispred rate: 7.5% ( 7.0% + 27.4% ) ==4492== I refs: 13,911,899,691 ==4492== ==4492== Branches: 1,926,096,214 (1,879,661,186 cond + 46,435,028 ind) ==4492== Mispredicts: 145,551,707 ( 132,827,231 cond + 12,724,476 ind) ==4492== Mispred rate: 7.5% ( 7.0% + 27.4% ) ==19521== I refs: 13,911,855,711 ==19521== ==19521== Branches: 1,926,090,982 (1,879,656,202 cond + 46,434,780 ind) ==19521== Mispredicts: 145,551,343 ( 132,826,948 cond + 12,724,395 ind) ==19521== Mispred rate: 7.5% ( 7.0% + 27.4% )
Re: [Patch] Improving jump-thread pass for PR 54742
On 11/24/14 18:09, Sebastian Pop wrote: Sebastian Pop wrote: I removed the return -1 and started a bootstrap on powerpc64-linux. Bootstrap passed on top of the 4 previous patches on powerpc64-linux. I will report the valgrind output. The output from valgrind looks closer to the output of master with no other patches: still 1M more instructions executed, and 300K more branches Just ran my suite where we get ~25k more branches, which definitely puts us in the noise. (that's with all 4 patches + fixing the return value ). I'm going to look at little closer at this stuff tomorrow, but I think we've resolved the performance issue. I'll dig deeper into the implementation tomorrow as well. Cheers, Jeff
Re: [Patch] Improving jump-thread pass for PR 54742
Jeff Law wrote: PS: I have run some perf analysis with the patch: - on a bootstrap of GCC I see 3209 FSM jump threads - libpng and libjpeg contain FSM jump threads, the perf increase is in the 1% (measured on simulators and reduced data sets) - coremark gets jump threaded (as expected) - I'm setting up the llvm test-suite and I will report perf differences So that's *far* more jump threads than I would expect this to find in a bootstrap of GCC -- like 3 orders of magnitude more than I'd expect to find. The second patch attached limits the search for FSM jump threads to loops. With that patch, we are now down to 470 jump threads in an x86_64-linux bootstrap (and 424 jump threads on powerpc64-linux bootstrap.) I haven't dug deep, but the first level analysis is not encouraging. Basically I used the trunk compiler with and without your patch to build gcc-4.7.3's cc1 (4.7.3 simply because that's what I last used this testing framework). So that gives me two cc1s that I then use to compile a bunch of .i files under valgrind's (cachegrind) control. valgrind --tool=cachegrind --cache-sim=no --branch-sim=yes .. That gives me two hunks of data for each input file I test. Specifically I get the dynamic number of instructions and the dynamic number of branches executed. For jump threading those values correspond directly to the effect we're looking for -- a reduction in dynamic conditional jumps and a reduction in dynamic instructions executed. Typically the change in dynamic instructions executed is 2-3X the change in dynamic conditional jumps -- which makes sense as removing the conditional jump usually means we remove a comparison and possibly some setup code as well. Consistently with your patch those values get worse. Across the entire set of .i files I get For the trunk: instructions:1339016494968 branches: 243568982489 With your patch: instructions:1339739533291 branches: 243806615986 So that's 723038323 more instructions and 237633497 more branches after installing your patch. While we're looking a just under .1% regression in dynamic branches, that's a terrible result for this work. One of the reasons I think we see more branches is that in sese region copying we do not use the knowledge of the value of the condition for the last branch in a jump-thread path: we rely on other propagation passes to remove the branch. The last attached patch adds: /* Remove the last branch in the jump thread path. */ remove_ctrl_stmt_and_useless_edges (region_copy[n_region - 1], exit-dest); Please let me know if the attached patches are producing better results on gcc. I rebased the original patch on trunk and all patches bootstrapped together on x86_64-linux and powerpc64-linux. Thanks, Sebastian From 120d5490598b1a09a06c04796b4fda46be7fd7db Mon Sep 17 00:00:00 2001 From: Sebastian Pop s@samsung.com Date: Fri, 26 Sep 2014 14:54:20 -0500 Subject: [PATCH 1/4] extend jump thread for finite state automata PR 54742 Adapted from a patch from James Greenhalgh. * params.def (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): New. * doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): Documented. * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New. * tree-cfg.c (gimple_duplicate_sese_region): Save and restore loop header and latch. * tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the original value of cond when simplification fails. (fsm_find_thread_path): New. (fsm_find_control_statement_thread_paths): New. (fsm_thread_through_normal_block): Call find_control_statement_thread_paths. * tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print EDGE_START_FSM_THREAD. (thread_through_all_blocks): Generate code for EDGE_START_FSM_THREAD edges calling gimple_duplicate_sese_region. * tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_START_FSM_THREAD. --- gcc/doc/invoke.texi | 12 ++ gcc/params.def | 15 ++ gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c | 38 + gcc/tree-cfg.c | 26 ++- gcc/tree-ssa-threadedge.c| 203 ++- gcc/tree-ssa-threadupdate.c | 52 +- gcc/tree-ssa-threadupdate.h | 1 + 7 files changed, 342 insertions(+), 5 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 89edddb..074183f 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -10624,6 +10624,18 @@ large and significantly increase compile time at optimization level @option{-O1} and higher. This parameter is a maximum nubmer of statements in a single generated constructor. Default value is 5000. +@item max-fsm-thread-path-insns +Maximum number of instructions
Re: [Patch] Improving jump-thread pass for PR 54742
On 11/18/14 15:19, Sebastian Pop wrote: The regions that we duplicate start inside a loop and stay inside the same loop, and the jump threading path is not allowed to go in deeper nested loops. The reason why we need to modify the sese duplication function is that the sese region that we need to duplicate starts at an arbitrary place inside the loop, whereas the current user of the sese duplication function tree-ssa-loop-ch.c:245 starts at the edge entering the loop and exits at the latch edge. I'll leave the rest to Jeff but it looks good to me from an overall structure. Thanks for your review. Sebastian PS: Patch passed bootstrap and regtest on x86_64-linux. PS: I have run some perf analysis with the patch: - on a bootstrap of GCC I see 3209 FSM jump threads - libpng and libjpeg contain FSM jump threads, the perf increase is in the 1% (measured on simulators and reduced data sets) - coremark gets jump threaded (as expected) - I'm setting up the llvm test-suite and I will report perf differences So that's *far* more jump threads than I would expect this to find in a bootstrap of GCC -- like 3 orders of magnitude more than I'd expect to find. I haven't dug deep, but the first level analysis is not encouraging. Basically I used the trunk compiler with and without your patch to build gcc-4.7.3's cc1 (4.7.3 simply because that's what I last used this testing framework). So that gives me two cc1s that I then use to compile a bunch of .i files under valgrind's (cachegrind) control. valgrind --tool=cachegrind --cache-sim=no --branch-sim=yes .. That gives me two hunks of data for each input file I test. Specifically I get the dynamic number of instructions and the dynamic number of branches executed. For jump threading those values correspond directly to the effect we're looking for -- a reduction in dynamic conditional jumps and a reduction in dynamic instructions executed. Typically the change in dynamic instructions executed is 2-3X the change in dynamic conditional jumps -- which makes sense as removing the conditional jump usually means we remove a comparison and possibly some setup code as well. Consistently with your patch those values get worse. Across the entire set of .i files I get For the trunk: instructions:1339016494968 branches: 243568982489 With your patch: instructions:1339739533291 branches: 243806615986 So that's 723038323 more instructions and 237633497 more branches after installing your patch. While we're looking a just under .1% regression in dynamic branches, that's a terrible result for this work. I'm not sure if the threads you're optimizing are somehow hiding other jump threading opportunities or somehow hiding CSE-able jump conditions, mucking up a loop structure or something else but something very bad is happening here. If I put Steve's patch through the same testing I get: instructions:1339006760834 branches: 243565768224 Which you'll note is a *very* slight decrease of 3214265 dynamic branches as 9734134 total instructions executed. So I think we need to dig deeper into why the branching behaviour of GCC gets noticeably worse with your patch when it should be as good as or better than without your patch. I know when i was analyzing the last update to this code, I found cases where we're much better off taking the shorter jump threading path without a joiner rather than preferring the long path with a joiner. IIRC, the issue was that if we selected the joiner path, then the duplication would create another jump threading opportunity (the original, shorter path without a join) that wouldn't be seen until the next pass of jump threading. Jeff
Re: [Patch] Improving jump-thread pass for PR 54742
On 10/26/14 15:34, Sebastian Pop wrote: I have tried to understand why the code generation part ICEs on coremark: the first problem that I have seen is that tree-ssa-threadupdate.c does not handle more than a joiner block per path to be threaded, so we would not be able to jump thread accross the joiners of the if condition and the joiner of the switch condition: i.e., these paths Right. There's nothing I can think of inherently that makes that impossible, it's just not something the code currently tries to support. I suspect there's a few places that need light generalization. I can offhand think of one or two. It's not lost on me that what we're building is a specialized region cloner. I keep wanting to go back and see if there's a representation where we have an incoming edge, series of blocks and an outgoing edge. The concept of a join block really isn't important. It's just a block that we want to copy for which we don't know its destination. Anyway, within that representation, I think the way to wire up the edges is simple if we build a map at the beginning of the process. The set of blocks in the path all get clones. There's a single edge into the cloned path (the incoming edge) and a single edge out (the edge corresponding to the statically computed destination of the path). Edges that were interior in the original path are kept interior in the clone. Edges that reached outside the original path go from the clone to the original destinations outside the path. It's a SEME region. Another problem is that we attach the path to be threaded to the -aux field of the first edge in the path, such that we would have to cancel some of the paths because we cannot keep track of all the paths to be threaded. What I can easily see is cases where you have two paths starting at a particular node where one path is a strict subset of another path. Assuming the amount of copying we're doing is reasonable, then you'd want to keep just the longer path But I don't think that's the case you're struggling with. We could (for example) have a case where all the successors of a join block become threadable, possibly to different destinations. That would argue that we really want to handle the threads off a different data structure than e-aux. jeff
Re: [Patch] Improving jump-thread pass for PR 54742
On Mon, 2014-11-17 at 09:24 +, James Greenhalgh wrote: For what it is worth, I've bootstrapped and tested this patch on aarch64-none-linux-gnu and arm-none-linux-gnueabi with no issues, and both targets get the expected speedup in the interesting benchmark. I've also thrown some of the larger popular benchmark suites at it, and they've compiled and run without any compilation issues or miscompares. I'm happy to help out with the testing and bug-triaging effort once this patch goes in. Some very shallow comments: you should document the new parameters in doc/invoke.texi and you ought to run contrib/check_GNU_style.sh on the patch and clean up the coding style issues it highlights. Thanks, James Greenhalgh I tested the patch on MIPS and things looked good there too. I got the desired speedup and did not see any regressions. Steve Ellcey
Re: [Patch] Improving jump-thread pass for PR 54742
On 11/18/14 12:25, Steve Ellcey wrote: On Mon, 2014-11-17 at 09:24 +, James Greenhalgh wrote: For what it is worth, I've bootstrapped and tested this patch on aarch64-none-linux-gnu and arm-none-linux-gnueabi with no issues, and both targets get the expected speedup in the interesting benchmark. I've also thrown some of the larger popular benchmark suites at it, and they've compiled and run without any compilation issues or miscompares. I'm happy to help out with the testing and bug-triaging effort once this patch goes in. Some very shallow comments: you should document the new parameters in doc/invoke.texi and you ought to run contrib/check_GNU_style.sh on the patch and clean up the coding style issues it highlights. Thanks, James Greenhalgh I tested the patch on MIPS and things looked good there too. I got the desired speedup and did not see any regressions. It's on my list of things to look at. The patch clearly was in prior to stage1 close and is thus eligible for inclusion in GCC 5. Slogging through stuff as quickly as I can. jeff
Re: [Patch] Improving jump-thread pass for PR 54742
Richard Biener wrote: On Tue, Nov 11, 2014 at 2:14 AM, Sebastian Pop seb...@gmail.com wrote: Hi Jeff, I have adapted the code generation part from James' patch to current trunk, and the resulting patch gets the 30% speedup on coremark and passes bootstrap of GCC. Ok for trunk? In addition to missing documentation for the parameters + /* If we are copying an abnormally shaped region, keep track of +which block will become our loop header. */ + if ((region[i] != entry-dest region[i] == loop-header) + || (region[i] != entry-src region[i] == loop-latch)) + { + save_loop_details = true; + memo_loop_latch_no = i; + memo_loop_header_no = i; + } this looks bogus as you overwrite latch/header. Right, this should be: if (region[i] != entry-dest region[i] == loop-header) { save_loop_details = true; memo_loop_header_no = i; } if (region[i] != entry-src region[i] == loop-latch) { save_loop_details = true; memo_loop_latch_no = i; } I wonder what you tried to fix with this as abnormally shaped isn't sth we support given the check before (all blocks must belong to the same loop and thus entry is always the loop header or there is no loop)? The regions that we duplicate start inside a loop and stay inside the same loop, and the jump threading path is not allowed to go in deeper nested loops. The reason why we need to modify the sese duplication function is that the sese region that we need to duplicate starts at an arbitrary place inside the loop, whereas the current user of the sese duplication function tree-ssa-loop-ch.c:245 starts at the edge entering the loop and exits at the latch edge. I'll leave the rest to Jeff but it looks good to me from an overall structure. Thanks for your review. Sebastian PS: Patch passed bootstrap and regtest on x86_64-linux. PS: I have run some perf analysis with the patch: - on a bootstrap of GCC I see 3209 FSM jump threads - libpng and libjpeg contain FSM jump threads, the perf increase is in the 1% (measured on simulators and reduced data sets) - coremark gets jump threaded (as expected) - I'm setting up the llvm test-suite and I will report perf differences From aee8e01469c05e370b757b20c357a1c9dae57950 Mon Sep 17 00:00:00 2001 From: Sebastian Pop s@samsung.com Date: Fri, 26 Sep 2014 14:54:20 -0500 Subject: [PATCH] extend jump thread for finite state automata PR 54742 Adapted from a patch from James Greenhalgh. * params.def (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): New. * doc/invoke.texi (max-fsm-thread-path-insns, max-fsm-thread-length, max-fsm-thread-paths): Documented. * testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c: New. * tree-cfg.c (gimple_duplicate_sese_region): Save and restore loop header and latch. * tree-ssa-threadedge.c (simplify_control_stmt_condition): Restore the original value of cond when simplification fails. (fsm_find_thread_path): New. (fsm_find_control_statement_thread_paths): New. (fsm_thread_through_normal_block): Call find_control_statement_thread_paths. * tree-ssa-threadupdate.c (dump_jump_thread_path): Pretty print EDGE_START_FSM_THREAD. (thread_through_all_blocks): Generate code for EDGE_START_FSM_THREAD edges calling gimple_duplicate_sese_region. * tree-ssa-threadupdate.h (jump_thread_edge_type): Add EDGE_START_FSM_THREAD. --- gcc/doc/invoke.texi | 12 ++ gcc/params.def | 15 ++ gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c | 38 + gcc/tree-cfg.c | 26 ++- gcc/tree-ssa-threadedge.c| 201 ++- gcc/tree-ssa-threadupdate.c | 52 +- gcc/tree-ssa-threadupdate.h | 1 + 7 files changed, 340 insertions(+), 5 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 13270bc..613edbb 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/doc/invoke.texi @@ -10586,6 +10586,18 @@ large and significantly increase compile time at optimization level @option{-O1} and higher. This parameter is a maximum nubmer of statements in a single generated constructor. Default value is 5000. +@item max-fsm-thread-path-insns +Maximum number of instructions to copy when duplicating blocks on a +finite state automaton jump thread path. The default is 100. + +@item max-fsm-thread-length +Maximum number of basic blocks on a finite state automaton jump thread +path. The default is 10. + +@item max-fsm-thread-paths +Maximum number of new jump thread paths to create for a finite state +automaton. The default is 50. + @end table @end table diff --git a/gcc/params.def b/gcc/params.def index d2d2add..55c287e 100644 --- a/gcc/params.def +++ b/gcc/params.def @@ -1125,6
Re: [Patch] Improving jump-thread pass for PR 54742
On Tue, Nov 11, 2014 at 01:14:04AM +, Sebastian Pop wrote: Hi Jeff, I have adapted the code generation part from James' patch to current trunk, and the resulting patch gets the 30% speedup on coremark and passes bootstrap of GCC. For what it is worth, I've bootstrapped and tested this patch on aarch64-none-linux-gnu and arm-none-linux-gnueabi with no issues, and both targets get the expected speedup in the interesting benchmark. I've also thrown some of the larger popular benchmark suites at it, and they've compiled and run without any compilation issues or miscompares. I'm happy to help out with the testing and bug-triaging effort once this patch goes in. Some very shallow comments: you should document the new parameters in doc/invoke.texi and you ought to run contrib/check_GNU_style.sh on the patch and clean up the coding style issues it highlights. Thanks, James Greenhalgh diff --git a/gcc/params.def b/gcc/params.def index d2d2add..749f962 100644 --- a/gcc/params.def +++ b/gcc/params.def @@ -123,6 +123,25 @@ DEFPARAM (PARAM_PARTIAL_INLINING_ENTRY_PROBABILITY, Maximum probability of the entry BB of split region (in percent relative to entry BB of the function) to make partial inlining happen, 70, 0, 0) +/* Maximum number of instructions to copy when duplicating blocks + on a jump thread path. */ +DEFPARAM (PARAM_MAX_THREAD_PATH_INSNS, + max-thread-path-insns, + Maximum number of instructions to copy when duplicating blocks on a jump thread path, + 100, 1, 99) + +/* Maximum length of a jump thread path. */ +DEFPARAM (PARAM_MAX_THREAD_LENGTH, + max-thread-length, + Maximum number of basic blocks on a jump thread path, + 10, 1, 99) + +/* Maximum number of jump thread paths to duplicate. */ +DEFPARAM (PARAM_MAX_THREAD_PATHS, + max-thread-paths, + Maximum number of new jump thread paths to create, + 50, 1, 99) + /* Limit the number of expansions created by the variable expansion optimization to avoid register pressure. */ DEFPARAM (PARAM_MAX_VARIABLE_EXPANSIONS, diff --git a/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c new file mode 100644 index 000..f3ef725 --- /dev/null +++ b/gcc/testsuite/gcc.dg/tree-ssa/ssa-dom-thread-6.c @@ -0,0 +1,32 @@ +int sum0, sum1, sum2, sum3; +int foo(char * s, char** ret) +{ + int state=0; + char c; + + for (; *s state != 4; s++) +{ + c = *s; + if (c == '*') + { + s++; + break; + } + switch (state) { + case 0: + if (c == '+') state = 1; + else if (c != '-') sum0+=c; + break; + case 1: + if (c == '+') state = 2; + else if (c == '-') state = 0; + else sum1+=c; + break; + default: + break; + } + +} + *ret = s; + return state; +} diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c index ee10bc6..565cfe3 100644 --- a/gcc/tree-cfg.c +++ b/gcc/tree-cfg.c @@ -5949,10 +5949,12 @@ gimple_duplicate_sese_region (edge entry, edge exit, { unsigned i; bool free_region_copy = false, copying_header = false; + bool save_loop_details = false; struct loop *loop = entry-dest-loop_father; edge exit_copy; vecbasic_block doms; edge redirected; + int memo_loop_header_no = 0, memo_loop_latch_no = 0; int total_freq = 0, entry_freq = 0; gcov_type total_count = 0, entry_count = 0; @@ -5970,9 +5972,15 @@ gimple_duplicate_sese_region (edge entry, edge exit, if (region[i]-loop_father != loop) return false; - if (region[i] != entry-dest -region[i] == loop-header) - return false; + /* If we are copying an abnormally shaped region, keep track of + which block will become our loop header. */ + if ((region[i] != entry-dest region[i] == loop-header) + || (region[i] != entry-src region[i] == loop-latch)) + { + save_loop_details = true; + memo_loop_latch_no = i; + memo_loop_header_no = i; + } } /* In case the function is used for loop header copying (which is the primary @@ -6055,6 +6063,13 @@ gimple_duplicate_sese_region (edge entry, edge exit, loop-latch = exit-src; } + /* Restore loop details if we were asked to save them. */ + if (save_loop_details) +{ + loop-header = region_copy[memo_loop_header_no]; + loop-latch = region_copy[memo_loop_latch_no]; +} + /* Redirect the entry and add the phi node arguments. */ redirected = redirect_edge_and_branch (entry, get_bb_copy (entry-dest)); gcc_assert (redirected != NULL); diff --git a/gcc/tree-ssa-threadedge.c b/gcc/tree-ssa-threadedge.c index d5b9696..de9b3fe 100644 --- a/gcc/tree-ssa-threadedge.c +++ b/gcc/tree-ssa-threadedge.c @@ -56,6 +56,8 @@ along with GCC; see the file COPYING3. If not see
Re: [Patch] Improving jump-thread pass for PR 54742
On Tue, Nov 11, 2014 at 2:14 AM, Sebastian Pop seb...@gmail.com wrote: Hi Jeff, I have adapted the code generation part from James' patch to current trunk, and the resulting patch gets the 30% speedup on coremark and passes bootstrap of GCC. Ok for trunk? In addition to missing documentation for the parameters + /* If we are copying an abnormally shaped region, keep track of +which block will become our loop header. */ + if ((region[i] != entry-dest region[i] == loop-header) + || (region[i] != entry-src region[i] == loop-latch)) + { + save_loop_details = true; + memo_loop_latch_no = i; + memo_loop_header_no = i; + } this looks bogus as you overwrite latch/header. I wonder what you tried to fix with this as abnormally shaped isn't sth we support given the check before (all blocks must belong to the same loop and thus entry is always the loop header or there is no loop)? I'll leave the rest to Jeff but it looks good to me from an overall structure. Thanks, Richard. Thanks, Sebastian Sebastian Pop wrote: Sebastian Pop wrote: Jeff Law wrote: On 08/21/14 04:30, Richard Biener wrote: It turns Jeff's jump-threading code in to a strange franken-pass of bits and pieces of detection and optimisation, and would need some substantial reworking to fit in with Jeff's changes last Autumn, but if it is more likely to be acceptable for trunk then perhaps we could look to revive it. It would be nice to reuse the path copy code Jeff added last year, but I don't have much intuition as to how feasible that is. Was this the sort of thing that you were imagining? Yeah, didn't look too closely though. It'd be pretty ugly I suspect. But it's probably worth pondering since that approach would eliminate the concerns about the cost of detection (which is problematical for the jump threader) by using Steve's code for that. On the update side, I suspect most, if not all of the framework is in place to handle this kind of update if the right threading paths were passed to the updater. I can probably cobble together that by-hand and see what the tree-ssa-threadupdate does with it. But it'll be a week or so before I could look at it. I adapted the patch James has sent last year to use the new update paths Attached an updated version of the patch. mechanism. I verified that the attached patch does register all the paths that need to be threaded. Thread updater seems to have some problems handling the attached testcase (a simplified version of the testcase attached to the bug.) Jeff, could you please have a look at why the jump-thread updater is crashing? I have tried to understand why the code generation part ICEs on coremark: the first problem that I have seen is that tree-ssa-threadupdate.c does not handle more than a joiner block per path to be threaded, so we would not be able to jump thread accross the joiners of the if condition and the joiner of the switch condition: i.e., these paths patch: Registering jump thread: (7, 10) incoming edge; (10, 25) joiner; (25, 26) joiner; (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 14) nocopy; patch: Registering jump thread: (28, 10) incoming edge; (10, 25) joiner; (25, 26) joiner; (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 11) nocopy; patch: Registering jump thread: (8, 10) incoming edge; (10, 25) joiner; (25, 26) joiner; (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 17) nocopy; patch: Registering jump thread: (9, 10) incoming edge; (10, 25) joiner; (25, 26) joiner; (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 25) nocopy; Another problem is that we attach the path to be threaded to the -aux field of the first edge in the path, such that we would have to cancel some of the paths because we cannot keep track of all the paths to be threaded. For coremark, we would discover some jump-thread paths from one of the switch cases over the loop exit condition, either to bb_27 outside the loop, or to bb_4 staying inside the loop. Then with the patch: we would discover jump threads that would thread switch cases to switch cases, and because these paths start with the same edges for which we have already assigned a path to e-aux, we would have to cancel the interesting threads added by the patch: Registering jump thread: (12, 25) incoming edge; (25, 26) joiner; (26, 4) nocopy; Registering jump thread: (13, 25) incoming edge; (25, 26) joiner; (26, 27) nocopy; Registering jump thread: (29, 25) incoming edge; (25, 26) joiner; (26, 4) nocopy; Registering jump thread: (31, 25) incoming edge; (25, 26) joiner; (26, 27) nocopy; Registering jump thread: (16, 25) incoming edge; (25, 26) joiner; (26, 4) nocopy; Registering jump thread: (15, 25) incoming edge; (25, 26) joiner;
Re: [Patch] Improving jump-thread pass for PR 54742
Hi Jeff, I have adapted the code generation part from James' patch to current trunk, and the resulting patch gets the 30% speedup on coremark and passes bootstrap of GCC. Ok for trunk? Thanks, Sebastian Sebastian Pop wrote: Sebastian Pop wrote: Jeff Law wrote: On 08/21/14 04:30, Richard Biener wrote: It turns Jeff's jump-threading code in to a strange franken-pass of bits and pieces of detection and optimisation, and would need some substantial reworking to fit in with Jeff's changes last Autumn, but if it is more likely to be acceptable for trunk then perhaps we could look to revive it. It would be nice to reuse the path copy code Jeff added last year, but I don't have much intuition as to how feasible that is. Was this the sort of thing that you were imagining? Yeah, didn't look too closely though. It'd be pretty ugly I suspect. But it's probably worth pondering since that approach would eliminate the concerns about the cost of detection (which is problematical for the jump threader) by using Steve's code for that. On the update side, I suspect most, if not all of the framework is in place to handle this kind of update if the right threading paths were passed to the updater. I can probably cobble together that by-hand and see what the tree-ssa-threadupdate does with it. But it'll be a week or so before I could look at it. I adapted the patch James has sent last year to use the new update paths Attached an updated version of the patch. mechanism. I verified that the attached patch does register all the paths that need to be threaded. Thread updater seems to have some problems handling the attached testcase (a simplified version of the testcase attached to the bug.) Jeff, could you please have a look at why the jump-thread updater is crashing? I have tried to understand why the code generation part ICEs on coremark: the first problem that I have seen is that tree-ssa-threadupdate.c does not handle more than a joiner block per path to be threaded, so we would not be able to jump thread accross the joiners of the if condition and the joiner of the switch condition: i.e., these paths patch: Registering jump thread: (7, 10) incoming edge; (10, 25) joiner; (25, 26) joiner; (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 14) nocopy; patch: Registering jump thread: (28, 10) incoming edge; (10, 25) joiner; (25, 26) joiner; (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 11) nocopy; patch: Registering jump thread: (8, 10) incoming edge; (10, 25) joiner; (25, 26) joiner; (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 17) nocopy; patch: Registering jump thread: (9, 10) incoming edge; (10, 25) joiner; (25, 26) joiner; (26, 4) nocopy; (4, 37) nocopy; (37, 36) nocopy; (36, 25) nocopy; Another problem is that we attach the path to be threaded to the -aux field of the first edge in the path, such that we would have to cancel some of the paths because we cannot keep track of all the paths to be threaded. For coremark, we would discover some jump-thread paths from one of the switch cases over the loop exit condition, either to bb_27 outside the loop, or to bb_4 staying inside the loop. Then with the patch: we would discover jump threads that would thread switch cases to switch cases, and because these paths start with the same edges for which we have already assigned a path to e-aux, we would have to cancel the interesting threads added by the patch: Registering jump thread: (12, 25) incoming edge; (25, 26) joiner; (26, 4) nocopy; Registering jump thread: (13, 25) incoming edge; (25, 26) joiner; (26, 27) nocopy; Registering jump thread: (29, 25) incoming edge; (25, 26) joiner; (26, 4) nocopy; Registering jump thread: (31, 25) incoming edge; (25, 26) joiner; (26, 27) nocopy; Registering jump thread: (16, 25) incoming edge; (25, 26) joiner; (26, 4) nocopy; Registering jump thread: (15, 25) incoming edge; (25, 26) joiner; (26, 4) nocopy; Registering jump thread: (32, 25) incoming edge; (25, 26) joiner; (26, 27) nocopy; Registering jump thread: (19, 25) incoming edge; (25, 26) joiner; (26, 4) nocopy; Registering jump thread: (18, 25) incoming edge; (25, 26) joiner; (26, 4) nocopy; Registering jump thread: (22, 25) incoming edge; (25, 26) joiner; (26, 27) nocopy; Registering jump thread: (21, 25) incoming edge; (25, 26) joiner; (26, 4) nocopy; Registering jump thread: (34, 25) incoming edge; (25, 26) joiner; (26, 27) nocopy; Registering jump thread: (33, 25) incoming edge; (25, 26) joiner; (26, 4) nocopy; Registering jump thread: (35, 25) incoming edge; (25, 26) joiner; (26, 27) nocopy; Registering jump thread: (24, 25) incoming edge; (25, 26) joiner; (26, 4) nocopy; patch: Registering jump thread: (12, 25)