Re: [Patch tree-ssa] RFC: Enable path threading for control variables (PR tree-optimization/54742).
On Fri, Jun 21, 2013 at 11:21 PM, Steve Ellcey sell...@mips.com wrote: On Fri, 2013-06-21 at 17:43 +0100, James Greenhalgh wrote: So to my mind, this is all far too tied up in pass ordering details to resolve. Given that all the threading opportunities for my patch are found in dom1 and how fragile the positioning of dom1 is, there is not a great deal I can do to modify the ordering. The biggest improvement I could find comes from rerunning pass_ch immediately after dom1, though I'm not sure what the cost of that would be. I wonder if you or others have any thoughts on what the right thing to do would be? I am not sure if path threading in general is turned off for -Os but it probably should be. I agree, jump threading is on at -Os, path threading should not be. Thanks, James I think that since it seems to be more a space issue then a time issue, the way you currently have it is reasonable. As long as the optimization does not happen at -Os then I think we can probably live with the space increase. I suppose with Jeffs recent work on jump-threading through paths this case in handled and the patch in this thread is obsolete or can be reworked? Thanks, Richard. Steve Ellcey sell...@mips.com
Re: [Patch tree-ssa] RFC: Enable path threading for control variables (PR tree-optimization/54742).
On Fri, Oct 18, 2013 at 11:55:08AM +0100, Richard Biener wrote: I suppose with Jeffs recent work on jump-threading through paths this case in handled and the patch in this thread is obsolete or can be reworked? Yes, this patch is now obsolete, Jeff's solution is much more elegant :-) Thanks, James
Re: [Patch tree-ssa] RFC: Enable path threading for control variables (PR tree-optimization/54742).
On 10/18/13 04:55, Richard Biener wrote: On Fri, Jun 21, 2013 at 11:21 PM, Steve Ellcey sell...@mips.com wrote: On Fri, 2013-06-21 at 17:43 +0100, James Greenhalgh wrote: So to my mind, this is all far too tied up in pass ordering details to resolve. Given that all the threading opportunities for my patch are found in dom1 and how fragile the positioning of dom1 is, there is not a great deal I can do to modify the ordering. The biggest improvement I could find comes from rerunning pass_ch immediately after dom1, though I'm not sure what the cost of that would be. I wonder if you or others have any thoughts on what the right thing to do would be? I am not sure if path threading in general is turned off for -Os but it probably should be. I agree, jump threading is on at -Os, path threading should not be. Thanks, James I think that since it seems to be more a space issue then a time issue, the way you currently have it is reasonable. As long as the optimization does not happen at -Os then I think we can probably live with the space increase. I suppose with Jeffs recent work on jump-threading through paths this case in handled and the patch in this thread is obsolete or can be reworked? It trivially falls out of the work I'm doing. The biggest question will be what to do about the loop structures. For the FSA case from coremark the loop structures get mucked up way beyond repair. My thoughts were to generally continue to not allow major damage to the loop structures when threading jumps -- with the exception being if we can eliminate an indirect or tablejump at the top of a loop. Possibly further restricting it to the jump threading done after our loop optimizers have completed. I haven't looked at that heuristic yet as I've primarily focused on having the right infrastructure to correctly update the SSA and CFG in the general case. jeff
Re: [Patch tree-ssa] RFC: Enable path threading for control variables (PR tree-optimization/54742).
On 10/18/13 05:05, James Greenhalgh wrote: On Fri, Oct 18, 2013 at 11:55:08AM +0100, Richard Biener wrote: I suppose with Jeffs recent work on jump-threading through paths this case in handled and the patch in this thread is obsolete or can be reworked? Yes, this patch is now obsolete, Jeff's solution is much more elegant :-) Not sure if I'd use the term elegant :-) My solution is certainly more general though. There's still a couple rounds of staging bits as I clean up the rough edges. The need to store a full thread path to get SSA graph update correct in cases where we have a joiner block, followed by multiple threadable blocks without side effects, followed by a threadable block with side effects was unexpected. Jeff
Re: [Patch tree-ssa] RFC: Enable path threading for control variables (PR tree-optimization/54742).
While testing it I noticed that the final executable is larger with your patch then with mine. Here are the sizes of the bare-metal executables I created using the same flags I sent you earlier, the first has no switch optimization, the second one uses my plugin optimization, and the third uses your latest patch. I haven't looked into why the size difference for your patch and mine exists, do you see a size difference on your platforms? Yes I do, but after playing around with it, this seems very dependant on pass ordering. I've built various arm-none-eabi compilers to test with: clean: is a compiler without path threading. steve.pass: is your original pass patch. james: is my patch (which will be called within vrp and dom passes) steve.after-vrp1: moves your pass to immediately after the first call to vrp steve.before, steve.after, steve.after-vrp-before-dom, steve.before-vrp-after-dom: run your pass immediately before or after both vrp and both dom passes. james.ch is my patch, rerunning pass_ch after dom1. Then, building with flags: -finline-limit=1000 -funroll-all-loops -finline-functions [[-ftree-switch-shortcut]] -O3 -mthumb And passing the resulting binary through: $ arm-none-eabi-strip blob.* I see: $ size blob.arm.* | sort -n textdata bss dec hex filename 539842548 296 56828ddfc ../blobs/blob.arm.clean 544642548 296 57308dfdc ../blobs/blob.arm.steve.pass 544962548 296 57340dffc ../blobs/blob.arm.steve.after 544962548 296 57340dffc ../blobs/blob.arm.steve.after-vrp-before-dom 545042548 296 57348e004 ../blobs/blob.arm.james.ch 545042548 296 57348e004 ../blobs/blob.arm.steve.only-after-vrp1 546562548 296 57500e09c ../blobs/blob.arm.james 547042548 296 57548e0cc ../blobs/blob.arm.steve.before-vrp-after-dom 547362548 296 57580e0ec ../blobs/blob.arm.steve.before So to my mind, this is all far too tied up in pass ordering details to resolve. Given that all the threading opportunities for my patch are found in dom1 and how fragile the positioning of dom1 is, there is not a great deal I can do to modify the ordering. The biggest improvement I could find comes from rerunning pass_ch immediately after dom1, though I'm not sure what the cost of that would be. I wonder if you or others have any thoughts on what the right thing to do would be? I am not sure if path threading in general is turned off for -Os but it probably should be. I agree, jump threading is on at -Os, path threading should not be. Thanks, James
Re: [Patch tree-ssa] RFC: Enable path threading for control variables (PR tree-optimization/54742).
On Fri, 2013-06-21 at 17:43 +0100, James Greenhalgh wrote: So to my mind, this is all far too tied up in pass ordering details to resolve. Given that all the threading opportunities for my patch are found in dom1 and how fragile the positioning of dom1 is, there is not a great deal I can do to modify the ordering. The biggest improvement I could find comes from rerunning pass_ch immediately after dom1, though I'm not sure what the cost of that would be. I wonder if you or others have any thoughts on what the right thing to do would be? I am not sure if path threading in general is turned off for -Os but it probably should be. I agree, jump threading is on at -Os, path threading should not be. Thanks, James I think that since it seems to be more a space issue then a time issue, the way you currently have it is reasonable. As long as the optimization does not happen at -Os then I think we can probably live with the space increase. Steve Ellcey sell...@mips.com
Re: [Patch tree-ssa] RFC: Enable path threading for control variables (PR tree-optimization/54742).
-Original Message- From: Steve Ellcey [mailto:sell...@mips.com] Sent: 14 June 2013 19:07 With my version the compiler calls gimple_duplicate_sese_region from duplicate_blocks 60 times. With your patch it calls gimple_duplicate_sese_region from duplicate_thread_path 16 times. Hi Steve, You are quite right. With -finline-limit=1000 I see a big difference in performance. The cause of this is the code added to tree-ssa-threadedge.c (simplify_control_stmt_condition). If we have failed to simplify to a gimple_min_invariant, we want to look for thread paths to the value given by gimple_goto_dest, rather than the SSA_NAME_VALUE of that value. This improves the performance on my x86_64 toolchain to the same level as your patch. I see Registered 20 jump paths printed 3 times in dom1, for a total of 60 thread paths. I've also fixed another couple of bugs I spotted, improved logging of results and added the parameters that were in your patch. I did investigate changing the search strategy back to yours, but I saw no impact on the thread paths found. Please let me know if this fixes the performance issues you were seeing and if you have any other comments. FWIW I've bootstrapped and regression tested this version of the patch on x86_64 and ARM with no regressions. Thanks, James Greenhalgh --- Changes from v1: --- gcc/ 2013-06-19 James Greenhalgh james.greenha...@arm.com * params.def (PARAM_MAX_THREAD_PATH_INSNS): New. (PARAM_THREAD_PATHS): Likewise. * tree-ssa-threadedge.c (simplify_control_stmt_condition): If we can't simplify cond, return it unmodified. (max_insn_count): Do not initialize. (max_path_count): Likewise. (find_control_statement_thread_paths): Catch case where path has already been computed (thus no further path exists), add sanity checking. (thread_across_edge): Initialize max_{insn, path}_count; * tree-ssa-threadupdate.c (duplicate_thread_path): Add sanity check, logging. (thread_through_all_blocks): Thread paths, even if no threaded_edges were found. (register_thread_paths): Improve logging. --- Changelog: gcc/ 2013-06-19 James Greenhalgh james.greenha...@arm.com PR tree-optimization/54742 * params.def (PARAM_MAX_THREAD_PATH_INSNS): New. (PARAM_THREAD_PATHS): Likewise. * tree.cfg (gimple_duplicate_sese_region): Memoize loop latch and loop header blocks if copying across a latch/header. * tree-flow.h (thread_paths): New struct. (register_thread_paths): Declare. * tree-ssa-threadedge.c (simplify_control_stmt_condition): Permit returning something not in gimple_min_invariant form. (max_insn_count): Declare. (max_path_count): Likewise. (find_thread_path_1): New function. (find_thread_path): Likewise. (save_new_thread_path): Likewise. (find_control_statement_thread_paths): Likewise. (thread_across_edge): Handle non gimple_min_invariant cases. * tree-ssa-threadupdate.c (thread_paths_vec): New. (remove_edge_from_thread_candidates): New function. (duplicate_thread_path): Likewise. (copy_control_statement_thread_paths): Likewise. (thread_through_all_blocks): Handle thread_paths. (register_thread_paths): New function. diff --git a/gcc/params.def b/gcc/params.def index 3c52651..25d36a6 100644 --- a/gcc/params.def +++ b/gcc/params.def @@ -123,6 +123,19 @@ DEFPARAM (PARAM_PARTIAL_INLINING_ENTRY_PROBABILITY, Maximum probability of the entry BB of split region (in percent relative to entry BB of the function) to make partial inlining happen, 70, 0, 0) +/* Maximum number of instructions to copy when duplicating blocks + on a jump thread path. */ +DEFPARAM (PARAM_MAX_THREAD_PATH_INSNS, + max-thread-path-insns, + Maximum number of instructions to copy when duplicating blocks on a jump thread path, + 100, 1, 99) + +/* Maximum number of jump thread paths to duplicate. */ +DEFPARAM (PARAM_MAX_THREAD_PATHS, + max-thread-paths, + Maximum number of new jump thread paths to create, + 50, 1, 99) + /* Limit the number of expansions created by the variable expansion optimization to avoid register pressure. */ DEFPARAM (PARAM_MAX_VARIABLE_EXPANSIONS, diff --git a/gcc/tree-cfg.c b/gcc/tree-cfg.c index 4b91a35..6dcd2e4 100644 --- a/gcc/tree-cfg.c +++ b/gcc/tree-cfg.c @@ -5717,10 +5717,12 @@ gimple_duplicate_sese_region (edge entry, edge exit, { unsigned i; bool free_region_copy = false, copying_header = false; + bool save_loop_details = false; struct loop *loop = entry-dest-loop_father; edge exit_copy; vecbasic_block doms; edge redirected; + int memo_loop_header_no = 0, memo_loop_latch_no = 0; int total_freq = 0, entry_freq = 0; gcov_type total_count = 0, entry_count = 0; @@ -5738,9 +5740,15 @@ gimple_duplicate_sese_region
Re: [Patch tree-ssa] RFC: Enable path threading for control variables (PR tree-optimization/54742).
On Wed, 2013-06-19 at 14:19 +0100, James Greenhalgh wrote: Please let me know if this fixes the performance issues you were seeing and if you have any other comments. FWIW I've bootstrapped and regression tested this version of the patch on x86_64 and ARM with no regressions. Thanks, James Greenhalgh James, This patch does give me the same performance as my original patch, so that is excellent. While testing it I noticed that the final executable is larger with your patch then with mine. Here are the sizes of the bare-metal executables I created using the same flags I sent you earlier, the first has no switch optimization, the second one uses my plugin optimization, and the third uses your latest patch. I haven't looked into why the size difference for your patch and mine exists, do you see a size difference on your platforms? I am not sure if path threading in general is turned off for -Os but it probably should be. % ll -art coremark.fsf*elf -rwxr-xr-x 1 sellcey src 413812 Jun 19 11:11 coremark.fsf.1.elf -rwxr-xr-x 1 sellcey src 414676 Jun 19 11:11 coremark.fsf.2.elf -rwxr-xr-x 1 sellcey src 416402 Jun 19 11:11 coremark.fsf.3.elf Steve Ellcey sell...@mips.com
Re: [Patch tree-ssa] RFC: Enable path threading for control variables (PR tree-optimization/54742).
On Fri, 2013-06-14 at 00:08 +0100, James Greenhalgh wrote: I see only minor cosmetic differences (Placement of labels, numbering of labels) between the two generated assembly files. Perhaps you could share the flags you were using and I can try to figure out which paths I seem to be missing. If I can't find them, I'll happily revert to using your search strategy. Regards, James Greenhalgh I am building a cross compiler running on x86_64 and targeting mips-mti-elf. I ran the compiler with: -mips32r2 -EL -msoft-float -O3 -funroll-all-loops -finline-functions -finline-limit=1000 -DRTEYAMON With my version the compiler calls gimple_duplicate_sese_region from duplicate_blocks 60 times. With your patch it calls gimple_duplicate_sese_region from duplicate_thread_path 16 times. Steve Ellcey sell...@mips.com
Re: [Patch tree-ssa] RFC: Enable path threading for control variables (PR tree-optimization/54742).
On Fri, 2013-06-07 at 16:14 +0100, James Greenhalgh wrote: Beyond that, the path search code is modified from Steve's patch to only perform one depth first search, and the path copy code is modified to purge the region entry and exit edges for those which traditional jump-threading may consider. Opinions? Thanks, James Greenhalgh Hi James, I tried out your patch and I am not getting the speed up in coremark that I got with my patch. I wonder if restricting it to one depth first search is the reason. With my patch I was able to completely get rid of the switch statement, but with this patch it still exists. Maybe we need an option to do multiple searches? Steve Ellcey sell...@mips.com
Re: [Patch tree-ssa] RFC: Enable path threading for control variables (PR tree-optimization/54742).
On Thu, Jun 13, 2013 at 08:29:08PM +0100, Steve Ellcey wrote: On Fri, 2013-06-07 at 16:14 +0100, James Greenhalgh wrote: Beyond that, the path search code is modified from Steve's patch to only perform one depth first search, and the path copy code is modified to purge the region entry and exit edges for those which traditional jump-threading may consider. Opinions? Thanks, James Greenhalgh Hi James, I tried out your patch and I am not getting the speed up in coremark that I got with my patch. I wonder if restricting it to one depth first search is the reason. With my patch I was able to completely get rid of the switch statement, but with this patch it still exists. Maybe we need an option to do multiple searches? Steve Ellcey sell...@mips.com Hi Steve, Thanks for having a look at the patch, though you appear to get very different results to my local build. Comparing a bootstrapped native x86_64 compiler with my patch and these flags: /work/gcc-build-jg-threading/build-x86/install/bin/gcc -S -O3 -Ilinux64 core_state.c And a bootstrapped native x86_64 compiler with your patch and these flags: /work/gcc-build-sje-threading/build-x86/install/bin/gcc -S -O3 -Ilinux64 -ftree-switch-shortcut core_state.c I see only minor cosmetic differences (Placement of labels, numbering of labels) between the two generated assembly files. Perhaps you could share the flags you were using and I can try to figure out which paths I seem to be missing. If I can't find them, I'll happily revert to using your search strategy. Regards, James Greenhalgh