[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller
--- Comment #14 from law at redhat dot com 2006-03-30 17:15 --- Just a note on the compile-time regressions for tramp3d... After fixing the timevars it was pretty clear that it isn't the cprop code itself that is slow, it is in fact very fast. THe slowdowns for tramp3d are in operand scanning and incremental SSA updates. I built a little instrumentation code and was rewarded with some insight into why tramp3d behaves differently for operand scanning. When we propagate into a statement, we (of course) fold and rescan the operands for the use statement. Clearly if we propagate several distinct copies into a single use statement, then we end up wasting time rescanning the use statement. My instrumentation recorded how often we perform more than one propagation into a statement vs how often we only propagate into a statement one time. In my test bucket that ratio is a little less than 1:1. I would have expected significantly smaller, but it is what it is. What's interesting is that for tramp3d that ratio is about 3:1 -- primarily from copy propagating into VUSE and V_MAY_DEF operands. I'm currently experimenting with some code to queue folding/rescanning of use statements in cases where there's a reasonable chance we're going to do more than one propagation into the statement. Initial experiments are showing a measurable compile-time improvement for both tramp3d as well as my testbucket. [ Note that Diego's memory tag rewrite work may make all this moot one day... ] Jeff -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829
[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller
--- Comment #13 from law at redhat dot com 2006-03-28 19:13 --- Subject: Re: [4.1/4.2 Regression] missed jump threading after unroller On Wed, 2006-03-22 at 16:06 +0100, Richard Guenther wrote: > On 3/22/06, Jeffrey A Law <[EMAIL PROTECTED]> wrote: > > On Wed, 2006-03-22 at 12:14 +0100, Richard Guenther wrote: > > > On 3/21/06, Jeffrey A Law <[EMAIL PROTECTED]> wrote: > > > > It turns out this specialized PHI optimization pass is as effective > > > > as running copy-prop and CCP on PHI nodes after DOM. Better yet, it > > > > is a teeny tiny slowdown compared to just running the stripped down > > > > copyprop. ie, for an almost unmeasurable slowdown we can do both > > > > constant and copy propagation instead of just copy propagation. > > > > > > This patch caused a compile-time regression from 139s to 143s, resp. > > > 192s to 197s (leafify) accounted by increases of operand scan / SSA > > > incremental > > > and tree CCP times for compiling tramp3d. Also memory usage during > > > compiling > > > went up from 655494 kB to 660626kB (this may be due to the VRP patch, > > > though). > > > > > > Runtime of tramp3d did not improve but regress slightly (but that > > > might be in the > > > noise - we'll see). > > > > > > For this simple cleanup pass can you try updating SSA form manually > > > please? > > I'm more than happy to look at it; however, be aware that if you're > > seeing increased time in CCP then either you're seeing some truly > > bizzarre secondary effect or your testing methodology is suspect. > > The patch did not affect CCP. In fact, the changes only affect > > passes which run *after* CCP in the optimization pipeline. > > struct tree_opt_pass pass_phi_only_cprop = > { > "phicprop", /* name */ > gate_dominator, /* gate */ > eliminate_degenerate_phis,/* execute */ > NULL, /* sub */ > NULL, /* next */ > 0,/* static_pass_number */ > TV_TREE_CCP, /* tv_id */ > PROP_cfg | PROP_ssa | PROP_alias, /* properties_required */ > 0,/* properties_provided */ > PROP_smt_usage, /* properties_destroyed */ > 0,/* todo_flags_start */ > TODO_cleanup_cfg | TODO_dump_func > | TODO_ggc_collect | TODO_verify_ssa > | TODO_verify_stmts | TODO_update_smt_usage > | TODO_update_ssa, /* todo_flags_finish */ > 0 /* letter */ > }; > > see tv_id - so I guess increased CCP times are expected. I've created a separate timevar for the phi-only cprop code so at least we can see how much time's taking. Note that while we should be seeing a small amount of time in phi-cprop, we should be seeing a reduction in TV_TREE_CCP and TV_TREE_COPY_PROP. Can you send me a recent tramp3d.ii file so that I can poke around and see if there's anything we can do about the compile time regression? Thanks, jeff -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829
[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller
--- Comment #12 from law at redhat dot com 2006-03-22 15:36 --- Subject: Re: [4.1/4.2 Regression] missed jump threading after unroller On Wed, 2006-03-22 at 16:06 +0100, Richard Guenther wrote: > ; > > see tv_id - so I guess increased CCP times are expected. Nuts. I should have made a separate tv_id. I'll fix that shortly. I probably won't be able to look at the compile-time regression for your testcase until Friday. jeff -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829
[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller
--- Comment #11 from richard dot guenther at gmail dot com 2006-03-22 15:06 --- Subject: Re: [4.1/4.2 Regression] missed jump threading after unroller On 3/22/06, Jeffrey A Law <[EMAIL PROTECTED]> wrote: > On Wed, 2006-03-22 at 12:14 +0100, Richard Guenther wrote: > > On 3/21/06, Jeffrey A Law <[EMAIL PROTECTED]> wrote: > > > It turns out this specialized PHI optimization pass is as effective > > > as running copy-prop and CCP on PHI nodes after DOM. Better yet, it > > > is a teeny tiny slowdown compared to just running the stripped down > > > copyprop. ie, for an almost unmeasurable slowdown we can do both > > > constant and copy propagation instead of just copy propagation. > > > > This patch caused a compile-time regression from 139s to 143s, resp. > > 192s to 197s (leafify) accounted by increases of operand scan / SSA > > incremental > > and tree CCP times for compiling tramp3d. Also memory usage during > > compiling > > went up from 655494 kB to 660626kB (this may be due to the VRP patch, > > though). > > > > Runtime of tramp3d did not improve but regress slightly (but that > > might be in the > > noise - we'll see). > > > > For this simple cleanup pass can you try updating SSA form manually please? > I'm more than happy to look at it; however, be aware that if you're > seeing increased time in CCP then either you're seeing some truly > bizzarre secondary effect or your testing methodology is suspect. > The patch did not affect CCP. In fact, the changes only affect > passes which run *after* CCP in the optimization pipeline. struct tree_opt_pass pass_phi_only_cprop = { "phicprop", /* name */ gate_dominator, /* gate */ eliminate_degenerate_phis,/* execute */ NULL, /* sub */ NULL, /* next */ 0,/* static_pass_number */ TV_TREE_CCP, /* tv_id */ PROP_cfg | PROP_ssa | PROP_alias, /* properties_required */ 0,/* properties_provided */ PROP_smt_usage, /* properties_destroyed */ 0,/* todo_flags_start */ TODO_cleanup_cfg | TODO_dump_func | TODO_ggc_collect | TODO_verify_ssa | TODO_verify_stmts | TODO_update_smt_usage | TODO_update_ssa, /* todo_flags_finish */ 0 /* letter */ }; see tv_id - so I guess increased CCP times are expected. Richard. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829
[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller
--- Comment #10 from law at redhat dot com 2006-03-22 14:01 --- Subject: Re: [4.1/4.2 Regression] missed jump threading after unroller On Wed, 2006-03-22 at 12:14 +0100, Richard Guenther wrote: > On 3/21/06, Jeffrey A Law <[EMAIL PROTECTED]> wrote: > > It turns out this specialized PHI optimization pass is as effective > > as running copy-prop and CCP on PHI nodes after DOM. Better yet, it > > is a teeny tiny slowdown compared to just running the stripped down > > copyprop. ie, for an almost unmeasurable slowdown we can do both > > constant and copy propagation instead of just copy propagation. > > This patch caused a compile-time regression from 139s to 143s, resp. > 192s to 197s (leafify) accounted by increases of operand scan / SSA > incremental > and tree CCP times for compiling tramp3d. Also memory usage during compiling > went up from 655494 kB to 660626kB (this may be due to the VRP patch, though). > > Runtime of tramp3d did not improve but regress slightly (but that > might be in the > noise - we'll see). > > For this simple cleanup pass can you try updating SSA form manually please? I'm more than happy to look at it; however, be aware that if you're seeing increased time in CCP then either you're seeing some truly bizzarre secondary effect or your testing methodology is suspect. The patch did not affect CCP. In fact, the changes only affect passes which run *after* CCP in the optimization pipeline. Jeff -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829
[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller
--- Comment #9 from richard dot guenther at gmail dot com 2006-03-22 11:14 --- Subject: Re: [4.1/4.2 Regression] missed jump threading after unroller On 3/21/06, Jeffrey A Law <[EMAIL PROTECTED]> wrote: > It turns out this specialized PHI optimization pass is as effective > as running copy-prop and CCP on PHI nodes after DOM. Better yet, it > is a teeny tiny slowdown compared to just running the stripped down > copyprop. ie, for an almost unmeasurable slowdown we can do both > constant and copy propagation instead of just copy propagation. This patch caused a compile-time regression from 139s to 143s, resp. 192s to 197s (leafify) accounted by increases of operand scan / SSA incremental and tree CCP times for compiling tramp3d. Also memory usage during compiling went up from 655494 kB to 660626kB (this may be due to the VRP patch, though). Runtime of tramp3d did not improve but regress slightly (but that might be in the noise - we'll see). For this simple cleanup pass can you try updating SSA form manually please? Thanks, Richard. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829
[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller
--- Comment #8 from law at redhat dot com 2006-03-21 05:10 --- Today's patch picks up the missed const-propagation and allows simplification of the modulo operation. THere's still the opportunitity to use range information to simplify a conditional. However, fixing that is just basically iterating VRP in response to CFG changes -- not something we're going to do as it's simply too expensive. Thus I'm changing the state of this bug to WONTFIX. -- law at redhat dot com changed: What|Removed |Added Status|NEW |RESOLVED Resolution||WONTFIX http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829
[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller
--- Comment #6 from law at redhat dot com 2006-03-21 05:09 --- Subject: Re: [4.1/4.2 Regression] missed jump threading after unroller On Sat, 2006-02-11 at 00:59 +, pinskia at gcc dot gnu dot org wrote: > > --- Comment #5 from pinskia at gcc dot gnu dot org 2006-02-11 00:59 > --- > The problem with this one after Jeff's recent patches is that we have: > :; > D.1402_17 = 0; > if (D.1402_17 == 1) goto ; else goto ; > > :; > x_18 = 1; > > # x_19 = PHI <0(2), 0(3), x_18(4)>; > :; > > Which causes us not to be able to the jump threading as we do a CCP in VRP and > then we get: > : > if (v_8 < 0) goto ; else goto ; > > :; > D.1402_17 = 0; > goto (); > > # x_19 = PHI <0(2)>; > :; > u_20 = 1; > ivtmp.26_21 = 1; > ivtmp.26_3 = 1; > u_14 = 1; > x_13 = 0; > if (v_8 <= 0) goto ; else goto ; > > So we need to be able to do some CCP and some DCE before invoking VRP. As I mentioned earlier, the problem is inherent with non-iterating dominator optimizations -- namely that we can't guarantee that for block BB that we will visit all of BB's predecessors before visiting BB when BB is at a dominance frontier. The net result is that we may still be propagating values into a PHI node for a block which has already been visited. Those propagations may result in the PHI turning into a degenerate too late for the dominator optimizer to discover the degenerate PHI and record the appropriate equivalence for the LHS of the degenerate PHI. I had added support for a stripped down copy-prop to run after DOM to pick up these kind of optimization opportunities rooted at degenerate PHI nodes that were trivial copies. (previously we had been running the full-blown copy-prop pass). However, that code does not catch this case because our PHI looks like x_2 = PHI (0 (BB1), 0 (BB2)) ie, it's a constant initialization, not a PHI-copy. I had played with running a similarly stripped down CCP after DOM, but that's really expensive compile-time wise. I then experimented with speeding up the propagation engine. While I may have found a couple of micro-opts, I wasn't able to find anything which was going to give us a big enough improvement to make running a phi-only CCP a viable option from a compile-time standpoint. I then proceeded to implement a concept that had been floating around in the back of my mind. Namely a specialized PHI const/copy optimizer which used a dominator walk plus a worklist of statements/phis to revisit after the DOM walk (which allows us to detect secondary effects). It turns out this specialized PHI optimization pass is as effective as running copy-prop and CCP on PHI nodes after DOM. Better yet, it is a teeny tiny slowdown compared to just running the stripped down copyprop. ie, for an almost unmeasurable slowdown we can do both constant and copy propagation instead of just copy propagation. Net result is in this PR we're able to clean up the extraneous modulo operation and propagate its result as well. Note that the resulting code could be simplified even more, namely iterating VRP could allow simplification of a relational into an equality test. That's simply not going to happen. We'll have to live with less than 100% optimized code. Bootstrapped and regression tested on i686-pc-linux-gnu. --- Comment #7 from law at redhat dot com 2006-03-21 05:09 --- Created an attachment (id=11077) --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=11077&action=view) -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829
[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller
-- mmitchel at gcc dot gnu dot org changed: What|Removed |Added Target Milestone|4.1.0 |4.1.1 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829
[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller
--- Comment #5 from pinskia at gcc dot gnu dot org 2006-02-11 00:59 --- The problem with this one after Jeff's recent patches is that we have: :; D.1402_17 = 0; if (D.1402_17 == 1) goto ; else goto ; :; x_18 = 1; # x_19 = PHI <0(2), 0(3), x_18(4)>; :; Which causes us not to be able to the jump threading as we do a CCP in VRP and then we get: : if (v_8 < 0) goto ; else goto ; :; D.1402_17 = 0; goto (); # x_19 = PHI <0(2)>; :; u_20 = 1; ivtmp.26_21 = 1; ivtmp.26_3 = 1; u_14 = 1; x_13 = 0; if (v_8 <= 0) goto ; else goto ; So we need to be able to do some CCP and some DCE before invoking VRP. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829
[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller
--- Comment #4 from law at redhat dot com 2006-02-09 03:19 --- I'll note this really isn't a jump threading issue. This is a fundamental weakness in a dominator based optimizer vs a truly global optimizer. What we've got is a block which looks something like this: # u_18 = PHI ; :; D.1528_10 = u_18 % 2; if (D.1528_10 == 1) goto ; else goto ; Note carefully that this block is at a dominance frontier (thus the PHI node). At a dominance frontier, a dominator walk does _not_ guarantee that the CFG parents are visited before the node itself. ie, we may visit this block before its CFG parents.That is in fact precisely what happens. Thus we do not discover that u_13 and u_26 both have the value 1 until after we have processed this block. Possible solutions: 1. Iterating DOM. I don't want to do that. 2. Run CCP before DOM. It's something I've pondered, mostly because I want to remove constant propagation from DOM. Compile-time issues worry me. 3. Simple constant/copy propagation in the renaming phase. I've always been a fan of this to avoid clutter in the IL, but I've been overruled on this as a design decision. 4. Somehow arrange to sort the et-tree dominance information so that we visit immediate descendants of a CFG node which are dominated by the node before visiting more distant CFG descendants. This could be horribly ugly. 5. Somehow get really really smart in the block duplication code and arrange for it to propagate constants. I've pondered this, but I don't think it's really that feasible both from a design and implementation standpoint. 6. Post-dom cleanup phase using worklists seeded by degenerate PHI nodes to identify statements that can be simplified as a result of discovery of the degenerate PHI. [Would replace the mini-copyprop pass we do after DOM. ] I suspect #2 or #6 are the most likely long term solutions. jeff -- law at redhat dot com changed: What|Removed |Added CC||law at redhat dot com http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829