[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller

2006-03-30 Thread law at redhat dot com


--- Comment #14 from law at redhat dot com  2006-03-30 17:15 ---
Just a note on the compile-time regressions for tramp3d...

After fixing the timevars it was pretty clear that it isn't the cprop code
itself that is slow, it is in fact very fast.  THe slowdowns for tramp3d are in
operand scanning and incremental SSA updates.

I built a little instrumentation code and was rewarded with some insight into
why tramp3d behaves differently for operand scanning.   When we propagate into
a statement, we (of course) fold and rescan the operands for the use statement.
 Clearly if we propagate several distinct copies into a single use statement,
then we end up wasting time rescanning the use statement.

My instrumentation recorded how often we perform more than one propagation into
a statement vs how often we only propagate into a statement one time.  In my
test bucket that ratio is a little less than 1:1.  I would have expected
significantly smaller, but it is what it is.

What's interesting is that for tramp3d that ratio is about 3:1 -- primarily
from  copy propagating into VUSE and V_MAY_DEF operands.

I'm currently experimenting with some code to queue folding/rescanning of use
statements in cases where there's a reasonable chance we're going to do more
than one propagation into the statement.  Initial experiments are showing a
measurable compile-time improvement for both tramp3d as well as my testbucket.

[ Note that Diego's memory tag rewrite work may make all this moot one day... ]

Jeff


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829



[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller

2006-03-28 Thread law at redhat dot com


--- Comment #13 from law at redhat dot com  2006-03-28 19:13 ---
Subject: Re:  [4.1/4.2 Regression] missed jump
threading after unroller

On Wed, 2006-03-22 at 16:06 +0100, Richard Guenther wrote:
> On 3/22/06, Jeffrey A Law <[EMAIL PROTECTED]> wrote:
> > On Wed, 2006-03-22 at 12:14 +0100, Richard Guenther wrote:
> > > On 3/21/06, Jeffrey A Law <[EMAIL PROTECTED]> wrote:
> > > > It turns out this specialized PHI optimization pass is as effective
> > > > as running copy-prop and CCP on PHI nodes after DOM.  Better yet, it
> > > > is a teeny tiny slowdown compared to just running the stripped down
> > > > copyprop.  ie, for an almost unmeasurable slowdown we can do both
> > > > constant and copy propagation instead of just copy propagation.
> > >
> > > This patch caused a compile-time regression from 139s to 143s, resp.
> > > 192s to 197s (leafify) accounted by increases of operand scan / SSA 
> > > incremental
> > > and tree CCP times for compiling tramp3d.  Also memory usage during 
> > > compiling
> > > went up from 655494 kB to 660626kB (this may be due to the VRP patch, 
> > > though).
> > >
> > > Runtime of tramp3d did not improve but regress slightly (but that
> > > might be in the
> > > noise - we'll see).
> > >
> > > For this simple cleanup pass can you try updating SSA form manually 
> > > please?
> > I'm more than happy to look at it; however, be aware that if you're
> > seeing increased time in CCP then either you're seeing some truly
> > bizzarre secondary effect or your testing methodology is suspect.
> > The patch did not affect CCP.  In fact, the changes only affect
> > passes which run *after* CCP in the optimization pipeline.
> 
> struct tree_opt_pass pass_phi_only_cprop =
> {
>   "phicprop",   /* name */
>   gate_dominator,   /* gate */
>   eliminate_degenerate_phis,/* execute */
>   NULL, /* sub */
>   NULL, /* next */
>   0,/* static_pass_number */
>   TV_TREE_CCP,  /* tv_id */
>   PROP_cfg | PROP_ssa | PROP_alias, /* properties_required */
>   0,/* properties_provided */
>   PROP_smt_usage,   /* properties_destroyed */
>   0,/* todo_flags_start */
>   TODO_cleanup_cfg | TODO_dump_func
> | TODO_ggc_collect | TODO_verify_ssa
> | TODO_verify_stmts | TODO_update_smt_usage
> | TODO_update_ssa, /* todo_flags_finish */
>   0 /* letter */
> };
> 
> see tv_id - so I guess increased CCP times are expected.
I've created a separate timevar for the phi-only cprop code so
at least we can see how much time's taking.  Note that while we
should be seeing a small amount of time in phi-cprop, we should
be seeing a reduction in TV_TREE_CCP and TV_TREE_COPY_PROP.

Can you send me a recent tramp3d.ii file so that I can poke
around and see if there's anything we can do about the compile
time regression?

Thanks,
jeff


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829



[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller

2006-03-22 Thread law at redhat dot com


--- Comment #12 from law at redhat dot com  2006-03-22 15:36 ---
Subject: Re:  [4.1/4.2 Regression] missed jump
threading after unroller

On Wed, 2006-03-22 at 16:06 +0100, Richard Guenther wrote:
> ;
> 
> see tv_id - so I guess increased CCP times are expected.
Nuts.  I should have made a separate tv_id.  I'll fix that
shortly.  

I probably won't be able to look at the compile-time regression
for your testcase until Friday.

jeff


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829



[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller

2006-03-22 Thread richard dot guenther at gmail dot com


--- Comment #11 from richard dot guenther at gmail dot com  2006-03-22 
15:06 ---
Subject: Re:  [4.1/4.2 Regression] missed jump threading after unroller

On 3/22/06, Jeffrey A Law <[EMAIL PROTECTED]> wrote:
> On Wed, 2006-03-22 at 12:14 +0100, Richard Guenther wrote:
> > On 3/21/06, Jeffrey A Law <[EMAIL PROTECTED]> wrote:
> > > It turns out this specialized PHI optimization pass is as effective
> > > as running copy-prop and CCP on PHI nodes after DOM.  Better yet, it
> > > is a teeny tiny slowdown compared to just running the stripped down
> > > copyprop.  ie, for an almost unmeasurable slowdown we can do both
> > > constant and copy propagation instead of just copy propagation.
> >
> > This patch caused a compile-time regression from 139s to 143s, resp.
> > 192s to 197s (leafify) accounted by increases of operand scan / SSA 
> > incremental
> > and tree CCP times for compiling tramp3d.  Also memory usage during 
> > compiling
> > went up from 655494 kB to 660626kB (this may be due to the VRP patch, 
> > though).
> >
> > Runtime of tramp3d did not improve but regress slightly (but that
> > might be in the
> > noise - we'll see).
> >
> > For this simple cleanup pass can you try updating SSA form manually please?
> I'm more than happy to look at it; however, be aware that if you're
> seeing increased time in CCP then either you're seeing some truly
> bizzarre secondary effect or your testing methodology is suspect.
> The patch did not affect CCP.  In fact, the changes only affect
> passes which run *after* CCP in the optimization pipeline.

struct tree_opt_pass pass_phi_only_cprop =
{
  "phicprop",   /* name */
  gate_dominator,   /* gate */
  eliminate_degenerate_phis,/* execute */
  NULL, /* sub */
  NULL, /* next */
  0,/* static_pass_number */
  TV_TREE_CCP,  /* tv_id */
  PROP_cfg | PROP_ssa | PROP_alias, /* properties_required */
  0,/* properties_provided */
  PROP_smt_usage,   /* properties_destroyed */
  0,/* todo_flags_start */
  TODO_cleanup_cfg | TODO_dump_func
| TODO_ggc_collect | TODO_verify_ssa
| TODO_verify_stmts | TODO_update_smt_usage
| TODO_update_ssa, /* todo_flags_finish */
  0 /* letter */
};

see tv_id - so I guess increased CCP times are expected.

Richard.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829



[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller

2006-03-22 Thread law at redhat dot com


--- Comment #10 from law at redhat dot com  2006-03-22 14:01 ---
Subject: Re:  [4.1/4.2 Regression] missed jump
threading after unroller

On Wed, 2006-03-22 at 12:14 +0100, Richard Guenther wrote:
> On 3/21/06, Jeffrey A Law <[EMAIL PROTECTED]> wrote:
> > It turns out this specialized PHI optimization pass is as effective
> > as running copy-prop and CCP on PHI nodes after DOM.  Better yet, it
> > is a teeny tiny slowdown compared to just running the stripped down
> > copyprop.  ie, for an almost unmeasurable slowdown we can do both
> > constant and copy propagation instead of just copy propagation.
> 
> This patch caused a compile-time regression from 139s to 143s, resp.
> 192s to 197s (leafify) accounted by increases of operand scan / SSA 
> incremental
> and tree CCP times for compiling tramp3d.  Also memory usage during compiling
> went up from 655494 kB to 660626kB (this may be due to the VRP patch, though).
> 
> Runtime of tramp3d did not improve but regress slightly (but that
> might be in the
> noise - we'll see).
> 
> For this simple cleanup pass can you try updating SSA form manually please?
I'm more than happy to look at it; however, be aware that if you're
seeing increased time in CCP then either you're seeing some truly
bizzarre secondary effect or your testing methodology is suspect.
The patch did not affect CCP.  In fact, the changes only affect
passes which run *after* CCP in the optimization pipeline.

Jeff


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829



[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller

2006-03-22 Thread richard dot guenther at gmail dot com


--- Comment #9 from richard dot guenther at gmail dot com  2006-03-22 11:14 
---
Subject: Re:  [4.1/4.2 Regression] missed jump threading after unroller

On 3/21/06, Jeffrey A Law <[EMAIL PROTECTED]> wrote:
> It turns out this specialized PHI optimization pass is as effective
> as running copy-prop and CCP on PHI nodes after DOM.  Better yet, it
> is a teeny tiny slowdown compared to just running the stripped down
> copyprop.  ie, for an almost unmeasurable slowdown we can do both
> constant and copy propagation instead of just copy propagation.

This patch caused a compile-time regression from 139s to 143s, resp.
192s to 197s (leafify) accounted by increases of operand scan / SSA incremental
and tree CCP times for compiling tramp3d.  Also memory usage during compiling
went up from 655494 kB to 660626kB (this may be due to the VRP patch, though).

Runtime of tramp3d did not improve but regress slightly (but that
might be in the
noise - we'll see).

For this simple cleanup pass can you try updating SSA form manually please?

Thanks,
Richard.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829



[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller

2006-03-20 Thread law at redhat dot com


--- Comment #8 from law at redhat dot com  2006-03-21 05:10 ---
Today's patch picks up the missed const-propagation and allows 
simplification of the modulo operation.

THere's still the opportunitity to use range information to simplify
a conditional.  However, fixing that is just basically iterating
VRP in response to CFG changes -- not something we're going to
do as it's simply too expensive.

Thus I'm changing the state of this bug to WONTFIX.


-- 

law at redhat dot com changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution||WONTFIX


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829



[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller

2006-03-20 Thread law at redhat dot com


--- Comment #6 from law at redhat dot com  2006-03-21 05:09 ---
Subject: Re:  [4.1/4.2 Regression] missed jump
threading after unroller

On Sat, 2006-02-11 at 00:59 +, pinskia at gcc dot gnu dot org wrote:
> 
> --- Comment #5 from pinskia at gcc dot gnu dot org  2006-02-11 00:59 
> ---
> The problem with this one after Jeff's recent patches is that we have:
> :;
>   D.1402_17 = 0;
>   if (D.1402_17 == 1) goto ; else goto ;
> 
> :;
>   x_18 = 1;
> 
>   # x_19 = PHI <0(2), 0(3), x_18(4)>;
> :;
> 
> Which causes us not to be able to the jump threading as we do a CCP in VRP and
> then we get:
> :
>   if (v_8 < 0) goto ; else goto ;
> 
> :;
>   D.1402_17 = 0;
>   goto  ();
> 
>   # x_19 = PHI <0(2)>;
> :;
>   u_20 = 1;
>   ivtmp.26_21 = 1;
>   ivtmp.26_3 = 1;
>   u_14 = 1;
>   x_13 = 0;
>   if (v_8 <= 0) goto ; else goto ;
> 
> So we need to be able to do some CCP and some DCE before invoking VRP.
As I mentioned earlier, the problem is inherent with non-iterating
dominator optimizations -- namely that we can't guarantee that for
block BB that we will visit all of BB's predecessors before visiting
BB when BB is at a dominance frontier.

The net result is that we may still be propagating values into a PHI
node for a block which has already been visited.  Those propagations
may result in the PHI turning into a degenerate too late for the
dominator optimizer to discover the degenerate PHI and record the
appropriate equivalence for the LHS of the degenerate PHI.

I had added support for a stripped down copy-prop to run after DOM
to pick up these kind of optimization opportunities rooted at degenerate
PHI nodes that were trivial copies.  (previously we had been running
the full-blown copy-prop pass).

However, that code does not catch this case because our PHI looks like

x_2 = PHI (0 (BB1), 0 (BB2))

ie, it's a constant initialization, not a PHI-copy.

I had played with running a similarly stripped down CCP after DOM,
but that's really expensive compile-time wise.

I then experimented with speeding up the propagation engine.  While
I may have found a couple of micro-opts, I wasn't able to find anything
which was going to give us a big enough improvement to make running
a phi-only CCP a viable option from a compile-time standpoint.

I then proceeded to implement a concept that had been floating around
in the back of my mind.  Namely a specialized PHI const/copy optimizer
which used a dominator walk plus a worklist of statements/phis to
revisit after the DOM walk (which allows us to detect secondary
effects).

It turns out this specialized PHI optimization pass is as effective
as running copy-prop and CCP on PHI nodes after DOM.  Better yet, it
is a teeny tiny slowdown compared to just running the stripped down
copyprop.  ie, for an almost unmeasurable slowdown we can do both
constant and copy propagation instead of just copy propagation.

Net result is in this PR we're able to clean up the extraneous
modulo operation and propagate its result as well.

Note that the resulting code could be simplified even more, namely
iterating VRP could allow simplification of a relational into an
equality test.  That's simply not going to happen.  We'll have to
live with less than 100% optimized code.

Bootstrapped and regression tested on i686-pc-linux-gnu.


--- Comment #7 from law at redhat dot com  2006-03-21 05:09 ---
Created an attachment (id=11077)
 --> (http://gcc.gnu.org/bugzilla/attachment.cgi?id=11077&action=view)


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829



[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller

2006-02-28 Thread mmitchel at gcc dot gnu dot org


-- 

mmitchel at gcc dot gnu dot org changed:

   What|Removed |Added

   Target Milestone|4.1.0   |4.1.1


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829



[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller

2006-02-10 Thread pinskia at gcc dot gnu dot org


--- Comment #5 from pinskia at gcc dot gnu dot org  2006-02-11 00:59 ---
The problem with this one after Jeff's recent patches is that we have:
:;
  D.1402_17 = 0;
  if (D.1402_17 == 1) goto ; else goto ;

:;
  x_18 = 1;

  # x_19 = PHI <0(2), 0(3), x_18(4)>;
:;

Which causes us not to be able to the jump threading as we do a CCP in VRP and
then we get:
:
  if (v_8 < 0) goto ; else goto ;

:;
  D.1402_17 = 0;
  goto  ();

  # x_19 = PHI <0(2)>;
:;
  u_20 = 1;
  ivtmp.26_21 = 1;
  ivtmp.26_3 = 1;
  u_14 = 1;
  x_13 = 0;
  if (v_8 <= 0) goto ; else goto ;

So we need to be able to do some CCP and some DCE before invoking VRP.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829



[Bug tree-optimization/21829] [4.1/4.2 Regression] missed jump threading after unroller

2006-02-08 Thread law at redhat dot com


--- Comment #4 from law at redhat dot com  2006-02-09 03:19 ---
I'll note this really isn't a jump threading issue.  This is a fundamental
weakness in a dominator based optimizer vs a truly global optimizer.

What we've got is a block which looks something like this:

# u_18 = PHI ;
:;
D.1528_10 = u_18 % 2;
if (D.1528_10 == 1) goto ; else goto ;

Note carefully that this block is at a dominance frontier (thus the PHI node). 
At a dominance frontier, a dominator walk does _not_ guarantee that the CFG
parents are visited before the node itself.  ie, we may visit this block before
its CFG parents.That is in fact precisely what happens.  Thus we do not
discover that u_13 and u_26 both have the value 1 until after we have processed
this block.

Possible solutions:

  1. Iterating DOM.  I don't want to do that.

  2. Run CCP before DOM.  It's something I've pondered, mostly because I want
 to remove constant propagation from DOM.  Compile-time issues worry me.

  3. Simple constant/copy propagation in the renaming phase.  I've always
 been a fan of this to avoid clutter in the IL, but I've been overruled
 on this as a design decision.

  4. Somehow arrange to sort the et-tree dominance information so that we
 visit immediate descendants of a CFG node which are dominated by the
 node before visiting more distant CFG descendants.  This could be
 horribly ugly.

  5. Somehow get really really smart in the block duplication code and arrange
 for it to propagate constants.  I've pondered this, but I don't think
 it's really that feasible both from a design and implementation
standpoint.

  6. Post-dom cleanup phase using worklists seeded by degenerate PHI nodes to
 identify statements that can be simplified as a result of discovery of
 the degenerate PHI.  [Would replace the mini-copyprop pass we do after
 DOM. ]


I suspect #2 or #6 are the most likely long term solutions.

jeff


-- 

law at redhat dot com changed:

   What|Removed |Added

 CC||law at redhat dot com


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21829