[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-12-07 Thread hubicka at ucw dot cz

--- Additional Comments From hubicka at ucw dot cz  2004-12-07 17:50 ---
Subject: Re:  [4.0 Regression] Inlining limits cause 340% performance regression

> 
> --- Additional Comments From rguenth at tat dot physik dot uni-tuebingen 
> dot de  2004-12-07 15:35 ---
> Subject: Re:  [4.0 Regression] Inlining limits
>  cause 340% performance regression
> 
> On Tue, 7 Dec 2004, Richard Guenther wrote:
> 
> > static inline void foo() {}
> > void bar() { foo(); }
> >
> > which for -O2 -fprofile-generate produces
> >
> > bar:
> > addl$1, .LPBX1
> > pushl   %ebp
> > movl%esp, %ebp
> > adcl$0, .LPBX1+4
> > addl$1, .LPBX1+16
> > popl%ebp
> > adcl$0, .LPBX1+20
> > addl$1, .LPBX1+8
> > adcl$0, .LPBX1+12
> > ret
> 
> Mainline manages to produce
> 
> bar:
> addl$1, .LPBX1
> pushl   %ebp
> movl%esp, %ebp
> adcl$0, .LPBX1+4
> popl%ebp
> ret
> 
> but that's RTL instrumentation?

It is instrumentation after inlining.  Before inlining you have two
functions so you get two entry points.
Doing little inlinng before profiling would do the trick here, but it
needs some restructuring first.

Honza
> 
> 
> 
> -- 
> 
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
> 
> --- You are receiving this mail because: ---
> You are on the CC list for the bug, or are watching someone who is.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-12-07 Thread rguenth at tat dot physik dot uni-tuebingen dot de

--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen 
dot de  2004-12-07 15:35 ---
Subject: Re:  [4.0 Regression] Inlining limits
 cause 340% performance regression

On Tue, 7 Dec 2004, Richard Guenther wrote:

> static inline void foo() {}
> void bar() { foo(); }
>
> which for -O2 -fprofile-generate produces
>
> bar:
> addl$1, .LPBX1
> pushl   %ebp
> movl%esp, %ebp
> adcl$0, .LPBX1+4
> addl$1, .LPBX1+16
> popl%ebp
> adcl$0, .LPBX1+20
> addl$1, .LPBX1+8
> adcl$0, .LPBX1+12
> ret

Mainline manages to produce

bar:
addl$1, .LPBX1
pushl   %ebp
movl%esp, %ebp
adcl$0, .LPBX1+4
popl%ebp
ret

but that's RTL instrumentation?



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-12-07 Thread rguenth at tat dot physik dot uni-tuebingen dot de

--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen 
dot de  2004-12-07 15:09 ---
Subject: Re:  [4.0 Regression] Inlining limits
 cause 340% performance regression

On 7 Dec 2004, hubicka at ucw dot cz wrote:

> > Yes, it seems so.  Really nice improvement.  Though profiling is
> > sloow.  I guess you avoid doing any CFG changing transformation
> > for the profiling stage?  I.e. not even inline the simplest functions?
>
> I can inline but only after actually instrumenting the functios.  That
> should minimize the costs, but I also noticed that tramp3d is
> surprisingly a lot slower with profiling.
>
> > That would be the reason the Intel compiler is unusable with profiling
> > for me.  -fprofile-generate comes with a 50fold increase in runtime!
>
> -fprofile-generate is actually package of
> -fprofile-arcs/-fprofile-values + -fprofile-values-transformations
> It might be interesting to figure out whether -fprofile-arcs itslef
> brings similar slowdown.  Only reason why this can happen I can think of
> is the fact that after instrumenting we again inline a lot less or we
> produce too many redundant counter.  Perhaps it would make sense to
> think about inlining functions reducing code size before instrumenting
> as we would do that anyway, but it will be tricky to get gcov output and
> -f* flags independence right then.

Hm.  There are a lot of counters - maybe it is possible to merge
the counters themselves?  The resulting asm of tramp3d-v3 consists
of 30% addl/adcl lines for adding the profiling counts - where
the total number of lines is just wc -l of a -S -fverbose-asm compilation.
That's very much a lot.  And additions are in cache unfriedly sequence,
too - dunno which optimization pass could improve this though.  Consider

static inline void foo() {}
void bar() { foo(); }

which for -O2 -fprofile-generate produces

bar:
addl$1, .LPBX1
pushl   %ebp
movl%esp, %ebp
adcl$0, .LPBX1+4
addl$1, .LPBX1+16
popl%ebp
adcl$0, .LPBX1+20
addl$1, .LPBX1+8
adcl$0, .LPBX1+12
ret

that should be

bar:
addl$1, .LPBX1
pushl   %ebp
movl%esp, %ebp
adcl$0, .LPBX1+4
addl$1, .LPBX1+8
adcl$0, .LPBX1+12
addl$1, .LPBX1+16
adcl$0, .LPBX1+20
ret

And of course all the three counters could be merged.  But that
would need a changed gcov file format somehow representing a
callgraph with merged edges.

The intel compiler is so much worse here because all the
counter adding is done thread-safe in a library (i.e. they
have an extra call for every edge and do not do any inlining).

> How our profilng performance is compared to ICC?

ICC is a lot worse.  ICC with -prof_gen causes a 1 fold slowdown
(if the current snapshot of icc doesn't segfault compiling the tramp3d
testcase) - ICC is completely unusable for me.  So - GCC is great!

> > > It would be nice to experiment with this a little - in general the
> > > heuristics can be viewed as having three players.  There are the limits
> > > (specified via --param) that it must obey, there is the cost model
> > > (estimated growth for inlining into all callees without profiling and
> > > the execute_count to estimated growth for inlining to one call with
> > > profiling) and the bin packing algorithm optimizing the gains while
> > > obeying the limits.
> > >
> > > With profiling in the cost model is pretty much realistic and it would
> > > be nice to figure out how the performance behave when the individual
> > > limits are changed and why.  If you have some time for experimentation,
> > > it would be very usefull.  I am trying to do the same with SPEC and GCC
> > > but I have dificulty to play with pooma or Gerald's application as I
> > > have little understanding what is going there.  I will try it myself
> > > next but any feedback can be very usefull here.
> >
> > I can produce some numbers for the tramp testcase.
> Thanks!  Note that with changling the flags you should not need to
> re-profile now so you can save quite a lot of time.

Ah, thats indeed nice.

Richard.

--
Richard Guenther 
WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-12-07 Thread hubicka at ucw dot cz

--- Additional Comments From hubicka at ucw dot cz  2004-12-07 14:52 ---
Subject: Re:  [4.0 Regression] Inlining limits cause 340% performance regression

> 
> --- Additional Comments From rguenth at tat dot physik dot uni-tuebingen 
> dot de  2004-12-07 14:35 ---
> Subject: Re:  [4.0 Regression] Inlining limits
>  cause 340% performance regression
> 
> On 6 Dec 2004, hubicka at ucw dot cz wrote:
> 
> > Looks like I get 4fold speedup on tree profiling with profiling compared
> > to tree profiling on mainline that is equivalent to speedup you are
> > seeing for leafify patch. That sounds pretty prommising (so the new
> > heuristics can get the leafify idea without the hint from user and
> > hitting the code growth problems).
> 
> Yes, it seems so.  Really nice improvement.  Though profiling is
> sloow.  I guess you avoid doing any CFG changing transformation
> for the profiling stage?  I.e. not even inline the simplest functions?
> That would be the reason the Intel compiler is unusable with profiling
> for me.  -fprofile-generate comes with a 50fold increase in runtime!

Also it might be possible to change
  NEXT_PASS (pass_tree_profile);
  NEXT_PASS (pass_cleanup_cfg);
into
  NEXT_PASS (pass_cleanup_cfg);
  NEXT_PASS (pass_tree_profile);
  NEXT_PASS (pass_cleanup_cfg);
in tree-optimize.c to get cfg cleaned up.  In theory it should not have
much of effect since profiling code is already smart enought to not
instrument edges that are redundant control flow wise, but perhaps it is
not doing it all the time.  The cleanup is prevented there to avod
problems with inexact coverage info, but it is not unthinkable to extend
cfgcleanup to be coverage info safe or execute it when
-fprofile-generate is used without -ftext-coverage if it makes any
difference.

Honza
> 
> > It would be nice to experiment with this a little - in general the
> > heuristics can be viewed as having three players.  There are the limits
> > (specified via --param) that it must obey, there is the cost model
> > (estimated growth for inlining into all callees without profiling and
> > the execute_count to estimated growth for inlining to one call with
> > profiling) and the bin packing algorithm optimizing the gains while
> > obeying the limits.
> >
> > With profiling in the cost model is pretty much realistic and it would
> > be nice to figure out how the performance behave when the individual
> > limits are changed and why.  If you have some time for experimentation,
> > it would be very usefull.  I am trying to do the same with SPEC and GCC
> > but I have dificulty to play with pooma or Gerald's application as I
> > have little understanding what is going there.  I will try it myself
> > next but any feedback can be very usefull here.
> 
> I can produce some numbers for the tramp testcase.
> 
> > My plan is to try undersand the limits first and then try to get the
> > cost model better without profiling as it is bit too clumpsy to do both
> > at once.
> 
> Do you have some written overview of the cost model?
> 
> Richard.
> 
> --
> Richard Guenther 
> WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
> 
> 
> 
> -- 
> 
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
> 
> --- You are receiving this mail because: ---
> You are on the CC list for the bug, or are watching someone who is.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-12-07 Thread hubicka at ucw dot cz

--- Additional Comments From hubicka at ucw dot cz  2004-12-07 14:49 ---
Subject: Re:  [4.0 Regression] Inlining limits cause 340% performance regression

> 
> --- Additional Comments From rguenth at tat dot physik dot uni-tuebingen 
> dot de  2004-12-07 14:35 ---
> Subject: Re:  [4.0 Regression] Inlining limits
>  cause 340% performance regression
> 
> On 6 Dec 2004, hubicka at ucw dot cz wrote:
> 
> > Looks like I get 4fold speedup on tree profiling with profiling compared
> > to tree profiling on mainline that is equivalent to speedup you are
> > seeing for leafify patch. That sounds pretty prommising (so the new
> > heuristics can get the leafify idea without the hint from user and
> > hitting the code growth problems).
> 
> Yes, it seems so.  Really nice improvement.  Though profiling is
> sloow.  I guess you avoid doing any CFG changing transformation
> for the profiling stage?  I.e. not even inline the simplest functions?

I can inline but only after actually instrumenting the functios.  That
should minimize the costs, but I also noticed that tramp3d is
surprisingly a lot slower with profiling.

> That would be the reason the Intel compiler is unusable with profiling
> for me.  -fprofile-generate comes with a 50fold increase in runtime!

-fprofile-generate is actually package of
-fprofile-arcs/-fprofile-values + -fprofile-values-transformations
It might be interesting to figure out whether -fprofile-arcs itslef
brings similar slowdown.  Only reason why this can happen I can think of
is the fact that after instrumenting we again inline a lot less or we
produce too many redundant counter.  Perhaps it would make sense to
think about inlining functions reducing code size before instrumenting
as we would do that anyway, but it will be tricky to get gcov output and
-f* flags independence right then.

How our profilng performance is compared to ICC?
> 
> > It would be nice to experiment with this a little - in general the
> > heuristics can be viewed as having three players.  There are the limits
> > (specified via --param) that it must obey, there is the cost model
> > (estimated growth for inlining into all callees without profiling and
> > the execute_count to estimated growth for inlining to one call with
> > profiling) and the bin packing algorithm optimizing the gains while
> > obeying the limits.
> >
> > With profiling in the cost model is pretty much realistic and it would
> > be nice to figure out how the performance behave when the individual
> > limits are changed and why.  If you have some time for experimentation,
> > it would be very usefull.  I am trying to do the same with SPEC and GCC
> > but I have dificulty to play with pooma or Gerald's application as I
> > have little understanding what is going there.  I will try it myself
> > next but any feedback can be very usefull here.
> 
> I can produce some numbers for the tramp testcase.
Thanks!  Note that with changling the flags you should not need to
re-profile now so you can save quite a lot of time.
> 
> > My plan is to try undersand the limits first and then try to get the
> > cost model better without profiling as it is bit too clumpsy to do both
> > at once.
> 
> Do you have some written overview of the cost model?

not really, but it is simple for the moment.  To estimate size of
function I use simple walk of function body cmputing most nodes as 1,
division, call and similar badies as 10, NOP and constants as 0.
When profiling the priority of inlining edge is number of executions
divided by the estimated growth (size of callee minus 10), when not
profiling it is the overall growth after inliing to all callees (ie i
count number of callees one can inline into and multiply it by size of
callee minus 10).

You can see the inlining decisions with -fdump-ipa-inline.

Honza
> 
> Richard.
> 
> --
> Richard Guenther 
> WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
> 
> 
> 
> -- 
> 
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
> 
> --- You are receiving this mail because: ---
> You are on the CC list for the bug, or are watching someone who is.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-12-07 Thread rguenth at tat dot physik dot uni-tuebingen dot de

--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen 
dot de  2004-12-07 14:35 ---
Subject: Re:  [4.0 Regression] Inlining limits
 cause 340% performance regression

On 6 Dec 2004, hubicka at ucw dot cz wrote:

> Looks like I get 4fold speedup on tree profiling with profiling compared
> to tree profiling on mainline that is equivalent to speedup you are
> seeing for leafify patch. That sounds pretty prommising (so the new
> heuristics can get the leafify idea without the hint from user and
> hitting the code growth problems).

Yes, it seems so.  Really nice improvement.  Though profiling is
sloow.  I guess you avoid doing any CFG changing transformation
for the profiling stage?  I.e. not even inline the simplest functions?
That would be the reason the Intel compiler is unusable with profiling
for me.  -fprofile-generate comes with a 50fold increase in runtime!

> It would be nice to experiment with this a little - in general the
> heuristics can be viewed as having three players.  There are the limits
> (specified via --param) that it must obey, there is the cost model
> (estimated growth for inlining into all callees without profiling and
> the execute_count to estimated growth for inlining to one call with
> profiling) and the bin packing algorithm optimizing the gains while
> obeying the limits.
>
> With profiling in the cost model is pretty much realistic and it would
> be nice to figure out how the performance behave when the individual
> limits are changed and why.  If you have some time for experimentation,
> it would be very usefull.  I am trying to do the same with SPEC and GCC
> but I have dificulty to play with pooma or Gerald's application as I
> have little understanding what is going there.  I will try it myself
> next but any feedback can be very usefull here.

I can produce some numbers for the tramp testcase.

> My plan is to try undersand the limits first and then try to get the
> cost model better without profiling as it is bit too clumpsy to do both
> at once.

Do you have some written overview of the cost model?

Richard.

--
Richard Guenther 
WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-12-06 Thread hubicka at ucw dot cz

--- Additional Comments From hubicka at ucw dot cz  2004-12-06 15:03 ---
Subject: Re:  [4.0 Regression] Inlining limits cause 340% performance regression

> 
> --- Additional Comments From rguenth at tat dot physik dot uni-tuebingen 
> dot de  2004-12-06 14:31 ---
> Subject: Re:  [4.0 Regression] Inlining limits
>  cause 340% performance regression
> 
> On 6 Dec 2004, hubicka at ucw dot cz wrote:
> 
> > > > the order of inlining decisions affecting this.  I would be curious how
> > > > those results compare to leafify and whether the 0m27s is not caused by
> > > > missoptimization.
> > >
> > > You can check for misoptimization by looking at the final output.
> > > I.e. the rh,vx,vy and vz sums should be nearly zero, the T sum
> > > will increase with the number of iterations.
> > >
> > > With mainline, -O2 -fpeel-loops -march=pentium4 -ffast-math
> > > -D__NO_MATH_INLINES (we still need explicit -fpeel-loops for
> > > unrolling for (i=0;i<3;++i) a[i]=0;), I need 0m17s for -n 10 with
> > > leafification turned on, with it turned off, runtime increases
> > > to 0m31s with --param inline-unit-growth=175.
> >
> > I compiled with -O3, would be possible for you to measure how much
> > speedup you get on mainline with -O3 and -O3+lefify?  That would
> > probably allow me relate those numbers somehow.
> 
> 0m23s for -O3+leafify, 1m54s for -O3, 0m35s for -O3 --param
> inline-unit-growth=150.

Looks like I get 4fold speedup on tree profiling with profiling compared
to tree profiling on mainline that is equivalent to speedup you are
seeing for leafify patch. That sounds pretty prommising (so the new
heuristics can get the leafify idea without the hint from user and
hitting the code growth problems).

It would be nice to experiment with this a little - in general the
heuristics can be viewed as having three players.  There are the limits
(specified via --param) that it must obey, there is the cost model
(estimated growth for inlining into all callees without profiling and
the execute_count to estimated growth for inlining to one call with
profiling) and the bin packing algorithm optimizing the gains while
obeying the limits.

With profiling in the cost model is pretty much realistic and it would
be nice to figure out how the performance behave when the individual
limits are changed and why.  If you have some time for experimentation,
it would be very usefull.  I am trying to do the same with SPEC and GCC
but I have dificulty to play with pooma or Gerald's application as I
have little understanding what is going there.  I will try it myself
next but any feedback can be very usefull here.

My plan is to try undersand the limits first and then try to get the
cost model better without profiling as it is bit too clumpsy to do both
at once.

Honza
> 
> Richard.
> 
> 
> 
> -- 
> 
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
> 
> --- You are receiving this mail because: ---
> You are on the CC list for the bug, or are watching someone who is.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-12-06 Thread rguenth at tat dot physik dot uni-tuebingen dot de

--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen 
dot de  2004-12-06 14:31 ---
Subject: Re:  [4.0 Regression] Inlining limits
 cause 340% performance regression

On 6 Dec 2004, hubicka at ucw dot cz wrote:

> > > the order of inlining decisions affecting this.  I would be curious how
> > > those results compare to leafify and whether the 0m27s is not caused by
> > > missoptimization.
> >
> > You can check for misoptimization by looking at the final output.
> > I.e. the rh,vx,vy and vz sums should be nearly zero, the T sum
> > will increase with the number of iterations.
> >
> > With mainline, -O2 -fpeel-loops -march=pentium4 -ffast-math
> > -D__NO_MATH_INLINES (we still need explicit -fpeel-loops for
> > unrolling for (i=0;i<3;++i) a[i]=0;), I need 0m17s for -n 10 with
> > leafification turned on, with it turned off, runtime increases
> > to 0m31s with --param inline-unit-growth=175.
>
> I compiled with -O3, would be possible for you to measure how much
> speedup you get on mainline with -O3 and -O3+lefify?  That would
> probably allow me relate those numbers somehow.

0m23s for -O3+leafify, 1m54s for -O3, 0m35s for -O3 --param
inline-unit-growth=150.

Richard.



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-12-06 Thread hubicka at ucw dot cz

--- Additional Comments From hubicka at ucw dot cz  2004-12-06 13:40 ---
Subject: Re:  [4.0 Regression] Inlining limits cause 340% performance regression

> 
> --- Additional Comments From rguenth at tat dot physik dot uni-tuebingen 
> dot de  2004-12-06 13:18 ---
> Subject: Re:  [4.0 Regression] Inlining limits
>  cause 340% performance regression
> 
> On 6 Dec 2004, hubicka at ucw dot cz wrote:
> 
> > The cfg inliner per se is not too interesting.  What matters here is the
> > code size esitmation and profitability estimation.  I am playing with
> > this now and trying to get profile based inlining working.
> 
> Yes, I guess the cfg inliner and some early dead code removal passes
> should improve code size metrics for stuff like
> 
> template 
> struct Foo
> {
>   enum { val = X::val };
>   void foo()
>   {
> if (val)
>   ...
> else
>   ...
>   }
> };
> 
> with val being const.
> 
> > For -n10 and tramp3d.cc I need 2m14s on mainline, 1m31s on the current
> > tree-profiling.  With my new implementation I need 0m27s with profile
> > feedback and 2m53s without.  I wonder what makes the new heuristics work
> > worse without profiling, but just increasing the inline-unit-growth very
> > slightly (to 155) I get 0m42s.  This might be just little unstability in
> 
> Note that inline-unit-growth is 50 by default, so 155 is not slightly
> increased.
OK, I will play around with 55 then :)
> 
> > the order of inlining decisions affecting this.  I would be curious how
> > those results compare to leafify and whether the 0m27s is not caused by
> > missoptimization.
> 
> You can check for misoptimization by looking at the final output.
> I.e. the rh,vx,vy and vz sums should be nearly zero, the T sum
> will increase with the number of iterations.
> 
> With mainline, -O2 -fpeel-loops -march=pentium4 -ffast-math
> -D__NO_MATH_INLINES (we still need explicit -fpeel-loops for
> unrolling for (i=0;i<3;++i) a[i]=0;), I need 0m17s for -n 10 with
> leafification turned on, with it turned off, runtime increases
> to 0m31s with --param inline-unit-growth=175.

I compiled with -O3, would be possible for you to measure how much
speedup you get on mainline with -O3 and -O3+lefify?  That would
probably allow me relate those numbers somehow.
> 
> > Unless I will observe it otherwise (on SPEC with intermodule), I will
> > apply my current patch and try to improve the profitability analysis
> > without profiling incrementally.  Ideally we ought to build estimated
> > profile and use it, but that needs some work so for the moment I guess I
> > will try to experiment with making loop depth available to the cgraph
> > code.
> 
> Yes, loops could be "auto-leafified", but it will be difficult to
> statically check if that is worthwhile.

I guess just increasing priority for calls inside loops (something like
dividing current cost estimation by loop nest) would do good job for
now, but first I need to convince myself that the new rewrite does
resonable job even for current cost metric before moving on.

Honza
> 
> Richard.
> 
> --
> Richard Guenther 
> WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/
> 
> 
> 
> -- 
> 
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
> 
> --- You are receiving this mail because: ---
> You are on the CC list for the bug, or are watching someone who is.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-12-06 Thread rguenth at tat dot physik dot uni-tuebingen dot de

--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen 
dot de  2004-12-06 13:18 ---
Subject: Re:  [4.0 Regression] Inlining limits
 cause 340% performance regression

On 6 Dec 2004, hubicka at ucw dot cz wrote:

> The cfg inliner per se is not too interesting.  What matters here is the
> code size esitmation and profitability estimation.  I am playing with
> this now and trying to get profile based inlining working.

Yes, I guess the cfg inliner and some early dead code removal passes
should improve code size metrics for stuff like

template 
struct Foo
{
  enum { val = X::val };
  void foo()
  {
if (val)
  ...
else
  ...
  }
};

with val being const.

> For -n10 and tramp3d.cc I need 2m14s on mainline, 1m31s on the current
> tree-profiling.  With my new implementation I need 0m27s with profile
> feedback and 2m53s without.  I wonder what makes the new heuristics work
> worse without profiling, but just increasing the inline-unit-growth very
> slightly (to 155) I get 0m42s.  This might be just little unstability in

Note that inline-unit-growth is 50 by default, so 155 is not slightly
increased.

> the order of inlining decisions affecting this.  I would be curious how
> those results compare to leafify and whether the 0m27s is not caused by
> missoptimization.

You can check for misoptimization by looking at the final output.
I.e. the rh,vx,vy and vz sums should be nearly zero, the T sum
will increase with the number of iterations.

With mainline, -O2 -fpeel-loops -march=pentium4 -ffast-math
-D__NO_MATH_INLINES (we still need explicit -fpeel-loops for
unrolling for (i=0;i<3;++i) a[i]=0;), I need 0m17s for -n 10 with
leafification turned on, with it turned off, runtime increases
to 0m31s with --param inline-unit-growth=175.

> Unless I will observe it otherwise (on SPEC with intermodule), I will
> apply my current patch and try to improve the profitability analysis
> without profiling incrementally.  Ideally we ought to build estimated
> profile and use it, but that needs some work so for the moment I guess I
> will try to experiment with making loop depth available to the cgraph
> code.

Yes, loops could be "auto-leafified", but it will be difficult to
statically check if that is worthwhile.

Richard.

--
Richard Guenther 
WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-12-06 Thread hubicka at ucw dot cz

--- Additional Comments From hubicka at ucw dot cz  2004-12-06 12:44 ---
Subject: Re:  [4.0 Regression] Inlining limits cause 340% performance regression

> 
> --- Additional Comments From rguenth at tat dot physik dot uni-tuebingen 
> dot de  2004-12-06 09:53 ---
> Subject: Re:  [4.0 Regression] Inlining limits
>  cause 340% performance regression
> 
> On 6 Dec 2004, pinskia at gcc dot gnu dot org wrote:
> 
> > No reason to keep this one open, there is PR 17863 still.  Also note I 
> > heard from Honza that the tree
> > profiling branch with feedback can optimizate better than with your leafy 
> > patch.
> 
> Wow, that would be cool.  Does the tree-profiling branch contain the
> cfg inliner?  I'll try it asap.

The cfg inliner per se is not too interesting.  What matters here is the
code size esitmation and profitability estimation.  I am playing with
this now and trying to get profile based inlining working.

For -n10 and tramp3d.cc I need 2m14s on mainline, 1m31s on the current
tree-profiling.  With my new implementation I need 0m27s with profile
feedback and 2m53s without.  I wonder what makes the new heuristics work
worse without profiling, but just increasing the inline-unit-growth very
slightly (to 155) I get 0m42s.  This might be just little unstability in
the order of inlining decisions affecting this.  I would be curious how
those results compare to leafify and whether the 0m27s is not caused by
missoptimization.

Unless I will observe it otherwise (on SPEC with intermodule), I will
apply my current patch and try to improve the profitability analysis
without profiling incrementally.  Ideally we ought to build estimated
profile and use it, but that needs some work so for the moment I guess I
will try to experiment with making loop depth available to the cgraph
code.

Honza
> 
> 
> 
> -- 
> 
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
> 
> --- You are receiving this mail because: ---
> You are on the CC list for the bug, or are watching someone who is.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-12-06 Thread rguenth at tat dot physik dot uni-tuebingen dot de

--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen 
dot de  2004-12-06 12:33 ---
Subject: Re:  [4.0 Regression] Inlining limits
 cause 340% performance regression

On 6 Dec 2004, pinskia at gcc dot gnu dot org wrote:

> No reason to keep this one open, there is PR 17863 still.
> Also note I heard from Honza that the tree
> profiling branch with feedback can optimizate better than with your
> leafy patch.

I tried tree-profiling branch and profile-based inlining is actually
worse than "normal" inlining with inline-unit-growth=150.  Worse by
a factor of four.  So, no cigar yet.

And btw. profile based inlining seems to be ignorant of inline-unit-growth
(at least it doesnt improve for greater values).

And generating the profile is _very_ slow (for the tramp3d testcase).
Runtime increases about 100 fold - not very good for creating a meaningful
profile.

Richard.



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-12-06 Thread rguenth at tat dot physik dot uni-tuebingen dot de

--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen 
dot de  2004-12-06 09:53 ---
Subject: Re:  [4.0 Regression] Inlining limits
 cause 340% performance regression

On 6 Dec 2004, pinskia at gcc dot gnu dot org wrote:

> No reason to keep this one open, there is PR 17863 still.  Also note I heard 
> from Honza that the tree
> profiling branch with feedback can optimizate better than with your leafy 
> patch.

Wow, that would be cool.  Does the tree-profiling branch contain the
cfg inliner?  I'll try it asap.



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-12-05 Thread pinskia at gcc dot gnu dot org

--- Additional Comments From pinskia at gcc dot gnu dot org  2004-12-06 
05:20 ---
No reason to keep this one open, there is PR 17863 still.  Also note I heard 
from Honza that the tree 
profiling branch with feedback can optimizate better than with your leafy patch.

*** This bug has been marked as a duplicate of 17863 ***

-- 
   What|Removed |Added

 Status|UNCONFIRMED |RESOLVED
 Resolution||DUPLICATE


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-11-29 Thread hubicka at ucw dot cz

--- Additional Comments From hubicka at ucw dot cz  2004-11-29 14:06 ---
Subject: Re:  [4.0 Regression] Inlining limits cause 340% performance regression

> 
> --- Additional Comments From giovannibajo at libero dot it  2004-11-29 
> 11:36 ---
> Honza is the one that plays with inlining, I'm CC:ing him on this bug.

Well, I am not quite sure how much we can do here.  Pooma testcase is
one of very unusual pieces of code from inliner point of view and the
current defualt of 50% compile unit growth is already much higher than
what most of other compilers does (with intermodule the defaults tends
to be somewhere in 15%).

As the compile unit gets bigger, the overall unit growth is more
important (for instance for SPEC we never hit the overall growth when
compiling it one file by one, but when doing IMA we hit it in most
programs), so making the limit arbitrary high has very large effect on
code size, compilation time and sometimes it degrades speed as well as
we run out of icache.

The fact that we fail to do resonable job on Pooma is more a result of
very poor analysis of beneficts of the inlining (we probably inline
things that don't matter and miss the things that does).  This is slowly
getting better on tree-profiling branch (and adding infrastructure for
this is one of it's main points) so it might be interesting try to see
how it scales on this testcase

I don't think bumping the overall unit growith too high is good idea,
but perhaps we can figure out if something is getting overestimated...

Honza
> 
> -- 
>What|Removed |Added
> 
>  CC||hubicka at gcc dot gnu dot
>||org
> 
> 
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
> 
> --- You are receiving this mail because: ---
> You are on the CC list for the bug, or are watching someone who is.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-11-29 Thread rguenth at tat dot physik dot uni-tuebingen dot de

--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen 
dot de  2004-11-29 12:10 ---
Documentation patches for 3.4 and mainline are here:

http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02457.html
http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02551.html

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-11-29 Thread giovannibajo at libero dot it

--- Additional Comments From giovannibajo at libero dot it  2004-11-29 
11:36 ---
Honza is the one that plays with inlining, I'm CC:ing him on this bug.

-- 
   What|Removed |Added

 CC||hubicka at gcc dot gnu dot
   ||org


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-11-29 Thread rguenth at tat dot physik dot uni-tuebingen dot de

--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen 
dot de  2004-11-29 11:04 ---
Looking at the 3.4 branch the defaults for the relevant inlining parameters are
the same.  So the difference in performance has to be accounted to different
tree-node counting (or to differences in the accounting during inlining).

As we throttle inlining params if -Os is specified in opts.c:

  if (optimize_size)
{
  /* Inlining of very small functions usually reduces total size.  */
  set_param_value ("max-inline-insns-single", 5);
  set_param_value ("max-inline-insns-auto", 5);
  flag_inline_functions = 1;

may I suggest to throttle inline-unit-growth there, too (though it
shouldn't have an effect with so small max-inline-insns-single).  And
then provide the documented limit (150) for inline-unit-growth?

One may even argue that limiting overall unit growth is not important,
as it is already limited by max-inline-insns-* and large-function-*.
Also both inline-unit-growth and large-function-growth cause inlining
to stop at the threshold leaving one with an unbalanced inlining decision.

Why were these (growth) limits invented?  Were there some particular testcases
that broke down otherwise?

-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-11-28 Thread pinskia at gcc dot gnu dot org

--- Additional Comments From pinskia at gcc dot gnu dot org  2004-11-28 
18:22 ---
This is most likely the same as PR 17863.

-- 
   What|Removed |Added

  BugsThisDependsOn||17863


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704


[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression

2004-11-28 Thread pinskia at gcc dot gnu dot org


-- 
   What|Removed |Added

 CC||pinskia at gcc dot gnu dot
   ||org
   Keywords||missed-optimization
Summary|Inlining limits cause 340%  |[4.0 Regression] Inlining
   |performance regression  |limits cause 340%
   ||performance regression
   Target Milestone|--- |4.0.0


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704