[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From hubicka at ucw dot cz 2004-12-07 17:50 --- Subject: Re: [4.0 Regression] Inlining limits cause 340% performance regression > > --- Additional Comments From rguenth at tat dot physik dot uni-tuebingen > dot de 2004-12-07 15:35 --- > Subject: Re: [4.0 Regression] Inlining limits > cause 340% performance regression > > On Tue, 7 Dec 2004, Richard Guenther wrote: > > > static inline void foo() {} > > void bar() { foo(); } > > > > which for -O2 -fprofile-generate produces > > > > bar: > > addl$1, .LPBX1 > > pushl %ebp > > movl%esp, %ebp > > adcl$0, .LPBX1+4 > > addl$1, .LPBX1+16 > > popl%ebp > > adcl$0, .LPBX1+20 > > addl$1, .LPBX1+8 > > adcl$0, .LPBX1+12 > > ret > > Mainline manages to produce > > bar: > addl$1, .LPBX1 > pushl %ebp > movl%esp, %ebp > adcl$0, .LPBX1+4 > popl%ebp > ret > > but that's RTL instrumentation? It is instrumentation after inlining. Before inlining you have two functions so you get two entry points. Doing little inlinng before profiling would do the trick here, but it needs some restructuring first. Honza > > > > -- > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704 > > --- You are receiving this mail because: --- > You are on the CC list for the bug, or are watching someone who is. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen dot de 2004-12-07 15:35 --- Subject: Re: [4.0 Regression] Inlining limits cause 340% performance regression On Tue, 7 Dec 2004, Richard Guenther wrote: > static inline void foo() {} > void bar() { foo(); } > > which for -O2 -fprofile-generate produces > > bar: > addl$1, .LPBX1 > pushl %ebp > movl%esp, %ebp > adcl$0, .LPBX1+4 > addl$1, .LPBX1+16 > popl%ebp > adcl$0, .LPBX1+20 > addl$1, .LPBX1+8 > adcl$0, .LPBX1+12 > ret Mainline manages to produce bar: addl$1, .LPBX1 pushl %ebp movl%esp, %ebp adcl$0, .LPBX1+4 popl%ebp ret but that's RTL instrumentation? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen dot de 2004-12-07 15:09 --- Subject: Re: [4.0 Regression] Inlining limits cause 340% performance regression On 7 Dec 2004, hubicka at ucw dot cz wrote: > > Yes, it seems so. Really nice improvement. Though profiling is > > sloow. I guess you avoid doing any CFG changing transformation > > for the profiling stage? I.e. not even inline the simplest functions? > > I can inline but only after actually instrumenting the functios. That > should minimize the costs, but I also noticed that tramp3d is > surprisingly a lot slower with profiling. > > > That would be the reason the Intel compiler is unusable with profiling > > for me. -fprofile-generate comes with a 50fold increase in runtime! > > -fprofile-generate is actually package of > -fprofile-arcs/-fprofile-values + -fprofile-values-transformations > It might be interesting to figure out whether -fprofile-arcs itslef > brings similar slowdown. Only reason why this can happen I can think of > is the fact that after instrumenting we again inline a lot less or we > produce too many redundant counter. Perhaps it would make sense to > think about inlining functions reducing code size before instrumenting > as we would do that anyway, but it will be tricky to get gcov output and > -f* flags independence right then. Hm. There are a lot of counters - maybe it is possible to merge the counters themselves? The resulting asm of tramp3d-v3 consists of 30% addl/adcl lines for adding the profiling counts - where the total number of lines is just wc -l of a -S -fverbose-asm compilation. That's very much a lot. And additions are in cache unfriedly sequence, too - dunno which optimization pass could improve this though. Consider static inline void foo() {} void bar() { foo(); } which for -O2 -fprofile-generate produces bar: addl$1, .LPBX1 pushl %ebp movl%esp, %ebp adcl$0, .LPBX1+4 addl$1, .LPBX1+16 popl%ebp adcl$0, .LPBX1+20 addl$1, .LPBX1+8 adcl$0, .LPBX1+12 ret that should be bar: addl$1, .LPBX1 pushl %ebp movl%esp, %ebp adcl$0, .LPBX1+4 addl$1, .LPBX1+8 adcl$0, .LPBX1+12 addl$1, .LPBX1+16 adcl$0, .LPBX1+20 ret And of course all the three counters could be merged. But that would need a changed gcov file format somehow representing a callgraph with merged edges. The intel compiler is so much worse here because all the counter adding is done thread-safe in a library (i.e. they have an extra call for every edge and do not do any inlining). > How our profilng performance is compared to ICC? ICC is a lot worse. ICC with -prof_gen causes a 1 fold slowdown (if the current snapshot of icc doesn't segfault compiling the tramp3d testcase) - ICC is completely unusable for me. So - GCC is great! > > > It would be nice to experiment with this a little - in general the > > > heuristics can be viewed as having three players. There are the limits > > > (specified via --param) that it must obey, there is the cost model > > > (estimated growth for inlining into all callees without profiling and > > > the execute_count to estimated growth for inlining to one call with > > > profiling) and the bin packing algorithm optimizing the gains while > > > obeying the limits. > > > > > > With profiling in the cost model is pretty much realistic and it would > > > be nice to figure out how the performance behave when the individual > > > limits are changed and why. If you have some time for experimentation, > > > it would be very usefull. I am trying to do the same with SPEC and GCC > > > but I have dificulty to play with pooma or Gerald's application as I > > > have little understanding what is going there. I will try it myself > > > next but any feedback can be very usefull here. > > > > I can produce some numbers for the tramp testcase. > Thanks! Note that with changling the flags you should not need to > re-profile now so you can save quite a lot of time. Ah, thats indeed nice. Richard. -- Richard Guenther WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/ -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From hubicka at ucw dot cz 2004-12-07 14:52 --- Subject: Re: [4.0 Regression] Inlining limits cause 340% performance regression > > --- Additional Comments From rguenth at tat dot physik dot uni-tuebingen > dot de 2004-12-07 14:35 --- > Subject: Re: [4.0 Regression] Inlining limits > cause 340% performance regression > > On 6 Dec 2004, hubicka at ucw dot cz wrote: > > > Looks like I get 4fold speedup on tree profiling with profiling compared > > to tree profiling on mainline that is equivalent to speedup you are > > seeing for leafify patch. That sounds pretty prommising (so the new > > heuristics can get the leafify idea without the hint from user and > > hitting the code growth problems). > > Yes, it seems so. Really nice improvement. Though profiling is > sloow. I guess you avoid doing any CFG changing transformation > for the profiling stage? I.e. not even inline the simplest functions? > That would be the reason the Intel compiler is unusable with profiling > for me. -fprofile-generate comes with a 50fold increase in runtime! Also it might be possible to change NEXT_PASS (pass_tree_profile); NEXT_PASS (pass_cleanup_cfg); into NEXT_PASS (pass_cleanup_cfg); NEXT_PASS (pass_tree_profile); NEXT_PASS (pass_cleanup_cfg); in tree-optimize.c to get cfg cleaned up. In theory it should not have much of effect since profiling code is already smart enought to not instrument edges that are redundant control flow wise, but perhaps it is not doing it all the time. The cleanup is prevented there to avod problems with inexact coverage info, but it is not unthinkable to extend cfgcleanup to be coverage info safe or execute it when -fprofile-generate is used without -ftext-coverage if it makes any difference. Honza > > > It would be nice to experiment with this a little - in general the > > heuristics can be viewed as having three players. There are the limits > > (specified via --param) that it must obey, there is the cost model > > (estimated growth for inlining into all callees without profiling and > > the execute_count to estimated growth for inlining to one call with > > profiling) and the bin packing algorithm optimizing the gains while > > obeying the limits. > > > > With profiling in the cost model is pretty much realistic and it would > > be nice to figure out how the performance behave when the individual > > limits are changed and why. If you have some time for experimentation, > > it would be very usefull. I am trying to do the same with SPEC and GCC > > but I have dificulty to play with pooma or Gerald's application as I > > have little understanding what is going there. I will try it myself > > next but any feedback can be very usefull here. > > I can produce some numbers for the tramp testcase. > > > My plan is to try undersand the limits first and then try to get the > > cost model better without profiling as it is bit too clumpsy to do both > > at once. > > Do you have some written overview of the cost model? > > Richard. > > -- > Richard Guenther > WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/ > > > > -- > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704 > > --- You are receiving this mail because: --- > You are on the CC list for the bug, or are watching someone who is. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From hubicka at ucw dot cz 2004-12-07 14:49 --- Subject: Re: [4.0 Regression] Inlining limits cause 340% performance regression > > --- Additional Comments From rguenth at tat dot physik dot uni-tuebingen > dot de 2004-12-07 14:35 --- > Subject: Re: [4.0 Regression] Inlining limits > cause 340% performance regression > > On 6 Dec 2004, hubicka at ucw dot cz wrote: > > > Looks like I get 4fold speedup on tree profiling with profiling compared > > to tree profiling on mainline that is equivalent to speedup you are > > seeing for leafify patch. That sounds pretty prommising (so the new > > heuristics can get the leafify idea without the hint from user and > > hitting the code growth problems). > > Yes, it seems so. Really nice improvement. Though profiling is > sloow. I guess you avoid doing any CFG changing transformation > for the profiling stage? I.e. not even inline the simplest functions? I can inline but only after actually instrumenting the functios. That should minimize the costs, but I also noticed that tramp3d is surprisingly a lot slower with profiling. > That would be the reason the Intel compiler is unusable with profiling > for me. -fprofile-generate comes with a 50fold increase in runtime! -fprofile-generate is actually package of -fprofile-arcs/-fprofile-values + -fprofile-values-transformations It might be interesting to figure out whether -fprofile-arcs itslef brings similar slowdown. Only reason why this can happen I can think of is the fact that after instrumenting we again inline a lot less or we produce too many redundant counter. Perhaps it would make sense to think about inlining functions reducing code size before instrumenting as we would do that anyway, but it will be tricky to get gcov output and -f* flags independence right then. How our profilng performance is compared to ICC? > > > It would be nice to experiment with this a little - in general the > > heuristics can be viewed as having three players. There are the limits > > (specified via --param) that it must obey, there is the cost model > > (estimated growth for inlining into all callees without profiling and > > the execute_count to estimated growth for inlining to one call with > > profiling) and the bin packing algorithm optimizing the gains while > > obeying the limits. > > > > With profiling in the cost model is pretty much realistic and it would > > be nice to figure out how the performance behave when the individual > > limits are changed and why. If you have some time for experimentation, > > it would be very usefull. I am trying to do the same with SPEC and GCC > > but I have dificulty to play with pooma or Gerald's application as I > > have little understanding what is going there. I will try it myself > > next but any feedback can be very usefull here. > > I can produce some numbers for the tramp testcase. Thanks! Note that with changling the flags you should not need to re-profile now so you can save quite a lot of time. > > > My plan is to try undersand the limits first and then try to get the > > cost model better without profiling as it is bit too clumpsy to do both > > at once. > > Do you have some written overview of the cost model? not really, but it is simple for the moment. To estimate size of function I use simple walk of function body cmputing most nodes as 1, division, call and similar badies as 10, NOP and constants as 0. When profiling the priority of inlining edge is number of executions divided by the estimated growth (size of callee minus 10), when not profiling it is the overall growth after inliing to all callees (ie i count number of callees one can inline into and multiply it by size of callee minus 10). You can see the inlining decisions with -fdump-ipa-inline. Honza > > Richard. > > -- > Richard Guenther > WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/ > > > > -- > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704 > > --- You are receiving this mail because: --- > You are on the CC list for the bug, or are watching someone who is. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen dot de 2004-12-07 14:35 --- Subject: Re: [4.0 Regression] Inlining limits cause 340% performance regression On 6 Dec 2004, hubicka at ucw dot cz wrote: > Looks like I get 4fold speedup on tree profiling with profiling compared > to tree profiling on mainline that is equivalent to speedup you are > seeing for leafify patch. That sounds pretty prommising (so the new > heuristics can get the leafify idea without the hint from user and > hitting the code growth problems). Yes, it seems so. Really nice improvement. Though profiling is sloow. I guess you avoid doing any CFG changing transformation for the profiling stage? I.e. not even inline the simplest functions? That would be the reason the Intel compiler is unusable with profiling for me. -fprofile-generate comes with a 50fold increase in runtime! > It would be nice to experiment with this a little - in general the > heuristics can be viewed as having three players. There are the limits > (specified via --param) that it must obey, there is the cost model > (estimated growth for inlining into all callees without profiling and > the execute_count to estimated growth for inlining to one call with > profiling) and the bin packing algorithm optimizing the gains while > obeying the limits. > > With profiling in the cost model is pretty much realistic and it would > be nice to figure out how the performance behave when the individual > limits are changed and why. If you have some time for experimentation, > it would be very usefull. I am trying to do the same with SPEC and GCC > but I have dificulty to play with pooma or Gerald's application as I > have little understanding what is going there. I will try it myself > next but any feedback can be very usefull here. I can produce some numbers for the tramp testcase. > My plan is to try undersand the limits first and then try to get the > cost model better without profiling as it is bit too clumpsy to do both > at once. Do you have some written overview of the cost model? Richard. -- Richard Guenther WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/ -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From hubicka at ucw dot cz 2004-12-06 15:03 --- Subject: Re: [4.0 Regression] Inlining limits cause 340% performance regression > > --- Additional Comments From rguenth at tat dot physik dot uni-tuebingen > dot de 2004-12-06 14:31 --- > Subject: Re: [4.0 Regression] Inlining limits > cause 340% performance regression > > On 6 Dec 2004, hubicka at ucw dot cz wrote: > > > > > the order of inlining decisions affecting this. I would be curious how > > > > those results compare to leafify and whether the 0m27s is not caused by > > > > missoptimization. > > > > > > You can check for misoptimization by looking at the final output. > > > I.e. the rh,vx,vy and vz sums should be nearly zero, the T sum > > > will increase with the number of iterations. > > > > > > With mainline, -O2 -fpeel-loops -march=pentium4 -ffast-math > > > -D__NO_MATH_INLINES (we still need explicit -fpeel-loops for > > > unrolling for (i=0;i<3;++i) a[i]=0;), I need 0m17s for -n 10 with > > > leafification turned on, with it turned off, runtime increases > > > to 0m31s with --param inline-unit-growth=175. > > > > I compiled with -O3, would be possible for you to measure how much > > speedup you get on mainline with -O3 and -O3+lefify? That would > > probably allow me relate those numbers somehow. > > 0m23s for -O3+leafify, 1m54s for -O3, 0m35s for -O3 --param > inline-unit-growth=150. Looks like I get 4fold speedup on tree profiling with profiling compared to tree profiling on mainline that is equivalent to speedup you are seeing for leafify patch. That sounds pretty prommising (so the new heuristics can get the leafify idea without the hint from user and hitting the code growth problems). It would be nice to experiment with this a little - in general the heuristics can be viewed as having three players. There are the limits (specified via --param) that it must obey, there is the cost model (estimated growth for inlining into all callees without profiling and the execute_count to estimated growth for inlining to one call with profiling) and the bin packing algorithm optimizing the gains while obeying the limits. With profiling in the cost model is pretty much realistic and it would be nice to figure out how the performance behave when the individual limits are changed and why. If you have some time for experimentation, it would be very usefull. I am trying to do the same with SPEC and GCC but I have dificulty to play with pooma or Gerald's application as I have little understanding what is going there. I will try it myself next but any feedback can be very usefull here. My plan is to try undersand the limits first and then try to get the cost model better without profiling as it is bit too clumpsy to do both at once. Honza > > Richard. > > > > -- > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704 > > --- You are receiving this mail because: --- > You are on the CC list for the bug, or are watching someone who is. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen dot de 2004-12-06 14:31 --- Subject: Re: [4.0 Regression] Inlining limits cause 340% performance regression On 6 Dec 2004, hubicka at ucw dot cz wrote: > > > the order of inlining decisions affecting this. I would be curious how > > > those results compare to leafify and whether the 0m27s is not caused by > > > missoptimization. > > > > You can check for misoptimization by looking at the final output. > > I.e. the rh,vx,vy and vz sums should be nearly zero, the T sum > > will increase with the number of iterations. > > > > With mainline, -O2 -fpeel-loops -march=pentium4 -ffast-math > > -D__NO_MATH_INLINES (we still need explicit -fpeel-loops for > > unrolling for (i=0;i<3;++i) a[i]=0;), I need 0m17s for -n 10 with > > leafification turned on, with it turned off, runtime increases > > to 0m31s with --param inline-unit-growth=175. > > I compiled with -O3, would be possible for you to measure how much > speedup you get on mainline with -O3 and -O3+lefify? That would > probably allow me relate those numbers somehow. 0m23s for -O3+leafify, 1m54s for -O3, 0m35s for -O3 --param inline-unit-growth=150. Richard. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From hubicka at ucw dot cz 2004-12-06 13:40 --- Subject: Re: [4.0 Regression] Inlining limits cause 340% performance regression > > --- Additional Comments From rguenth at tat dot physik dot uni-tuebingen > dot de 2004-12-06 13:18 --- > Subject: Re: [4.0 Regression] Inlining limits > cause 340% performance regression > > On 6 Dec 2004, hubicka at ucw dot cz wrote: > > > The cfg inliner per se is not too interesting. What matters here is the > > code size esitmation and profitability estimation. I am playing with > > this now and trying to get profile based inlining working. > > Yes, I guess the cfg inliner and some early dead code removal passes > should improve code size metrics for stuff like > > template > struct Foo > { > enum { val = X::val }; > void foo() > { > if (val) > ... > else > ... > } > }; > > with val being const. > > > For -n10 and tramp3d.cc I need 2m14s on mainline, 1m31s on the current > > tree-profiling. With my new implementation I need 0m27s with profile > > feedback and 2m53s without. I wonder what makes the new heuristics work > > worse without profiling, but just increasing the inline-unit-growth very > > slightly (to 155) I get 0m42s. This might be just little unstability in > > Note that inline-unit-growth is 50 by default, so 155 is not slightly > increased. OK, I will play around with 55 then :) > > > the order of inlining decisions affecting this. I would be curious how > > those results compare to leafify and whether the 0m27s is not caused by > > missoptimization. > > You can check for misoptimization by looking at the final output. > I.e. the rh,vx,vy and vz sums should be nearly zero, the T sum > will increase with the number of iterations. > > With mainline, -O2 -fpeel-loops -march=pentium4 -ffast-math > -D__NO_MATH_INLINES (we still need explicit -fpeel-loops for > unrolling for (i=0;i<3;++i) a[i]=0;), I need 0m17s for -n 10 with > leafification turned on, with it turned off, runtime increases > to 0m31s with --param inline-unit-growth=175. I compiled with -O3, would be possible for you to measure how much speedup you get on mainline with -O3 and -O3+lefify? That would probably allow me relate those numbers somehow. > > > Unless I will observe it otherwise (on SPEC with intermodule), I will > > apply my current patch and try to improve the profitability analysis > > without profiling incrementally. Ideally we ought to build estimated > > profile and use it, but that needs some work so for the moment I guess I > > will try to experiment with making loop depth available to the cgraph > > code. > > Yes, loops could be "auto-leafified", but it will be difficult to > statically check if that is worthwhile. I guess just increasing priority for calls inside loops (something like dividing current cost estimation by loop nest) would do good job for now, but first I need to convince myself that the new rewrite does resonable job even for current cost metric before moving on. Honza > > Richard. > > -- > Richard Guenther > WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/ > > > > -- > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704 > > --- You are receiving this mail because: --- > You are on the CC list for the bug, or are watching someone who is. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen dot de 2004-12-06 13:18 --- Subject: Re: [4.0 Regression] Inlining limits cause 340% performance regression On 6 Dec 2004, hubicka at ucw dot cz wrote: > The cfg inliner per se is not too interesting. What matters here is the > code size esitmation and profitability estimation. I am playing with > this now and trying to get profile based inlining working. Yes, I guess the cfg inliner and some early dead code removal passes should improve code size metrics for stuff like template struct Foo { enum { val = X::val }; void foo() { if (val) ... else ... } }; with val being const. > For -n10 and tramp3d.cc I need 2m14s on mainline, 1m31s on the current > tree-profiling. With my new implementation I need 0m27s with profile > feedback and 2m53s without. I wonder what makes the new heuristics work > worse without profiling, but just increasing the inline-unit-growth very > slightly (to 155) I get 0m42s. This might be just little unstability in Note that inline-unit-growth is 50 by default, so 155 is not slightly increased. > the order of inlining decisions affecting this. I would be curious how > those results compare to leafify and whether the 0m27s is not caused by > missoptimization. You can check for misoptimization by looking at the final output. I.e. the rh,vx,vy and vz sums should be nearly zero, the T sum will increase with the number of iterations. With mainline, -O2 -fpeel-loops -march=pentium4 -ffast-math -D__NO_MATH_INLINES (we still need explicit -fpeel-loops for unrolling for (i=0;i<3;++i) a[i]=0;), I need 0m17s for -n 10 with leafification turned on, with it turned off, runtime increases to 0m31s with --param inline-unit-growth=175. > Unless I will observe it otherwise (on SPEC with intermodule), I will > apply my current patch and try to improve the profitability analysis > without profiling incrementally. Ideally we ought to build estimated > profile and use it, but that needs some work so for the moment I guess I > will try to experiment with making loop depth available to the cgraph > code. Yes, loops could be "auto-leafified", but it will be difficult to statically check if that is worthwhile. Richard. -- Richard Guenther WWW: http://www.tat.physik.uni-tuebingen.de/~rguenth/ -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From hubicka at ucw dot cz 2004-12-06 12:44 --- Subject: Re: [4.0 Regression] Inlining limits cause 340% performance regression > > --- Additional Comments From rguenth at tat dot physik dot uni-tuebingen > dot de 2004-12-06 09:53 --- > Subject: Re: [4.0 Regression] Inlining limits > cause 340% performance regression > > On 6 Dec 2004, pinskia at gcc dot gnu dot org wrote: > > > No reason to keep this one open, there is PR 17863 still. Also note I > > heard from Honza that the tree > > profiling branch with feedback can optimizate better than with your leafy > > patch. > > Wow, that would be cool. Does the tree-profiling branch contain the > cfg inliner? I'll try it asap. The cfg inliner per se is not too interesting. What matters here is the code size esitmation and profitability estimation. I am playing with this now and trying to get profile based inlining working. For -n10 and tramp3d.cc I need 2m14s on mainline, 1m31s on the current tree-profiling. With my new implementation I need 0m27s with profile feedback and 2m53s without. I wonder what makes the new heuristics work worse without profiling, but just increasing the inline-unit-growth very slightly (to 155) I get 0m42s. This might be just little unstability in the order of inlining decisions affecting this. I would be curious how those results compare to leafify and whether the 0m27s is not caused by missoptimization. Unless I will observe it otherwise (on SPEC with intermodule), I will apply my current patch and try to improve the profitability analysis without profiling incrementally. Ideally we ought to build estimated profile and use it, but that needs some work so for the moment I guess I will try to experiment with making loop depth available to the cgraph code. Honza > > > > -- > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704 > > --- You are receiving this mail because: --- > You are on the CC list for the bug, or are watching someone who is. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen dot de 2004-12-06 12:33 --- Subject: Re: [4.0 Regression] Inlining limits cause 340% performance regression On 6 Dec 2004, pinskia at gcc dot gnu dot org wrote: > No reason to keep this one open, there is PR 17863 still. > Also note I heard from Honza that the tree > profiling branch with feedback can optimizate better than with your > leafy patch. I tried tree-profiling branch and profile-based inlining is actually worse than "normal" inlining with inline-unit-growth=150. Worse by a factor of four. So, no cigar yet. And btw. profile based inlining seems to be ignorant of inline-unit-growth (at least it doesnt improve for greater values). And generating the profile is _very_ slow (for the tramp3d testcase). Runtime increases about 100 fold - not very good for creating a meaningful profile. Richard. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen dot de 2004-12-06 09:53 --- Subject: Re: [4.0 Regression] Inlining limits cause 340% performance regression On 6 Dec 2004, pinskia at gcc dot gnu dot org wrote: > No reason to keep this one open, there is PR 17863 still. Also note I heard > from Honza that the tree > profiling branch with feedback can optimizate better than with your leafy > patch. Wow, that would be cool. Does the tree-profiling branch contain the cfg inliner? I'll try it asap. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From pinskia at gcc dot gnu dot org 2004-12-06 05:20 --- No reason to keep this one open, there is PR 17863 still. Also note I heard from Honza that the tree profiling branch with feedback can optimizate better than with your leafy patch. *** This bug has been marked as a duplicate of 17863 *** -- What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution||DUPLICATE http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From hubicka at ucw dot cz 2004-11-29 14:06 --- Subject: Re: [4.0 Regression] Inlining limits cause 340% performance regression > > --- Additional Comments From giovannibajo at libero dot it 2004-11-29 > 11:36 --- > Honza is the one that plays with inlining, I'm CC:ing him on this bug. Well, I am not quite sure how much we can do here. Pooma testcase is one of very unusual pieces of code from inliner point of view and the current defualt of 50% compile unit growth is already much higher than what most of other compilers does (with intermodule the defaults tends to be somewhere in 15%). As the compile unit gets bigger, the overall unit growth is more important (for instance for SPEC we never hit the overall growth when compiling it one file by one, but when doing IMA we hit it in most programs), so making the limit arbitrary high has very large effect on code size, compilation time and sometimes it degrades speed as well as we run out of icache. The fact that we fail to do resonable job on Pooma is more a result of very poor analysis of beneficts of the inlining (we probably inline things that don't matter and miss the things that does). This is slowly getting better on tree-profiling branch (and adding infrastructure for this is one of it's main points) so it might be interesting try to see how it scales on this testcase I don't think bumping the overall unit growith too high is good idea, but perhaps we can figure out if something is getting overestimated... Honza > > -- >What|Removed |Added > > CC||hubicka at gcc dot gnu dot >||org > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704 > > --- You are receiving this mail because: --- > You are on the CC list for the bug, or are watching someone who is. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen dot de 2004-11-29 12:10 --- Documentation patches for 3.4 and mainline are here: http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02457.html http://gcc.gnu.org/ml/gcc-patches/2004-11/msg02551.html -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From giovannibajo at libero dot it 2004-11-29 11:36 --- Honza is the one that plays with inlining, I'm CC:ing him on this bug. -- What|Removed |Added CC||hubicka at gcc dot gnu dot ||org http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From rguenth at tat dot physik dot uni-tuebingen dot de 2004-11-29 11:04 --- Looking at the 3.4 branch the defaults for the relevant inlining parameters are the same. So the difference in performance has to be accounted to different tree-node counting (or to differences in the accounting during inlining). As we throttle inlining params if -Os is specified in opts.c: if (optimize_size) { /* Inlining of very small functions usually reduces total size. */ set_param_value ("max-inline-insns-single", 5); set_param_value ("max-inline-insns-auto", 5); flag_inline_functions = 1; may I suggest to throttle inline-unit-growth there, too (though it shouldn't have an effect with so small max-inline-insns-single). And then provide the documented limit (150) for inline-unit-growth? One may even argue that limiting overall unit growth is not important, as it is already limited by max-inline-insns-* and large-function-*. Also both inline-unit-growth and large-function-growth cause inlining to stop at the threshold leaving one with an unbalanced inlining decision. Why were these (growth) limits invented? Were there some particular testcases that broke down otherwise? -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
--- Additional Comments From pinskia at gcc dot gnu dot org 2004-11-28 18:22 --- This is most likely the same as PR 17863. -- What|Removed |Added BugsThisDependsOn||17863 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704
[Bug tree-optimization/18704] [4.0 Regression] Inlining limits cause 340% performance regression
-- What|Removed |Added CC||pinskia at gcc dot gnu dot ||org Keywords||missed-optimization Summary|Inlining limits cause 340% |[4.0 Regression] Inlining |performance regression |limits cause 340% ||performance regression Target Milestone|--- |4.0.0 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=18704