Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/29/07, Mark Mitchell [EMAIL PROTECTED] wrote: It doesn't need to be a small testcase. If you have a preprocessed source file and a command-line, I'm sure one of the GCC developers would be able to analyze the situation. We're all good at isolating problems, even starting with big complicated inputs. This now known as PR / 30627 http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30627 PS: Thanks to Vladimir for his input.
remarks about g++ 4.3 and some comparison to msvc icc on ia32
Let it be clear from the start this is a potshot and while those trends aren't exactly new or specific to my code, i haven't tried to provide anything but specific data from one of my app, on win32/cygwin. Primo, gcc getting much better wrt inling exacerbates the fact that it's not as good as other compilers at shrinking the stack frame size, and perhaps as was suggested by Uros when discussing that point a pass to address that would make sense. As i'm too lazy to properly measure cruft across multiple compilers, i'll use my rtrt app where i mostly control large scale inlining by hand. objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl -ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head -n 10 msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160 icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396 gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132 That's with msvc8 sp1, icc 9.1.033, g++ 4.3-20070119, each compiler being configured to optimize as much as possible for speed. That confirms what i see when checking codegen for specific functions. Secundo, while i very much appreciate the brand new string ops, it seems that on ia32 some array initialization cases where left out, hence i still see oodles of 'movl $0x0' when generating code for k8. Also those zeroings get coalesced at the top of functions on ia32, and i have a function where there's 3 pages of those right after prologue. See the attached 'grep 'movl $0x0' dump. movl0.S.bz2 Description: BZip2 compressed data
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/28/07, tbp [EMAIL PROTECTED] wrote: Let it be clear from the start this is a potshot and while those trends aren't exactly new or specific to my code, i haven't tried to provide anything but specific data from one of my app, on win32/cygwin. Primo, gcc getting much better wrt inling exacerbates the fact that it's not as good as other compilers at shrinking the stack frame size, and perhaps as was suggested by Uros when discussing that point a pass to address that would make sense. As i'm too lazy to properly measure cruft across multiple compilers, i'll use my rtrt app where i mostly control large scale inlining by hand. objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl -ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head -n 10 msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160 icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396 gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132 It would have been nice to tell us what the particular columns in this table mean - now we have to decrypt objdump params and perl postprocessing ourselves. (If you are interested in stack size related to inlining you may want to tune --param large-stack-frame and --param large-stack-frame-growth). Richard.
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/28/07, Richard Guenther [EMAIL PROTECTED] wrote: On 1/28/07, tbp [EMAIL PROTECTED] wrote: objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl -ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head -n 10 msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160 icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396 gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132 It would have been nice to tell us what the particular columns in this table mean - now we have to decrypt objdump params and perl postprocessing ourselves. I should have known better than to post on a sunday morning. Sorry. That's the sorted 10 largest stack allocations in binaries produced by each compiler (presuming most everything falls in place). Each time i verify codegen for a function across all 3, gcc always has the largest frame by a substantial amount (on ia32). And that's what that rigorous table is trying to demonstrate ;) Basically i'm wondering if a stack frame shrinking pass [ ] is possible, [ ] makes no sense, [ ] has been done, [ ] is planed etc... (If you are interested in stack size related to inlining you may want to tune --param large-stack-frame and --param large-stack-frame-growth). Recently g++ 4.3 has started to complain about warning: inlining failed in call to 'xxx': --param large-stack-frame-growth limit reached [-Winline]. Bumping said large-function-growth by an ungodly amount did the trick. But it was the sure sign inlining was being fixed. There's much less need to babysit it, thanks a lot to whomever wrote those patches.
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/28/07, tbp [EMAIL PROTECTED] wrote: Let it be clear from the start this is a potshot and while those trends aren't exactly new or specific to my code, i haven't tried to provide anything but specific data from one of my app, on win32/cygwin. Primo, gcc getting much better wrt inling exacerbates the fact that it's not as good as other compilers at shrinking the stack frame size, and perhaps as was suggested by Uros when discussing that point a pass to address that would make sense. As i'm too lazy to properly measure cruft across multiple compilers, i'll use my rtrt app where i mostly control large scale inlining by hand. objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl -ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head -n 10 msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160 icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396 gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132 It would have been nice to tell us what the particular columns in this table mean - now we have to decrypt objdump params and perl postprocessing ourselves. (If you are interested in stack size related to inlining you may want to tune --param large-stack-frame and --param large-stack-frame-growth). Also having some testcases showing inlining deffects in GCC would be very interesting for me. Now after IPA-SSA has been merged, I plan to do some retuning of inliner for 4.3 release since a lot has changes about properties of it's input and it was originally designed to operate well on IL used by early tree-ssa. Considering information about stack frame size in the inlining costs is one of things I believe we should do but it is also dificult to tune without interesting testcases for it. Honza Richard.
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/28/07, Richard Guenther [EMAIL PROTECTED] wrote: On 1/28/07, tbp [EMAIL PROTECTED] wrote: objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl -ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head -n 10 msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160 icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396 gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132 It would have been nice to tell us what the particular columns in this table mean - now we have to decrypt objdump params and perl postprocessing ourselves. I should have known better than to post on a sunday morning. Sorry. That's the sorted 10 largest stack allocations in binaries produced by each compiler (presuming most everything falls in place). Each time i verify codegen for a function across all 3, gcc always has the largest frame by a substantial amount (on ia32). And that's what that rigorous table is trying to demonstrate ;) Basically i'm wondering if a stack frame shrinking pass [ ] is possible, [ ] makes no sense, [ ] has been done, [ ] is planed etc... Actually we do have one stack frame shrinking pass already. It depends on where the bloat is comming from - we can pack (with some limitations) memory used by structures/arrays used by different inline functions or lexical blocks. We don't do any packing of spilled registers nor shring wrapping other compilers sometimes implement. Honza (If you are interested in stack size related to inlining you may want to tune --param large-stack-frame and --param large-stack-frame-growth). Recently g++ 4.3 has started to complain about warning: inlining failed in call to 'xxx': --param large-stack-frame-growth limit reached [-Winline]. Bumping said large-function-growth by an ungodly amount did the trick. But it was the sure sign inlining was being fixed. There's much less need to babysit it, thanks a lot to whomever wrote those patches.
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote: Actually we do have one stack frame shrinking pass already. It depends on where the bloat is comming from - we can pack (with some limitations) memory used by structures/arrays used by different inline functions or lexical blocks. We don't do any packing of spilled registers nor shring wrapping other compilers sometimes implement. Ah. So there's already some shrinkage. I don't think i can blame spilling for all that waste, but then i also have no idea what that shring wrapping involves. Also i think it's only a bit worse with C++ where some idioms appear to cause more trouble. It would be nice to have a cheat sheet of do and don't :) It seems my previous obese mail got axed a bit, http://ompf.org/vault/frontend.ii.bz2 http://ompf.org/vault/rt_render_packet.ii.bz2
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote: Also having some testcases showing inlining deffects in GCC would be very interesting for me. Now after IPA-SSA has been merged, I plan to do some retuning of inliner for 4.3 release since a lot has changes about properties of it's input and it was originally designed to operate well on IL used by early tree-ssa. Gcc, well g++ really, used to be so bad at the inlining game, ie single op functions/ctors suddendly left out, there was no other options than to explicitly direct inlining if one cared about performance. So i don't have much to show, for what i monitored wasn't under g++ juridiction. Now i know it has improved (much) because obviously other parts are being stressed. Considering information about stack frame size in the inlining costs is one of things I believe we should do but it is also dificult to tune without interesting testcases for it. I have no idea what would make such testcase interesting to you. But i can try. You'll find 2 preprocessed GPLed sources attached with frontend.cc, app::frontend_loop() (i don't particularly care about that function, but on ia32 - x86-64 is immune - g++ is quite creative about it (large frame, oodles of upfront zeroing, even if it's a bit better with the gcc-4.3-20070119 snapshot)) frame size, msvc 1152 bytes, icc 2108, g++ 2604 rt_render_packet.cc, horde::grunt_render_tiles_packet(...) (this one i care about, inlining is controlled) frame size, msvc 1688, icc 1804, gcc 1932 Performance wise on that one msvc lags by 25% and gcc has a slight lead of a couple percent on icc. note: take 2, http://ompf.org/vault/frontend.ii.bz2 http://ompf.org/vault/rt_render_packet.ii.bz2
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote: Also having some testcases showing inlining deffects in GCC would be very interesting for me. Now after IPA-SSA has been merged, I plan to do some retuning of inliner for 4.3 release since a lot has changes about properties of it's input and it was originally designed to operate well on IL used by early tree-ssa. Gcc, well g++ really, used to be so bad at the inlining game, ie single op functions/ctors suddendly left out, there was no other options than to explicitly direct inlining if one cared about performance. So i don't have much to show, for what i monitored wasn't I am not quite sure what you mean by direct inlining here. At -O2 G++ still don't inline any functions not marked as inline by user, at -O3 we do. I plan to propose to change this behaviour to make -O2 auto-inline everything that is expected to reduce code size. I am just about to leave for 5 days but I plan to run couple of benchmarks after that and propose this. I would be interested to know about obvious mistakes GCC do - GCC now has logic to set cost of inlining wrapper functions (ie functions doing just one extra call and casts) to at most 0. It might be interesting to know if some common scenarios are missed. under g++ juridiction. Now i know it has improved (much) because obviously other parts are being stressed. Well, we are working on it ;) You can take a look at c++ benchmarks http://www.suse.de/~gcctest the work is ongoing since cgraph was implemented in 2003, another retunning happen at about 4.0 timeframe, 4.3 has the SSA based IPA that should be another improvement. Considering information about stack frame size in the inlining costs is one of things I believe we should do but it is also dificult to tune without interesting testcases for it. I have no idea what would make such testcase interesting to you. But i can try. You'll find 2 preprocessed GPLed sources attached with frontend.cc, app::frontend_loop() (i don't particularly care about that function, but on ia32 - x86-64 is immune - g++ is quite creative about it (large frame, oodles of upfront zeroing, even if it's a bit better with the gcc-4.3-20070119 snapshot)) frame size, msvc 1152 bytes, icc 2108, g++ 2604 rt_render_packet.cc, horde::grunt_render_tiles_packet(...) (this one i care about, inlining is controlled) frame size, msvc 1688, icc 1804, gcc 1932 Performance wise on that one msvc lags by 25% and gcc has a slight lead of a couple percent on icc. note: take 2, http://ompf.org/vault/frontend.ii.bz2 http://ompf.org/vault/rt_render_packet.ii.bz2 Thanks, what is definitly most interesting for me is self contained testcase I can easilly compile and run, like we have tramp3d. I will definitly take a lok at your testcases, but perhaps only after returning from trip at next weekend since I am running out of time for all my TODOs today ;) Concerning the frame sizes, we really need some kind of analysis from where it is comming - ie whether GCC simply inline too much together, or fail to pack well the structures using existing algorithm or it is register pressure problem. Thanks, Honza
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote: I am not quite sure what you mean by direct inlining here. At -O2 G++ Decorating everything in sight with attribute always_inline/noinline (flatten wasn't an option because it used to be troublesome and not as 'portable' across compilers). I would be interested to know about obvious mistakes GCC do - GCC now has logic to set cost of inlining wrapper functions (ie functions doing just one extra call and casts) to at most 0. It might be interesting to know if some common scenarios are missed. I guess i should remove those attribute and see what it looks like. Well, we are working on it ;) You can take a look at c++ benchmarks http://www.suse.de/~gcctest the work is ongoing since cgraph was implemented in 2003, another retunning happen at about 4.0 timeframe, 4.3 has the SSA based IPA that should be another improvement. I'm aware of that progression and some of my code is already being tested http://www.suse.de/~gcctest/c++bench/raytracer/ ;) 4.2 made a substantial difference for me, and it seems 4.3 is well on its way (even if it's a bit chaotic at times); IPA when enabled used to ICE on me and recently started to work, but i've failed to notice a difference (efficiency wise) yet. I guess i should wait a bit more. I very much appreciate the string op stuff, and i'm eagerly waiting for the assume() directive (wink wink). Thanks, what is definitly most interesting for me is self contained testcase I can easilly compile and run, like we have tramp3d. I will definitly take a lok at your testcases, but perhaps only after returning from trip at next weekend since I am running out of time for all my TODOs today ;) It's still very much in flux, but once it stabilizes a bit i'll dump everything into a self contained black box of doom. Concerning the frame sizes, we really need some kind of analysis from where it is comming - ie whether GCC simply inline too much together, or fail to pack well the structures using existing algorithm or it is register pressure problem. I'm out of my league. I know the frontend_loop function isn't as horrible on x86-64, giving some credit to the register pressure hypothesis, but then that code isn't doing anything fancy. For the other function, which heavily uses SSE vector intrinsics, g++ is really doing a good job, if only for the, sometimes, duplicated structures here there and the larger frame. But you can rule out g++'s inlining heuristic as it has no (or shouldn't have) any freedom. If there's anything i can do, do not hesitate. And thanks for taking notice.
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
tbp wrote: Secundo, while i very much appreciate the brand new string ops, it seems that on ia32 some array initialization cases where left out, hence i still see oodles of 'movl $0x0' when generating code for k8. Also those zeroings get coalesced at the top of functions on ia32, and i have a function where there's 3 pages of those right after prologue. See the attached 'grep 'movl $0x0' dump. It looks like Jan and Richard have answered some of your questions about inlining (or are in the process of doing so), but I haven't seen a response to this point. Certainly, if we're generating zillions of zero-initializations to contiguous memory, rather than using memset, or an inline loop, that seems unfortunate. Would you please file a bug report? Thanks, -- Mark Mitchell CodeSourcery [EMAIL PROTECTED] (650) 331-3385 x713
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
tbp wrote: Secundo, while i very much appreciate the brand new string ops, it seems that on ia32 some array initialization cases where left out, hence i still see oodles of 'movl $0x0' when generating code for k8. Also those zeroings get coalesced at the top of functions on ia32, and i have a function where there's 3 pages of those right after prologue. See the attached 'grep 'movl $0x0' dump. It looks like Jan and Richard have answered some of your questions about inlining (or are in the process of doing so), but I haven't seen a response to this point. Certainly, if we're generating zillions of zero-initializations to contiguous memory, rather than using memset, or an inline loop, that seems unfortunate. Would you please file a bug report? I though the comment was more reffering to fact that we will happily generate movl $0x0, place1 movl $0x0, place2 ... movl $0x0, placeMillion rather than shorter xor %eax, %eax movl %eax, ... but indeed both of those issues should be addressed (and it would be interesting to know where we fail ty synthetize memset in real scenarios). With the repeated mov issue unforutnately I don't know what would be the best place: we obviously don't want to constrain register allocation too much and after regalloc I guess only machine dependent pass is the hope that is pretty ugly (but not that difiuclt to code at least at local level). Honza Thanks, -- Mark Mitchell CodeSourcery [EMAIL PROTECTED] (650) 331-3385 x713
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote: tbp wrote: Secundo, while i very much appreciate the brand new string ops, it seems that on ia32 some array initialization cases where left out, hence i still see oodles of 'movl $0x0' when generating code for k8. Also those zeroings get coalesced at the top of functions on ia32, and i have a function where there's 3 pages of those right after prologue. See the attached 'grep 'movl $0x0' dump. It looks like Jan and Richard have answered some of your questions about inlining (or are in the process of doing so), but I haven't seen a response to this point. Certainly, if we're generating zillions of zero-initializations to contiguous memory, rather than using memset, or an inline loop, that seems unfortunate. Would you please file a bug report? I though the comment was more reffering to fact that we will happily generate movl $0x0, place1 movl $0x0, place2 ... movl $0x0, placeMillion rather than shorter xor %eax, %eax movl %eax, ... but indeed both of those issues should be addressed (and it would be interesting to know where we fail ty synthetize memset in real scenarios). With the repeated mov issue unforutnately I don't know what would be the best place: we obviously don't want to constrain register allocation too much and after regalloc I guess only machine dependent pass is the hope that is pretty ugly (but not that difiuclt to code at least at local level). One source of these patterns is SRA decomposing structures and initialization. But the structure size we do that for is limited (I also believe we already have bugreports about this, but cannot find them right now). Richard.
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
Jan Hubicka wrote: I though the comment was more reffering to fact that we will happily generate movl $0x0, place1 movl $0x0, place2 ... movl $0x0, placeMillion rather than shorter xor %eax, %eax movl %eax, ... Yes, that would be an improvement, but, as you say, at some point we want to call memset. With the repeated mov issue unforutnately I don't know what would be the best place: we obviously don't want to constrain register allocation too much and after regalloc I guess only machine dependent pass I would hope that we could notice this much earlier than that. Wouldn't this be evident even at the tree level or at least after stack-allocation in the RTL layer? I wouldn't expect the zeroing to be coming from machine-dependent code. One possibility is that we're doing something dumb with arrays. Another possibility is that we're SRA-ing a lot of small structures, which add up to a ton of stack space. I realize that we need a full bug report to be sure, though. -- Mark Mitchell CodeSourcery [EMAIL PROTECTED] (650) 331-3385 x713
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
Jan Hubicka wrote: I though the comment was more reffering to fact that we will happily generate movl $0x0, place1 movl $0x0, place2 ... movl $0x0, placeMillion rather than shorter xor %eax, %eax movl %eax, ... Yes, that would be an improvement, but, as you say, at some point we want to call memset. With the repeated mov issue unforutnately I don't know what would be the best place: we obviously don't want to constrain register allocation too much and after regalloc I guess only machine dependent pass I would hope that we could notice this much earlier than that. Wouldn't this be evident even at the tree level or at least after stack-allocation in the RTL layer? I wouldn't expect the zeroing to be coming from machine-dependent code. What I meant is the generic problem that constants on i386 (especially for moves) increase instruction encoding and thus when mutiple copies of the same constant appears in the instruction stream and register is available one can add extra move and use that register instead. Of course we can also have pass detecting large sets of unwound mov instructions and pack them into memset. We can do it either at early RTL level or with some lowering of initializers at tree level too (I guess many of those sequences actally come from expanding initializers that are sort of black boxes for most tree optimizers). Sort of similar transformation is done by Tomas who can use vectorizer infrastructure to detect loops doing memset/memcpy. Those are pretty common especially for floats/doubles and after unrolling also loeads to such a sequences. I hope he will polish and send the patch soonish. Honza One possibility is that we're doing something dumb with arrays. Another possibility is that we're SRA-ing a lot of small structures, which add up to a ton of stack space. I realize that we need a full bug report to be sure, though. -- Mark Mitchell CodeSourcery [EMAIL PROTECTED] (650) 331-3385 x713
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote: BTW when inlining seems to make so noticeable difference, did you try to use profile feedback? Once a year, i try. But then it boils down to the fact that as a programmer i have no way to express how/where i want gcc to put its nose into. And i get back to fixing branches, inlining and unrolling (wink) by hand. I'm aware of that progression and some of my code is already being tested http://www.suse.de/~gcctest/c++bench/raytracer/ ;) I see, we didn't seem to make that much progress on this testcase performance wise yet ;) It's a silly 100 LOC raytracer and historically g++ already did the Right Thing[tm] (inlining everything), there's not much left to be gained. For the other function, which heavily uses SSE vector intrinsics, g++ is really doing a good job, if only for the, sometimes, duplicated structures here there and the larger frame. But you can rule out g++'s inlining heuristic as it has no (or shouldn't have) any freedom. Hmm, so then it should be esither structure packing or regalloc. I will be able to take a look only after returning from a course. Honza Regalloc is a lost cause on ia32 :) Note that nowadays g++ is up to the point where despite those wastes, it's still faster to inline it all in one rendering function than splitting. And i think you can also put gcse on the culprit list.
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
tbp wrote: On 1/28/07, Mark Mitchell [EMAIL PROTECTED] wrote: Certainly, if we're generating zillions of zero-initializations to contiguous memory, rather than using memset, or an inline loop, that seems unfortunate. Would you please file a bug report? Because it takes, or seems to, a large function with small structure sprinkled around to trigger proper condition, i can't make a convincing reduced testcase, I guess that goes along with what Richard said. It doesn't need to be a small testcase. If you have a preprocessed source file and a command-line, I'm sure one of the GCC developers would be able to analyze the situation. We're all good at isolating problems, even starting with big complicated inputs. -- Mark Mitchell CodeSourcery [EMAIL PROTECTED] (650) 331-3385 x713
Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32
tbp wrote: On 1/28/07, Richard Guenther [EMAIL PROTECTED] wrote: On 1/28/07, tbp [EMAIL PROTECTED] wrote: objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl -ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head -n 10 msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160 icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396 gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132 It would have been nice to tell us what the particular columns in this table mean - now we have to decrypt objdump params and perl postprocessing ourselves. I should have known better than to post on a sunday morning. Sorry. That's the sorted 10 largest stack allocations in binaries produced by each compiler (presuming most everything falls in place). Each time i verify codegen for a function across all 3, gcc always has the largest frame by a substantial amount (on ia32). And that's what that rigorous table is trying to demonstrate ;) Basically i'm wondering if a stack frame shrinking pass [ ] is possible, [ ] makes no sense, [ ] has been done, [ ] is planed etc... The current gcc register allocator (more correctly reload) reserves a separate stack slot for each pseudo register which did not get a hard register. Stack slot sharing stack or slot coloring has been implemented on IRA branch. This is very useful for x86 and x86_64. Besides smaller stack and better code locality, it results in smaller displacements which means smaller insns (code) and even better code locality. Another thing is sharing slots for saving call used hard registers through calls. The current gcc register allocator reserves a separate slot for each call used register assigned to a pseudo living through a call. It is a not so important for x86 but it is more important for x86_64 (there are more call used hard registers and they are bigger 64-bit). Such stack slot sharing has been also implemented on IRA. I am focused to make IRA available for gcc 4.4.