Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-29 Thread tbp

On 1/29/07, Mark Mitchell [EMAIL PROTECTED] wrote:

It doesn't need to be a small testcase.  If you have a preprocessed
source file and a command-line, I'm sure one of the GCC developers would
be able to analyze the situation.  We're all good at isolating problems,
even starting with big complicated inputs.

This now known as PR / 30627
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=30627

PS: Thanks to Vladimir for his input.


remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread tbp

Let it be clear from the start this is a potshot and while those
trends aren't exactly new or specific to my code, i haven't tried to
provide anything but specific data from one of my app, on
win32/cygwin.

Primo, gcc getting much better wrt inling exacerbates the fact that
it's not as good as other compilers at shrinking the stack frame size,
and perhaps as was suggested by Uros when discussing that point a pass
to address that would make sense.
As i'm too lazy to properly measure cruft across multiple compilers,
i'll use my rtrt app where i mostly control large scale inlining by
hand.
objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl
-ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head
-n 10

msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160
icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396
gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132

That's with msvc8 sp1, icc 9.1.033, g++ 4.3-20070119, each compiler
being configured to optimize as much as possible for speed. That
confirms what i see when checking codegen for specific functions.

Secundo, while i very much appreciate the brand new string ops, it
seems that on ia32 some array initialization cases where left out,
hence i still see oodles of 'movl $0x0' when generating code for k8.
Also those zeroings get coalesced at the top of functions on ia32, and
i have a function where there's 3 pages of those right after prologue.
See the attached 'grep 'movl   $0x0' dump.


movl0.S.bz2
Description: BZip2 compressed data


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread Richard Guenther

On 1/28/07, tbp [EMAIL PROTECTED] wrote:

Let it be clear from the start this is a potshot and while those
trends aren't exactly new or specific to my code, i haven't tried to
provide anything but specific data from one of my app, on
win32/cygwin.

Primo, gcc getting much better wrt inling exacerbates the fact that
it's not as good as other compilers at shrinking the stack frame size,
and perhaps as was suggested by Uros when discussing that point a pass
to address that would make sense.
As i'm too lazy to properly measure cruft across multiple compilers,
i'll use my rtrt app where i mostly control large scale inlining by
hand.
objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl
-ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head
-n 10

msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160
icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396
gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132


It would have been nice to tell us what the particular columns in
this table mean - now we have to decrypt objdump params and
perl postprocessing ourselves.

(If you are interested in stack size related to inlining you may want
to tune --param large-stack-frame and --param large-stack-frame-growth).

Richard.


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread tbp

On 1/28/07, Richard Guenther [EMAIL PROTECTED] wrote:

On 1/28/07, tbp [EMAIL PROTECTED] wrote:
 objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl
 -ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head
 -n 10

 msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160
 icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396
 gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132

It would have been nice to tell us what the particular columns in
this table mean - now we have to decrypt objdump params and
perl postprocessing ourselves.

I should have known better than to post on a sunday morning. Sorry.
That's the sorted 10 largest stack allocations in binaries produced by
each compiler (presuming most everything falls in place).
Each time i verify codegen for a function across all 3, gcc always has
the largest frame by a substantial amount (on ia32). And that's what
that rigorous table is trying to demonstrate ;)

Basically i'm wondering if a stack frame shrinking pass [ ] is
possible, [ ] makes no sense, [ ] has been done, [ ] is planed etc...


(If you are interested in stack size related to inlining you may want
to tune --param large-stack-frame and --param large-stack-frame-growth).

Recently g++ 4.3 has started to complain about warning:  inlining
failed in call to 'xxx': --param large-stack-frame-growth limit
reached [-Winline]. Bumping said large-function-growth by an ungodly
amount did the trick. But it was the sure sign inlining was being
fixed.
There's much less need to babysit it, thanks a lot to whomever wrote
those patches.


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread Jan Hubicka
 On 1/28/07, tbp [EMAIL PROTECTED] wrote:
 Let it be clear from the start this is a potshot and while those
 trends aren't exactly new or specific to my code, i haven't tried to
 provide anything but specific data from one of my app, on
 win32/cygwin.
 
 Primo, gcc getting much better wrt inling exacerbates the fact that
 it's not as good as other compilers at shrinking the stack frame size,
 and perhaps as was suggested by Uros when discussing that point a pass
 to address that would make sense.
 As i'm too lazy to properly measure cruft across multiple compilers,
 i'll use my rtrt app where i mostly control large scale inlining by
 hand.
 objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl
 -ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head
 -n 10
 
 msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160
 icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396
 gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132
 
 It would have been nice to tell us what the particular columns in
 this table mean - now we have to decrypt objdump params and
 perl postprocessing ourselves.
 
 (If you are interested in stack size related to inlining you may want
 to tune --param large-stack-frame and --param large-stack-frame-growth).

Also having some testcases showing inlining deffects in GCC would be
very interesting for me.  Now after IPA-SSA has been merged, I plan to
do some retuning of inliner for 4.3 release since a lot has changes
about properties of it's input and it was originally designed to operate
well on IL used by early tree-ssa.

Considering information about stack frame size in the inlining costs is
one of things I believe we should do but it is also dificult to tune
without interesting testcases for it.

Honza
 
 Richard.


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread Jan Hubicka
 On 1/28/07, Richard Guenther [EMAIL PROTECTED] wrote:
 On 1/28/07, tbp [EMAIL PROTECTED] wrote:
  objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl
  -ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head
  -n 10
 
  msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160
  icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396
  gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132
 
 It would have been nice to tell us what the particular columns in
 this table mean - now we have to decrypt objdump params and
 perl postprocessing ourselves.
 I should have known better than to post on a sunday morning. Sorry.
 That's the sorted 10 largest stack allocations in binaries produced by
 each compiler (presuming most everything falls in place).
 Each time i verify codegen for a function across all 3, gcc always has
 the largest frame by a substantial amount (on ia32). And that's what
 that rigorous table is trying to demonstrate ;)
 
 Basically i'm wondering if a stack frame shrinking pass [ ] is
 possible, [ ] makes no sense, [ ] has been done, [ ] is planed etc...

Actually we do have one stack frame shrinking pass already.  It depends
on where the bloat is comming from - we can pack (with some limitations)
memory used by structures/arrays used by different inline functions or
lexical blocks.  We don't do any packing of spilled registers nor shring
wrapping other compilers sometimes implement.

Honza
 
 (If you are interested in stack size related to inlining you may want
 to tune --param large-stack-frame and --param large-stack-frame-growth).
 Recently g++ 4.3 has started to complain about warning:  inlining
 failed in call to 'xxx': --param large-stack-frame-growth limit
 reached [-Winline]. Bumping said large-function-growth by an ungodly
 amount did the trick. But it was the sure sign inlining was being
 fixed.
 There's much less need to babysit it, thanks a lot to whomever wrote
 those patches.


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread tbp

On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote:

Actually we do have one stack frame shrinking pass already.  It depends
on where the bloat is comming from - we can pack (with some limitations)
memory used by structures/arrays used by different inline functions or
lexical blocks.  We don't do any packing of spilled registers nor shring
wrapping other compilers sometimes implement.

Ah. So there's already some shrinkage.
I don't think i can blame spilling for all that waste, but then i also
have no idea what that shring wrapping involves.
Also i think it's only a bit worse with C++ where some idioms appear
to cause more trouble.

It would be nice to have a cheat sheet of do and don't :)

It seems my previous obese mail got axed a bit,
http://ompf.org/vault/frontend.ii.bz2
http://ompf.org/vault/rt_render_packet.ii.bz2


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread tbp

On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote:

Also having some testcases showing inlining deffects in GCC would be
very interesting for me.  Now after IPA-SSA has been merged, I plan to
do some retuning of inliner for 4.3 release since a lot has changes
about properties of it's input and it was originally designed to operate
well on IL used by early tree-ssa.

Gcc, well g++ really, used to be so bad at the inlining game, ie
single op functions/ctors suddendly left out, there was no other
options than to explicitly direct inlining if one cared about
performance. So i don't have much to show, for what i monitored wasn't
under g++ juridiction.
Now i know it has improved (much) because obviously other parts are
being stressed.


Considering information about stack frame size in the inlining costs is
one of things I believe we should do but it is also dificult to tune
without interesting testcases for it.

I have no idea what would make such testcase interesting to you.
But i can try.
You'll find 2 preprocessed GPLed sources attached with

frontend.cc, app::frontend_loop()
(i don't particularly care about that function, but on ia32 - x86-64
is immune - g++ is quite creative about it (large frame, oodles of
upfront zeroing, even if it's a bit better with the gcc-4.3-20070119
snapshot))
frame size, msvc 1152 bytes, icc 2108, g++ 2604

rt_render_packet.cc, horde::grunt_render_tiles_packet(...)
(this one i care about, inlining is controlled)
frame size, msvc 1688, icc 1804, gcc 1932
Performance wise on that one msvc lags by 25% and gcc has a slight
lead of a couple percent on icc.

note: take 2, http://ompf.org/vault/frontend.ii.bz2
http://ompf.org/vault/rt_render_packet.ii.bz2


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread Jan Hubicka
 On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote:
 Also having some testcases showing inlining deffects in GCC would be
 very interesting for me.  Now after IPA-SSA has been merged, I plan to
 do some retuning of inliner for 4.3 release since a lot has changes
 about properties of it's input and it was originally designed to operate
 well on IL used by early tree-ssa.
 Gcc, well g++ really, used to be so bad at the inlining game, ie
 single op functions/ctors suddendly left out, there was no other
 options than to explicitly direct inlining if one cared about
 performance. So i don't have much to show, for what i monitored wasn't

I am not quite sure what you mean by direct inlining here.  At -O2 G++
still don't inline any functions not marked as inline by user, at -O3 we
do.  I plan to propose to change this behaviour to make -O2 auto-inline
everything that is expected to reduce code size.

I am just about to leave for 5 days but I plan to run couple of
benchmarks after that and propose this.

I would be interested to know about obvious mistakes GCC do - GCC now
has logic to set cost of inlining wrapper functions (ie functions
doing just one extra call and casts) to at most 0. It might be
interesting to know if some common scenarios are missed.

 under g++ juridiction.
 Now i know it has improved (much) because obviously other parts are
 being stressed.

Well, we are working on it ;)
You can take a look at c++ benchmarks http://www.suse.de/~gcctest the
work is ongoing since cgraph was implemented in 2003, another retunning
happen at about 4.0 timeframe, 4.3 has the SSA based IPA that should be
another improvement.
 
 Considering information about stack frame size in the inlining costs is
 one of things I believe we should do but it is also dificult to tune
 without interesting testcases for it.
 I have no idea what would make such testcase interesting to you.
 But i can try.
 You'll find 2 preprocessed GPLed sources attached with
 
 frontend.cc, app::frontend_loop()
 (i don't particularly care about that function, but on ia32 - x86-64
 is immune - g++ is quite creative about it (large frame, oodles of
 upfront zeroing, even if it's a bit better with the gcc-4.3-20070119
 snapshot))
 frame size, msvc 1152 bytes, icc 2108, g++ 2604
 
 rt_render_packet.cc, horde::grunt_render_tiles_packet(...)
 (this one i care about, inlining is controlled)
 frame size, msvc 1688, icc 1804, gcc 1932
 Performance wise on that one msvc lags by 25% and gcc has a slight
 lead of a couple percent on icc.
 
 note: take 2, http://ompf.org/vault/frontend.ii.bz2
 http://ompf.org/vault/rt_render_packet.ii.bz2

Thanks, what is definitly most interesting for me is self contained
testcase I can easilly compile and run, like we have tramp3d. I will
definitly take a lok at your testcases, but perhaps only after returning
from trip at next weekend since I am running out of time for all my
TODOs today ;)

Concerning the frame sizes, we really need some kind of analysis from
where it is comming - ie whether GCC simply inline too much together, or
fail to pack well the structures using existing algorithm or it is
register pressure problem.

Thanks,
Honza


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread tbp

On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote:

I am not quite sure what you mean by direct inlining here.  At -O2 G++

Decorating everything in sight with attribute always_inline/noinline
(flatten wasn't an option because it used to be troublesome and not as
'portable' across compilers).


I would be interested to know about obvious mistakes GCC do - GCC now
has logic to set cost of inlining wrapper functions (ie functions
doing just one extra call and casts) to at most 0. It might be
interesting to know if some common scenarios are missed.

I guess i should remove those attribute and see what it looks like.


Well, we are working on it ;)
You can take a look at c++ benchmarks http://www.suse.de/~gcctest the
work is ongoing since cgraph was implemented in 2003, another retunning
happen at about 4.0 timeframe, 4.3 has the SSA based IPA that should be
another improvement.

I'm aware of that progression and some of my code is already being tested
http://www.suse.de/~gcctest/c++bench/raytracer/ ;)

4.2 made a substantial difference for me, and it seems 4.3 is well on
its way (even if it's a bit chaotic at times); IPA when enabled used
to ICE on me and recently started to work, but i've failed to notice a
difference (efficiency wise) yet. I guess i should wait a bit more.

I very much appreciate the string op stuff, and i'm eagerly waiting
for the assume() directive (wink wink).


Thanks, what is definitly most interesting for me is self contained
testcase I can easilly compile and run, like we have tramp3d. I will
definitly take a lok at your testcases, but perhaps only after returning
from trip at next weekend since I am running out of time for all my
TODOs today ;)

It's still very much in flux, but once it stabilizes a bit i'll dump
everything into a self contained black box of doom.


Concerning the frame sizes, we really need some kind of analysis from
where it is comming - ie whether GCC simply inline too much together, or
fail to pack well the structures using existing algorithm or it is
register pressure problem.

I'm out of my league. I know the frontend_loop function isn't as
horrible on x86-64, giving some credit to the register pressure
hypothesis, but then that code isn't doing anything fancy.

For the other function, which heavily uses SSE vector intrinsics, g++
is really doing a good job, if only for the, sometimes, duplicated
structures here  there and the larger frame. But you can rule out
g++'s inlining heuristic as it has no (or shouldn't have) any freedom.

If there's anything i can do, do not hesitate.
And thanks for taking notice.


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread Mark Mitchell
tbp wrote:

 Secundo, while i very much appreciate the brand new string ops, it
 seems that on ia32 some array initialization cases where left out,
 hence i still see oodles of 'movl $0x0' when generating code for k8.
 Also those zeroings get coalesced at the top of functions on ia32, and
 i have a function where there's 3 pages of those right after prologue.
 See the attached 'grep 'movl   $0x0' dump.

It looks like Jan and Richard have answered some of your questions about
inlining (or are in the process of doing so), but I haven't seen a
response to this point.

Certainly, if we're generating zillions of zero-initializations to
contiguous memory, rather than using memset, or an inline loop, that
seems unfortunate.  Would you please file a bug report?

Thanks,

-- 
Mark Mitchell
CodeSourcery
[EMAIL PROTECTED]
(650) 331-3385 x713


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread Jan Hubicka
 tbp wrote:
 
  Secundo, while i very much appreciate the brand new string ops, it
  seems that on ia32 some array initialization cases where left out,
  hence i still see oodles of 'movl $0x0' when generating code for k8.
  Also those zeroings get coalesced at the top of functions on ia32, and
  i have a function where there's 3 pages of those right after prologue.
  See the attached 'grep 'movl   $0x0' dump.
 
 It looks like Jan and Richard have answered some of your questions about
 inlining (or are in the process of doing so), but I haven't seen a
 response to this point.
 
 Certainly, if we're generating zillions of zero-initializations to
 contiguous memory, rather than using memset, or an inline loop, that
 seems unfortunate.  Would you please file a bug report?

I though the comment was more reffering to fact that we will happily
generate
movl $0x0,  place1
movl $0x0,  place2
...
movl $0x0,  placeMillion

rather than shorter
xor %eax, %eax
movl %eax, ...
but indeed both of those issues should be addressed (and it would be
interesting to know where we fail ty synthetize memset in real
scenarios). 

With the repeated mov issue unforutnately I don't know what would be the
best place: we obviously don't want to constrain register allocation too
much and after regalloc I guess only machine dependent pass is the hope
that is pretty ugly (but not that difiuclt to code at least at local
level).

Honza
 
 Thanks,
 
 -- 
 Mark Mitchell
 CodeSourcery
 [EMAIL PROTECTED]
 (650) 331-3385 x713


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread Richard Guenther

On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote:

 tbp wrote:

  Secundo, while i very much appreciate the brand new string ops, it
  seems that on ia32 some array initialization cases where left out,
  hence i still see oodles of 'movl $0x0' when generating code for k8.
  Also those zeroings get coalesced at the top of functions on ia32, and
  i have a function where there's 3 pages of those right after prologue.
  See the attached 'grep 'movl   $0x0' dump.

 It looks like Jan and Richard have answered some of your questions about
 inlining (or are in the process of doing so), but I haven't seen a
 response to this point.

 Certainly, if we're generating zillions of zero-initializations to
 contiguous memory, rather than using memset, or an inline loop, that
 seems unfortunate.  Would you please file a bug report?

I though the comment was more reffering to fact that we will happily
generate
movl $0x0,  place1
movl $0x0,  place2
...
movl $0x0,  placeMillion

rather than shorter
xor %eax, %eax
movl %eax, ...
but indeed both of those issues should be addressed (and it would be
interesting to know where we fail ty synthetize memset in real
scenarios).

With the repeated mov issue unforutnately I don't know what would be the
best place: we obviously don't want to constrain register allocation too
much and after regalloc I guess only machine dependent pass is the hope
that is pretty ugly (but not that difiuclt to code at least at local
level).


One source of these patterns is SRA decomposing structures and
initialization.  But the structure size we do that for is limited (I also
believe we already have bugreports about this, but cannot find them
right now).

Richard.


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread Mark Mitchell
Jan Hubicka wrote:

 I though the comment was more reffering to fact that we will happily
 generate
 movl $0x0,  place1
 movl $0x0,  place2
 ...
 movl $0x0,  placeMillion
 
 rather than shorter
 xor %eax, %eax
 movl %eax, ...

Yes, that would be an improvement, but, as you say, at some point we
want to call memset.

 With the repeated mov issue unforutnately I don't know what would be the
 best place: we obviously don't want to constrain register allocation too
 much and after regalloc I guess only machine dependent pass

I would hope that we could notice this much earlier than that.  Wouldn't
this be evident even at the tree level or at least after
stack-allocation in the RTL layer?  I wouldn't expect the zeroing to be
coming from machine-dependent code.

One possibility is that we're doing something dumb with arrays.  Another
possibility is that we're SRA-ing a lot of small structures, which add
up to a ton of stack space.

I realize that we need a full bug report to be sure, though.

-- 
Mark Mitchell
CodeSourcery
[EMAIL PROTECTED]
(650) 331-3385 x713


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread Jan Hubicka
 Jan Hubicka wrote:
 
  I though the comment was more reffering to fact that we will happily
  generate
  movl $0x0,  place1
  movl $0x0,  place2
  ...
  movl $0x0,  placeMillion
  
  rather than shorter
  xor %eax, %eax
  movl %eax, ...
 
 Yes, that would be an improvement, but, as you say, at some point we
 want to call memset.
 
  With the repeated mov issue unforutnately I don't know what would be the
  best place: we obviously don't want to constrain register allocation too
  much and after regalloc I guess only machine dependent pass
 
 I would hope that we could notice this much earlier than that.  Wouldn't
 this be evident even at the tree level or at least after
 stack-allocation in the RTL layer?  I wouldn't expect the zeroing to be
 coming from machine-dependent code.

What I meant is the generic problem that constants on i386 (especially
for moves) increase instruction encoding and thus when mutiple copies of
the same constant appears in the instruction stream and register is
available one can add extra move and use that register instead.

Of course we can also have pass detecting large sets of unwound mov
instructions and pack them into memset. We can do it either at early RTL
level or with some lowering of initializers at tree level too (I guess
many of those sequences actally come from expanding initializers that
are sort of black boxes for most tree optimizers).

Sort of similar transformation is done by Tomas who can use vectorizer
infrastructure to detect loops doing memset/memcpy.  Those are pretty
common especially for floats/doubles and after unrolling also loeads to
such a sequences.  I hope he will polish and send the patch soonish.

Honza
 
 One possibility is that we're doing something dumb with arrays.  Another
 possibility is that we're SRA-ing a lot of small structures, which add
 up to a ton of stack space.
 
 I realize that we need a full bug report to be sure, though.
 
 -- 
 Mark Mitchell
 CodeSourcery
 [EMAIL PROTECTED]
 (650) 331-3385 x713


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread tbp

On 1/28/07, Jan Hubicka [EMAIL PROTECTED] wrote:

BTW when inlining seems to make so noticeable difference, did you try to
use profile feedback?

Once a year, i try.
But then it boils down to the fact that as a programmer i have no way
to express how/where i want gcc to put its nose into. And i get back
to fixing branches, inlining and unrolling (wink) by hand.


 I'm aware of that progression and some of my code is already being tested
 http://www.suse.de/~gcctest/c++bench/raytracer/ ;)

I see, we didn't seem to make that much progress on this testcase
performance wise yet ;)

It's a silly 100 LOC raytracer and historically g++ already did the
Right Thing[tm] (inlining everything), there's not much left to be
gained.


 For the other function, which heavily uses SSE vector intrinsics, g++
 is really doing a good job, if only for the, sometimes, duplicated
 structures here  there and the larger frame. But you can rule out
 g++'s inlining heuristic as it has no (or shouldn't have) any freedom.

Hmm, so then it should be esither structure packing or regalloc. I will
be able to take a look only after returning from a course.
Honza

Regalloc is a lost cause on ia32 :)
Note that nowadays g++ is up to the point where despite those wastes,
it's still faster to inline it all in one rendering function than
splitting. And i think you can also put gcse on the culprit list.


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread Mark Mitchell
tbp wrote:
 On 1/28/07, Mark Mitchell [EMAIL PROTECTED] wrote:
 Certainly, if we're generating zillions of zero-initializations to
 contiguous memory, rather than using memset, or an inline loop, that
 seems unfortunate.  Would you please file a bug report?
 Because it takes, or seems to, a large function with small structure
 sprinkled around to trigger proper condition, i can't make a
 convincing reduced testcase, I guess that goes along with what Richard
 said.

It doesn't need to be a small testcase.  If you have a preprocessed
source file and a command-line, I'm sure one of the GCC developers would
be able to analyze the situation.  We're all good at isolating problems,
even starting with big complicated inputs.

-- 
Mark Mitchell
CodeSourcery
[EMAIL PROTECTED]
(650) 331-3385 x713


Re: remarks about g++ 4.3 and some comparison to msvc icc on ia32

2007-01-28 Thread Vladimir N. Makarov

tbp wrote:


On 1/28/07, Richard Guenther [EMAIL PROTECTED] wrote:


On 1/28/07, tbp [EMAIL PROTECTED] wrote:
 objdump -wdrfC --no-show-raw-insn $1|perl -pe 's/^\s+\w+:\s+//'|perl
 -ne 'printf %4d\n, hex($1) if /sub\s+\$(0x\w+),%esp/'|sort -r| head
 -n 10

 msvc:2196 2100 1772 1692 1688 1444 1428 1312 1308 1160
 icc: 2412 2280 2172 2044 1928 1848 1820 1588 1428 1396
 gcc: 2604 2596 2412 2076 2028 1932 1900 1756 1720 1132

It would have been nice to tell us what the particular columns in
this table mean - now we have to decrypt objdump params and
perl postprocessing ourselves.


I should have known better than to post on a sunday morning. Sorry.
That's the sorted 10 largest stack allocations in binaries produced by
each compiler (presuming most everything falls in place).
Each time i verify codegen for a function across all 3, gcc always has
the largest frame by a substantial amount (on ia32). And that's what
that rigorous table is trying to demonstrate ;)

Basically i'm wondering if a stack frame shrinking pass [ ] is
possible, [ ] makes no sense, [ ] has been done, [ ] is planed etc...

The current gcc register allocator (more correctly reload) reserves a 
separate stack slot for each pseudo register which did not get a hard 
register.  Stack slot sharing stack or slot coloring has been 
implemented on IRA branch.  This is very useful for x86 and x86_64.  
Besides smaller stack and better code locality, it results in smaller 
displacements which means smaller insns (code) and  even better code 
locality.


Another thing is sharing slots for saving call used hard registers 
through calls.  The current gcc register allocator reserves a separate 
slot for each call used register assigned to a pseudo living through a 
call.  It is a not so important for x86 but it is more important for 
x86_64 (there are more call used hard registers and they are bigger 
64-bit).  Such stack slot sharing has been also implemented on IRA.


I am focused to  make IRA available for  gcc 4.4.