Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread Mike Stump

On Mar 13, 2006, at 12:16 AM, Paolo Bonzini wrote:
PR/21195 is about inlining the SSE builtins.  These are special  
because, for example, you probably would prefer GDB to not step  
into them, but just execute them.


:-)  We have an APPLE LOCAL patch to remove the debug information  
associated with them so that the debugger never steps `into' them.  :- 
(  attr (__nodebug)


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread Richard Guenther
On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> On 3/13/06, Dan Kegel <[EMAIL PROTECTED]> wrote:
> > Is there a bugzilla entry describing the bug Richard is fixing?
> > If not, it'd be nice to have, if for no other reason than
> > it would show up naturally when people look for bugs fixed in gcc-4.1.1.
> >
> > I can create one, but it'd be better if someone actually
> > involved in the action did.
>
> I can do it.

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26667

Richard.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread Richard Guenther
On 3/13/06, Dan Kegel <[EMAIL PROTECTED]> wrote:
> Is there a bugzilla entry describing the bug Richard is fixing?
> If not, it'd be nice to have, if for no other reason than
> it would show up naturally when people look for bugs fixed in gcc-4.1.1.
>
> I can create one, but it'd be better if someone actually
> involved in the action did.

I can do it.

Richard.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread Dan Kegel
Is there a bugzilla entry describing the bug Richard is fixing?
If not, it'd be nice to have, if for no other reason than
it would show up naturally when people look for bugs fixed in gcc-4.1.1.

I can create one, but it'd be better if someone actually
involved in the action did.
- Dan

--
Wine for Windows ISVs: http://kegel.com/wine/isv


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread tbp
On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> I don't think this is related, and a quick check with the patch shows
> still unaligned
> moves to the stack.
Patience is a virtue i guess :)
Is there good chances your inlining fix will hit mainline soon?


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread Richard Guenther
On 3/13/06, tbp <[EMAIL PROTECTED]> wrote:
> On 3/13/06, tbp <[EMAIL PROTECTED]> wrote:
> > On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> > > http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00739.html
> > /me ventilates.
> > You're my hero.
> A double+ hero on top of that.
> http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00737.html
> I think i've hit that one that one too; reported here:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26650

I don't think this is related, and a quick check with the patch shows
still unaligned
moves to the stack.

Richard.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread tbp
On 3/13/06, tbp <[EMAIL PROTECTED]> wrote:
> On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> > http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00739.html
> /me ventilates.
> You're my hero.
A double+ hero on top of that.
http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00737.html
I think i've hit that one that one too; reported here:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26650

Well, i can always dream.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread tbp
On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00739.html
/me ventilates.
You're my hero.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread Richard Guenther
On 3/13/06, tbp <[EMAIL PROTECTED]> wrote:
> On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> > I see the bug and will have a fix in a moment.
> You made my day. Or you're about to. Unless you're lying and i'll have
> to curse you for 7 generations.

http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00739.html

;)


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread tbp
On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> I see the bug and will have a fix in a moment.
You made my day. Or you're about to. Unless you're lying and i'll have
to curse you for 7 generations.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread tbp
On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> Of course from 4.1.0 on you can easier stick an
> __attribute__((flatten)) on the function you want everything inlined to
> (finalblow) and get everything inlined into it.
But that's not really what i'm after: i expect trivial functions to
get inlined no matter what at a given -Ox.

> With always_inline on it, the wrappers are no longer inlined - this is a bug 
> and
> should be reported.
> Can you report a bugzilla for the bad interaction between always_inline
> and inlining of simple wrappers?
I will report it again then.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread Richard Guenther
On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> On 3/13/06, tbp <[EMAIL PROTECTED]> wrote:
> > On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> >  > Starting with gcc 4.1.0 we have inline heuristics in place that will 
> > _always_
> > > inline such simple "wrappers".  So, if this still happens, there is a bug 
> > > in the
> > > heuristics and that should be reported.  Before 4.1.0 the heuristics were 
> > > bogus
> > > and wrappers were not inlined all the time.
> > > So, can you verify you are happy with the heuristics in 4.1.0
> > No i'm not, and i've used a pristine 4.1.0 in
> > http://gcc.gnu.org/ml/gcc/2006-03/msg00410.html
>
> For the testcase in this message, I get (I removed the always_inline)
> all wrappers inlined to bloatit.  Of course bloatit does not get inlined
> w/o always_inline - it's a huge function and not a simple wrapper.  With
> always_inline on it, the wrappers are no longer inlined - this is a bug and
> should be reported.  Of course from 4.1.0 on you can easier stick an
> __attribute__((flatten)) on the function you want everything inlined to
> (finalblow) and get everything inlined into it.

I see the bug and will have a fix in a moment.

Richard.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread Richard Guenther
On 3/13/06, tbp <[EMAIL PROTECTED]> wrote:
> On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
>  > Starting with gcc 4.1.0 we have inline heuristics in place that will 
> _always_
> > inline such simple "wrappers".  So, if this still happens, there is a bug 
> > in the
> > heuristics and that should be reported.  Before 4.1.0 the heuristics were 
> > bogus
> > and wrappers were not inlined all the time.
> > So, can you verify you are happy with the heuristics in 4.1.0
> No i'm not, and i've used a pristine 4.1.0 in
> http://gcc.gnu.org/ml/gcc/2006-03/msg00410.html

For the testcase in this message, I get (I removed the always_inline)
all wrappers inlined to bloatit.  Of course bloatit does not get inlined
w/o always_inline - it's a huge function and not a simple wrapper.  With
always_inline on it, the wrappers are no longer inlined - this is a bug and
should be reported.  Of course from 4.1.0 on you can easier stick an
__attribute__((flatten)) on the function you want everything inlined to
(finalblow) and get everything inlined into it.

Can you report a bugzilla for the bad interaction between always_inline
and inlining of simple wrappers?

Thanks,
Richard.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread tbp
On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
 > Starting with gcc 4.1.0 we have inline heuristics in place that will _always_
> inline such simple "wrappers".  So, if this still happens, there is a bug in 
> the
> heuristics and that should be reported.  Before 4.1.0 the heuristics were 
> bogus
> and wrappers were not inlined all the time.
> So, can you verify you are happy with the heuristics in 4.1.0
No i'm not, and i've used a pristine 4.1.0 in
http://gcc.gnu.org/ml/gcc/2006-03/msg00410.html
I haven't tried that particular testcase on 4.2.x, but some weeks ago
i had to go thru all my code again to put always_inline in some
forgotten places because i was seeing even empty ctors not being
inlined (to the effect of having a call to a ret). So in this regard,
4.1.0 & 4.2.x still exhibit that kind of behaviour.

It seems to trigger when some particular threshold is met, either for
a function or unit, then nothing at all gets inlined but functions
tagged with always_inline; of course major performance regression
ensues.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread Richard Guenther
On 3/13/06, tbp <[EMAIL PROTECTED]> wrote:
> On 3/13/06, Paolo Bonzini <[EMAIL PROTECTED]> wrote:
> >Wait wait.  PR/21195 is about inlining
> > the SSE builtins.
> No. PR/21195 was really about inline heuristic going ballistic.
> Those intrinsics are thin wrappers around builtins, and ultimately
> resolve to a couple of operations. Typical C++ (accessors/ctors) also
> presents lots of such small functions.
> And guess what, same cause same symptom.

Starting with gcc 4.1.0 we have inline heuristics in place that will _always_
inline such simple "wrappers".  So, if this still happens, there is a bug in the
heuristics and that should be reported.  Before 4.1.0 the heuristics were bogus
and wrappers were not inlined all the time.

So, can you verify you are happy with the heuristics in 4.1.0 (not talking about
inlining of memcpy/memset that are really not function inlining, but
the SSE/altivec
inline function implementations).

Richard.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread tbp
On 3/13/06, Paolo Bonzini <[EMAIL PROTECTED]> wrote:
>Wait wait.  PR/21195 is about inlining
> the SSE builtins.
No. PR/21195 was really about inline heuristic going ballistic.
Those intrinsics are thin wrappers around builtins, and ultimately
resolve to a couple of operations. Typical C++ (accessors/ctors) also
presents lots of such small functions.
And guess what, same cause same symptom.

There's no sensible metric by which code i've quoted in previous mail
makes sense. Size? Nope. Execution time? Certainly not.

Again whether or not SSE ops are involved was and is still irrelevant.

> Your case seems to be different, because it involves inlining user
> routines.  Again, you need to give us the preprocessed source code for
> us to look at your bug effectively.
Thanks for the tip, but i'll pass. I've done my duty already.
Months ago there was 2 options for fixing PR/21195:
a) Fix the inlining heuristic.
b) Kludge all intrinsics with always_inline.

I've tried to argue a bit but to no avail. So, while you remain
convinced everything's fine with the  inliner, i'll keep tagging every
function in my code with always_inline/noinline where performance
matters.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-13 Thread Paolo Bonzini

tbp wrote:

On 3/13/06, Andrew Pinski <[EMAIL PROTECTED]> wrote:

Actually the best way of improving the inline heuristics is to get
a real testcase (and not some benchmark) where  the inline heuristics
is messed up.

Ah, you mean a brand new testcase because PR-21195 wasn't good enough?


 show up in GCC 4.1 except for Wait wait.  PR/21195 is about inlining 
the SSE builtins.  These are special because, for example, you probably 
would prefer GDB to not step into them, but just execute them.  As 
Andrew said, it is only an implementation choice (subject to revision) 
that they are implemented as inline functions at all.  For example, if 
an older GCC had a similar bug with Altivec intrinsics, it would have 
showed up only in C++ (because Altivec intrinsics were never implemented 
as inlines in C) and would not show up anymore in GCC 4.1 except for a 
handful of intrinsics (because most Altivec intrinsics are not inlines 
at all anymore).


memset/memcpy is different from SSE builtins because the choice of 
whether to inline or not is target dependent, and because glibc also 
decides whether or not to provide its own inlining, depending on the GCC 
version you're using.  So the best way to report the problem is to file 
a *preprocessed* testcase into Bugzilla (i.e. the output of "gcc -E 
testcase.c > testcase.i" or equivalently "gcc -save-temps testcase.c", 
and to include the output of


   gcc -v testcase.c -O2

of the bug report.  Using preprocessed source code at least makes sure 
that the glibc choices are not influencing the comparison between 3.4.x 
and 4.0.x.  This information is present in the "how to file a bug" 
chapter of the manual.


Your case seems to be different, because it involves inlining user 
routines.  Again, you need to give us the preprocessed source code for 
us to look at your bug effectively.


Paolo


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread tbp
On 3/13/06, Andrew Pinski <[EMAIL PROTECTED]> wrote:
> Actually the best way of improving the inline heuristics is to get
> a real testcase (and not some benchmark) where  the inline heuristics
> is messed up.
Ah, you mean a brand new testcase because PR-21195 wasn't good enough?

$ /usr/local/gcc-4.1.0/bin/g++ -v
Using built-in specs.
Target: i686-pc-cygwin
Configured with: ../configure --prefix=/usr/local/gcc-4.1.0
--enable-languages=c,c++ --enable-threads=posix --with-system-zlib
--disable-checking --disable-nls --disable-shared
--disable-win32-registry --verbose --enable-bootstrap --with-gcc
--with-gnu-ld --with-gnu-as --with-cpu=k8
Thread model: posix
gcc version 4.1.0

/usr/local/gcc-4.1.0/bin/g++ -g -O3 -march=k8 -msse2   -o pr-inline.o
pr-inline.cc

#include 

static __m128 mm_max_ps(const __m128 a, const __m128 b) { return
_mm_max_ps(a,b); }
static __m128 mm_min_ps(const __m128 a, const __m128 b) { return
_mm_min_ps(a,b); }
static __m128 mm_mul_ps(const __m128 a, const __m128 b) { return
_mm_mul_ps(a,b); }
static __m128 mm_div_ps(const __m128 a, const __m128 b) { return
_mm_div_ps(a,b); }
static __m128 mm_or_ps(const __m128 a, const __m128 b) { return
_mm_or_ps(a,b); }
static int mm_movemask_ps(const __m128 a) { return _mm_movemask_ps(a); }

static __attribute__ ((always_inline)) bool bloatit(const __m128 a, const
__m128
b) {

const __m128
v0 = mm_max_ps(a,b),
v1 = mm_min_ps(a,b),
v2 = mm_mul_ps(a,b),
v3 = mm_div_ps(a,b),
g0 = mm_or_ps(_mm_or_ps(_mm_or_ps(v0,v1), v2), v3),
v4 = mm_min_ps(mm_or_ps(a,b),mm_div_ps(b,a)),
v5 = mm_max_ps(mm_min_ps(a,mm_div_ps(b,a)), 
mm_or_ps(b, mm_div_ps(b,g0))),
g1 = mm_or_ps(g0,mm_or_ps(v4,v5));
return mm_movemask_ps(g1);
}

bool finalblow(const __m128 a, const __m128 b, const __m128 c, const __m128 d,
const __m128 e, const __m128 f) {
return
bloatit(a,b) & bloatit(c,d) & bloatit(e,f) & bloatit(a,c) &
bloatit(b,d) & bloatit(c,e) & bloatit(d,f) &
bloatit(b,a) & bloatit(d,c) & bloatit(f,e) & bloatit(c,a) &
bloatit(d,b) & bloatit(e,c) & bloatit(f,d);

}

int main() { return 0; }

00401080 :
  401080:   push   %ebp
  401081:   mulps  %xmm1,%xmm0
  401084:   mov%esp,%ebp
  401086:   sub$0x8,%esp
  401089:   leave
  40108a:   ret
  40108b:   nop
  40108c:   lea0x0(%esi),%esi

00401090 :
  401090:   push   %ebp
  401091:   orps   %xmm1,%xmm0
  401094:   mov%esp,%ebp
  401096:   sub$0x8,%esp
  401099:   leave
  40109a:   ret
  40109b:   nop
  40109c:   lea0x0(%esi),%esi

004010a0 :
  4010a0:   divps  %xmm1,%xmm0
  4010a3:   push   %ebp
  4010a4:   mov%esp,%ebp
  4010a6:   sub$0x8,%esp
  4010a9:   leave
  4010aa:   ret
  4010ab:   nop

...
004010e0 :
...
  401101:   call   4010c0 
  401106:   movaps %xmm0,0xf958(%ebp)
  40110d:   movaps 0xf8f8(%ebp),%xmm1
  401114:   movaps 0xf908(%ebp),%xmm0
  40111b:   call   4010b0 
  401120:   movaps 0xf8f8(%ebp),%xmm1
  401127:   movaps %xmm0,0xf948(%ebp)
  40112e:   movaps 0xf908(%ebp),%xmm0
  401135:   call   401080 
  40113a:   movaps 0xf8f8(%ebp),%xmm1
  401141:   movaps %xmm0,0xf938(%ebp)
  401148:   movaps 0xf908(%ebp),%xmm0
  40114f:   call   4010a0 
  401154:   movaps 0xf958(%ebp),%xmm1
  40115b:   orps   0xf948(%ebp),%xmm1
  401162:   movaps %xmm1,0xf958(%ebp)
  401169:   movaps %xmm0,%xmm1
  40116c:   movaps 0xf958(%ebp),%xmm0
  401173:   orps   0xf938(%ebp),%xmm0
  40117a:   call   401090 
  40117f:   movaps 0xf908(%ebp),%xmm1
  401186:   movaps %xmm0,0xf928(%ebp)
  40118d:   movaps 0xf8f8(%ebp),%xmm0
  401194:   call   4010a0 
  401199:   movaps 0xf8f8(%ebp),%xmm1


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Gabriel Dos Reis
Andrew Pinski <[EMAIL PROTECTED]> writes:

| > 
| > On 3/12/06, Steven Bosscher <[EMAIL PROTECTED]> wrote:
| > > > Yes, why is the benchmark not valid?
| > >
| > > It is valid.  We should understand why this behavior has changed so 
drastically.
| > This benchmark maybe useless, it still exposes a weakness of gcc4. At
| > least it's not news to me:
| > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21195
| > 
| > So that PR has been closed when gcc-devs marked all those intrinsics
| > as force_inline. That's also the kludge i use with my code. The real
| > problem is once you start marking some functions as force_inline, you
| > upset the inlining heuristic even more creating even more silly
| > inlining misses, rince, repeat.
| > At the end of the day, everything is marked either force_inline or
| > noinline and you'd be better off without a heuristic at all.
| 
| Actually the best way of improving the inline heuristics is to get
| a real testcase (and not some benchmark) where  the inline heuristics
| is messed up.

I suppose that is part of the problem.  When users send feedback (for
example, by filling a PR) they feel the PRs are closed pretty quickly
with no acknowledgement of what is happening and the way that affects
program overall structures, and they can't do anything anyway --
because they feel the "masters" would close the PR with no possibility
of appeal.  When they send code snippets that demonstrate the
particular aspect of the compiler making their life miserable, they
are told their testcases are not "real".   However, we should also
acknowledge that aspects of the inlines are tuned based on
benchmarks.  For users, that course of action appears discouraging or
irritating, if not incoherent.

-- Gaby


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Gabriel Dos Reis
tbp <[EMAIL PROTECTED]> writes:

| On 3/12/06, Steven Bosscher <[EMAIL PROTECTED]> wrote:
| > > Yes, why is the benchmark not valid?
| >
| > It is valid.  We should understand why this behavior has changed so 
drastically.
| This benchmark maybe useless, it still exposes a weakness of gcc4. At
| least it's not news to me:
| http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21195
| 
| So that PR has been closed when gcc-devs marked all those intrinsics
| as force_inline. That's also the kludge i use with my code. The real
| problem is once you start marking some functions as force_inline, you
| upset the inlining heuristic even more creating even more silly
| inlining misses, rince, repeat.
| At the end of the day, everything is marked either force_inline or
| noinline and you'd be better off without a heuristic at all.

so force_inline is like a virus :-)

-- Gaby


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Andrew Pinski
> 
> On 3/12/06, Steven Bosscher <[EMAIL PROTECTED]> wrote:
> > > Yes, why is the benchmark not valid?
> >
> > It is valid.  We should understand why this behavior has changed so 
> > drastically.
> This benchmark maybe useless, it still exposes a weakness of gcc4. At
> least it's not news to me:
> http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21195
> 
> So that PR has been closed when gcc-devs marked all those intrinsics
> as force_inline. That's also the kludge i use with my code. The real
> problem is once you start marking some functions as force_inline, you
> upset the inlining heuristic even more creating even more silly
> inlining misses, rince, repeat.
> At the end of the day, everything is marked either force_inline or
> noinline and you'd be better off without a heuristic at all.

Actually the best way of improving the inline heuristics is to get
a real testcase (and not some benchmark) where  the inline heuristics
is messed up.  Now SSE intrinsics are special in that they should be
always inlined and that fact should be hidden from the user.  Maybe
they should be rewritten so that they are just like the altivec
intrinsics in that it is just a plain #define and nothing special to
the user and no worrying about the inlining heuristic.  I should
note that always inline was added for altivec intrinsics in the 
first place and they have now since been rewritten.  Also the
kernel uses always inline but I and other feels that is a mistake.

-- Pinski



Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread tbp
On 3/12/06, Steven Bosscher <[EMAIL PROTECTED]> wrote:
> > Yes, why is the benchmark not valid?
>
> It is valid.  We should understand why this behavior has changed so 
> drastically.
This benchmark maybe useless, it still exposes a weakness of gcc4. At
least it's not news to me:
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21195

So that PR has been closed when gcc-devs marked all those intrinsics
as force_inline. That's also the kludge i use with my code. The real
problem is once you start marking some functions as force_inline, you
upset the inlining heuristic even more creating even more silly
inlining misses, rince, repeat.
At the end of the day, everything is marked either force_inline or
noinline and you'd be better off without a heuristic at all.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Richard Guenther
On 3/12/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> So, I tried to reproduce the slowdown and on i686 get all
> memcpy/memset inlined on 3.3, 3.4, 4.0 and 4.1.  On ppc I get calls to
> memcpy/memset in all cases.  This might be more a glibc issue I think.

So my suggestion is to file a bugzilla PR about this problem including
_preprocessed_ source to get glibc dependencies out of the way.

Richard.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Richard Guenther
On 12 Mar 2006 18:09:26 +0100, Gabriel Dos Reis
<[EMAIL PROTECTED]> wrote:
> "Richard Guenther" <[EMAIL PROTECTED]> writes:
>
> [...]
>
> | this one should be measured.  But note that the benchmark is a
> | no-op and can be validly optimizes to int main() { return 0; } by the
> | compiler.  This is why I call it a stupid benchmark.
>
> please let's refrain from getting into that back hole.
>
> Different people measure different things that they perceive important
> for them.  I doubt that the "optimization to int main() { return 0; }"
> would be useful to everybody.
>
> | Also you are measuring exclusively cache performance.
>
> that may be a decisive criteria under given circumstances; it takes
> more justification to qualify it as "stupid benchmark".  We can either
> acknowledge "oops, we fumbled that case; but we are not going to fix
> it" or "well, we should not have done that; it should be fixed".
> But handwaving with "stupid" qualification is not helpful.

So, I tried to reproduce the slowdown and on i686 get all
memcpy/memset inlined on 3.3, 3.4, 4.0 and 4.1.  On ppc I get calls to
memcpy/memset in all cases.  This might be more a glibc issue I think.

Richard.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Gabriel Dos Reis
"Richard Guenther" <[EMAIL PROTECTED]> writes:

[...]

| this one should be measured.  But note that the benchmark is a
| no-op and can be validly optimizes to int main() { return 0; } by the
| compiler.  This is why I call it a stupid benchmark.

please let's refrain from getting into that back hole.  

Different people measure different things that they perceive important
for them.  I doubt that the "optimization to int main() { return 0; }"
would be useful to everybody.

| Also you are measuring exclusively cache performance.

that may be a decisive criteria under given circumstances; it takes
more justification to qualify it as "stupid benchmark".  We can either
acknowledge "oops, we fumbled that case; but we are not going to fix
it" or "well, we should not have done that; it should be fixed".  
But handwaving with "stupid" qualification is not helpful.

-- Gaby


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Richard Guenther
On 3/12/06, Ernest L. Williams Jr. <[EMAIL PROTECTED]> wrote:
> On Sun, 2006-03-12 at 15:17 +0100, Richard Guenther wrote:
> > On 3/12/06, Ernest L. Williams Jr. <[EMAIL PROTECTED]> wrote:
> > > > In any case: memcpy/memset inlining is broken in current GCC at least
> > > > on athlon arch.
> >
> > let's say it changed.  Also memcpy/memset "inlining" is not regular inlining
> > but driven by completely different heuristics.
> >
> > > Yes, why is the benchmark not valid?
> > > Then we would appreciate if the developers could recommend a valid test.
> >
> > What is the benchmark supposed to measure?
>
> The following is from the website mentioned previously:
> =
>
> What does it benchmark?

I asked about the specific benchmark, I guess

> Bashmark is testing the things that most applications need. It is trying
> to show you how well your hardware works together.
> Currently the things which are being tested are:
> -Calculations with types of different range
> -Calculations with floating point types of different range
> -Read and write into the memory with different size.

this one should be measured.  But note that the benchmark is a
no-op and can be validly optimizes to int main() { return 0; } by the
compiler.  This is why I call it a stupid benchmark.  Also you are
measuring exclusively cache performance.

Richard.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Ernest L. Williams Jr.
On Sun, 2006-03-12 at 15:17 +0100, Richard Guenther wrote:
> On 3/12/06, Ernest L. Williams Jr. <[EMAIL PROTECTED]> wrote:
> > > In any case: memcpy/memset inlining is broken in current GCC at least
> > > on athlon arch.
> 
> let's say it changed.  Also memcpy/memset "inlining" is not regular inlining
> but driven by completely different heuristics.
> 
> > Yes, why is the benchmark not valid?
> > Then we would appreciate if the developers could recommend a valid test.
> 
> What is the benchmark supposed to measure?

The following is from the website mentioned previously:
=

What does it benchmark?
Bashmark is testing the things that most applications need. It is trying
to show you how well your hardware works together.
Currently the things which are being tested are:
-Calculations with types of different range
-Calculations with floating point types of different range
-Read and write into the memory with different size.
-Calling your system for memory and give it free with different size.
-Tryout the speed of one main part of multithreading
=


Ernesto


> 
> Richard.



Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Nickolay Kolchin
On 3/12/06, Steven Bosscher <[EMAIL PROTECTED]> wrote:
>
> It is valid.  We should understand why this behavior has changed so 
> drastically.
>

I've attached assembler output from different compiler versions:

3.4.5-athlon-xp: gcc-3.4.5 -O3 -march=athlon-xp
3.4.5-pentium4: gcc-3.4.5 -O3 -march=pentium4
4.1.0-athlon-xp: gcc-4.1.0 -O3 -march=athlon-xp

As you can see, gcc-3.4.5 generates fastest code for
"-march=athlon-xp". This code should also run faster on any pentium
machine.

gcc-4.1.0 generates "same" slow code for "pentium" and "athlon" arch.

--
Nickolay


test_cmd-3.4.5-athlon-xp.s
Description: Binary data


test_cmd-3.4.5-pentium4.s
Description: Binary data


test_cmd-4.1.0-athlon-xp.s
Description: Binary data


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Richard Guenther
On 3/12/06, Ernest L. Williams Jr. <[EMAIL PROTECTED]> wrote:
> > In any case: memcpy/memset inlining is broken in current GCC at least
> > on athlon arch.

let's say it changed.  Also memcpy/memset "inlining" is not regular inlining
but driven by completely different heuristics.

> Yes, why is the benchmark not valid?
> Then we would appreciate if the developers could recommend a valid test.

What is the benchmark supposed to measure?

Richard.


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Steven Bosscher
> Yes, why is the benchmark not valid?

It is valid.  We should understand why this behavior has changed so drastically.

Gr.
Steven


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Ernest L. Williams Jr.
On Sun, 2006-03-12 at 16:55 +0300, Nickolay Kolchin wrote:
> On 3/12/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> > On 3/12/06, Nickolay Kolchin <[EMAIL PROTECTED]> wrote:
> > > During "bashmark" memory benchmark perfomance analyze, I found 100x 
> > > perfomance
> > > regression between gcc 3.4.5 and gcc 4.X.
> > >
> > > -- test_cmd.cpp (simplified bashmark memory RW test) ---
> > > #include 
> > > #include 
> > >
> > > template 
> > > static void int_membench(uint8_t* mb1, uint8_t* mb2)
> > > {
> > >   for(uint32_t i = 0; i < Loops; i+=1)
> > >   {
> > > #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size);
> > > T T T T T
> > > T T T T T
> > > #undef T
> > >   }
> > > }
> > >
> > > template 
> > > static void membench()
> > > {
> > >   static uint8_t mb1[Buf_Size];
> > >   static uint8_t mb2[Buf_Size];
> > >   for(uint32_t i = 0; i < 1; i+=1)
> > > int_membench(mb1, mb2);
> > > }
> > >
> > > int main()
> > > {
> > >   membench<128, 4000>();
> > >   return 0;
> > > }
> > >
> > > ---
> > > GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed
> > > GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed
> > > GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed
> > >
> > > Compiler options:
> > > -march=athlon-xp
> > > -O3
> > > -fomit-frame-pointer
> > > -mfpmath=sse -msse
> > > -ftracer -fweb
> > > -maccumulate-outgoing-args
> > > -ffast-math
> > >
> > > I've played with various settings (-O2, -O1, without march, without 
> > > tracer and
> > > web, etc) without any serious difference. I.e. GCC4 is always many times 
> > > slower
> > > than GCC 3.4.5.
> > >
> > > Lurking inside assembler generation showed that GCC4 don't inline memcpy 
> > > and
> > > memset calls.
> > >
> > > -- test.c (uber simplified problem demonstration) -
> > > #include 
> > >
> > > char* f(char* b)
> > > {
> > >   static char a[64];
> > >   memcpy(a, b, 64);
> > >   memset(a, 0, 64);
> > >   return a;
> > > }
> > > 
> > >
> > > GCC4 will generate calls to memcpy and memset in this example. GCC3 will 
> > > inline
> > > all calls.
> > >
> > > So, it looks like GCC4 inliner is broken at some point.
> >
> > Inlining of memcpy/memset is architecture dependent (I see calls
> > on ppc for gcc 3.4, too).  This is a stupid benchmark and as such
> > not worth optimizing for.
> >
> 
> bashmark (http://bashmark.coders-net.de/ ) is a benchmark. My code is
> just a test to demonstrate problem and as such can't be stupid. :)
> 
> Situation when compiler generates code from simple test that run 100
> times slower, than code from previous compiler version is not normal
> anyway.  (and GCC3 generates smaller code, too)
> 
> I thought that this regression was caused by different "max-inline-*"
> params setting in 4.X.
> 
> In any case: memcpy/memset inlining is broken in current GCC at least
> on athlon arch.

Yes, why is the benchmark not valid?
Then we would appreciate if the developers could recommend a valid test.

Here is what I get on my platform:
==
gcc version 4.0.2 20051125 (Red Hat 4.0.2-8)
Architecture = i686
OS: Linux 
Kernel: 2.6.15-1.1833_FC4

[EMAIL PROTECTED] src]$ time ./test_cmd

real0m50.583s
user0m50.003s
sys 0m0.220s
===

Thanks,
Ernesto


> 
> --
> Nickolay



Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Nickolay Kolchin
On 3/12/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> On 3/12/06, Nickolay Kolchin <[EMAIL PROTECTED]> wrote:
> > During "bashmark" memory benchmark perfomance analyze, I found 100x 
> > perfomance
> > regression between gcc 3.4.5 and gcc 4.X.
> >
> > -- test_cmd.cpp (simplified bashmark memory RW test) ---
> > #include 
> > #include 
> >
> > template 
> > static void int_membench(uint8_t* mb1, uint8_t* mb2)
> > {
> >   for(uint32_t i = 0; i < Loops; i+=1)
> >   {
> > #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size);
> > T T T T T
> > T T T T T
> > #undef T
> >   }
> > }
> >
> > template 
> > static void membench()
> > {
> >   static uint8_t mb1[Buf_Size];
> >   static uint8_t mb2[Buf_Size];
> >   for(uint32_t i = 0; i < 1; i+=1)
> > int_membench(mb1, mb2);
> > }
> >
> > int main()
> > {
> >   membench<128, 4000>();
> >   return 0;
> > }
> >
> > ---
> > GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed
> > GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed
> > GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed
> >
> > Compiler options:
> > -march=athlon-xp
> > -O3
> > -fomit-frame-pointer
> > -mfpmath=sse -msse
> > -ftracer -fweb
> > -maccumulate-outgoing-args
> > -ffast-math
> >
> > I've played with various settings (-O2, -O1, without march, without tracer 
> > and
> > web, etc) without any serious difference. I.e. GCC4 is always many times 
> > slower
> > than GCC 3.4.5.
> >
> > Lurking inside assembler generation showed that GCC4 don't inline memcpy and
> > memset calls.
> >
> > -- test.c (uber simplified problem demonstration) -
> > #include 
> >
> > char* f(char* b)
> > {
> >   static char a[64];
> >   memcpy(a, b, 64);
> >   memset(a, 0, 64);
> >   return a;
> > }
> > 
> >
> > GCC4 will generate calls to memcpy and memset in this example. GCC3 will 
> > inline
> > all calls.
> >
> > So, it looks like GCC4 inliner is broken at some point.
>
> Inlining of memcpy/memset is architecture dependent (I see calls
> on ppc for gcc 3.4, too).  This is a stupid benchmark and as such
> not worth optimizing for.
>

bashmark (http://bashmark.coders-net.de/ ) is a benchmark. My code is
just a test to demonstrate problem and as such can't be stupid. :)

Situation when compiler generates code from simple test that run 100
times slower, than code from previous compiler version is not normal
anyway.  (and GCC3 generates smaller code, too)

I thought that this regression was caused by different "max-inline-*"
params setting in 4.X.

In any case: memcpy/memset inlining is broken in current GCC at least
on athlon arch.

--
Nickolay


Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X

2006-03-12 Thread Richard Guenther
On 3/12/06, Nickolay Kolchin <[EMAIL PROTECTED]> wrote:
> During "bashmark" memory benchmark perfomance analyze, I found 100x perfomance
> regression between gcc 3.4.5 and gcc 4.X.
>
> -- test_cmd.cpp (simplified bashmark memory RW test) ---
> #include 
> #include 
>
> template 
> static void int_membench(uint8_t* mb1, uint8_t* mb2)
> {
>   for(uint32_t i = 0; i < Loops; i+=1)
>   {
> #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size);
> T T T T T
> T T T T T
> #undef T
>   }
> }
>
> template 
> static void membench()
> {
>   static uint8_t mb1[Buf_Size];
>   static uint8_t mb2[Buf_Size];
>   for(uint32_t i = 0; i < 1; i+=1)
> int_membench(mb1, mb2);
> }
>
> int main()
> {
>   membench<128, 4000>();
>   return 0;
> }
>
> ---
> GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed
> GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed
> GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed
>
> Compiler options:
> -march=athlon-xp
> -O3
> -fomit-frame-pointer
> -mfpmath=sse -msse
> -ftracer -fweb
> -maccumulate-outgoing-args
> -ffast-math
>
> I've played with various settings (-O2, -O1, without march, without tracer and
> web, etc) without any serious difference. I.e. GCC4 is always many times 
> slower
> than GCC 3.4.5.
>
> Lurking inside assembler generation showed that GCC4 don't inline memcpy and
> memset calls.
>
> -- test.c (uber simplified problem demonstration) -
> #include 
>
> char* f(char* b)
> {
>   static char a[64];
>   memcpy(a, b, 64);
>   memset(a, 0, 64);
>   return a;
> }
> 
>
> GCC4 will generate calls to memcpy and memset in this example. GCC3 will 
> inline
> all calls.
>
> So, it looks like GCC4 inliner is broken at some point.

Inlining of memcpy/memset is architecture dependent (I see calls
on ppc for gcc 3.4, too).  This is a stupid benchmark and as such
not worth optimizing for.

Richard.