Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On Mar 13, 2006, at 12:16 AM, Paolo Bonzini wrote: PR/21195 is about inlining the SSE builtins. These are special because, for example, you probably would prefer GDB to not step into them, but just execute them. :-) We have an APPLE LOCAL patch to remove the debug information associated with them so that the debugger never steps `into' them. :- ( attr (__nodebug)
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > On 3/13/06, Dan Kegel <[EMAIL PROTECTED]> wrote: > > Is there a bugzilla entry describing the bug Richard is fixing? > > If not, it'd be nice to have, if for no other reason than > > it would show up naturally when people look for bugs fixed in gcc-4.1.1. > > > > I can create one, but it'd be better if someone actually > > involved in the action did. > > I can do it. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26667 Richard.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Dan Kegel <[EMAIL PROTECTED]> wrote: > Is there a bugzilla entry describing the bug Richard is fixing? > If not, it'd be nice to have, if for no other reason than > it would show up naturally when people look for bugs fixed in gcc-4.1.1. > > I can create one, but it'd be better if someone actually > involved in the action did. I can do it. Richard.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
Is there a bugzilla entry describing the bug Richard is fixing? If not, it'd be nice to have, if for no other reason than it would show up naturally when people look for bugs fixed in gcc-4.1.1. I can create one, but it'd be better if someone actually involved in the action did. - Dan -- Wine for Windows ISVs: http://kegel.com/wine/isv
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > I don't think this is related, and a quick check with the patch shows > still unaligned > moves to the stack. Patience is a virtue i guess :) Is there good chances your inlining fix will hit mainline soon?
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, tbp <[EMAIL PROTECTED]> wrote: > On 3/13/06, tbp <[EMAIL PROTECTED]> wrote: > > On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > > > http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00739.html > > /me ventilates. > > You're my hero. > A double+ hero on top of that. > http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00737.html > I think i've hit that one that one too; reported here: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26650 I don't think this is related, and a quick check with the patch shows still unaligned moves to the stack. Richard.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, tbp <[EMAIL PROTECTED]> wrote: > On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > > http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00739.html > /me ventilates. > You're my hero. A double+ hero on top of that. http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00737.html I think i've hit that one that one too; reported here: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=26650 Well, i can always dream.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00739.html /me ventilates. You're my hero.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, tbp <[EMAIL PROTECTED]> wrote: > On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > > I see the bug and will have a fix in a moment. > You made my day. Or you're about to. Unless you're lying and i'll have > to curse you for 7 generations. http://gcc.gnu.org/ml/gcc-patches/2006-03/msg00739.html ;)
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > I see the bug and will have a fix in a moment. You made my day. Or you're about to. Unless you're lying and i'll have to curse you for 7 generations.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > Of course from 4.1.0 on you can easier stick an > __attribute__((flatten)) on the function you want everything inlined to > (finalblow) and get everything inlined into it. But that's not really what i'm after: i expect trivial functions to get inlined no matter what at a given -Ox. > With always_inline on it, the wrappers are no longer inlined - this is a bug > and > should be reported. > Can you report a bugzilla for the bad interaction between always_inline > and inlining of simple wrappers? I will report it again then.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > On 3/13/06, tbp <[EMAIL PROTECTED]> wrote: > > On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > > > Starting with gcc 4.1.0 we have inline heuristics in place that will > > _always_ > > > inline such simple "wrappers". So, if this still happens, there is a bug > > > in the > > > heuristics and that should be reported. Before 4.1.0 the heuristics were > > > bogus > > > and wrappers were not inlined all the time. > > > So, can you verify you are happy with the heuristics in 4.1.0 > > No i'm not, and i've used a pristine 4.1.0 in > > http://gcc.gnu.org/ml/gcc/2006-03/msg00410.html > > For the testcase in this message, I get (I removed the always_inline) > all wrappers inlined to bloatit. Of course bloatit does not get inlined > w/o always_inline - it's a huge function and not a simple wrapper. With > always_inline on it, the wrappers are no longer inlined - this is a bug and > should be reported. Of course from 4.1.0 on you can easier stick an > __attribute__((flatten)) on the function you want everything inlined to > (finalblow) and get everything inlined into it. I see the bug and will have a fix in a moment. Richard.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, tbp <[EMAIL PROTECTED]> wrote: > On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > > Starting with gcc 4.1.0 we have inline heuristics in place that will > _always_ > > inline such simple "wrappers". So, if this still happens, there is a bug > > in the > > heuristics and that should be reported. Before 4.1.0 the heuristics were > > bogus > > and wrappers were not inlined all the time. > > So, can you verify you are happy with the heuristics in 4.1.0 > No i'm not, and i've used a pristine 4.1.0 in > http://gcc.gnu.org/ml/gcc/2006-03/msg00410.html For the testcase in this message, I get (I removed the always_inline) all wrappers inlined to bloatit. Of course bloatit does not get inlined w/o always_inline - it's a huge function and not a simple wrapper. With always_inline on it, the wrappers are no longer inlined - this is a bug and should be reported. Of course from 4.1.0 on you can easier stick an __attribute__((flatten)) on the function you want everything inlined to (finalblow) and get everything inlined into it. Can you report a bugzilla for the bad interaction between always_inline and inlining of simple wrappers? Thanks, Richard.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > Starting with gcc 4.1.0 we have inline heuristics in place that will _always_ > inline such simple "wrappers". So, if this still happens, there is a bug in > the > heuristics and that should be reported. Before 4.1.0 the heuristics were > bogus > and wrappers were not inlined all the time. > So, can you verify you are happy with the heuristics in 4.1.0 No i'm not, and i've used a pristine 4.1.0 in http://gcc.gnu.org/ml/gcc/2006-03/msg00410.html I haven't tried that particular testcase on 4.2.x, but some weeks ago i had to go thru all my code again to put always_inline in some forgotten places because i was seeing even empty ctors not being inlined (to the effect of having a call to a ret). So in this regard, 4.1.0 & 4.2.x still exhibit that kind of behaviour. It seems to trigger when some particular threshold is met, either for a function or unit, then nothing at all gets inlined but functions tagged with always_inline; of course major performance regression ensues.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, tbp <[EMAIL PROTECTED]> wrote: > On 3/13/06, Paolo Bonzini <[EMAIL PROTECTED]> wrote: > >Wait wait. PR/21195 is about inlining > > the SSE builtins. > No. PR/21195 was really about inline heuristic going ballistic. > Those intrinsics are thin wrappers around builtins, and ultimately > resolve to a couple of operations. Typical C++ (accessors/ctors) also > presents lots of such small functions. > And guess what, same cause same symptom. Starting with gcc 4.1.0 we have inline heuristics in place that will _always_ inline such simple "wrappers". So, if this still happens, there is a bug in the heuristics and that should be reported. Before 4.1.0 the heuristics were bogus and wrappers were not inlined all the time. So, can you verify you are happy with the heuristics in 4.1.0 (not talking about inlining of memcpy/memset that are really not function inlining, but the SSE/altivec inline function implementations). Richard.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Paolo Bonzini <[EMAIL PROTECTED]> wrote: >Wait wait. PR/21195 is about inlining > the SSE builtins. No. PR/21195 was really about inline heuristic going ballistic. Those intrinsics are thin wrappers around builtins, and ultimately resolve to a couple of operations. Typical C++ (accessors/ctors) also presents lots of such small functions. And guess what, same cause same symptom. There's no sensible metric by which code i've quoted in previous mail makes sense. Size? Nope. Execution time? Certainly not. Again whether or not SSE ops are involved was and is still irrelevant. > Your case seems to be different, because it involves inlining user > routines. Again, you need to give us the preprocessed source code for > us to look at your bug effectively. Thanks for the tip, but i'll pass. I've done my duty already. Months ago there was 2 options for fixing PR/21195: a) Fix the inlining heuristic. b) Kludge all intrinsics with always_inline. I've tried to argue a bit but to no avail. So, while you remain convinced everything's fine with the inliner, i'll keep tagging every function in my code with always_inline/noinline where performance matters.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
tbp wrote: On 3/13/06, Andrew Pinski <[EMAIL PROTECTED]> wrote: Actually the best way of improving the inline heuristics is to get a real testcase (and not some benchmark) where the inline heuristics is messed up. Ah, you mean a brand new testcase because PR-21195 wasn't good enough? show up in GCC 4.1 except for Wait wait. PR/21195 is about inlining the SSE builtins. These are special because, for example, you probably would prefer GDB to not step into them, but just execute them. As Andrew said, it is only an implementation choice (subject to revision) that they are implemented as inline functions at all. For example, if an older GCC had a similar bug with Altivec intrinsics, it would have showed up only in C++ (because Altivec intrinsics were never implemented as inlines in C) and would not show up anymore in GCC 4.1 except for a handful of intrinsics (because most Altivec intrinsics are not inlines at all anymore). memset/memcpy is different from SSE builtins because the choice of whether to inline or not is target dependent, and because glibc also decides whether or not to provide its own inlining, depending on the GCC version you're using. So the best way to report the problem is to file a *preprocessed* testcase into Bugzilla (i.e. the output of "gcc -E testcase.c > testcase.i" or equivalently "gcc -save-temps testcase.c", and to include the output of gcc -v testcase.c -O2 of the bug report. Using preprocessed source code at least makes sure that the glibc choices are not influencing the comparison between 3.4.x and 4.0.x. This information is present in the "how to file a bug" chapter of the manual. Your case seems to be different, because it involves inlining user routines. Again, you need to give us the preprocessed source code for us to look at your bug effectively. Paolo
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/13/06, Andrew Pinski <[EMAIL PROTECTED]> wrote: > Actually the best way of improving the inline heuristics is to get > a real testcase (and not some benchmark) where the inline heuristics > is messed up. Ah, you mean a brand new testcase because PR-21195 wasn't good enough? $ /usr/local/gcc-4.1.0/bin/g++ -v Using built-in specs. Target: i686-pc-cygwin Configured with: ../configure --prefix=/usr/local/gcc-4.1.0 --enable-languages=c,c++ --enable-threads=posix --with-system-zlib --disable-checking --disable-nls --disable-shared --disable-win32-registry --verbose --enable-bootstrap --with-gcc --with-gnu-ld --with-gnu-as --with-cpu=k8 Thread model: posix gcc version 4.1.0 /usr/local/gcc-4.1.0/bin/g++ -g -O3 -march=k8 -msse2 -o pr-inline.o pr-inline.cc #include static __m128 mm_max_ps(const __m128 a, const __m128 b) { return _mm_max_ps(a,b); } static __m128 mm_min_ps(const __m128 a, const __m128 b) { return _mm_min_ps(a,b); } static __m128 mm_mul_ps(const __m128 a, const __m128 b) { return _mm_mul_ps(a,b); } static __m128 mm_div_ps(const __m128 a, const __m128 b) { return _mm_div_ps(a,b); } static __m128 mm_or_ps(const __m128 a, const __m128 b) { return _mm_or_ps(a,b); } static int mm_movemask_ps(const __m128 a) { return _mm_movemask_ps(a); } static __attribute__ ((always_inline)) bool bloatit(const __m128 a, const __m128 b) { const __m128 v0 = mm_max_ps(a,b), v1 = mm_min_ps(a,b), v2 = mm_mul_ps(a,b), v3 = mm_div_ps(a,b), g0 = mm_or_ps(_mm_or_ps(_mm_or_ps(v0,v1), v2), v3), v4 = mm_min_ps(mm_or_ps(a,b),mm_div_ps(b,a)), v5 = mm_max_ps(mm_min_ps(a,mm_div_ps(b,a)), mm_or_ps(b, mm_div_ps(b,g0))), g1 = mm_or_ps(g0,mm_or_ps(v4,v5)); return mm_movemask_ps(g1); } bool finalblow(const __m128 a, const __m128 b, const __m128 c, const __m128 d, const __m128 e, const __m128 f) { return bloatit(a,b) & bloatit(c,d) & bloatit(e,f) & bloatit(a,c) & bloatit(b,d) & bloatit(c,e) & bloatit(d,f) & bloatit(b,a) & bloatit(d,c) & bloatit(f,e) & bloatit(c,a) & bloatit(d,b) & bloatit(e,c) & bloatit(f,d); } int main() { return 0; } 00401080 : 401080: push %ebp 401081: mulps %xmm1,%xmm0 401084: mov%esp,%ebp 401086: sub$0x8,%esp 401089: leave 40108a: ret 40108b: nop 40108c: lea0x0(%esi),%esi 00401090 : 401090: push %ebp 401091: orps %xmm1,%xmm0 401094: mov%esp,%ebp 401096: sub$0x8,%esp 401099: leave 40109a: ret 40109b: nop 40109c: lea0x0(%esi),%esi 004010a0 : 4010a0: divps %xmm1,%xmm0 4010a3: push %ebp 4010a4: mov%esp,%ebp 4010a6: sub$0x8,%esp 4010a9: leave 4010aa: ret 4010ab: nop ... 004010e0 : ... 401101: call 4010c0 401106: movaps %xmm0,0xf958(%ebp) 40110d: movaps 0xf8f8(%ebp),%xmm1 401114: movaps 0xf908(%ebp),%xmm0 40111b: call 4010b0 401120: movaps 0xf8f8(%ebp),%xmm1 401127: movaps %xmm0,0xf948(%ebp) 40112e: movaps 0xf908(%ebp),%xmm0 401135: call 401080 40113a: movaps 0xf8f8(%ebp),%xmm1 401141: movaps %xmm0,0xf938(%ebp) 401148: movaps 0xf908(%ebp),%xmm0 40114f: call 4010a0 401154: movaps 0xf958(%ebp),%xmm1 40115b: orps 0xf948(%ebp),%xmm1 401162: movaps %xmm1,0xf958(%ebp) 401169: movaps %xmm0,%xmm1 40116c: movaps 0xf958(%ebp),%xmm0 401173: orps 0xf938(%ebp),%xmm0 40117a: call 401090 40117f: movaps 0xf908(%ebp),%xmm1 401186: movaps %xmm0,0xf928(%ebp) 40118d: movaps 0xf8f8(%ebp),%xmm0 401194: call 4010a0 401199: movaps 0xf8f8(%ebp),%xmm1
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
Andrew Pinski <[EMAIL PROTECTED]> writes: | > | > On 3/12/06, Steven Bosscher <[EMAIL PROTECTED]> wrote: | > > > Yes, why is the benchmark not valid? | > > | > > It is valid. We should understand why this behavior has changed so drastically. | > This benchmark maybe useless, it still exposes a weakness of gcc4. At | > least it's not news to me: | > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21195 | > | > So that PR has been closed when gcc-devs marked all those intrinsics | > as force_inline. That's also the kludge i use with my code. The real | > problem is once you start marking some functions as force_inline, you | > upset the inlining heuristic even more creating even more silly | > inlining misses, rince, repeat. | > At the end of the day, everything is marked either force_inline or | > noinline and you'd be better off without a heuristic at all. | | Actually the best way of improving the inline heuristics is to get | a real testcase (and not some benchmark) where the inline heuristics | is messed up. I suppose that is part of the problem. When users send feedback (for example, by filling a PR) they feel the PRs are closed pretty quickly with no acknowledgement of what is happening and the way that affects program overall structures, and they can't do anything anyway -- because they feel the "masters" would close the PR with no possibility of appeal. When they send code snippets that demonstrate the particular aspect of the compiler making their life miserable, they are told their testcases are not "real". However, we should also acknowledge that aspects of the inlines are tuned based on benchmarks. For users, that course of action appears discouraging or irritating, if not incoherent. -- Gaby
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
tbp <[EMAIL PROTECTED]> writes: | On 3/12/06, Steven Bosscher <[EMAIL PROTECTED]> wrote: | > > Yes, why is the benchmark not valid? | > | > It is valid. We should understand why this behavior has changed so drastically. | This benchmark maybe useless, it still exposes a weakness of gcc4. At | least it's not news to me: | http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21195 | | So that PR has been closed when gcc-devs marked all those intrinsics | as force_inline. That's also the kludge i use with my code. The real | problem is once you start marking some functions as force_inline, you | upset the inlining heuristic even more creating even more silly | inlining misses, rince, repeat. | At the end of the day, everything is marked either force_inline or | noinline and you'd be better off without a heuristic at all. so force_inline is like a virus :-) -- Gaby
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
> > On 3/12/06, Steven Bosscher <[EMAIL PROTECTED]> wrote: > > > Yes, why is the benchmark not valid? > > > > It is valid. We should understand why this behavior has changed so > > drastically. > This benchmark maybe useless, it still exposes a weakness of gcc4. At > least it's not news to me: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21195 > > So that PR has been closed when gcc-devs marked all those intrinsics > as force_inline. That's also the kludge i use with my code. The real > problem is once you start marking some functions as force_inline, you > upset the inlining heuristic even more creating even more silly > inlining misses, rince, repeat. > At the end of the day, everything is marked either force_inline or > noinline and you'd be better off without a heuristic at all. Actually the best way of improving the inline heuristics is to get a real testcase (and not some benchmark) where the inline heuristics is messed up. Now SSE intrinsics are special in that they should be always inlined and that fact should be hidden from the user. Maybe they should be rewritten so that they are just like the altivec intrinsics in that it is just a plain #define and nothing special to the user and no worrying about the inlining heuristic. I should note that always inline was added for altivec intrinsics in the first place and they have now since been rewritten. Also the kernel uses always inline but I and other feels that is a mistake. -- Pinski
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/12/06, Steven Bosscher <[EMAIL PROTECTED]> wrote: > > Yes, why is the benchmark not valid? > > It is valid. We should understand why this behavior has changed so > drastically. This benchmark maybe useless, it still exposes a weakness of gcc4. At least it's not news to me: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21195 So that PR has been closed when gcc-devs marked all those intrinsics as force_inline. That's also the kludge i use with my code. The real problem is once you start marking some functions as force_inline, you upset the inlining heuristic even more creating even more silly inlining misses, rince, repeat. At the end of the day, everything is marked either force_inline or noinline and you'd be better off without a heuristic at all.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/12/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > So, I tried to reproduce the slowdown and on i686 get all > memcpy/memset inlined on 3.3, 3.4, 4.0 and 4.1. On ppc I get calls to > memcpy/memset in all cases. This might be more a glibc issue I think. So my suggestion is to file a bugzilla PR about this problem including _preprocessed_ source to get glibc dependencies out of the way. Richard.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 12 Mar 2006 18:09:26 +0100, Gabriel Dos Reis <[EMAIL PROTECTED]> wrote: > "Richard Guenther" <[EMAIL PROTECTED]> writes: > > [...] > > | this one should be measured. But note that the benchmark is a > | no-op and can be validly optimizes to int main() { return 0; } by the > | compiler. This is why I call it a stupid benchmark. > > please let's refrain from getting into that back hole. > > Different people measure different things that they perceive important > for them. I doubt that the "optimization to int main() { return 0; }" > would be useful to everybody. > > | Also you are measuring exclusively cache performance. > > that may be a decisive criteria under given circumstances; it takes > more justification to qualify it as "stupid benchmark". We can either > acknowledge "oops, we fumbled that case; but we are not going to fix > it" or "well, we should not have done that; it should be fixed". > But handwaving with "stupid" qualification is not helpful. So, I tried to reproduce the slowdown and on i686 get all memcpy/memset inlined on 3.3, 3.4, 4.0 and 4.1. On ppc I get calls to memcpy/memset in all cases. This might be more a glibc issue I think. Richard.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
"Richard Guenther" <[EMAIL PROTECTED]> writes: [...] | this one should be measured. But note that the benchmark is a | no-op and can be validly optimizes to int main() { return 0; } by the | compiler. This is why I call it a stupid benchmark. please let's refrain from getting into that back hole. Different people measure different things that they perceive important for them. I doubt that the "optimization to int main() { return 0; }" would be useful to everybody. | Also you are measuring exclusively cache performance. that may be a decisive criteria under given circumstances; it takes more justification to qualify it as "stupid benchmark". We can either acknowledge "oops, we fumbled that case; but we are not going to fix it" or "well, we should not have done that; it should be fixed". But handwaving with "stupid" qualification is not helpful. -- Gaby
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/12/06, Ernest L. Williams Jr. <[EMAIL PROTECTED]> wrote: > On Sun, 2006-03-12 at 15:17 +0100, Richard Guenther wrote: > > On 3/12/06, Ernest L. Williams Jr. <[EMAIL PROTECTED]> wrote: > > > > In any case: memcpy/memset inlining is broken in current GCC at least > > > > on athlon arch. > > > > let's say it changed. Also memcpy/memset "inlining" is not regular inlining > > but driven by completely different heuristics. > > > > > Yes, why is the benchmark not valid? > > > Then we would appreciate if the developers could recommend a valid test. > > > > What is the benchmark supposed to measure? > > The following is from the website mentioned previously: > = > > What does it benchmark? I asked about the specific benchmark, I guess > Bashmark is testing the things that most applications need. It is trying > to show you how well your hardware works together. > Currently the things which are being tested are: > -Calculations with types of different range > -Calculations with floating point types of different range > -Read and write into the memory with different size. this one should be measured. But note that the benchmark is a no-op and can be validly optimizes to int main() { return 0; } by the compiler. This is why I call it a stupid benchmark. Also you are measuring exclusively cache performance. Richard.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On Sun, 2006-03-12 at 15:17 +0100, Richard Guenther wrote: > On 3/12/06, Ernest L. Williams Jr. <[EMAIL PROTECTED]> wrote: > > > In any case: memcpy/memset inlining is broken in current GCC at least > > > on athlon arch. > > let's say it changed. Also memcpy/memset "inlining" is not regular inlining > but driven by completely different heuristics. > > > Yes, why is the benchmark not valid? > > Then we would appreciate if the developers could recommend a valid test. > > What is the benchmark supposed to measure? The following is from the website mentioned previously: = What does it benchmark? Bashmark is testing the things that most applications need. It is trying to show you how well your hardware works together. Currently the things which are being tested are: -Calculations with types of different range -Calculations with floating point types of different range -Read and write into the memory with different size. -Calling your system for memory and give it free with different size. -Tryout the speed of one main part of multithreading = Ernesto > > Richard.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/12/06, Steven Bosscher <[EMAIL PROTECTED]> wrote: > > It is valid. We should understand why this behavior has changed so > drastically. > I've attached assembler output from different compiler versions: 3.4.5-athlon-xp: gcc-3.4.5 -O3 -march=athlon-xp 3.4.5-pentium4: gcc-3.4.5 -O3 -march=pentium4 4.1.0-athlon-xp: gcc-4.1.0 -O3 -march=athlon-xp As you can see, gcc-3.4.5 generates fastest code for "-march=athlon-xp". This code should also run faster on any pentium machine. gcc-4.1.0 generates "same" slow code for "pentium" and "athlon" arch. -- Nickolay test_cmd-3.4.5-athlon-xp.s Description: Binary data test_cmd-3.4.5-pentium4.s Description: Binary data test_cmd-4.1.0-athlon-xp.s Description: Binary data
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/12/06, Ernest L. Williams Jr. <[EMAIL PROTECTED]> wrote: > > In any case: memcpy/memset inlining is broken in current GCC at least > > on athlon arch. let's say it changed. Also memcpy/memset "inlining" is not regular inlining but driven by completely different heuristics. > Yes, why is the benchmark not valid? > Then we would appreciate if the developers could recommend a valid test. What is the benchmark supposed to measure? Richard.
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
> Yes, why is the benchmark not valid? It is valid. We should understand why this behavior has changed so drastically. Gr. Steven
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On Sun, 2006-03-12 at 16:55 +0300, Nickolay Kolchin wrote: > On 3/12/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > > On 3/12/06, Nickolay Kolchin <[EMAIL PROTECTED]> wrote: > > > During "bashmark" memory benchmark perfomance analyze, I found 100x > > > perfomance > > > regression between gcc 3.4.5 and gcc 4.X. > > > > > > -- test_cmd.cpp (simplified bashmark memory RW test) --- > > > #include > > > #include > > > > > > template > > > static void int_membench(uint8_t* mb1, uint8_t* mb2) > > > { > > > for(uint32_t i = 0; i < Loops; i+=1) > > > { > > > #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size); > > > T T T T T > > > T T T T T > > > #undef T > > > } > > > } > > > > > > template > > > static void membench() > > > { > > > static uint8_t mb1[Buf_Size]; > > > static uint8_t mb2[Buf_Size]; > > > for(uint32_t i = 0; i < 1; i+=1) > > > int_membench(mb1, mb2); > > > } > > > > > > int main() > > > { > > > membench<128, 4000>(); > > > return 0; > > > } > > > > > > --- > > > GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed > > > GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed > > > GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed > > > > > > Compiler options: > > > -march=athlon-xp > > > -O3 > > > -fomit-frame-pointer > > > -mfpmath=sse -msse > > > -ftracer -fweb > > > -maccumulate-outgoing-args > > > -ffast-math > > > > > > I've played with various settings (-O2, -O1, without march, without > > > tracer and > > > web, etc) without any serious difference. I.e. GCC4 is always many times > > > slower > > > than GCC 3.4.5. > > > > > > Lurking inside assembler generation showed that GCC4 don't inline memcpy > > > and > > > memset calls. > > > > > > -- test.c (uber simplified problem demonstration) - > > > #include > > > > > > char* f(char* b) > > > { > > > static char a[64]; > > > memcpy(a, b, 64); > > > memset(a, 0, 64); > > > return a; > > > } > > > > > > > > > GCC4 will generate calls to memcpy and memset in this example. GCC3 will > > > inline > > > all calls. > > > > > > So, it looks like GCC4 inliner is broken at some point. > > > > Inlining of memcpy/memset is architecture dependent (I see calls > > on ppc for gcc 3.4, too). This is a stupid benchmark and as such > > not worth optimizing for. > > > > bashmark (http://bashmark.coders-net.de/ ) is a benchmark. My code is > just a test to demonstrate problem and as such can't be stupid. :) > > Situation when compiler generates code from simple test that run 100 > times slower, than code from previous compiler version is not normal > anyway. (and GCC3 generates smaller code, too) > > I thought that this regression was caused by different "max-inline-*" > params setting in 4.X. > > In any case: memcpy/memset inlining is broken in current GCC at least > on athlon arch. Yes, why is the benchmark not valid? Then we would appreciate if the developers could recommend a valid test. Here is what I get on my platform: == gcc version 4.0.2 20051125 (Red Hat 4.0.2-8) Architecture = i686 OS: Linux Kernel: 2.6.15-1.1833_FC4 [EMAIL PROTECTED] src]$ time ./test_cmd real0m50.583s user0m50.003s sys 0m0.220s === Thanks, Ernesto > > -- > Nickolay
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/12/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > On 3/12/06, Nickolay Kolchin <[EMAIL PROTECTED]> wrote: > > During "bashmark" memory benchmark perfomance analyze, I found 100x > > perfomance > > regression between gcc 3.4.5 and gcc 4.X. > > > > -- test_cmd.cpp (simplified bashmark memory RW test) --- > > #include > > #include > > > > template > > static void int_membench(uint8_t* mb1, uint8_t* mb2) > > { > > for(uint32_t i = 0; i < Loops; i+=1) > > { > > #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size); > > T T T T T > > T T T T T > > #undef T > > } > > } > > > > template > > static void membench() > > { > > static uint8_t mb1[Buf_Size]; > > static uint8_t mb2[Buf_Size]; > > for(uint32_t i = 0; i < 1; i+=1) > > int_membench(mb1, mb2); > > } > > > > int main() > > { > > membench<128, 4000>(); > > return 0; > > } > > > > --- > > GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed > > GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed > > GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed > > > > Compiler options: > > -march=athlon-xp > > -O3 > > -fomit-frame-pointer > > -mfpmath=sse -msse > > -ftracer -fweb > > -maccumulate-outgoing-args > > -ffast-math > > > > I've played with various settings (-O2, -O1, without march, without tracer > > and > > web, etc) without any serious difference. I.e. GCC4 is always many times > > slower > > than GCC 3.4.5. > > > > Lurking inside assembler generation showed that GCC4 don't inline memcpy and > > memset calls. > > > > -- test.c (uber simplified problem demonstration) - > > #include > > > > char* f(char* b) > > { > > static char a[64]; > > memcpy(a, b, 64); > > memset(a, 0, 64); > > return a; > > } > > > > > > GCC4 will generate calls to memcpy and memset in this example. GCC3 will > > inline > > all calls. > > > > So, it looks like GCC4 inliner is broken at some point. > > Inlining of memcpy/memset is architecture dependent (I see calls > on ppc for gcc 3.4, too). This is a stupid benchmark and as such > not worth optimizing for. > bashmark (http://bashmark.coders-net.de/ ) is a benchmark. My code is just a test to demonstrate problem and as such can't be stupid. :) Situation when compiler generates code from simple test that run 100 times slower, than code from previous compiler version is not normal anyway. (and GCC3 generates smaller code, too) I thought that this regression was caused by different "max-inline-*" params setting in 4.X. In any case: memcpy/memset inlining is broken in current GCC at least on athlon arch. -- Nickolay
Re: 100x perfomance regression between gcc 3.4.5 and gcc 4.X
On 3/12/06, Nickolay Kolchin <[EMAIL PROTECTED]> wrote: > During "bashmark" memory benchmark perfomance analyze, I found 100x perfomance > regression between gcc 3.4.5 and gcc 4.X. > > -- test_cmd.cpp (simplified bashmark memory RW test) --- > #include > #include > > template > static void int_membench(uint8_t* mb1, uint8_t* mb2) > { > for(uint32_t i = 0; i < Loops; i+=1) > { > #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size); > T T T T T > T T T T T > #undef T > } > } > > template > static void membench() > { > static uint8_t mb1[Buf_Size]; > static uint8_t mb2[Buf_Size]; > for(uint32_t i = 0; i < 1; i+=1) > int_membench(mb1, mb2); > } > > int main() > { > membench<128, 4000>(); > return 0; > } > > --- > GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed > GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed > GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed > > Compiler options: > -march=athlon-xp > -O3 > -fomit-frame-pointer > -mfpmath=sse -msse > -ftracer -fweb > -maccumulate-outgoing-args > -ffast-math > > I've played with various settings (-O2, -O1, without march, without tracer and > web, etc) without any serious difference. I.e. GCC4 is always many times > slower > than GCC 3.4.5. > > Lurking inside assembler generation showed that GCC4 don't inline memcpy and > memset calls. > > -- test.c (uber simplified problem demonstration) - > #include > > char* f(char* b) > { > static char a[64]; > memcpy(a, b, 64); > memset(a, 0, 64); > return a; > } > > > GCC4 will generate calls to memcpy and memset in this example. GCC3 will > inline > all calls. > > So, it looks like GCC4 inliner is broken at some point. Inlining of memcpy/memset is architecture dependent (I see calls on ppc for gcc 3.4, too). This is a stupid benchmark and as such not worth optimizing for. Richard.