On Sun, 2006-03-12 at 16:55 +0300, Nickolay Kolchin wrote:
> On 3/12/06, Richard Guenther <[EMAIL PROTECTED]> wrote:
> > On 3/12/06, Nickolay Kolchin <[EMAIL PROTECTED]> wrote:
> > > During "bashmark" memory benchmark perfomance analyze, I found 100x 
> > > perfomance
> > > regression between gcc 3.4.5 and gcc 4.X.
> > >
> > > ------ test_cmd.cpp (simplified bashmark memory RW test) -------
> > > #include <stdint.h>
> > > #include <cstring>
> > >
> > > template <const uint8_t Block_Size, const uint32_t Loops>
> > > static void int_membench(uint8_t* mb1, uint8_t* mb2)
> > > {
> > >   for(uint32_t i = 0; i < Loops; i+=1)
> > >   {
> > > #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size);
> > >     T T T T T
> > >     T T T T T
> > > #undef T
> > >   }
> > > }
> > >
> > > template <const uint32_t Buf_Size, const uint32_t Loops>
> > > static void membench()
> > > {
> > >   static uint8_t mb1[Buf_Size];
> > >   static uint8_t mb2[Buf_Size];
> > >   for(uint32_t i = 0; i < 10000; i+=1)
> > >     int_membench<Buf_Size, Loops>(mb1, mb2);
> > > }
> > >
> > > int main()
> > > {
> > >   membench<128, 4000>();
> > >   return 0;
> > > }
> > >
> > > ---------------------------------------------------------------
> > > GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed
> > > GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed
> > > GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed
> > >
> > > Compiler options:
> > >     -march=athlon-xp
> > >     -O3
> > >     -fomit-frame-pointer
> > >     -mfpmath=sse -msse
> > >     -ftracer -fweb
> > >     -maccumulate-outgoing-args
> > >     -ffast-math
> > >
> > > I've played with various settings (-O2, -O1, without march, without 
> > > tracer and
> > > web, etc) without any serious difference. I.e. GCC4 is always many times 
> > > slower
> > > than GCC 3.4.5.
> > >
> > > Lurking inside assembler generation showed that GCC4 don't inline memcpy 
> > > and
> > > memset calls.
> > >
> > > ------ test.c (uber simplified problem demonstration) ---------
> > > #include <string.h>
> > >
> > > char* f(char* b)
> > > {
> > >   static char a[64];
> > >   memcpy(a, b, 64);
> > >   memset(a, 0, 64);
> > >   return a;
> > > }
> > > ----------------------------------------------------------------
> > >
> > > GCC4 will generate calls to memcpy and memset in this example. GCC3 will 
> > > inline
> > > all calls.
> > >
> > > So, it looks like GCC4 inliner is broken at some point.
> >
> > Inlining of memcpy/memset is architecture dependent (I see calls
> > on ppc for gcc 3.4, too).  This is a stupid benchmark and as such
> > not worth optimizing for.
> >
> 
> bashmark (http://bashmark.coders-net.de/ ) is a benchmark. My code is
> just a test to demonstrate problem and as such can't be stupid. :)
> 
> Situation when compiler generates code from simple test that run 100
> times slower, than code from previous compiler version is not normal
> anyway.  (and GCC3 generates smaller code, too)
> 
> I thought that this regression was caused by different "max-inline-*"
> params setting in 4.X.
> 
> In any case: memcpy/memset inlining is broken in current GCC at least
> on athlon arch.

Yes, why is the benchmark not valid?
Then we would appreciate if the developers could recommend a valid test.

Here is what I get on my platform:
======================================================================
gcc version 4.0.2 20051125 (Red Hat 4.0.2-8)
Architecture = i686
OS: Linux 
Kernel: 2.6.15-1.1833_FC4

[EMAIL PROTECTED] src]$ time ./test_cmd

real    0m50.583s
user    0m50.003s
sys     0m0.220s
=======================================================================

Thanks,
Ernesto


> 
> --
> Nickolay

Reply via email to