On Sun, 2006-03-12 at 16:55 +0300, Nickolay Kolchin wrote: > On 3/12/06, Richard Guenther <[EMAIL PROTECTED]> wrote: > > On 3/12/06, Nickolay Kolchin <[EMAIL PROTECTED]> wrote: > > > During "bashmark" memory benchmark perfomance analyze, I found 100x > > > perfomance > > > regression between gcc 3.4.5 and gcc 4.X. > > > > > > ------ test_cmd.cpp (simplified bashmark memory RW test) ------- > > > #include <stdint.h> > > > #include <cstring> > > > > > > template <const uint8_t Block_Size, const uint32_t Loops> > > > static void int_membench(uint8_t* mb1, uint8_t* mb2) > > > { > > > for(uint32_t i = 0; i < Loops; i+=1) > > > { > > > #define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size); > > > T T T T T > > > T T T T T > > > #undef T > > > } > > > } > > > > > > template <const uint32_t Buf_Size, const uint32_t Loops> > > > static void membench() > > > { > > > static uint8_t mb1[Buf_Size]; > > > static uint8_t mb2[Buf_Size]; > > > for(uint32_t i = 0; i < 10000; i+=1) > > > int_membench<Buf_Size, Loops>(mb1, mb2); > > > } > > > > > > int main() > > > { > > > membench<128, 4000>(); > > > return 0; > > > } > > > > > > --------------------------------------------------------------- > > > GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed > > > GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed > > > GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed > > > > > > Compiler options: > > > -march=athlon-xp > > > -O3 > > > -fomit-frame-pointer > > > -mfpmath=sse -msse > > > -ftracer -fweb > > > -maccumulate-outgoing-args > > > -ffast-math > > > > > > I've played with various settings (-O2, -O1, without march, without > > > tracer and > > > web, etc) without any serious difference. I.e. GCC4 is always many times > > > slower > > > than GCC 3.4.5. > > > > > > Lurking inside assembler generation showed that GCC4 don't inline memcpy > > > and > > > memset calls. > > > > > > ------ test.c (uber simplified problem demonstration) --------- > > > #include <string.h> > > > > > > char* f(char* b) > > > { > > > static char a[64]; > > > memcpy(a, b, 64); > > > memset(a, 0, 64); > > > return a; > > > } > > > ---------------------------------------------------------------- > > > > > > GCC4 will generate calls to memcpy and memset in this example. GCC3 will > > > inline > > > all calls. > > > > > > So, it looks like GCC4 inliner is broken at some point. > > > > Inlining of memcpy/memset is architecture dependent (I see calls > > on ppc for gcc 3.4, too). This is a stupid benchmark and as such > > not worth optimizing for. > > > > bashmark (http://bashmark.coders-net.de/ ) is a benchmark. My code is > just a test to demonstrate problem and as such can't be stupid. :) > > Situation when compiler generates code from simple test that run 100 > times slower, than code from previous compiler version is not normal > anyway. (and GCC3 generates smaller code, too) > > I thought that this regression was caused by different "max-inline-*" > params setting in 4.X. > > In any case: memcpy/memset inlining is broken in current GCC at least > on athlon arch.
Yes, why is the benchmark not valid? Then we would appreciate if the developers could recommend a valid test. Here is what I get on my platform: ====================================================================== gcc version 4.0.2 20051125 (Red Hat 4.0.2-8) Architecture = i686 OS: Linux Kernel: 2.6.15-1.1833_FC4 [EMAIL PROTECTED] src]$ time ./test_cmd real 0m50.583s user 0m50.003s sys 0m0.220s ======================================================================= Thanks, Ernesto > > -- > Nickolay