During "bashmark" memory benchmark perfomance analyze, I found 100x perfomance
regression between gcc 3.4.5 and gcc 4.X.

------ test_cmd.cpp (simplified bashmark memory RW test) -------
#include <stdint.h>
#include <cstring>

template <const uint8_t Block_Size, const uint32_t Loops>
static void int_membench(uint8_t* mb1, uint8_t* mb2)
{
  for(uint32_t i = 0; i < Loops; i+=1)
  {
#define T memcpy(mb1, mb2, Block_Size); memset(mb2, i, Block_Size);
    T T T T T
    T T T T T
#undef T
  }
}

template <const uint32_t Buf_Size, const uint32_t Loops>
static void membench()
{
  static uint8_t mb1[Buf_Size];
  static uint8_t mb2[Buf_Size];
  for(uint32_t i = 0; i < 10000; i+=1)
    int_membench<Buf_Size, Loops>(mb1, mb2);
}

int main()
{
  membench<128, 4000>();
  return 0;
}

---------------------------------------------------------------
GCC 3.4.5: 0.43user 0.00system 0:00.44elapsed
GCC 4.0.2: 34.83user 0.68system 0:36.09elapsed
GCC 4.1.0: 33.86user 0.58system 0:34.96elapsed

Compiler options:
    -march=athlon-xp
    -O3
    -fomit-frame-pointer
    -mfpmath=sse -msse
    -ftracer -fweb
    -maccumulate-outgoing-args
    -ffast-math

I've played with various settings (-O2, -O1, without march, without tracer and
web, etc) without any serious difference. I.e. GCC4 is always many times slower
than GCC 3.4.5.

Lurking inside assembler generation showed that GCC4 don't inline memcpy and
memset calls.

------ test.c (uber simplified problem demonstration) ---------
#include <string.h>

char* f(char* b)
{
  static char a[64];
  memcpy(a, b, 64);
  memset(a, 0, 64);
  return a;
}
----------------------------------------------------------------

GCC4 will generate calls to memcpy and memset in this example. GCC3 will inline
all calls.

So, it looks like GCC4 inliner is broken at some point.

Reply via email to