------- Comment #20 from uros at kss-loka dot si 2006-06-26 06:31 ------- (In reply to comment #15)
> Can someone tell me if anyone is looking into this problem with the hopes of > fixing it? I just noticed that despite the posted code demonstrating the > problem, and verification on: Pentium Pro, Pentium III, Pentium 4e, Pentium-D, > Athlon-64 X2 and Opteron, it is still marked as "new", and no one is assigned > to look at it . . . Hm, I tried your single testcase (SSE) on: processor : 0 vendor_id : GenuineIntel cpu family : 15 model : 2 model name : Intel(R) Pentium(R) 4 CPU 3.20GHz stepping : 9 cpu MHz : 3191.917 cache size : 512 KB And the results are a bit suprising (this is the exact output of your test): /usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -DTYPE=float -c mmbench.c /usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -c sgemm_atlas.c /usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -o xsmm_gcc mmbench.o sgemm_atlas.o rm -f *.o /usr/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -DTYPE=float -c mmbench.c /usr/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -c sgemm_atlas.c /usr/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -o xsmm_gc4 mmbench.o sgemm_atlas.o rm -f *.o echo "GCC 3.x single performance:" GCC 3.x single performance: ./xsmm_gcc ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.141 3072.00 echo "GCC 4.x single performance:" GCC 4.x single performance: ./xsmm_gc4 ALGORITHM NB REPS TIME MFLOPS ========= ===== ===== ========== ========== atlasmm 60 1000 0.141 3072.00 where: "gcc (GCC) 3.4.6" was tested against "gcc version 4.2.0 20060608 (experimental)" FYI: there is another pathological testcase (PR target/19780), where SSE code is 30% slower on AMD64, despite the fact that for SSE, 16 xmm registers were available and _no_ memory was accessed in a for loop. > The reason I ask is that I am preparing the next stable release of ATLAS, and > I'm getting close to having to make a decision on what compilers I will > support. > If someone is working feverishly in the background, I will be sure to wait > for it, in the hopes that there'll be a fix that will allow me to use > gcc 4, which I think will be what most of my users want. If this problem > is not being looked into, I should not delay the ATLAS release for it, and > just require my users to install gcc 3 in order to get decent performance. > > I realize you guys are busy, and fp performance is probably not your main > concern, so hopefully this message sounds more like a request for info on what > is going on, than a bitch about help that I'm getting for free :) Without any other information available, I can only speculate, that perhaps gcc4 code does not fully utilize multiple FP pipelines in the processors you listed. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827