------- Comment #20 from uros at kss-loka dot si  2006-06-26 06:31 -------
(In reply to comment #15)

> Can someone tell me if anyone is looking into this problem with the hopes of
> fixing it?  I just noticed that despite the posted code demonstrating the
> problem, and verification on: Pentium Pro, Pentium III, Pentium 4e, Pentium-D,
> Athlon-64 X2 and Opteron, it is still marked as "new", and no one is assigned
> to look at it  . . .

Hm, I tried your single testcase (SSE) on:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Pentium(R) 4 CPU 3.20GHz
stepping        : 9
cpu MHz         : 3191.917
cache size      : 512 KB

And the results are a bit suprising (this is the exact output of your test):

/usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2
-mfpmath=sse -DTYPE=float -c mmbench.c
/usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2
-mfpmath=sse -c sgemm_atlas.c
/usr/local.uros/gcc34/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2
-mfpmath=sse -o xsmm_gcc mmbench.o sgemm_atlas.o
rm -f *.o
/usr/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse
-DTYPE=float -c mmbench.c
/usr/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -c
sgemm_atlas.c
/usr/bin/gcc -DREPS=1000 -fomit-frame-pointer -O -msse2 -mfpmath=sse -o
xsmm_gc4 mmbench.o sgemm_atlas.o
rm -f *.o
echo "GCC 3.x     single performance:"
GCC 3.x     single performance:
./xsmm_gcc
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.141     3072.00

echo "GCC 4.x     single performance:"
GCC 4.x     single performance:
./xsmm_gc4
ALGORITHM     NB   REPS        TIME      MFLOPS
=========  =====  =====  ==========  ==========

atlasmm       60   1000       0.141     3072.00

where:

"gcc (GCC) 3.4.6" was tested against "gcc version 4.2.0 20060608
(experimental)"

FYI: there is another pathological testcase (PR target/19780), where SSE code
is 30% slower on AMD64, despite the fact that for SSE, 16 xmm registers were
available and _no_ memory was accessed in a for loop.

> The reason I ask is that I am preparing the next stable release of ATLAS, and
> I'm getting close to having to make a decision on what compilers I will
> support.
> If someone is working feverishly in the background, I will be sure to wait
> for it, in the hopes that there'll be a fix that will allow me to use
> gcc 4, which I think will be what most of my users want.  If this problem
> is not being looked into, I should not delay the ATLAS release for it, and
> just require my users to install gcc 3 in order to get decent performance.
> 
> I realize you guys are busy, and fp performance is probably not your main
> concern, so hopefully this message sounds more like a request for info on what
> is going on, than a bitch about help that I'm getting for free :)  

Without any other information available, I can only speculate, that perhaps
gcc4 code does not fully utilize multiple FP pipelines in the processors you
listed.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

Reply via email to