------- Comment #36 from whaley at cs dot utsa dot edu  2006-08-06 15:03 -------
Paola,

Thanks for working on this.  We are making progres, but I have some mixed
results.  I timed the assemblies you provided directly.  I added a target
"asgexe" that builds the same benchmark, assuming assembly source instead of C
to make this more reproducable.  I ran on the Athlon-64X2, where your new
assembly ran *faster* than gcc 3 for double precision.  However, you still lost
for single precision.  I believe the reason is that you still have more
fmuls/fmull (fmul from memory) than does gcc 3:

>animal>fgrep -i fmuls smm_4.s | wc
>    240     480    4051
>animal>fgrep -i fmuls smm_asg.s | wc
>     60     120    1020
>animal>fgrep -i fmuls smm_3.s  | wc
>      0       0       0
>animal>fgrep -i fmull dmm_4.s | wc
>    100     200    1739
>animal>fgrep -i fmull dmm_asg.s | wc
>     20      40     360
>animal>fgrep -i fmuls dmm_3.s | wc
>      0       0       0


I haven't really scoped out the dmm diff, but in single prec anyway, these
dreaded fmuls are in the inner loop, and this is probably why you are still
losing.  I'm guessing your peephole is missing some cases, and for some reason
is missing more under single.  Any ideas?

As for your assembly actually beating gcc 3 for double, my guess is that it is
some other optimization that gcc 4 has, and you will beat by even more once the
final fmull are removed . . .

On the P4e, your double precision code is faster than stock gcc 4, but still
slower than gcc3.  again, I suspect the remaining fmull.  Then comes the thing
I cannot explain at all.  Your single precision results are horrible.  gcc 3
gets 1991MFLOPS, gcc 4 gets 1664, and the assembly you sent gets 34!  No chance
the mixed fld/fmuls is causing stack overflow, I guess?  I think this might
account for such a catastrophic drop  . . .  That's about the only WAG I've got
for this behavior.

Anyway, I think the first order of business may be to get your peephole to
grabbing all the cases, and see if that makes you win everywhere on Athlon, and
if it makes single precision P4e better, and we can go from there . . .

If you do that, attach the assemblies  again, and I'll redo timings.  Also, if
you could attach (not put in comment) the patch, it'd be nice to get the
compiler, so I could test x86-64 code on Athlon, etc.

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

Reply via email to