------- Comment #21 from whaley at cs dot utsa dot edu  2006-06-26 15:03 -------
Uros,

Thanks for the reply; I think some confusion has set in (see below) :)

>And the results are a bit suprising (this is the exact output of your test):

Note that you are running the opposite of my test case: SSE vs SSE rather than
x87 vs x87.  This whole bug report is about x87 performance.  You can get more
detail on why I want x87 in my messages above, particularly comment #11, but
single precision is indeed the place where SSE cannot compete with the x87
unit.  To see it, put the flags back the way I had them in the attachment, and
you'll see that gcc 3 is much faster.  Also, you should find in single
precision that the x87 unit soundly beats the SSE unit (unlike double
precision, where the gcc 3's x87 unit is only slightly faster than the best SSE
code).  I think the x87 will win even using gcc 4 for both compilations, even
though gcc 4's x87 support is crippled by its new register allocation scheme.

So, let me say what I think is going on here, and you can correct me if I've
gotten it wrong.  I think in this last timing you think you've found an
exception to the problem, but have forgotten we want to look at the x87 (which
is the fastest method in this case anyway).  Try it with my original flags
(essentially, throw '-mfpmath=387' instead of the sse flags), and you should
see that this gives far better performance using gcc 3 than any use of scalar
sse.  I think even gcc 4 will be better using its de-optimized x87 code,
because x87 is inherently better than scalar sse on these platforms.  There is
only one machine that likes the gcc 4's new x87 register usage pattern of all
the ones I've tested, and that is the CoreDue.

The issue is in x87 register usage: Gcc 4 saves a register, and does the FMUL
from memory rather than first loading the value to the fpstack, and on at least
the PentiumPRO, Pentium III, Pentium 4e, Pentium-D, Athlon-64 X2 and Opteron,
that drops your x87 (which is your best) performance significantly.

Note that given gcc 3's register usage, I think a simple peephole step can
transform it to gcc 4's, if you want to maintain that usage for CoreDuo. 
Unfortunately, going the other way requires an additional register, and the
load plays with your stack operands, so it is easier to keep gcc 3's way as the
default, and peephole to gcc 4's when on a machine that likes that usage
(currently, only the Core).

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

Reply via email to