------- Comment #21 from whaley at cs dot utsa dot edu 2006-06-26 15:03 ------- Uros,
Thanks for the reply; I think some confusion has set in (see below) :) >And the results are a bit suprising (this is the exact output of your test): Note that you are running the opposite of my test case: SSE vs SSE rather than x87 vs x87. This whole bug report is about x87 performance. You can get more detail on why I want x87 in my messages above, particularly comment #11, but single precision is indeed the place where SSE cannot compete with the x87 unit. To see it, put the flags back the way I had them in the attachment, and you'll see that gcc 3 is much faster. Also, you should find in single precision that the x87 unit soundly beats the SSE unit (unlike double precision, where the gcc 3's x87 unit is only slightly faster than the best SSE code). I think the x87 will win even using gcc 4 for both compilations, even though gcc 4's x87 support is crippled by its new register allocation scheme. So, let me say what I think is going on here, and you can correct me if I've gotten it wrong. I think in this last timing you think you've found an exception to the problem, but have forgotten we want to look at the x87 (which is the fastest method in this case anyway). Try it with my original flags (essentially, throw '-mfpmath=387' instead of the sse flags), and you should see that this gives far better performance using gcc 3 than any use of scalar sse. I think even gcc 4 will be better using its de-optimized x87 code, because x87 is inherently better than scalar sse on these platforms. There is only one machine that likes the gcc 4's new x87 register usage pattern of all the ones I've tested, and that is the CoreDue. The issue is in x87 register usage: Gcc 4 saves a register, and does the FMUL from memory rather than first loading the value to the fpstack, and on at least the PentiumPRO, Pentium III, Pentium 4e, Pentium-D, Athlon-64 X2 and Opteron, that drops your x87 (which is your best) performance significantly. Note that given gcc 3's register usage, I think a simple peephole step can transform it to gcc 4's, if you want to maintain that usage for CoreDuo. Unfortunately, going the other way requires an additional register, and the load plays with your stack operands, so it is easier to keep gcc 3's way as the default, and peephole to gcc 4's when on a machine that likes that usage (currently, only the Core). Thanks, Clint -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827