------- Comment #28 from whaley at cs dot utsa dot edu 2006-06-29 04:17 ------- Guys,
If you are looking for the reason that the new code might be slower, my feeling from the benchmark data is that involves hiding the cost of the loads. Notice that, except for the cases where the double exceeds the cache, the single precision gcc4 code always gets a greater percentage of gcc3's numbers than double for each platform. This is the opposite of what you expect if the problem is purely computational, but exactly what you expect if the problem is due to memory costs (since single has half the memory cost). If I were forced to take a WAG as to what's going on, I would guess it has to do with the more dependencies in the new code sequence confusing tomasulo's or register renaming. I haven't worked it out in detail, but scope the two competing code sequences: gcc 3 gcc 4 =========== ======= fldl 32(%edx) fldl 32(%edx) fldl 32(%eax) fld %st(0) fmul %st(1),%st fmull 32(%eax) faddp %st,%st(6) faddp %st, %st(2) Note that in gcc 3, both loads are independent, and can be moved past each other and arbitrarily early in the instruction stream. The fmull would need to be broken into two instructions before a similar freedom occurs. I'm not sure how the fp stack handling is done in hardware, but the fact that you've replaced two independent loads with 3 forced-order instructions cannot be beneficial. At the same time, it is difficult for me to see how the new sequence can be better. We've got the same number of loads, the same number of instructions, the same register use (I think), with a forced ordering and loads you cannot advance (critical in load-happy 8-register land). I originally thought that the gcc 4 stream used one less register, but it appears to copy the edx operand twice to stack, so I'm no longer sure it has even that advantage? Just my guess, Clint -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827