------- Comment #2 from dominiq at lps dot ens dot fr 2008-01-08 09:46 ------- I don't think either that this is a regression, only a bad side effect. A possibility to overcome it would be to change the way theNewton-Raphson iteration is computed. Presently it seems to be x1=x0*(2.0-x*x0) which is bad when x*x0=nearest(1.0,-1.0) as the result of (2.0-x*x0) is 1.0. I see two ways to improve the accuracy: x1=2.0*x0-(x*x0*x0) and x1=x0+(x0*(1.0-x*x0)) (assuming the parentheses are obeyed). The first case add a multiply, but should not increase the latency if the multiply in 2.0*x0 is inserted between the first and the second multiplies of x*x0*x0. The second case would add the 'add' latency to the original one, but have a better balance between adds and multiplies and is probably the most accurate.
Since I am not familiar enough with the x86, I cannot guess precisely what are the other effects of these implementations: extra moves, register pressure, ... . Naively I'll say that the first one would be better in codes having a deficit in multiplies while the second one would be better for long sequence of divisions. If anyone is interested to dig further this issue, I can test patches and timings on a core2duo. Anyway I think something should be said about this "feature" in the manual and there may be some need to have some (better?) "cost model" of replacing a division by "recip+NR" as I read it in a previous post. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=34702