On Thu, 30 Sep 2010, Dimitry Andric wrote:
On 2010-09-30 05:46, Bruce Evans wrote:
...
This file probably shouldn't exist, especially on amd64. There are 4 or 5
versions of ldexp(), and this file implements what seems to be the worst
one, even without the bug.
...
The version in libc/gen/ldexp.c is just a copy of msun/src/s_scalbn.c,
with some things like copysign() directly pasted in. It even has:
/* @(#)fdlibm.h 5.1 93/09/24 */
at the top.
Bah, I missed this sixth version :-).
Testing indicates that the fdlibm C version is 2.5 times faster than the
asm versions on amd64 on a core2 (ref9), while on i386 the C version is
only 1.5 times faster. The C code is a bit larger so benefits more from
being called from a loop. The asm code uses a slow i387 instruction, and
on i387 it hhs to do expensive moves from xmm registers to i387 ones and
back.
Times for 100 million calls:
amd64 libc ldexp: 3.18 seconds
amd64 libm asm scalbn: 2.96
amd64 libm C scalbn: 1.30
i386 libc ldexp: 3.13
i386 libm asm scalbn: 2.86
i386 libm C scalbn: 2.11
Seeing these results, I propose to just delete
lib/libc/amd64/gen/ldexp.c and lib/libc/i386/gen/ldexp.c, which will
cause the amd64 and i386 builds to automatically pick up
lib/libc/gen/ldexp.c instead, which effectively is the fdlibm
implementation. (And no more clang workarounds needed. :)
I like this idea.
Does anyone have ideas for better testing? The loop also benefits
machines with multiple pipelines and/or out/of order execution.
Especially with the latter I think it is possible for several iterations
to be in progress at once (looks like an average of about 1.5 for
AthlonXP and later in other similar loop benchmarks). In other
benchmarks I use a volatile variable to be more sure of defeating
unwanted compiler optimizations, but I don't want to enforce serialization
since non-benchmarks don't do that. In libm functions, the largest
optimizations are from avoiding as internal serialization as much as
possible. Using the i387 functions tends to defeat this since there is
only 1 ALU for them (unlike for i387 addition, etc.; there are 2 ALUs
for that on AthlonXP and later). Perhaps the i387 functions will be
relatively faster again someday when there are more ALUs for them and
better microcode in them, but x86 architects apparently consider this
a low priority and/or the microcode is too hard make better than ordinary
instructions.
I think big functions using ordinary instructions are OK if they are
slightly faster than i387 functions, since if they aren't called much
then it doesn't matter and if they are called much then they will stay
cached. But in they latter case, they will push other code out of caches;
I don't know how to quantify this.
Bruce
_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "[email protected]"