On Sun, 4 Sep 2016, Konstantin Belousov wrote:

On Sun, Sep 04, 2016 at 12:22:14PM +0000, Bruce Evans wrote:
...
Log:
  Add asm versions of fmod(), fmodf() and fmodl() on amd64.  Add asm
  versions of fmodf() amd fmodl() on i387.
...
It seems that wrong version of i387/f_fmodf.S, it is identical to the
amd64 version.

Indeed.  Fixed.

Added: head/lib/msun/amd64/e_fmod.S
==============================================================================
--- /dev/null   00:00:00 1970   (empty, because file is newly added)
+++ head/lib/msun/amd64/e_fmod.S        Sun Sep  4 12:22:14 2016        
(r305382)
+ENTRY(fmod)
+       movsd   %xmm0,-8(%rsp)
+       movsd   %xmm1,-16(%rsp)
+       fldl    -16(%rsp)
+       fldl    -8(%rsp)
+1:     fprem
+       fstsw   %ax
+       testw   $0x400,%ax
+       jne     1b
+       fstpl   -8(%rsp)
+       movsd   -8(%rsp),%xmm0
+       fstp    %st
+       ret
+END(fmod)

I see that this is not a new approach in the amd64 subdirectory, to use
x87 FPU on amd64.  Please note that it might have non-obvious effects on
the performance, in particular, on the speed of the context switches and
handling of #NM exception.

For long double functions, the i387 gets used anyway.

This function is very slow even with the i387.  It takes about 500
cycles per call on args uniformly distributed in double precision
space, but this distribution is very non-average since it gives many
huge args.   The loop iterates many times on huge args.

This is still better the the C code which takes 3 or more times longer
or > 1500 cycles.  It does a loop on the bits using integer code.  The
C code is relatively even slower when there are fewer bits (something
like 9 times slower for args uniformly distributed in float precision
space).

Newer Intel and possibly AMD CPUs have an optimization which allows
coprocessor code to save and restore state to not save and restore state
which was not changed.  In other words, for typical amd64 binary which
uses %xmm register file but did not touched %st nor %ymm, only %xmm
bits are spilled and then loaded.  Touching %st defeats the optimization,
possible for the whole lifetime of the thread.

This feature (XSAVEOPT) is available at least starting from Haswell
microarchitecture, not sure about IvyBridge.

Isn't the i386 space too small to matter much?  There should be the
same number of NM#'s and just 100 bytes extra to save.  Avoiding use
of larger register sets by using only the i387 might save more :-).

The other amd64 asm uses of the i387 for floats and doubles are:
- 3 files for remainder and 3 files for remquo.  Needed for the same
  reason as for fmod
- s_scalbn.S, s_scalbnf.S.  To use i387 fscale.  Probably a mistake.
  The functions themselves are too slow to be very useful too.  libm
  almost never uses them internally, and in optimized functions like
  exp* the exponent scaling is done inline using special integer code.
  I have spent many hours fighting the compiler to stop it pessimizing
  the memory accesses to give pipeline stalls for this integer code.
  Using fscale probably tends to give another type of pipeline stall.

I plan to remove many more i387 uses on i386, but there aren't many more
on amd64.

Bruce
_______________________________________________
svn-src-all@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/svn-src-all
To unsubscribe, send any mail to "svn-src-all-unsubscr...@freebsd.org"

Reply via email to