Re: RFR: 8265768 [aarch64] Use glibc libm impl for dlog, dlog10, dexp iff 2.29 or greater on AArch64.

gregcawthorne Mon, 24 May 2021 11:52:17 -0700

On Thu, 15 Apr 2021 08:33:47 GMT, gregcawthorne 
<[email protected]> wrote:


> Glibc 2.29 onwards provides optimised versions of log,log10,exp.
> These functions have an accuracy of 0.9ulp or better in glibc
> 2.29.
> 
> Therefore this patch adds code to parse, store and check
> the runtime glibcs version in os_linux.cpp/hpp.
> This is then used to select the glibcs implementation of
> log, log10, exp at runtime for c1 and c2, iff we have
> glibc 2.29 or greater.
> 
> This will ensure OpenJDK can benefit from future improvements
> to glibc.
> 
> Glibc adheres to the ieee754 standard, unless stated otherwise
> in its spec.
> 
> As there are no stated exceptions in the current glibc spec
> for dlog, dlog10 and dexp, we can assume they currently follow
> ieee754 (which testing confirms). As such, future version of
> glibc are unlikely to lose this compliance with ieee754 in
> future.
> 
> W.r.t performance this patch sees ~15-30% performance improvements for
> log and log10, with ~50-80% performance improvements for exp for the
> common input ranged (which output real numbers). However for the NaN
> and inf output ranges we see a slow down of up to a factor of 2 for
> some functions and architectures.
> 
> Due to this being the uncommon case we assert that this is a
> worthwhile tradeoff.

I have been reading up on the monotonicity paper suggested by Andrew Haley:
[http://www-leland.stanford.edu/class/ee486/doc/ferguson1991.pdf](url)

In order to try and see if I can prove the current glibc implementations of log 
and exp, for monotonicity.

However, I have come to the conclusion that the paper calculates the relative 
error threshold for monotonicity for an approximation, and then relies on extra 
bits of floating-point hardware precision to be guaranteed monotonic. These 
extra bits of precision are greater than the target representations mantissa 
bits, which when subsequently rounded at the end (rounding is semi monotonic), 
leads to a monotonic implementation. No extra bits of floating-point precision 
are present in AArch64 in-between floating-point operations and so this paper 
won't help us in this case.

I am currently unsure how the current implementation (from fdlibm I believe) is 
proven to be monotonic. Perhaps there is a proof I am unaware of. The 
implementation has obviously stood the test of time at least.

If anyone has an idea of how the existing implementation was proven to be 
monotonic (if it has been), it could help us to apply it on the Arm optimised 
routine version (AOR).
What we can do for now is compare the remez approximation used in the current 
OpenJDK implementation (take log for example):
[https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/sharedRuntimeTrans.cpp#L49](url)

It states:
" The maximum error of this polynomial approximation is bounded by 2**-58.45"
Which must be the theoretical accuracy, as it would be bounded by the 52 bits 
of mantissa if run on AArch64. And is insufficient for a proof of monotonicity 
on its own.

Now if you look AOR for log (where the implementation comes from):
[https://github.com/ARM-software/optimized-routines/blob/master/math/tools/log.sollya](url)
[https://github.com/ARM-software/optimized-routines/blob/master/math/tools/log_abs.sollya](url)
And run them, you will see the abs and relative accuracies of ~2**-63 for 
log.sollya
And for log_abs.sollya which is used for inputs around 1.0, there is an 
absolute accuracy of ~2**-65 and a relative error of ~ 2**-56.

So for log the theoretical accuracy is actually higher than the current 
implementations, apart from when near 1.0 the relative error is slightly worse, 
however I have confirmed with Szabolcs Nagy at Arm who worked on the 
implementation, that it is the absolute error here which dictates the effective 
accuracy, as there is arithmetic afterwards which changes the magnitude.
As for the existing accuracy of the exp the implementation code comments state 
the maximum theoretical error of the remez approximation is 2**-59.

While running the exp soylla script in AOR:
[https://github.com/openjdk/jdk/blob/master/src/hotspot/share/runtime/sharedRuntimeTrans.cpp#L238](url)
It shows its remez has a theoretical accuracy of 2**-66.
[https://github.com/ARM-software/optimized-routines/blob/master/math/tools/exp.sollya](url)

Another thing to consider is the reconstruction process of the current fdlibm 
implementation and glibcs. glibcs exp and log uses a table lookup algorithm in 
order to allow their polynomial to have a smaller principle domain around 0 and 
it is then transformed to a larger principle domain, where as fdlibms does use 
this method.

A description of the table loop up scheme can be found here:
[https://dl.acm.org/doi/pdf/10.1145/63522.214389](url)
And an analysis of the error bounds of table look type approximations of 
functions can be found here:
[https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=709374](url)

If the remez polynomial of fdlibm isn’t definitively proven to be monotonic, 
and the glibcs remez polynomials has comparable accuracy. Then it might suffice 
to attempt to investigate further the effects of the table lookup method 
implementation; with regards to semi-monotonicity, even if it wouldn’t offer a 
total proof of the overall implementation.

My initial look at glibcs' exp implementation is that the relationship between 
the scale and tail variables, along with the table values themselves possibly 
appearing monotonic themselves, might indicate that it won't break 
monotonicity. However this would need further investigation and similarly for 
log, iff this sub sections proof of monotonicity suffices for the overall 
implementation.

-------------

PR: https://git.openjdk.java.net/jdk/pull/3510

Re: RFR: 8265768 [aarch64] Use glibc libm impl for dlog, dlog10, dexp iff 2.29 or greater on AArch64.

Reply via email to