https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105101

--- Comment #21 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to jos...@codesourcery.com from comment #20)
> On Sat, 11 Jun 2022, already5chosen at yahoo dot com via Gcc-bugs wrote:
> 
> > On MSYS2 _Float128 and __float128 appears to be mostly the same thing, 
> > mapped
> > to the same library routines with significant difference that _Float128 is 
> > not
> > accessible from C++. Since all my test benches are written in C++ I can't 
> > even
> > validate that what I wrote above is 100% true.
> > 
> > Also according to my understanding of glibc docs (not the clearest piece of
> > text that I ever read) a relevant square root routine should be named
> > sqrtf128().
> > Unfortunately, nothing like that appears to be present in either math.h or 
> > in
> > library. Am I doing something wrong?
> 
> The function should be sqrtf128 (present in glibc 2.26 and later on 
> x86_64, x86, powerpc64le, ia64).  I don't know about support in non-glibc 
> C libraries.

x86-64 gcc on Godbolt does not appear to know about it.
I think, Godbolt uses rather standard Linux with quite new glibc and headers.
https://godbolt.org/z/Y4YecvxK6


> 
> > Right now, there are only two [gcc] platforms with hw binary128 - IBM POWER 
> > and
> > IBM z. I am not sure about the later, but the former has xssqrtqp 
> > instruction
> > which is likely the right way to do sqrtq()/sqrtf128() on this platform. If 
> > z
> > is the same, which sound likely, then implementation based on binary128
> > mul/add/fma by now has no use cases at all.
> 
> That may well be the case for sqrt.
> 
> > > fma is a particularly tricky case because it *is* required to be 
> > > correctly 
> > > rounding, in all rounding modes, and correct rounding implies correct 
> > > exceptions, *and* correct exceptions for fma includes getting right the 
> > > architecture-specific choice of whether tininess is detected before or 
> > > after rounding.
> > 
> > I suspect that by strict IEEE-754 rules sqrt() is the same as fma(), i.e. 
> > you
> > have to calculate a root to infinite precision and then to round with
> > accordance to current mode.
> 
> Yes, but fma has the extra complication of the architecture-specific 
> tininess detection rules (to quote IEEE 754, "The implementer shall choose 
> how tininess is detected [i.e. from the two options listed immediately 
> above this quote in IEEE 754], but shall detect tininess in the same way 
> for all operations in radix two"), which doesn't apply to sqrt because 
> sqrt results can never underflow.
> 
> (I expect the existing soft-fp version of fma in glibc to be rather better 
> optimized than the soft-fp version of sqrt.)
> 

May be. I don't know how to get soft-fp version.
What I do know is that version of fmaq() shipped with gcc/gfortran on MSYS2
x86-64 is absurdly slow - order of 4 microsecond on quite fast modern CPUs.
I don't know how you call this variant, hard, soft or may be freezing.
sqrtq() is also slow, but 3 or 5 times faster than that.

> > I don't quite or understand why you say that. First, I don't remember using 
> > any
> > double math in the variant of sqrtq() posted here. So, may be, you meant
> > division?
> > Here I use doable math, but IMO, I use it in a way that never causes
> > exceptions.
> 
> Yes, the double division.  If that division can ever be inexact when the 
> result of the square root itself is exact, you have a problem (which in 
> turn could lead to actual incorrectly rounded results from other functions 
> such as the square root operations rounding to a narrower type, when the 
> round-to-odd versions of those functions are used, because the 
> round-to-odd technique relies on correct "inexact" exceptions).

It seems, you didn't pay attention that in my later posts I am giving
implementations of binary128 *division* rather than sqrtq().

sqrtq() is in Comment 9. It's pure integer, no FP (double) math involved.

Comment 12 is pure integer implementation of binary128 division. Again, no FP
(double) math involved.

Comment 13 is again implementation of binary128 division. It does use double
math (division) internally. On modern Intel and AMD CPUs it is a little faster
than one from Comment 12, but not radically so. Both arguments and result of
double division are well-controlled, they never come close to subnormal range.

All three implementations support only one rounding mode, a default (round to
nearest with tie broken to even). They do not generate any exceptions, neither
underflow nor even overflow. In case of overflow a division silently returns
Inf with proper sign.
Support for exceptions and for non-default rounding can be added relatively
easily, but would undoubtedly cause a slowdown. I'd expect that the main
performance bottleneck would be not even an additional tests and ifs (modern
branch predictors are very good in sorting out uncommon cases), but reading of
control "register" in thread-safe manner.

BTW, I see no mentioning of rounding control or of any sort of exceptions in
GCC libquadmath docs. No APIs with names resembling fesetround() or
mpfr_set_default_rounding_mode().

Reply via email to