Hi, I forgot to highlight that the floating-point version for 2x64 only works if long double type is 128-bit (quad precision). Is there a test macro in GMP for that?
Similarly for single 64-bit, if long double actually is double = 64 bits, then error check after sqrtl() is needed. There are probably other subtleties about floating-pout stuff that I am not aware of... So first I'd like to know, what do GMP developers think about using FP there? Adrien On Sat, 2017-01-28 at 17:49 +0100, Adrien Prost-Boucle wrote: > Hi Marco, > > Thank you for the tests, > and for the example of benchmarking with GMP. > > We must be adamant about the correctness issue of course. > Recently I've been contacted separately by Bradley Lucier, > who has discovered an error (-1) for a particular input of 64x2 version... > So yes, correctness issue. > > Bradley Lucier also proposed to try a completely different approach: > simply use floating-point functions sqrtf(), sqrt(), sqrtl(). > > Indeed recent and not-so-recent CPUs have floating-point sqrt instructions. > Also, if I understand correctly, IEEE standard imposes that the sqrt result > is nondecreasing as the input value increases. > This makes it possible to ensure correctness. > > Basically, the algorithm is: > - convert the input value to floating-point (float, double or long double) > - call sqrtf(), sqrt() or sqrtl() > - convert back to integer > - apply correction -1 or +1 if necessary > > At the end of this message is my code for 32b, 64b and 2x64b. > By default I check only for correction -1 because it seems only this is > necessary, > due to the nondecreasing property of the sqrt instructions. > > Here is is quick benchmark on my Core2 Duo mobile CPU: > (speedup is around 2x compared to mpn) > > ./sqrt -noslow bench32 10000000 > null ....... 0.0135976 s > mpn ........ 0.134436 s > inv ........ 0.129447 s > fp-f ....... 0.0614301 s > fp-d ....... 0.0955542 s > > ./sqrt -noslow bench64 10000000 > null ....... 0.0113118 s > mpn ........ 0.238126 s > inv ........ 0.225985 s > fp-d ....... 0.115913 s > fp-ld ...... 0.117514 s > > ./sqrt -noslow bench64x2 10000000 > null ....... 0.0123146 s > mpn ........ 0.709794 s > inv ........ 0.606813 s > fp-ld ...... 0.223781 s > > Here is is quick benchmark on a higher-end i7 CPU: > (speedup is at least 3x compared to mpn) > > ./sqrt -noslow bench32 10000000 > null ....... 0.00276399 s > mpn ........ 0.0957994 s > inv ........ 0.0418732 s > fp-f ....... 0.0194476 s > fp-d ....... 0.0376165 s > > ./sqrt -noslow bench64 10000000 > null ....... 0.00264621 s > mpn ........ 0.137777 s > inv ........ 0.0551834 s > fp-d ....... 0.0389557 s > fp-ld ...... 0.0494387 s > > ./sqrt -noslow bench64x2 10000000 > null ....... 0.00257349 s > mpn ........ 0.181686 s > inv ........ 0.191645 s > fp-ld ...... 0.0610073 s > > When compiled with gcc with some optim, the calls to sqrt() are not even > present in the compiled program. > It is "inlined" by the appropriate CPU sqrt instructions. > So it's very efficient. > > However, floating-point stuff is probably not a good solutions for embedded > systems, > who may not have FP CPU instruction set. > ARM machines without the NEON extension, in particular. > So the current fully interger-base code should be kept. > > I have updated my standalone test program with these algorithms: > http://94.23.21.190/publicshare/sqrt.tar.gz > make > make bench > > About my "inv" algorithm: > I fortuitously discovered that, on some systems and with some compilers, > the generated code is much slower than "mpn" (Debian with gcc 5.4). > This is about using int32_t types. Using int_fast32_t instead solved the > issue. > > For a quick glance at my code, here it is. > Bradley Lucier noted that right shift of negative variables is undefined. > There have already been some discussion in the GMP mailing lists about that, > so I'll just let you guys decide. > I compile with -fno-math-errno to tell gcc that sqrt of a negative number > just returns a NaN without setting ERRNO. > I've not investigated impact on speed. > > > > // Uncomment this to enable check for error correction +1 > //#define FLOAT_CORR_P1 > > uint32_t sqrt32_float(uint32_t val) { > uint32_t root = (uint32_t)sqrtf((float) val); > int32_t rem = val - root * root; > int32_t corr = rem >> 31; > #ifdef FLOAT_CORR_P1 > rem = 2*root - rem; > corr += ((uint32_t)rem) >> 31; > #endif > //printf("%"PRIi32" ", corr); > root += corr; > return root; > } > uint32_t sqrt32_float_double(uint32_t val) { > return (uint32_t)sqrt((double) val); > } > > uint64_t sqrt64_float_double(uint64_t val) { > uint64_t root = (uint64_t)sqrt((double) val); > int64_t rem = val - root * root; > int64_t corr = rem >> 63; > #ifdef FLOAT_CORR_P1 > rem = 2*root - rem; > corr += ((uint64_t)rem) >> 63; > #endif > //printf("%"PRIi64" ", corr); > root += corr; > return root; > } > uint64_t sqrt64_float_longdouble(uint64_t val) { > return (uint64_t)sqrtl((long double) val); > } > > #define CST2POW64 (65536. * 65536. * 65536. * 65536.) > > uint64_t sqrt64x2_float(uint64_t valh, uint64_t vall) { > long double ld = valh; > ld = (ld * CST2POW64) + vall; > uint64_t root = sqrtl(ld); > > uint64_t remh, reml; > int64_t corr; > > // Calculate r^2 > umul64x2(&remh, &reml, root, root); > // Calculate val-r^2 > usub64x2(&remh, &reml, valh, vall, remh, reml); > // Get the correction -1 if the root is too high > corr = ((int64_t)remh) >> 63; > > #ifdef FLOAT_CORR_P1 > // Subtract to 2*r to check if the root is too low > usub64x2(&remh, &reml, root >> 63, root << 1, remh, reml); > // Get the full correction > corr += remh >> 63; > #endif > > // Apply the error correction > //printf("%"PRIi64" ", corr); > root += corr; > > return root; > } > > > > Adrien > > > On Sat, 2017-01-28 at 14:12 +0100, Marco Bodrato wrote: > > Ciao, > > > > Il Ven, 23 Dicembre 2016 5:22 pm, Adrien Prost-Boucle ha scritto: > > > I now have the same API for my implementation and for the code I've taken > > > from GMP. > > > > As a contribution to this discussion, I did the same thing, but the other > > direction. > > > > The attached code contains a copy of mpn_sqrtrem{1,2} from our repository, > > a copy of the sqrtrem2 code that Paul Zimmermann posted to this list (July > > 2016), and an adaptation of Adrien Prost-Boucle's sqrtrem2 to the API and > > sintax currently used by this internal function of GMP. > > > > The attached file is meant as a replacement for the tune/speed-ext.c file > > in a GMP source tree. After replacement: > > > > ./configure ABI=64 && make && (cd tune;make speed-ext) > > > > And the tune/speed-ext program will be ready to compare the speed of the > > three implementations. > > > > Two limitations: it works only with ABI=64, some measurements seem > > sensible to the "precision" required for measures... this is unexpected to > > me. > > > > > [bodrato@shell ~/gmp-repo]$ tune/speed-ext -s2 -cp1000000000 mpn_sqrtrem2 > > > > pz_sqrtrem2 apb_sqrtrem2 mpn_sqrtrem > > overhead 5.84 cycles, precision 1000000000 units of 2.86e-10 secs, CPU > > freq 3500.07 MHz > > mpn_sqrtrem2 pz_sqrtrem2 apb_sqrtrem2 mpn_sqrtrem > > 2 106.08 95.39 #91.21 108.41 > > > > The abp function seems the faster one (at least on shell), but... > > > > > However don't forget that nothing is fully checked nor proven in my > > > implementation! It may still be wrong in some corner cases! > > > > This is the biggest issue with the apb implementation... correctness > > should be a prerequisite :-) but the promised speed is interesting. > > > > Regards, > > m > > > > _______________________________________________ > gmp-devel mailing list > gmp-devel@gmplib.org > https://gmplib.org/mailman/listinfo/gmp-devel _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel