t...@gmplib.org (Torbjörn Granlund) writes: > Looking at when method 1 or 3 is faster than 2 is more interesting. > Method 1 and to some extent also method 3 would benefit from asm code,
How would method 1 benefit from asm code? To use instructions providing both quotient and remainder? For x86_64, it looks like gcc uses xor %edx, %edx div %rcx to get quotient and remainder (i.e., using the 128/64 division instruction; there seems there is no 64/64 divison instruction?). Method 2 and 3 begs for using subtract + cmov, not sure if the compiler is clever enough to do that for the code using masks. > Would your present tuneup/speed setup allow measuring of asm code? I think so, ./speed mpn_hgcd2 will measure the mpn_hgcd2 function in the library, with no extra hacks. Will maybe need some tricks to be able to select other methods and ignore the asm version, but I'd expect that to be fairly easy. > The current div1 measurements include hgcd2's own time, right? I.e., if > we found a div1 which runs in zero cycles, the timings would not be > zero. Yes. Do you think it makes sense to measure div1 in isolation? Since performance depends so much on the distribution of the inputs, measuring hgcd2 makes sense to me. I'm appending another iteration of the patch to add div2 function based on div1 on the high limbs. Selected via HGCD2_DIV2_METHOD. Benchmarks: HGCD2_DIV2_METHOD mpn_hgcd2_1 mpn_hgcd2_2 mpn_hgcd2_3 1 #1504.47 1740.53 1513.56 div1 + unlikely case 2 #1739.66 1858.01 1753.73 current code 3 #1615.32 1736.69 1621.95 the if:ed out version Regards, /Niels *** /tmp/extdiff.C9Ot80/gmp.217337dadd8e/mpn/generic/hgcd2.c 2019-09-15 14:37:42.395279848 +0200 --- /home/nisse/hack/gmp/mpn/generic/hgcd2.c 2019-09-15 14:32:06.745439876 +0200 *************** *** 41,44 **** --- 41,48 ---- #endif + #ifndef HGCD2_DIV2_METHOD + #define HGCD2_DIV2_METHOD 3 + #endif + #if GMP_NAIL_BITS != 0 #error Nails not implemented *************** *** 139,142 **** --- 143,201 ---- #endif + #if HGCD2_DIV2_METHOD == 1 + static mp_limb_t + div2 (mp_ptr rp, + mp_limb_t n1, mp_limb_t n0, + mp_limb_t d1, mp_limb_t d0) + { + mp_double_limb_t rq = div1 (n1, d1); + if (UNLIKELY (rq.d1 > d1)) + { + mp_limb_t n2, q, t1, t0; + int c; + + /* Normalize */ + count_leading_zeros (c, d1); + ASSERT (c > 0); + + n2 = n1 >> (GMP_LIMB_BITS - c); + n1 = (n1 << c) | (n0 >> (GMP_LIMB_BITS - c)); + n0 <<= c; + d1 = (d1 << c) | (d0 >> (GMP_LIMB_BITS - c)); + d0 <<= c; + + udiv_qrnnd (q, n1, n2, n1, d1); + umul_ppmm (t1, t0, q, d0); + if (t1 > n1 || (t1 == n1 && t0 > n0)) + { + ASSERT_ALWAYS (q > 0); + q--; + n1 += d1; + sub_ddmmss (t1, t0, t1, t0, 0, d0); + } + sub_ddmmss (n1, n0, n1, n0, t1, t0); + + /* Undo normalization */ + rp[0] = (n0 >> c) | (n1 << (GMP_LIMB_BITS - c)); + rp[1] = n1 >> c; + + return q; + } + else + { + mp_limb_t q, t1, t0; + n1 = rq.d0; + q = rq.d1; + umul_ppmm (t1, t0, q, d0); + if (UNLIKELY (t1 >= n1) && (t1 > n1 || t0 > n0)) + { + q--; + sub_ddmmss (t1, t0, t1, t0, d1, d0); + } + sub_ddmmss (rp[1], rp[0], n1, n0, t1, t0); + return q; + } + } + #elif HGCD2_DIV2_METHOD == 2 /* Two-limb division optimized for small quotients. */ static inline mp_limb_t *************** *** 198,202 **** } ! #if 0 /* This div2 uses less branches, but relies on fast count_leading_zeros. */ static inline mp_limb_t --- 257,261 ---- } ! #elif HGCD2_DIV2_METHOD == 3 /* This div2 uses less branches, but relies on fast count_leading_zeros. */ static inline mp_limb_t *************** *** 239,242 **** --- 298,303 ---- return q; } + #else + #error Unknown HGCD2_DIV2_METHOD #endif -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel