Ciao Il Gio, 5 Settembre 2019 10:33 am, Torbjörn Granlund ha scritto: > One of my previous proposal looks like this, when put into its right form: > > mp_limb_t > div1 (mp_limb_t n0, mp_limb_t d0) > { > if (UNLIKELY ((d0 >> 61) != 0)) > return n0 / d0; > > if (UNLIKELY (n0 >= (d0 << 3))) > return n0 / d0; > else { > mp_limb_t q, mask; > > d0 <<= 2; > > mask = -(mp_limb_t) (n0 >= d0); > q = 4 & mask; > n0 -= d0 & mask; > > d0 >>= 1; > mask = -(mp_limb_t) (n0 >= d0); > q += 2 & mask; > n0 -= d0 & mask; > > d0 >>= 1; > mask = -(mp_limb_t) (n0 >= d0); > q += 1 & mask; > n0 -= d0 & mask; > > return q; > } > }
> Some timing tests indicate that it is 6 times faster on 'shell' for > quotients < 8. For hgcd, that would mean almost all quotients. As I realised on the powm thread... On shell, also the following rewrite is compiled without branches: static inline mp_double_limb_t div1 (mp_limb_t n0, mp_limb_t d0) { mp_double_limb_t res; if (UNLIKELY ((d0 >> (GMP_LIMB_BITS - 3)) != 0) || UNLIKELY (n0 >= (d0 << 3))) { res.d1 = n0 / d0; res.d0 = n0 - res.d1 * d0; } else { mp_limb_t q; d0 <<= 2; q = (n0 >= d0); n0 -= (n0 >= d0 ? d0 : 0); d0 >>= 1; q <<= 1; q += (n0 >= d0); n0 -= (n0 >= d0 ? d0 : 0); d0 >>= 1; q <<= 1; q += (n0 >= d0); n0 -= (n0 >= d0 ? d0 : 0); res.d0 = n0; res.d1 = q; } return res; } The last d0 >>= 1; q <<= 1; q += (n0 >= d0); n0 -= (n0 >= d0 ? d0 : 0); block gets compiled (on shell, with -O2) into: shrq %rsi xorl %edx, %edx cmpq %rsi, %rdi setae %dl leaq (%rdx,%rax,2), %rax cmovbq %rcx, %rsi subq %rsi, %rdi Not bad, is it? Must we use masks to obtain branchless code? Or should we let the compiler choose? Ĝis, m -- http://bodrato.it/papers/ _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel