ni...@lysator.liu.se (Niels Möller) writes: In that case, not so surprising that the div1 function loses. Do other architectures also have decent performance for small-quotient division?
Do you think table lookup on high bits should beat 16 cycles? It needs to give good enough accuracy (possibly with an adjustment step) to not result in a large penalty in iteration count. Even with div1 deleted, the code handles q == 1 specially, and only divides when q >= 2. I think 16 cycles should be ueasy to beat. Even with Intel's slow lzcnt (3 cycles latency, 1/cycle throughput) something like this should be faster; count_leading_zeros (ncnt, n0) count_leading_zeros (dcnt, d0) ni = ... extract high k bits from n0 ... di = ... extract high \ell bits from d0 ... q = qtab[ni<<\ell+di]; ... adjust ... One may also keep track of the msb for n0 and d0 for the benefit of machines with slow lzcnt. An alternative might be something along the lines of if (n0 >= (d0 << 3)) q = n0 / d0; else { q = 0; if (n0 > (d0 << 2)) q += 4 n0 -= d0 << 2 if (n0 > (d0 << 1)) q += 2 n0 -= d0 << 1 if (n0 > (d0 << 0)) q += 1 n0 -= d0 } with masking to avoid any branches from the "else" clause and onwards. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel