ni...@lysator.liu.se (Niels Möller) writes: I've made a quick try deleting it from the single-limb loop. See patch below. Measurements are a bit noisy, but it looks like a slowdown when I time it. With hgcd2 time increasing from 1220 cycles to 1290 (this time measured on broadwell), which seems to be an increase of more than one cycle per iteration of this loop.
With which HGCD_DIV1_METHOD did you make these experiments? For _METHOD 1 one almost surely want q = 1 special handling, at least for Intel CPUs. (Not as surely with AMD or ARM.) Incidentally, my mpn_div_11 asm code didn't help any x86-64 CPUs. The speed was about the same. Presumably inlining of the C variants compensates for their slower per-bit speed. I find it hard to accept that 25 cycles per iteration is as good as it gets. (25 cycles is Intel's best division instruction speed.) I still believe we could beat it soundly with a table-based approach if it only rarely incurs a branch miss. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel