"Marco Bodrato" <bodr...@mail.dm.unipi.it> writes: We have an explicit example of this: INV_MULMOD_BNM1_THRESHOLD is typically larger than the MULMOD_BNM1_THRESHOLD, the latter is only used internally .
OK. These are widely apart, the quotient between them is 3 on average. MULMOD_BNM1_THRESHOLD varies from 10 to 20, INV_MULMOD_BNM1_THRESHOLD varies from 20 to 80. I tested this approach for sqrlo_basecase too, you can find the code enclosed by #ifdef SQRLO_SHORTCUT_MULTIPLICATIONS But I'm not sure it is faster, so it is currently disabled. It will obviously be faster for machines where widening multiplication is expensive, and in some other hardware cases. How it will work for high-end CPUs, I cannot tell. I would expect a small speedup in most cases. But using mul_2/addmul_2 when these exists is more important. There is a weird reason for why addmul_2 will pay off extra here: Loop counter prediction problems for some CPUs. The inner loop counts will decrease every time the inner loop is repeated. CPUs which base predicition on branch count repetions will pay a pipeline replay for the last inner loop back branch and in many cases again for the non-taken inner loop fall-though. Why does that hurt addmul_1 more than addmul_2? Simply because the latter will be executed twice as many times. (I believe recent Intel CPUs detect the loop count register and predict based in its value rather than on branch insn statistics/counts.) I don't recall how we defined mullo/sqrlo wrt the size of the target operand. Can we write above the defined result, i.e., can we write the full 2n product if it is convenient? -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel