Dear Torbjörn, > yes, see Section 2.4.2 of Modern Computer Arithmetic, where we call it > "Montgomery-Svoboda". The quotient selection becomes trivial, which means > one can reduce the latency between two mpn_addmul_1 calls. > > It really becomes a mul_basecase except that the first round there is > done with mul_1. If we replace that mul_1 by addmul_1, and > then called the resulting function like this, > > mul_basecase'(rp, up, un, rp, rn-un) > > then we will get the desired remainder at mul_basecase speed.
a slight difference is that here the limb multiplier is not known in advance, but only at the beginning of each loop. But maybe it does not make any difference. I tried to implement Montgomery-Svoboda at the C level, but did not manage to beat the mpn_redc_x routines. I'm very interested to see your results! Best regards, Paul _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel