Ciao, Il Gio, 3 Ottobre 2019 8:39 am, Niels Möller ha scritto: > "Marco Bodrato" <bodr...@mail.dm.unipi.it> writes:
>> I implemented it with two variants of that portion: >> >> #ifdef BRANCHLESS >> int cond1 = (r0 >= m0); >> int cond2 = (r0 > (m0 >> 1)); >> >> r0 -= m0 & - (mp_limb_t) cond1; >> r0 <<= 1; >> r0 -= m0 & - (mp_limb_t) (cond1 ^ cond2); >> #else >> if (r0 <= (m0 >> 1)) >> r0 <<= 1; >> else if (r0 >= m0) >> r0 = (r0 - m0) << 1; >> else >> r0 = (r0 << 1) - m0; >> #endif > > What's the range of r0? To me it looks like the last two cases could > overflow if r0 is much larger than m0, so I take it there's some limit? r0 comes from sqr+redc. So, if I understood correctly, it has at most as many bits as m0, but may be r0 >= m0. Since the number of bits is not larger, we can not have r0 >= 2 * m0, and the formula (r0 - m0) << 1 can not overflow. > If r0 <= m0, one could the modular-add-via-sub trick, which reduces the > number of conditions from two to one: Ok, I agree, if cond1 = (r0 >= m0) is always false, then we can get rid of it :-D Anyway, I was simply surprised by the fact that a code with two branches seems faster than the same idea reimplemented in a branchless-way, with masks. Maybe because the compiler translate the branchy code in assembler using cmovs, and obtain a more efficient branchless object code than the code with masks? I will check... > But if m0 has the most significant bit set, one would need some other > trick. If you don't find something clever enough, the easiest way to > organize it may be to work with canonical representation, and For the canonical representation we need a canonical modular reduction. I'm not sure it could be faster than the current code, at least for large enough exponents (let's say 50-60 bits). Ĝis, m -- http://bodrato.it/papers/ _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel