Re: Likely GMP bug
Ciao, Il Mer, 30 Maggio 2018 10:20 am, Niels Möller ha scritto: > "Marco Bodrato" writes: > Thinking about micro optimizations... Consider >> count_trailing_zeros (c, ulimb); >> ulimb = (ulimb >> 1) >> c; > vs >> count_trailing_zeros (c, ulimb); >> ulimb >>= (c + 1); > I wonder if maybe it's preferable to always use the first variant? Both You are saying that my patch https://gmplib.org/repo/gmp/rev/e4849ae7c974 was not an emergency workaround, but a cleverly subtle optimization? Well, I have to admit that I was not aware... but thanks :-) > first variant, the first shift can be issued in parallel with ctz, If we could hint to the compiler... HINT (c >= 0 && c + 1 < ) then they should be equivalent. The optimiser of the compiler can choose the best one for the target environment :-) If we define c as unsigned, we do not need other hints, the compiler has enough information to understand that ulimb>>(c+1) can be implemented as (ulimb>>1)>>c for all non-undefined cases. The vice-versa is not as easy. Maybe, of course, that current compilers does not take care of this possible optimisation... Ĝis, m -- http://bodrato.it/ ___ gmp-bugs mailing list gmp-bugs@gmplib.org https://gmplib.org/mailman/listinfo/gmp-bugs
Re: Likely GMP bug
ni...@lysator.liu.se (Niels Möller) writes: But you may be right that it's important for performance to avoid a redundant count_trailing_zeros on u. It is slow on several machines, but it has become more common to provide good leading/trailing bit count for newer microarchs. It seems tricky to avoid that, without code duplication or complications in the code flow. For my personal preference, tradeoff threshold between goto and duplicated code is around 5-10 duplicated code lines. Thinking about micro optimizations... Consider > count_trailing_zeros (c, ulimb); > ulimb = (ulimb >> 1) >> c; vs > count_trailing_zeros (c, ulimb); > ulimb >>= (c + 1); Right shift has OK performance almost everywhere. (And obsolete exceptions is x86-64 Pentium 4 where it takes some 8 cycles and stalls the pipeline.) On newer Intel x86-64 shifting is fast for immediate counts but takes 2 cycles with 1/2 thoughput for dynamic counts. The latest generations provides a new set of shifts for dynamic counts which are fast, e.g. shrx. AMD chips have great right shift performance with no exceptions. No RISCs have any problems here. I wonder if maybe it's preferable to always use the first variant? Both should compile to three instructions (assuming we have a ctz instruction), ctz + shift + shift, or ctz + add + shift. But with the first variant, the first shift can be issued in parallel with ctz, reducing the critical path of the gcd loop. We'd need some benchmarking to see if it makes a difference. Suppressing redundant count_trailing_zeros should make more difference. Ref: https://gmplib.org/~tege/x86-timing.pdf -- Torbjörn Please encrypt, key id 0xC8601622 ___ gmp-bugs mailing list gmp-bugs@gmplib.org https://gmplib.org/mailman/listinfo/gmp-bugs
Re: Likely GMP bug
"Marco Bodrato" writes: > ... the effect is that in many cases (if U don't need reduction modulo V) > the trailing zeros of U are removed twice. But you may be right that it's important for performance to avoid a redundant count_trailing_zeros on u. It seems tricky to avoid that, without code duplication or complications in the code flow. For my personal preference, tradeoff threshold between goto and duplicated code is around 5-10 duplicated code lines. Thinking about micro optimizations... Consider > count_trailing_zeros (c, ulimb); > ulimb = (ulimb >> 1) >> c; vs > count_trailing_zeros (c, ulimb); > ulimb >>= (c + 1); I wonder if maybe it's preferable to always use the first variant? Both should compile to three instructions (assuming we have a ctz instruction), ctz + shift + shift, or ctz + add + shift. But with the first variant, the first shift can be issued in parallel with ctz, reducing the critical path of the gcd loop. We'd need some benchmarking to see if it makes a difference. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. ___ gmp-bugs mailing list gmp-bugs@gmplib.org https://gmplib.org/mailman/listinfo/gmp-bugs
Re: Likely GMP bug
Ciao, Il Lun, 28 Maggio 2018 10:00 pm, Niels Möller ha scritto: > I'd suggest the below (complete file, I think that's more readable The code is clean. You removed all gotos... > The last part of the function requires vlimb odd, but tolerates > arbitrary u, including 0. ... the effect is that in many cases (if U don't need reduction modulo V) the trailing zeros of U are removed twice. > This would be a candidate gcd_11 or gcd_11_odd. Maybe, if we keep both parts in the same function, some code replication can be good? Is something like the following too redundant? mp_limb_t mpn_gcd_1 (mp_srcptr up, mp_size_t size, mp_limb_t vlimb) { mp_limb_t ulimb; unsigned long zero_bits, u_low_zero_bits; int c; ASSERT (size >= 1); ASSERT (vlimb != 0); ASSERT_MPN_NONZERO_P (up, size); ulimb = up[0]; /* Need vlimb odd for modexact, want it odd to get common zeros. */ count_trailing_zeros (zero_bits, vlimb); vlimb >>= zero_bits; if (size > 1) { /* Must get common zeros before the mod reduction. If ulimb==0 then vlimb already gives the common zeros. */ if (ulimb != 0) { count_trailing_zeros (u_low_zero_bits, ulimb); zero_bits = MIN (zero_bits, u_low_zero_bits); } ulimb = MPN_MOD_OR_MODEXACT_1_ODD (up, size, vlimb); if (ulimb == 0) return vlimb << zero_bits; /* Note that if ulimb == GMP_LIMB_HIGHBIT, c+1 is an invalid shift count. */ count_trailing_zeros (c, ulimb); ulimb = (ulimb >> 1) >> c; } else { /* size==1, so up[0]!=0 */ count_trailing_zeros (u_low_zero_bits, ulimb); ulimb >>= u_low_zero_bits; zero_bits = MIN (zero_bits, u_low_zero_bits); /* make u bigger */ if (vlimb > ulimb) MP_LIMB_T_SWAP (ulimb, vlimb); /* if u is much bigger than v, reduce using a division rather than chipping away at it bit-by-bit */ if ((ulimb >> 16) > vlimb) { ulimb %= vlimb; if (ulimb == 0) return vlimb << zero_bits; count_trailing_zeros (c, ulimb); ulimb >>= (c + 1); { else ulimb >>= 1; } ASSERT (vlimb & 1); /* Represent the odd numbers ulimb and vlimb without the redundant least significant one bit. This reduction in size by one bit ensures that the high bit of t, below, is set if and only if vlimb > ulimb. And in addition, t != GMP_LIMB_HIGHBIT. */ vlimb >>= 1; while (ulimb != vlimb) { ... Ĝis, m -- http://bodrato.it/ ___ gmp-bugs mailing list gmp-bugs@gmplib.org https://gmplib.org/mailman/listinfo/gmp-bugs