Re: Likely GMP bug

2018-05-30 Thread Marco Bodrato
Ciao,

Il Mer, 30 Maggio 2018 10:20 am, Niels Möller ha scritto:
> "Marco Bodrato"  writes:

> Thinking about micro optimizations... Consider

>>   count_trailing_zeros (c, ulimb);
>>   ulimb = (ulimb >> 1) >> c;

> vs

>>   count_trailing_zeros (c, ulimb);
>>   ulimb >>= (c + 1);

> I wonder if maybe it's preferable to always use the first variant? Both

You are saying that my patch https://gmplib.org/repo/gmp/rev/e4849ae7c974
was not an emergency workaround, but a cleverly subtle optimization?
Well, I have to admit that I was not aware... but thanks :-)

> first variant, the first shift can be issued in parallel with ctz,

If we could hint to the compiler...
  HINT (c >= 0 && c + 1 < )
then they should be equivalent. The optimiser of the compiler can choose
the best one for the target environment :-)

If we define c as unsigned, we do not need other hints, the compiler has
enough information to understand that ulimb>>(c+1) can be implemented as
(ulimb>>1)>>c for all non-undefined cases. The vice-versa is not as easy.

Maybe, of course, that current compilers does not take care of this
possible optimisation...

Ĝis,
m

-- 
http://bodrato.it/

___
gmp-bugs mailing list
gmp-bugs@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-bugs


Re: Likely GMP bug

2018-05-30 Thread Torbjörn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  But you may be right that it's important for performance to avoid a
  redundant count_trailing_zeros on u.

It is slow on several machines, but it has become more common to provide
good leading/trailing bit count for newer microarchs.

  It seems tricky to avoid that, without code duplication or complications
  in the code flow. For my personal preference, tradeoff threshold between
  goto and duplicated code is around 5-10 duplicated code lines.

  Thinking about micro optimizations... Consider

  >   count_trailing_zeros (c, ulimb);
  >   ulimb = (ulimb >> 1) >> c;

  vs

  >   count_trailing_zeros (c, ulimb);
  >   ulimb >>= (c + 1);

Right shift has OK performance almost everywhere.  (And obsolete
exceptions is x86-64 Pentium 4 where it takes some 8 cycles and stalls
the pipeline.)

On newer Intel x86-64 shifting is fast for immediate counts but takes 2
cycles with 1/2 thoughput for dynamic counts.  The latest generations
provides a new set of shifts for dynamic counts which are fast,
e.g. shrx.

AMD chips have great right shift performance with no exceptions.

No RISCs have any problems here.

  I wonder if maybe it's preferable to always use the first variant? Both
  should compile to three instructions (assuming we have a ctz
  instruction), ctz + shift + shift, or ctz + add + shift. But with the
  first variant, the first shift can be issued in parallel with ctz,
  reducing the critical path of the gcd loop.

  We'd need some benchmarking to see if it makes a difference.

Suppressing redundant count_trailing_zeros should make more difference.

Ref: https://gmplib.org/~tege/x86-timing.pdf

-- 
Torbjörn
Please encrypt, key id 0xC8601622
___
gmp-bugs mailing list
gmp-bugs@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-bugs


Re: Likely GMP bug

2018-05-30 Thread Niels Möller
"Marco Bodrato"  writes:

> ... the effect is that in many cases (if U don't need reduction modulo V)
> the trailing zeros of U are removed twice.

But you may be right that it's important for performance to avoid a
redundant count_trailing_zeros on u.

It seems tricky to avoid that, without code duplication or complications
in the code flow. For my personal preference, tradeoff threshold between
goto and duplicated code is around 5-10 duplicated code lines.

Thinking about micro optimizations... Consider

>   count_trailing_zeros (c, ulimb);
>   ulimb = (ulimb >> 1) >> c;

vs

>   count_trailing_zeros (c, ulimb);
>   ulimb >>= (c + 1);

I wonder if maybe it's preferable to always use the first variant? Both
should compile to three instructions (assuming we have a ctz
instruction), ctz + shift + shift, or ctz + add + shift. But with the
first variant, the first shift can be issued in parallel with ctz,
reducing the critical path of the gcd loop.

We'd need some benchmarking to see if it makes a difference.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
___
gmp-bugs mailing list
gmp-bugs@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-bugs


Re: Likely GMP bug

2018-05-30 Thread Marco Bodrato
Ciao,

Il Lun, 28 Maggio 2018 10:00 pm, Niels Möller ha scritto:
> I'd suggest the below (complete file, I think that's more readable

The code is clean.

You removed all gotos...

> The last part of the function requires vlimb odd, but tolerates
> arbitrary u, including 0.

... the effect is that in many cases (if U don't need reduction modulo V)
the trailing zeros of U are removed twice.

> This would be a candidate gcd_11 or gcd_11_odd.

Maybe, if we keep both parts in the same function, some code replication
can be good?

Is something like the following too redundant?

mp_limb_t
mpn_gcd_1 (mp_srcptr up, mp_size_t size, mp_limb_t vlimb)
{
  mp_limb_t  ulimb;
  unsigned long  zero_bits, u_low_zero_bits;
  int c;

  ASSERT (size >= 1);
  ASSERT (vlimb != 0);
  ASSERT_MPN_NONZERO_P (up, size);

  ulimb = up[0];

  /* Need vlimb odd for modexact, want it odd to get common zeros. */
  count_trailing_zeros (zero_bits, vlimb);
  vlimb >>= zero_bits;

  if (size > 1)
{
  /* Must get common zeros before the mod reduction.  If ulimb==0 then
 vlimb already gives the common zeros.  */
  if (ulimb != 0)
{
  count_trailing_zeros (u_low_zero_bits, ulimb);
  zero_bits = MIN (zero_bits, u_low_zero_bits);
}

  ulimb = MPN_MOD_OR_MODEXACT_1_ODD (up, size, vlimb);
  if (ulimb == 0)
return vlimb << zero_bits;

 /* Note that if ulimb == GMP_LIMB_HIGHBIT, c+1 is an invalid shift
count. */
  count_trailing_zeros (c, ulimb);
  ulimb = (ulimb >> 1) >> c;
}
  else
{
  /* size==1, so up[0]!=0 */
  count_trailing_zeros (u_low_zero_bits, ulimb);
  ulimb >>= u_low_zero_bits;
  zero_bits = MIN (zero_bits, u_low_zero_bits);

  /* make u bigger */
  if (vlimb > ulimb)
MP_LIMB_T_SWAP (ulimb, vlimb);

  /* if u is much bigger than v, reduce using a division rather than
 chipping away at it bit-by-bit */
  if ((ulimb >> 16) > vlimb)
{
  ulimb %= vlimb;
  if (ulimb == 0)
return vlimb << zero_bits;

  count_trailing_zeros (c, ulimb);
  ulimb >>= (c + 1);
{
  else
ulimb >>= 1;
}

  ASSERT (vlimb & 1);

  /* Represent the odd numbers ulimb and vlimb without the redundant
 least significant one bit. This reduction in size by one bit
 ensures that the high bit of t, below, is set if and only if
 vlimb > ulimb. And in addition, t != GMP_LIMB_HIGHBIT. */
  vlimb >>= 1;

  while (ulimb != vlimb)
{
...



Ĝis,
m

-- 
http://bodrato.it/

___
gmp-bugs mailing list
gmp-bugs@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-bugs