broot vs brootinv performancs

2012-10-30 Thread Niels Möller
I've tried some more benchmarking (sorry, patches to speed not yet
checked in). To my disapppointment, broot appear to be 20% slower than
brootinv (except for small sizes, where it wins because it does the
initial few iterations without bignum overhead).

I don't understand why. When describing the iterations below, let n be
the target size of the iteraton.

brootinv uses the iteration

  r' <-- ((k+1) r - r^{k+1} y) / k

which is

  mul_1,
  powlo,
  mullo_n,
  sub_n,
  bdiv_q_1,

all of size n. For broot, I've rewritten the iteration as

  r' <-- r - (a^{k-1} (r^2)^{(k+1)/2} - r) / k

which gave a modest improvement. We still have cancellation in the
subtraction in parenthesis (and I don't yet use wraparound, neither does
brootinv). The operations are then

  sqr   (input size n/2, output size n)
  powlo (size n, exponent size one bit smaller)
  mullo_n   (size n)
  bdiv_q_1, (size n/2)
  mpn_neg   (size n/2)

The mullo_n is the same. The sqr + powlo ought to be roughly the same as
the powlo i brootinv (consider small k so windowing is not useful). I
had expected a significant win because a mpn_neg of size n/2 ought to be
faster than the combination of mul_1, sub_n of size n and bdiv_q_1 of
size n/2.

There's also a powlo (a^{k-1}) and mullo outside of the loop. I expected
them to contribute little to the total running time. But mullo seems to
take 13% for large sizes and mpn_powlo is not currently supported by
speed, so it's not so easy to measure).

Maybe I ought to power a and r at the same time as

  a^{k-1} r^{k+1} = (a*r)^{k-1} * r^2

or so. Some wraparound or mulmid tricks may be applicable to

Or maybe I should rather try to optimize brootinv to take advantage of
cancellation and wraparound.

Another note: I also wish for a "bdiv_nq_1" which returns - u / d (mod
B^n) rather than u / d (mod B^n), then one could eliminate the mpn_neg.
(And this is somewhat related to my bdiv work a few months ago, which
also returns a quotient with different sign).

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: A perfect power, and then?

2012-10-30 Thread Niels Möller
Torbjorn Granlund  writes:

> ni...@lysator.liu.se (Niels Möller) writes:
>
>   By the way, I wonder if there are some circumstanses when
>   
> r <-- a^{1/k}  (mod 2^n)
>   
>   is more effiently computed as
>   
> k' <-- k^{-1} (mod 2^{n-1})  [binvert]
> r <-- a^{k'} (mod 2^n)   [powlo]
>
> This is going to help for large enough k and small enough n...  Will
> that occur?

Maybe for the largest k values used in perfect_power_p?

> I suppose that in practice for both bsqrt and broot, we'll get a
> progression of limb sizes that is not optimial (except for root when
> n=2^t and for sqrt when n=2^t+1).

Assuming that the bitsize of the initial approximation is a power of two
(or more generally, a divisor of GMP_NUMB_BITS), and we want at least
one full limb of precision, I think the limb sizes for broot are
optimal.

> But for root is might be the case that iterations are much more
> expensive since they are something like O(f(n)log(k)) for computing
> a^(1/k) mod n.  Even improving an initial 4-bit approximation to say 20
> bits will be significant work.

Remember we can limit k for the current precision (if the current
iteration computes m bits, we need only the least signifivant m-1 bits
of k). For large k, this wil help the first few iterations.

> We don't want to go to 64 bits internally just because we don't know
> the actually needed precision.

For this case, it makes sense to have a broot_limb, with a precision
argument in bits.

> It might make more sense to have a coarser size in bsqrt, since its
> iteration to b bits of precision will be faster.

For the interface, maybe. Internally, I think we ought to count bits.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: A perfect power, and then?

2012-10-30 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  By the way, I wonder if there are some circumstanses when
  
r <-- a^{1/k}  (mod 2^n)
  
  is more effiently computed as
  
k' <-- k^{-1} (mod 2^{n-1})  [binvert]
r <-- a^{k'} (mod 2^n)   [powlo]
  
  We get a typically cheaper newton iteration (for binvert), but a
  typically much larger exponent for powlo. I doubt it's useful for large
  n, since then k' is *much* larger than k. But maybe for some types of
  small operands.
  
This is going to help for large enough k and small enough n...  Will
that occur?

  > I am afraid I don't appreciate the difference.
  
  Say that, at some point, we have n limbs of precision. Then one
  iteration of kth root (k odd) gives us precisely 2n limbs. While one
  iteration of bsqrt gives us 2n limbs minus one bit (or is it minus 2
  bits? I don't remember now). To make use of the valid bits in that top
  limb (rather than just discarding them and saying that we now have 2n-1
  valid limbs, which is particularly bad for n == 1...), we need some
  book-keeping in units of bits.
  
I suppose that in practice for both bsqrt and broot, we'll get a
progression of limb sizes that is not optimial (except for root when
n=2^t and for sqrt when n=2^t+1).

But for root is might be the case that iterations are much more
expensive since they are something like O(f(n)log(k)) for computing
a^(1/k) mod n.  Even improving an initial 4-bit approximation to say 20
bits will be significant work.  We don't want to go to 64 bits
internally just because we don't know the actually needed precision.  It
might make more sense to have a coarser size in bsqrt, since its
iteration to b bits of precision will be faster.

Does this make sense?

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: mpz_combit

2012-10-30 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes:

  ni...@lysator.liu.se (Niels Möller) writes:
  
  > I've tried rewriting mpz_combit. Two reasons: 1. To reduce overhead int
  > he common case. 2. I just realized that the common case, for both
  > positive and negative numbers, is to complement the corresponding bit
  > of  the absolute value.
  
  Any comments on this code?
  
I thought we had discussed this.  The change is fine, optimising the
common case is a great thing in general.

Did you consider analogous changes for the sibling functions, clrbit and
setbit?

Do you want the suggested incompatible change listed on the incompatible
web page?

-- 
Torbjörn
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel


Re: mpz_combit

2012-10-30 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes:

> I've tried rewriting mpz_combit. Two reasons: 1. To reduce overhead int
> he common case. 2. I just realized that the common case, for both
> positive and negative numbers, is to complement the corresponding bit
> of  the absolute value.

Any comments on this code?

Regards,
/Niels

> Code below (with some additional comments).
>
>  void
>  mpz_combit (mpz_ptr d, mp_bitcnt_t bit_index)
>  {
>mp_size_t dsize = SIZ(d);
>mp_ptr dp = PTR(d);
>  
>mp_size_t limb_index = bit_index / GMP_NUMB_BITS;
>mp_limb_t bit = (CNST_LIMB (1) << (bit_index % GMP_NUMB_BITS));
>  
>/* Check for the most common case: Positive input, no realloc or
>   normalization needed. */
>if (limb_index + 1 < dsize)
>  dp[limb_index] ^= bit;
>
> The above case is an optimization for the common case. If deleted, it
> should still be handled correctly by the common-case code later on.
>
>/* Check for the hairy case. d < 0, and we have all zero bits to the
>   right of the bit to toggle. */
>else if (limb_index < -dsize && mpn_zero_p (dp, limb_index)
> && (dp[limb_index] & (bit - 1)) == 0)
>  {  
>ASSERT (dsize < 0);
>dsize = -dsize;
>  
>if (dp[limb_index] & bit)
>  {
>/* We toggle the least significant one bit.
>   Corresponds to an add, with carry propagation, on the
>   absolute value. */
>dp = MPZ_REALLOC (d, 1 + dsize);
>dp[dsize] = 0;
>MPN_INCR_U (dp + limb_index, 1 + dsize - limb_index, bit);
>SIZ(d) -= dp[dsize];
>  }
>else
>  {
>/* We toggle a zero bit, subtract from the absolute value. */
>MPN_DECR_U (dp + limb_index, dsize - limb_index, bit);
>MPN_NORMALIZE (dp, dsize);
>ASSERT (dsize > 0);
>SIZ(d) = -dsize;
>  }
>  }
>else
>  {
>/* Simple case: Toggle the bit in the absolute value. */
>dsize = ABS(dsize);
>if (limb_index < dsize)
>  {
>dp[limb_index] ^= bit;
>  
>/* Can happen only when limb_index = dsize - 1. Avoid SIZ(d)
>   bookkeeping in the common case. */
>if (dp[dsize-1] == 0)
>  {
>dsize--;
>MPN_NORMALIZE (dp, dsize);
>SIZ (d) = SIZ (d) >= 0 ? dsize : -dsize;
>  }
>
> Here, MPN_NORMALIZE could be called unconditionally. I write it this way
> in order to avoid having to check the sign of SIZ (d) in the common
> case. Not sure if that optimization is worth the two extra lines of code.
>
>  }
>else
>  {
>dp = MPZ_REALLOC (d, limb_index + 1);
>MPN_ZERO(dp + dsize, limb_index - dsize);
>dp[limb_index++] = bit;
>SIZ(d) = SIZ(d) >= 0 ? limb_index : -limb_index;
>  }
>  }
>  }
>
> Another reflection: I think it would make sense to have mpz_combit
> return the value of the bit in question. It might also make sense to
> have setbit and clrbit return the previous value of the bit (and in that
> case, for consistency combit should probably return the *previous* value
> of the bit, not the new value). That would be an interface change, of
> course.
>
> Regards,
> /Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel