Skip to site navigation (Press enter)

Re: GMP used during 3 and a half years to solve MIT's LCS35

Marco Bodrato Sat, 14 Dec 2019 08:15:50 -0800

Ciao,

Il 2019-05-02 10:09 paul zimmermann ha scritto (on gmp-discuss):

That's why a few weeks ago on this same list I suggested:
"What about even sparser exponents? Should we use 2*popcount(n)instead ofsize_in_bits(n) to estimate the best window size? They should bealmost
equivalent on average."
[ https://gmplib.org/list-archives/gmp-discuss/2019-April/006310.html]

I support the (clever) idea of using 2*popcount(exponent).
Could you benchmark this against the current version, on
dense and sparse exponents?

I tried, with the attached code (replace tune/speed-ext.c, then cdtune;make speed-ext). It contains a copy of the currentmpn/generic/powm.c code (and mpz/powm.c) with an additional flag and abranch:


  if (flag != 0)
    windowsize = win_size (mpn_popcount (ep, en) * 2 - 1);
  else
    windowsize = win_size (ebi);

This way we can compare both ways to estimate the window size.

Unfortunately, there are large fluctuations while measuring thisfunction, for each size I measure three _bs and three _pc (bitsize vs.popcount).


I used the command-line

for i in 0 1 2 4 16 256; do tune/speed-ext -rs1-260 -f2 mpz_powm_bs.$impz_powm_pc.$i mpz_powm_bs.$i mpz_powm_pc.$i mpz_powm_bs.$impz_powm_pc.$i; done

Because of the "-r", the first column is the time used, the othercolumns contain the ratio.

With r-flag=0 (random exponent) the speed are comparable. Half the timesthe winner is _bs, half the times the winner is _pc:

overhead 0.000000002 secs, precision 10000 units of 2.86e-10 secs, CPUfreq 3500.09 MHz

         _bs.0       _pc.0     _bs.0     _pc.0     _bs.0     _pc.0
1     0.000000849    0.9973    0.9954    0.9968   #0.9929    0.9951
2     0.000002583    1.1107    0.9936    0.9959    0.9868   #0.9862
4     0.000013473    0.9951    0.9952   #0.9947    0.9961    0.9947
8     0.000070850    0.9983    0.9987    0.9978    0.9981   #0.9977
16    0.000481797    0.9624    0.9554   #0.9161    0.9602    0.9537
32    0.003307878    0.9979    0.9971    0.9946    0.9998   #0.9780
64    0.021502000    1.0013    0.9979    0.9972   #0.9960    0.9973
128  #0.126749000    1.0036    1.0104    1.0131    1.0059    1.0032
256   0.738998000    1.0001    0.9961    0.9995   #0.9953    1.0008

With r-flag=1 (exponent is set to binary 111...111), popcount maysuggest a larger window-size, but again no clear winner:

overhead 0.000000002 secs, precision 10000 units of 2.86e-10 secs, CPUfreq 3500.09 MHz

         _bs.1       _pc.1     _bs.1     _pc.1     _bs.1     _pc.1
1     0.000000878    0.9967    1.0014   #0.9921    1.0016    0.9978
2     0.000002970    1.0053    0.9846    0.9871   #0.9829    0.9949
4     0.000013400    0.9995    0.9977    0.9973   #0.9973    0.9990
8     0.000071776    0.9948    0.9989    0.9950    0.9998   #0.9947
16    0.000475079    0.9701    0.9993   #0.9605    0.9627    0.9691
32    0.003303542   #0.9968    1.0012    1.0083    1.0123    1.0168
64    0.021803000    1.0093    1.0117    0.9971    1.0008   #0.9928
128  #0.128225000    1.0047    1.0042    1.0129    1.0018    1.0083
256   0.746284000    1.0051   #0.9977    1.0006    0.9988    1.0058

With sparser exponents (with r-flag=2, half of the exponent is randomdata, half is 0; with r-flag=4, one fourth of the exponent is randomdata; and so on...), popcount suggests a smaller window-size, and seemseffective:

overhead 0.000000002 secs, precision 10000 units of 2.86e-10 secs, CPUfreq 3500.09 MHz

         _bs.4       _pc.4     _bs.4     _pc.4     _bs.4     _pc.4
1     0.000000765    0.9845    0.9753    0.9818   #0.9736    0.9842
2     0.000002285    0.9839    1.0016    0.9814    1.0022   #0.9810
4     0.000011773    0.9941    1.0026   #0.9820    1.0001    0.9988
8     0.000061283   #0.9884    0.9975    0.9889    1.9493    1.9487
16    0.000394527    1.0343    1.0565   #0.9918    0.9996    1.0378
32    0.002986944    0.9758    0.9840   #0.9663    0.9794    0.9669
64   #0.019095000    1.0025    1.0098    1.0096    1.0194    1.0104
128   0.116289000    0.9958    0.9983    0.9941    0.9994   #0.9933
256   0.681213000    0.9912    0.9948   #0.9902    1.0015    0.9905

overhead 0.000000002 secs, precision 10000 units of 2.86e-10 secs, CPUfreq 3500.09 MHz

         _bs.16      _pc.16    _bs.16    _pc.16    _bs.16    _pc.16
1     0.000000729    0.9875    0.9652   #0.9515    0.9536    0.9849
2     0.000002228   #0.9450    0.9953    0.9475    0.9948    0.9477
4     0.000011343   #0.9477    0.9952    0.9480    0.9943    0.9518
8     0.000058805    0.9784    0.9984    0.9760    0.9976   #0.9759
16    0.000381547   #0.9784    1.0186    0.9912    1.0007    0.9927
32    0.002858413   #0.9695    0.9948    0.9793    1.0029    0.9706
64    0.019123000    0.9794    0.9918   #0.9716    0.9963    0.9876
128   0.113480000    0.9892    0.9974   #0.9878    1.0043    0.9907
256   0.668013000    0.9845    0.9979    0.9848    0.9980   #0.9814

overhead 0.000000002 secs, precision 10000 units of 2.86e-10 secs, CPUfreq 3500.09 MHz

         _bs.256     _pc.256   _bs.256   _pc.256   _bs.256   _pc.256
1     0.000000730    0.9886    0.9601    0.9521    0.9610   #0.9442
2     0.000002242    0.9355    0.9983   #0.9352    0.9946    0.9369
4     0.000011123    0.9416    1.0002    0.9401    0.9970   #0.9390
8     0.000057803    0.9670    1.0013   #0.9665    1.0005    0.9669
16    0.000378739   #0.9648    0.9993    0.9671    1.0213    0.9670
32    0.002830238   #0.9640    1.0034    0.9669    1.0006    0.9765
64    0.018543000    1.0168    1.0207    1.0020    1.0169   #0.9857
128   0.112874000    0.9867    1.0010   #0.9861    1.0119    0.9893
256   0.664346000   #0.9749    0.9956    0.9766    0.9942    0.9789

I mean, the gain is detectable only for very sparse exponents.

Ĝis,
m

/* An example of extending the speed program to measure routines not in GMP.

Copyright 1999, 2000, 2002, 2003, 2005, 2019 Free Software Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of:

  * the GNU General Public License as published by the Free
    Software Foundation; either version 3 of the License, or (at your
    option) any later version.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.  */

#include "gmp-impl.h"
#include "longlong.h"

#define SPEED_EXTRA_PROTOS                                              \
  double speed_mpz_powm_bitsize (struct speed_params *s);		\
  double speed_mpz_powm_popcount (struct speed_params *s);

#define SPEED_EXTRA_ROUTINES					     \
  { "mpz_powm_bitzise",  speed_mpz_powm_bitsize,   FLAG_R_OPTIONAL}, \
  { "mpz_powm_popcount", speed_mpz_powm_popcount,  FLAG_R_OPTIONAL},

#include "speed.c"


#undef MPN_REDC_0
#define MPN_REDC_0(rp, up, mp, invm)					\
  do {									\
    mp_limb_t p1, r0, u0, _dummy;					\
    u0 = *(up);								\
    umul_ppmm (p1, _dummy, *(mp), (u0 * (invm)) & GMP_NUMB_MASK);	\
    ASSERT (((u0 + _dummy) & GMP_NUMB_MASK) == 0);			\
    p1 += (u0 != 0);							\
    r0 = (up)[1] + p1;							\
    if (p1 > r0)							\
      r0 -= *(mp);							\
    *(rp) = r0;								\
  } while (0)

#undef MPN_REDC_1
#if HAVE_NATIVE_mpn_sbpi1_bdiv_r
#define MPN_REDC_1(rp, up, mp, n, invm)					\
  do {									\
    mp_limb_t cy;							\
    cy = mpn_sbpi1_bdiv_r (up, 2 * n, mp, n, invm);			\
    if (cy != 0)							\
      mpn_sub_n (rp, up + n, mp, n);					\
    else								\
      MPN_COPY (rp, up + n, n);						\
  } while (0)
#else
#define MPN_REDC_1(rp, up, mp, n, invm)					\
  do {									\
    mp_limb_t cy;							\
    cy = mpn_redc_1 (rp, up, mp, n, invm);				\
    if (cy != 0)							\
      mpn_sub_n (rp, rp, mp, n);					\
  } while (0)
#endif

#undef MPN_REDC_2
#define MPN_REDC_2(rp, up, mp, n, mip)					\
  do {									\
    mp_limb_t cy;							\
    cy = mpn_redc_2 (rp, up, mp, n, mip);				\
    if (cy != 0)							\
      mpn_sub_n (rp, rp, mp, n);					\
  } while (0)

#if HAVE_NATIVE_mpn_addmul_2 || HAVE_NATIVE_mpn_redc_2
#define WANT_REDC_2 1
#endif

#define getbit(p,bi) \
  ((p[(bi - 1) / GMP_LIMB_BITS] >> (bi - 1) % GMP_LIMB_BITS) & 1)

static inline mp_limb_t
getbits (const mp_limb_t *p, mp_bitcnt_t bi, int nbits)
{
  int nbits_in_r;
  mp_limb_t r;
  mp_size_t i;

  if (bi < nbits)
    {
      return p[0] & (((mp_limb_t) 1 << bi) - 1);
    }
  else
    {
      bi -= nbits;			/* bit index of low bit to extract */
      i = bi / GMP_NUMB_BITS;		/* word index of low bit to extract */
      bi %= GMP_NUMB_BITS;		/* bit index in low word */
      r = p[i] >> bi;			/* extract (low) bits */
      nbits_in_r = GMP_NUMB_BITS - bi;	/* number of bits now in r */
      if (nbits_in_r < nbits)		/* did we get enough bits? */
	r += p[i + 1] << nbits_in_r;	/* prepend bits from higher word */
      return r & (((mp_limb_t) 1 << nbits) - 1);
    }
}

static inline int
win_size (mp_bitcnt_t eb)
{
  int k;
  static mp_bitcnt_t x[] = {0,7,25,81,241,673,1793,4609,11521,28161,~(mp_bitcnt_t)0};
  for (k = 1; eb > x[k]; k++)
    ;
  return k;
}

/* Convert U to REDC form, U_r = B^n * U mod M */
static void
redcify (mp_ptr rp, mp_srcptr up, mp_size_t un, mp_srcptr mp, mp_size_t n)
{
  mp_ptr tp, qp;
  TMP_DECL;
  TMP_MARK;

  TMP_ALLOC_LIMBS_2 (tp, un + n, qp, un + 1);

  MPN_ZERO (tp, n);
  MPN_COPY (tp + n, up, un);
  mpn_tdiv_qr (qp, rp, 0L, tp, un + n, mp, n);
  TMP_FREE;
}

/* rp[n-1..0] = bp[bn-1..0] ^ ep[en-1..0] mod mp[n-1..0]
   Requires that mp[n-1..0] is odd.
   Requires that ep[en-1..0] is > 1.
   Uses scratch space at tp of MAX(mpn_binvert_itch(n),2n) limbs.  */
static void
mpn_powm_flag (mp_ptr rp, mp_srcptr bp, mp_size_t bn,
	  mp_srcptr ep, mp_size_t en,
	       mp_srcptr mp, mp_size_t n, mp_ptr tp, int flag)
{
  mp_limb_t ip[2], *mip;
  int cnt;
  mp_bitcnt_t ebi;
  int windowsize, this_windowsize;
  mp_limb_t expbits;
  mp_ptr pp, this_pp;
  long i;
  TMP_DECL;

  ASSERT (en > 1 || (en == 1 && ep[0] > 1));
  ASSERT (n >= 1 && ((mp[0] & 1) != 0));

  TMP_MARK;

  MPN_SIZEINBASE_2EXP(ebi, ep, en, 1);

#if 0
  if (bn < n)
    {
      /* Do the first few exponent bits without mod reductions,
	 until the result is greater than the mod argument.  */
      for (;;)
	{
	  mpn_sqr (tp, this_pp, tn);
	  tn = tn * 2 - 1,  tn += tp[tn] != 0;
	  if (getbit (ep, ebi) != 0)
	    mpn_mul (..., tp, tn, bp, bn);
	  ebi--;
	}
    }
#endif

  if (flag != 0)
    windowsize = win_size (mpn_popcount (ep, en) * 2 - 1);
  else
    windowsize = win_size (ebi);
  /* printf("f=%i, ws=%i (%lu)\n", flag, windowsize, flag? mpn_popcount (ep, en) * 2 - 1 :ebi); */

#if WANT_REDC_2
  if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_2_THRESHOLD))
    {
      mip = ip;
      binvert_limb (mip[0], mp[0]);
      mip[0] = -mip[0];
    }
  else if (BELOW_THRESHOLD (n, REDC_2_TO_REDC_N_THRESHOLD))
    {
      mip = ip;
      mpn_binvert (mip, mp, 2, tp);
      mip[0] = -mip[0]; mip[1] = ~mip[1];
    }
#else
  if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_N_THRESHOLD))
    {
      mip = ip;
      binvert_limb (mip[0], mp[0]);
      mip[0] = -mip[0];
    }
#endif
  else
    {
      mip = TMP_ALLOC_LIMBS (n);
      mpn_binvert (mip, mp, n, tp);
    }

  pp = TMP_ALLOC_LIMBS (n << (windowsize - 1));

  this_pp = pp;
  redcify (this_pp, bp, bn, mp, n);

  /* Store b^2 at rp.  */
  mpn_sqr (tp, this_pp, n);
#if 0
  if (n == 1) {
    MPN_REDC_0 (rp, tp, mp, mip[0]);
  } else
#endif
#if WANT_REDC_2
  if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_2_THRESHOLD))
    MPN_REDC_1 (rp, tp, mp, n, mip[0]);
  else if (BELOW_THRESHOLD (n, REDC_2_TO_REDC_N_THRESHOLD))
    MPN_REDC_2 (rp, tp, mp, n, mip);
#else
  if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_N_THRESHOLD))
    MPN_REDC_1 (rp, tp, mp, n, mip[0]);
#endif
  else
    mpn_redc_n (rp, tp, mp, n, mip);

  /* Precompute odd powers of b and put them in the temporary area at pp.  */
  for (i = (1 << (windowsize - 1)) - 1; i > 0; i--)
#if 1
    if (n == 1) {
      umul_ppmm((tp)[1], *(tp), *(this_pp), *(rp));
      ++this_pp ;
      MPN_REDC_0 (this_pp, tp, mp, mip[0]);
    } else
#endif
    {
      mpn_mul_n (tp, this_pp, rp, n);
      this_pp += n;
#if WANT_REDC_2
      if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_2_THRESHOLD))
	MPN_REDC_1 (this_pp, tp, mp, n, mip[0]);
      else if (BELOW_THRESHOLD (n, REDC_2_TO_REDC_N_THRESHOLD))
	MPN_REDC_2 (this_pp, tp, mp, n, mip);
#else
      if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_N_THRESHOLD))
	MPN_REDC_1 (this_pp, tp, mp, n, mip[0]);
#endif
      else
	mpn_redc_n (this_pp, tp, mp, n, mip);
    }

  expbits = getbits (ep, ebi, windowsize);
  if (ebi < windowsize)
    ebi = 0;
  else
    ebi -= windowsize;

  count_trailing_zeros (cnt, expbits);
  ebi += cnt;
  expbits >>= cnt;

  MPN_COPY (rp, pp + n * (expbits >> 1), n);

#define INNERLOOP							\
  while (ebi != 0)							\
    {									\
      while (getbit (ep, ebi) == 0)					\
	{								\
	  MPN_SQR (tp, rp, n);						\
	  MPN_REDUCE (rp, tp, mp, n, mip);				\
	  if (--ebi == 0)						\
	    goto done;							\
	}								\
									\
      /* The next bit of the exponent is 1.  Now extract the largest	\
	 block of bits <= windowsize, and such that the least		\
	 significant bit is 1.  */					\
									\
      expbits = getbits (ep, ebi, windowsize);				\
      this_windowsize = windowsize;					\
      if (ebi < windowsize)						\
	{								\
	  this_windowsize -= windowsize - ebi;				\
	  ebi = 0;							\
	}								\
      else								\
        ebi -= windowsize;						\
									\
      count_trailing_zeros (cnt, expbits);				\
      this_windowsize -= cnt;						\
      ebi += cnt;							\
      expbits >>= cnt;							\
									\
      do								\
	{								\
	  MPN_SQR (tp, rp, n);						\
	  MPN_REDUCE (rp, tp, mp, n, mip);				\
	}								\
      while (--this_windowsize != 0);					\
									\
      MPN_MUL_N (tp, rp, pp + n * (expbits >> 1), n);			\
      MPN_REDUCE (rp, tp, mp, n, mip);					\
    }


  if (n == 1)
    {
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		umul_ppmm((r)[1], *(r), *(a), *(b))
#define MPN_SQR(r,a,n)			umul_ppmm((r)[1], *(r), *(a), *(a))
#define MPN_REDUCE(rp,tp,mp,n,mip)	MPN_REDC_0(rp, tp, mp, mip[0])
      INNERLOOP;
    }
  else
#if WANT_REDC_2
  if (REDC_1_TO_REDC_2_THRESHOLD < MUL_TOOM22_THRESHOLD)
    {
      if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_2_THRESHOLD))
	{
	  if (REDC_1_TO_REDC_2_THRESHOLD < SQR_BASECASE_THRESHOLD
	      || BELOW_THRESHOLD (n, SQR_BASECASE_THRESHOLD))
	    {
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n)			mpn_mul_basecase (r,a,n,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	MPN_REDC_1 (rp, tp, mp, n, mip[0])
	      INNERLOOP;
	    }
	  else
	    {
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n)			mpn_sqr_basecase (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	MPN_REDC_1 (rp, tp, mp, n, mip[0])
	      INNERLOOP;
	    }
	}
      else if (BELOW_THRESHOLD (n, MUL_TOOM22_THRESHOLD))
	{
	  if (MUL_TOOM22_THRESHOLD < SQR_BASECASE_THRESHOLD
	      || BELOW_THRESHOLD (n, SQR_BASECASE_THRESHOLD))
	    {
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n)			mpn_mul_basecase (r,a,n,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	MPN_REDC_2 (rp, tp, mp, n, mip)
	      INNERLOOP;
	    }
	  else
	    {
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n)			mpn_sqr_basecase (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	MPN_REDC_2 (rp, tp, mp, n, mip)
	      INNERLOOP;
	    }
	}
      else if (BELOW_THRESHOLD (n, REDC_2_TO_REDC_N_THRESHOLD))
	{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n)			mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	MPN_REDC_2 (rp, tp, mp, n, mip)
	  INNERLOOP;
	}
      else
	{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n)			mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	mpn_redc_n (rp, tp, mp, n, mip)
	  INNERLOOP;
	}
    }
  else
    {
      if (BELOW_THRESHOLD (n, MUL_TOOM22_THRESHOLD))
	{
	  if (MUL_TOOM22_THRESHOLD < SQR_BASECASE_THRESHOLD
	      || BELOW_THRESHOLD (n, SQR_BASECASE_THRESHOLD))
	    {
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n)			mpn_mul_basecase (r,a,n,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	MPN_REDC_1 (rp, tp, mp, n, mip[0])
	      INNERLOOP;
	    }
	  else
	    {
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n)			mpn_sqr_basecase (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	MPN_REDC_1 (rp, tp, mp, n, mip[0])
	      INNERLOOP;
	    }
	}
      else if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_2_THRESHOLD))
	{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n)			mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	MPN_REDC_1 (rp, tp, mp, n, mip[0])
	  INNERLOOP;
	}
      else if (BELOW_THRESHOLD (n, REDC_2_TO_REDC_N_THRESHOLD))
	{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n)			mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	MPN_REDC_2 (rp, tp, mp, n, mip)
	  INNERLOOP;
	}
      else
	{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n)			mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	mpn_redc_n (rp, tp, mp, n, mip)
	  INNERLOOP;
	}
    }

#else  /* WANT_REDC_2 */

  if (REDC_1_TO_REDC_N_THRESHOLD < MUL_TOOM22_THRESHOLD)
    {
      if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_N_THRESHOLD))
	{
	  if (REDC_1_TO_REDC_N_THRESHOLD < SQR_BASECASE_THRESHOLD
	      || BELOW_THRESHOLD (n, SQR_BASECASE_THRESHOLD))
	    {
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n)			mpn_mul_basecase (r,a,n,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	MPN_REDC_1 (rp, tp, mp, n, mip[0])
	      INNERLOOP;
	    }
	  else
	    {
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n)			mpn_sqr_basecase (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	MPN_REDC_1 (rp, tp, mp, n, mip[0])
	      INNERLOOP;
	    }
	}
      else if (BELOW_THRESHOLD (n, MUL_TOOM22_THRESHOLD))
	{
	  if (MUL_TOOM22_THRESHOLD < SQR_BASECASE_THRESHOLD
	      || BELOW_THRESHOLD (n, SQR_BASECASE_THRESHOLD))
	    {
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n)			mpn_mul_basecase (r,a,n,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	mpn_redc_n (rp, tp, mp, n, mip)
	      INNERLOOP;
	    }
	  else
	    {
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n)			mpn_sqr_basecase (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	mpn_redc_n (rp, tp, mp, n, mip)
	      INNERLOOP;
	    }
	}
      else
	{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n)			mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	mpn_redc_n (rp, tp, mp, n, mip)
	  INNERLOOP;
	}
    }
  else
    {
      if (BELOW_THRESHOLD (n, MUL_TOOM22_THRESHOLD))
	{
	  if (MUL_TOOM22_THRESHOLD < SQR_BASECASE_THRESHOLD
	      || BELOW_THRESHOLD (n, SQR_BASECASE_THRESHOLD))
	    {
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n)			mpn_mul_basecase (r,a,n,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	MPN_REDC_1 (rp, tp, mp, n, mip[0])
	      INNERLOOP;
	    }
	  else
	    {
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n)			mpn_sqr_basecase (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	MPN_REDC_1 (rp, tp, mp, n, mip[0])
	      INNERLOOP;
	    }
	}
      else if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_N_THRESHOLD))
	{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n)			mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	MPN_REDC_1 (rp, tp, mp, n, mip[0])
	  INNERLOOP;
	}
      else
	{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n)		mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n)			mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip)	mpn_redc_n (rp, tp, mp, n, mip)
	  INNERLOOP;
	}
    }
#endif  /* WANT_REDC_2 */

 done:

  MPN_COPY (tp, rp, n);
  MPN_ZERO (tp + n, n);

#if WANT_REDC_2
  if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_2_THRESHOLD))
    MPN_REDC_1 (rp, tp, mp, n, mip[0]);
  else if (BELOW_THRESHOLD (n, REDC_2_TO_REDC_N_THRESHOLD))
    MPN_REDC_2 (rp, tp, mp, n, mip);
#else
  if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_N_THRESHOLD))
    MPN_REDC_1 (rp, tp, mp, n, mip[0]);
#endif
  else
    mpn_redc_n (rp, tp, mp, n, mip);

  if (mpn_cmp (rp, mp, n) >= 0)
    mpn_sub_n (rp, rp, mp, n);

  TMP_FREE;
}

void
mpz_powm_flag (mpz_ptr r, mpz_srcptr b, mpz_srcptr e, mpz_srcptr m, int flag)
{
  mp_size_t n, nodd, ncnt;
  int cnt;
  mp_ptr rp, tp;
  mp_srcptr bp, ep, mp;
  mp_size_t rn, bn, es, en, itch;
  mpz_t new_b;			/* note: value lives long via 'b' */
  TMP_DECL;

  n = ABSIZ(m);
  if (UNLIKELY (n == 0))
    DIVIDE_BY_ZERO;

  mp = PTR(m);

  TMP_MARK;

  es = SIZ(e);
  if (UNLIKELY (es <= 0))
    {
      if (es == 0)
	{
	  /* b^0 mod m,  b is anything and m is non-zero.
	     Result is 1 mod m, i.e., 1 or 0 depending on if m = 1.  */
	  SIZ(r) = n != 1 || mp[0] != 1;
	  MPZ_NEWALLOC (r, 1)[0] = 1;
	  TMP_FREE;	/* we haven't really allocated anything here */
	  return;
	}
#if HANDLE_NEGATIVE_EXPONENT
      MPZ_TMP_INIT (new_b, n + 1);

      if (UNLIKELY (! mpz_invert (new_b, b, m)))
	DIVIDE_BY_ZERO;
      b = new_b;
      es = -es;
#else
      DIVIDE_BY_ZERO;
#endif
    }
  en = es;

  bn = ABSIZ(b);

  if (UNLIKELY (bn == 0))
    {
      SIZ(r) = 0;
      TMP_FREE;
      return;
    }

  ep = PTR(e);

  /* Handle (b^1 mod m) early, since mpn_pow* do not handle that case.  */
  if (UNLIKELY (en == 1 && ep[0] == 1))
    {
      rp = TMP_ALLOC_LIMBS (n);
      bp = PTR(b);
      if (bn >= n)
	{
	  mp_ptr qp = TMP_ALLOC_LIMBS (bn - n + 1);
	  mpn_tdiv_qr (qp, rp, 0L, bp, bn, mp, n);
	  rn = n;
	  MPN_NORMALIZE (rp, rn);

	  if (rn != 0 && SIZ(b) < 0)
	    {
	      mpn_sub (rp, mp, n, rp, rn);
	      rn = n;
	      MPN_NORMALIZE_NOT_ZERO (rp, rn);
	    }
	}
      else
	{
	  if (SIZ(b) < 0)
	    {
	      mpn_sub (rp, mp, n, bp, bn);
	      rn = n;
	      MPN_NORMALIZE_NOT_ZERO (rp, rn);
	    }
	  else
	    {
	      MPN_COPY (rp, bp, bn);
	      rn = bn;
	    }
	}
      goto ret;
    }

  /* Remove low zero limbs from M.  This loop will terminate for correctly
     represented mpz numbers.  */
  ncnt = 0;
  while (UNLIKELY (mp[0] == 0))
    {
      mp++;
      ncnt++;
    }
  nodd = n - ncnt;
  cnt = 0;
  if (mp[0] % 2 == 0)
    {
      mp_ptr newmp = TMP_ALLOC_LIMBS (nodd);
      count_trailing_zeros (cnt, mp[0]);
      mpn_rshift (newmp, mp, nodd, cnt);
      nodd -= newmp[nodd - 1] == 0;
      mp = newmp;
      ncnt++;
    }

  if (ncnt != 0)
    {
      /* We will call both mpn_powm and mpn_powlo.  */
      /* rp needs n, mpn_powlo needs 4n, the 2 mpn_binvert might need more */
      mp_size_t n_largest_binvert = MAX (ncnt, nodd);
      mp_size_t itch_binvert = mpn_binvert_itch (n_largest_binvert);
      itch = 3 * n + MAX (itch_binvert, 2 * n);
    }
  else
    {
      /* We will call just mpn_powm.  */
      mp_size_t itch_binvert = mpn_binvert_itch (nodd);
      itch = n + MAX (itch_binvert, 2 * n);
    }
  tp = TMP_ALLOC_LIMBS (itch);

  rp = tp;  tp += n;

  bp = PTR(b);
  mpn_powm_flag (rp, bp, bn, ep, en, mp, nodd, tp, flag);

  rn = n;

  if (ncnt != 0)
    {
      mp_ptr r2, xp, yp, odd_inv_2exp;
      unsigned long t;
      int bcnt;

      if (bn < ncnt)
	{
	  mp_ptr newbp = TMP_ALLOC_LIMBS (ncnt);
	  MPN_COPY (newbp, bp, bn);
	  MPN_ZERO (newbp + bn, ncnt - bn);
	  bp = newbp;
	}

      r2 = tp;

      if (bp[0] % 2 == 0)
	{
	  if (en > 1)
	    {
	      MPN_ZERO (r2, ncnt);
	      goto zero;
	    }

	  ASSERT (en == 1);
	  t = (ncnt - (cnt != 0)) * GMP_NUMB_BITS + cnt;

	  /* Count number of low zero bits in B, up to 3.  */
	  bcnt = (0x1213 >> ((bp[0] & 7) << 1)) & 0x3;
	  /* Note that ep[0] * bcnt might overflow, but that just results
	     in a missed optimization.  */
	  if (ep[0] * bcnt >= t)
	    {
	      MPN_ZERO (r2, ncnt);
	      goto zero;
	    }
	}

      mpn_powlo (r2, bp, ep, en, ncnt, tp + ncnt);

    zero:
      if (nodd < ncnt)
	{
	  mp_ptr newmp = TMP_ALLOC_LIMBS (ncnt);
	  MPN_COPY (newmp, mp, nodd);
	  MPN_ZERO (newmp + nodd, ncnt - nodd);
	  mp = newmp;
	}

      odd_inv_2exp = tp + n;
      mpn_binvert (odd_inv_2exp, mp, ncnt, tp + 2 * n);

      mpn_sub (r2, r2, ncnt, rp, nodd > ncnt ? ncnt : nodd);

      xp = tp + 2 * n;
      mpn_mullo_n (xp, odd_inv_2exp, r2, ncnt);

      if (cnt != 0)
	xp[ncnt - 1] &= (CNST_LIMB(1) << cnt) - 1;

      yp = tp;
      if (ncnt > nodd)
	mpn_mul (yp, xp, ncnt, mp, nodd);
      else
	mpn_mul (yp, mp, nodd, xp, ncnt);

      mpn_add (rp, yp, n, rp, nodd);

      ASSERT (nodd + ncnt >= n);
      ASSERT (nodd + ncnt <= n + 1);
    }

  MPN_NORMALIZE (rp, rn);

  if ((ep[0] & 1) && SIZ(b) < 0 && rn != 0)
    {
      mpn_sub (rp, PTR(m), n, rp, rn);
      rn = n;
      MPN_NORMALIZE (rp, rn);
    }

 ret:
  MPZ_NEWALLOC (r, rn);
  SIZ(r) = rn;
  MPN_COPY (PTR(r), rp, rn);

  TMP_FREE;
}

#define SPEED_ROUTINE_MPZ_POWM_EXPO(FLAG)				\
  {									\
    mpz_t     r, b, e, m;						\
    unsigned  i;							\
    double    t;							\
									\
    SPEED_RESTRICT_COND (s->size >= 1);					\
									\
    mpz_init (r);							\
    mpz_init_set_n (b, s->xp, s->size);					\
    mpz_init_set_n (e, s->xp_block, s->size);				\
    if (s->r > 1) {							\
      mpz_tdiv_r_2exp (e, e, s->size * GMP_NUMB_BITS / s->r);		\
    } else if (s->r == 1) {						\
      mpz_set_ui (e, 0);						\
      mpz_setbit (e, s->size * GMP_NUMB_BITS);				\
      mpz_sub_ui (e, e, 1);						\
    }									\
    mpz_setbit (e, s->size * GMP_NUMB_BITS - 1);			\
    mpz_init_set_n (m, s->yp, s->size);					\
    mpz_setbit (m, 0);	/* force m to odd */				\
									\
    speed_starttime ();							\
    i = s->reps;							\
    do									\
      mpz_powm_flag (r, b, e, m, FLAG);					\
    while (--i != 0);							\
    t = speed_endtime ();						\
									\
    mpz_clear (r);							\
    mpz_clear (b);							\
    mpz_clear (e);							\
    mpz_clear (m);							\
    return t;								\
  }

double
speed_mpz_powm_bitsize (struct speed_params *s)
{
  SPEED_ROUTINE_MPZ_POWM_EXPO (0);
}

double
speed_mpz_powm_popcount (struct speed_params *s)
{
  SPEED_ROUTINE_MPZ_POWM_EXPO (1);
}

_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel