Ciao,
Il 2019-05-02 10:09 paul zimmermann ha scritto (on gmp-discuss):
That's why a few weeks ago on this same list I suggested:
"What about even sparser exponents? Should we use 2*popcount(n)
instead of
size_in_bits(n) to estimate the best window size? They should be
almost
equivalent on average."
[ https://gmplib.org/list-archives/gmp-discuss/2019-April/006310.html
]
I support the (clever) idea of using 2*popcount(exponent).
Could you benchmark this against the current version, on
dense and sparse exponents?
I tried, with the attached code (replace tune/speed-ext.c, then cd
tune;make speed-ext). It contains a copy of the current
mpn/generic/powm.c code (and mpz/powm.c) with an additional flag and a
branch:
if (flag != 0)
windowsize = win_size (mpn_popcount (ep, en) * 2 - 1);
else
windowsize = win_size (ebi);
This way we can compare both ways to estimate the window size.
Unfortunately, there are large fluctuations while measuring this
function, for each size I measure three _bs and three _pc (bitsize vs.
popcount).
I used the command-line
for i in 0 1 2 4 16 256; do tune/speed-ext -rs1-260 -f2 mpz_powm_bs.$i
mpz_powm_pc.$i mpz_powm_bs.$i mpz_powm_pc.$i mpz_powm_bs.$i
mpz_powm_pc.$i; done
Because of the "-r", the first column is the time used, the other
columns contain the ratio.
With r-flag=0 (random exponent) the speed are comparable. Half the times
the winner is _bs, half the times the winner is _pc:
overhead 0.000000002 secs, precision 10000 units of 2.86e-10 secs, CPU
freq 3500.09 MHz
_bs.0 _pc.0 _bs.0 _pc.0 _bs.0 _pc.0
1 0.000000849 0.9973 0.9954 0.9968 #0.9929 0.9951
2 0.000002583 1.1107 0.9936 0.9959 0.9868 #0.9862
4 0.000013473 0.9951 0.9952 #0.9947 0.9961 0.9947
8 0.000070850 0.9983 0.9987 0.9978 0.9981 #0.9977
16 0.000481797 0.9624 0.9554 #0.9161 0.9602 0.9537
32 0.003307878 0.9979 0.9971 0.9946 0.9998 #0.9780
64 0.021502000 1.0013 0.9979 0.9972 #0.9960 0.9973
128 #0.126749000 1.0036 1.0104 1.0131 1.0059 1.0032
256 0.738998000 1.0001 0.9961 0.9995 #0.9953 1.0008
With r-flag=1 (exponent is set to binary 111...111), popcount may
suggest a larger window-size, but again no clear winner:
overhead 0.000000002 secs, precision 10000 units of 2.86e-10 secs, CPU
freq 3500.09 MHz
_bs.1 _pc.1 _bs.1 _pc.1 _bs.1 _pc.1
1 0.000000878 0.9967 1.0014 #0.9921 1.0016 0.9978
2 0.000002970 1.0053 0.9846 0.9871 #0.9829 0.9949
4 0.000013400 0.9995 0.9977 0.9973 #0.9973 0.9990
8 0.000071776 0.9948 0.9989 0.9950 0.9998 #0.9947
16 0.000475079 0.9701 0.9993 #0.9605 0.9627 0.9691
32 0.003303542 #0.9968 1.0012 1.0083 1.0123 1.0168
64 0.021803000 1.0093 1.0117 0.9971 1.0008 #0.9928
128 #0.128225000 1.0047 1.0042 1.0129 1.0018 1.0083
256 0.746284000 1.0051 #0.9977 1.0006 0.9988 1.0058
With sparser exponents (with r-flag=2, half of the exponent is random
data, half is 0; with r-flag=4, one fourth of the exponent is random
data; and so on...), popcount suggests a smaller window-size, and seems
effective:
overhead 0.000000002 secs, precision 10000 units of 2.86e-10 secs, CPU
freq 3500.09 MHz
_bs.4 _pc.4 _bs.4 _pc.4 _bs.4 _pc.4
1 0.000000765 0.9845 0.9753 0.9818 #0.9736 0.9842
2 0.000002285 0.9839 1.0016 0.9814 1.0022 #0.9810
4 0.000011773 0.9941 1.0026 #0.9820 1.0001 0.9988
8 0.000061283 #0.9884 0.9975 0.9889 1.9493 1.9487
16 0.000394527 1.0343 1.0565 #0.9918 0.9996 1.0378
32 0.002986944 0.9758 0.9840 #0.9663 0.9794 0.9669
64 #0.019095000 1.0025 1.0098 1.0096 1.0194 1.0104
128 0.116289000 0.9958 0.9983 0.9941 0.9994 #0.9933
256 0.681213000 0.9912 0.9948 #0.9902 1.0015 0.9905
overhead 0.000000002 secs, precision 10000 units of 2.86e-10 secs, CPU
freq 3500.09 MHz
_bs.16 _pc.16 _bs.16 _pc.16 _bs.16 _pc.16
1 0.000000729 0.9875 0.9652 #0.9515 0.9536 0.9849
2 0.000002228 #0.9450 0.9953 0.9475 0.9948 0.9477
4 0.000011343 #0.9477 0.9952 0.9480 0.9943 0.9518
8 0.000058805 0.9784 0.9984 0.9760 0.9976 #0.9759
16 0.000381547 #0.9784 1.0186 0.9912 1.0007 0.9927
32 0.002858413 #0.9695 0.9948 0.9793 1.0029 0.9706
64 0.019123000 0.9794 0.9918 #0.9716 0.9963 0.9876
128 0.113480000 0.9892 0.9974 #0.9878 1.0043 0.9907
256 0.668013000 0.9845 0.9979 0.9848 0.9980 #0.9814
overhead 0.000000002 secs, precision 10000 units of 2.86e-10 secs, CPU
freq 3500.09 MHz
_bs.256 _pc.256 _bs.256 _pc.256 _bs.256 _pc.256
1 0.000000730 0.9886 0.9601 0.9521 0.9610 #0.9442
2 0.000002242 0.9355 0.9983 #0.9352 0.9946 0.9369
4 0.000011123 0.9416 1.0002 0.9401 0.9970 #0.9390
8 0.000057803 0.9670 1.0013 #0.9665 1.0005 0.9669
16 0.000378739 #0.9648 0.9993 0.9671 1.0213 0.9670
32 0.002830238 #0.9640 1.0034 0.9669 1.0006 0.9765
64 0.018543000 1.0168 1.0207 1.0020 1.0169 #0.9857
128 0.112874000 0.9867 1.0010 #0.9861 1.0119 0.9893
256 0.664346000 #0.9749 0.9956 0.9766 0.9942 0.9789
I mean, the gain is detectable only for very sparse exponents.
Ĝis,
m
/* An example of extending the speed program to measure routines not in GMP.
Copyright 1999, 2000, 2002, 2003, 2005, 2019 Free Software Foundation, Inc.
This file is part of the GNU MP Library.
The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of:
* the GNU General Public License as published by the Free
Software Foundation; either version 3 of the License, or (at your
option) any later version.
The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
for more details.
You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library. If not,
see https://www.gnu.org/licenses/. */
#include "gmp-impl.h"
#include "longlong.h"
#define SPEED_EXTRA_PROTOS \
double speed_mpz_powm_bitsize (struct speed_params *s); \
double speed_mpz_powm_popcount (struct speed_params *s);
#define SPEED_EXTRA_ROUTINES \
{ "mpz_powm_bitzise", speed_mpz_powm_bitsize, FLAG_R_OPTIONAL}, \
{ "mpz_powm_popcount", speed_mpz_powm_popcount, FLAG_R_OPTIONAL},
#include "speed.c"
#undef MPN_REDC_0
#define MPN_REDC_0(rp, up, mp, invm) \
do { \
mp_limb_t p1, r0, u0, _dummy; \
u0 = *(up); \
umul_ppmm (p1, _dummy, *(mp), (u0 * (invm)) & GMP_NUMB_MASK); \
ASSERT (((u0 + _dummy) & GMP_NUMB_MASK) == 0); \
p1 += (u0 != 0); \
r0 = (up)[1] + p1; \
if (p1 > r0) \
r0 -= *(mp); \
*(rp) = r0; \
} while (0)
#undef MPN_REDC_1
#if HAVE_NATIVE_mpn_sbpi1_bdiv_r
#define MPN_REDC_1(rp, up, mp, n, invm) \
do { \
mp_limb_t cy; \
cy = mpn_sbpi1_bdiv_r (up, 2 * n, mp, n, invm); \
if (cy != 0) \
mpn_sub_n (rp, up + n, mp, n); \
else \
MPN_COPY (rp, up + n, n); \
} while (0)
#else
#define MPN_REDC_1(rp, up, mp, n, invm) \
do { \
mp_limb_t cy; \
cy = mpn_redc_1 (rp, up, mp, n, invm); \
if (cy != 0) \
mpn_sub_n (rp, rp, mp, n); \
} while (0)
#endif
#undef MPN_REDC_2
#define MPN_REDC_2(rp, up, mp, n, mip) \
do { \
mp_limb_t cy; \
cy = mpn_redc_2 (rp, up, mp, n, mip); \
if (cy != 0) \
mpn_sub_n (rp, rp, mp, n); \
} while (0)
#if HAVE_NATIVE_mpn_addmul_2 || HAVE_NATIVE_mpn_redc_2
#define WANT_REDC_2 1
#endif
#define getbit(p,bi) \
((p[(bi - 1) / GMP_LIMB_BITS] >> (bi - 1) % GMP_LIMB_BITS) & 1)
static inline mp_limb_t
getbits (const mp_limb_t *p, mp_bitcnt_t bi, int nbits)
{
int nbits_in_r;
mp_limb_t r;
mp_size_t i;
if (bi < nbits)
{
return p[0] & (((mp_limb_t) 1 << bi) - 1);
}
else
{
bi -= nbits; /* bit index of low bit to extract */
i = bi / GMP_NUMB_BITS; /* word index of low bit to extract */
bi %= GMP_NUMB_BITS; /* bit index in low word */
r = p[i] >> bi; /* extract (low) bits */
nbits_in_r = GMP_NUMB_BITS - bi; /* number of bits now in r */
if (nbits_in_r < nbits) /* did we get enough bits? */
r += p[i + 1] << nbits_in_r; /* prepend bits from higher word */
return r & (((mp_limb_t) 1 << nbits) - 1);
}
}
static inline int
win_size (mp_bitcnt_t eb)
{
int k;
static mp_bitcnt_t x[] = {0,7,25,81,241,673,1793,4609,11521,28161,~(mp_bitcnt_t)0};
for (k = 1; eb > x[k]; k++)
;
return k;
}
/* Convert U to REDC form, U_r = B^n * U mod M */
static void
redcify (mp_ptr rp, mp_srcptr up, mp_size_t un, mp_srcptr mp, mp_size_t n)
{
mp_ptr tp, qp;
TMP_DECL;
TMP_MARK;
TMP_ALLOC_LIMBS_2 (tp, un + n, qp, un + 1);
MPN_ZERO (tp, n);
MPN_COPY (tp + n, up, un);
mpn_tdiv_qr (qp, rp, 0L, tp, un + n, mp, n);
TMP_FREE;
}
/* rp[n-1..0] = bp[bn-1..0] ^ ep[en-1..0] mod mp[n-1..0]
Requires that mp[n-1..0] is odd.
Requires that ep[en-1..0] is > 1.
Uses scratch space at tp of MAX(mpn_binvert_itch(n),2n) limbs. */
static void
mpn_powm_flag (mp_ptr rp, mp_srcptr bp, mp_size_t bn,
mp_srcptr ep, mp_size_t en,
mp_srcptr mp, mp_size_t n, mp_ptr tp, int flag)
{
mp_limb_t ip[2], *mip;
int cnt;
mp_bitcnt_t ebi;
int windowsize, this_windowsize;
mp_limb_t expbits;
mp_ptr pp, this_pp;
long i;
TMP_DECL;
ASSERT (en > 1 || (en == 1 && ep[0] > 1));
ASSERT (n >= 1 && ((mp[0] & 1) != 0));
TMP_MARK;
MPN_SIZEINBASE_2EXP(ebi, ep, en, 1);
#if 0
if (bn < n)
{
/* Do the first few exponent bits without mod reductions,
until the result is greater than the mod argument. */
for (;;)
{
mpn_sqr (tp, this_pp, tn);
tn = tn * 2 - 1, tn += tp[tn] != 0;
if (getbit (ep, ebi) != 0)
mpn_mul (..., tp, tn, bp, bn);
ebi--;
}
}
#endif
if (flag != 0)
windowsize = win_size (mpn_popcount (ep, en) * 2 - 1);
else
windowsize = win_size (ebi);
/* printf("f=%i, ws=%i (%lu)\n", flag, windowsize, flag? mpn_popcount (ep, en) * 2 - 1 :ebi); */
#if WANT_REDC_2
if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_2_THRESHOLD))
{
mip = ip;
binvert_limb (mip[0], mp[0]);
mip[0] = -mip[0];
}
else if (BELOW_THRESHOLD (n, REDC_2_TO_REDC_N_THRESHOLD))
{
mip = ip;
mpn_binvert (mip, mp, 2, tp);
mip[0] = -mip[0]; mip[1] = ~mip[1];
}
#else
if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_N_THRESHOLD))
{
mip = ip;
binvert_limb (mip[0], mp[0]);
mip[0] = -mip[0];
}
#endif
else
{
mip = TMP_ALLOC_LIMBS (n);
mpn_binvert (mip, mp, n, tp);
}
pp = TMP_ALLOC_LIMBS (n << (windowsize - 1));
this_pp = pp;
redcify (this_pp, bp, bn, mp, n);
/* Store b^2 at rp. */
mpn_sqr (tp, this_pp, n);
#if 0
if (n == 1) {
MPN_REDC_0 (rp, tp, mp, mip[0]);
} else
#endif
#if WANT_REDC_2
if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_2_THRESHOLD))
MPN_REDC_1 (rp, tp, mp, n, mip[0]);
else if (BELOW_THRESHOLD (n, REDC_2_TO_REDC_N_THRESHOLD))
MPN_REDC_2 (rp, tp, mp, n, mip);
#else
if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_N_THRESHOLD))
MPN_REDC_1 (rp, tp, mp, n, mip[0]);
#endif
else
mpn_redc_n (rp, tp, mp, n, mip);
/* Precompute odd powers of b and put them in the temporary area at pp. */
for (i = (1 << (windowsize - 1)) - 1; i > 0; i--)
#if 1
if (n == 1) {
umul_ppmm((tp)[1], *(tp), *(this_pp), *(rp));
++this_pp ;
MPN_REDC_0 (this_pp, tp, mp, mip[0]);
} else
#endif
{
mpn_mul_n (tp, this_pp, rp, n);
this_pp += n;
#if WANT_REDC_2
if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_2_THRESHOLD))
MPN_REDC_1 (this_pp, tp, mp, n, mip[0]);
else if (BELOW_THRESHOLD (n, REDC_2_TO_REDC_N_THRESHOLD))
MPN_REDC_2 (this_pp, tp, mp, n, mip);
#else
if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_N_THRESHOLD))
MPN_REDC_1 (this_pp, tp, mp, n, mip[0]);
#endif
else
mpn_redc_n (this_pp, tp, mp, n, mip);
}
expbits = getbits (ep, ebi, windowsize);
if (ebi < windowsize)
ebi = 0;
else
ebi -= windowsize;
count_trailing_zeros (cnt, expbits);
ebi += cnt;
expbits >>= cnt;
MPN_COPY (rp, pp + n * (expbits >> 1), n);
#define INNERLOOP \
while (ebi != 0) \
{ \
while (getbit (ep, ebi) == 0) \
{ \
MPN_SQR (tp, rp, n); \
MPN_REDUCE (rp, tp, mp, n, mip); \
if (--ebi == 0) \
goto done; \
} \
\
/* The next bit of the exponent is 1. Now extract the largest \
block of bits <= windowsize, and such that the least \
significant bit is 1. */ \
\
expbits = getbits (ep, ebi, windowsize); \
this_windowsize = windowsize; \
if (ebi < windowsize) \
{ \
this_windowsize -= windowsize - ebi; \
ebi = 0; \
} \
else \
ebi -= windowsize; \
\
count_trailing_zeros (cnt, expbits); \
this_windowsize -= cnt; \
ebi += cnt; \
expbits >>= cnt; \
\
do \
{ \
MPN_SQR (tp, rp, n); \
MPN_REDUCE (rp, tp, mp, n, mip); \
} \
while (--this_windowsize != 0); \
\
MPN_MUL_N (tp, rp, pp + n * (expbits >> 1), n); \
MPN_REDUCE (rp, tp, mp, n, mip); \
}
if (n == 1)
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) umul_ppmm((r)[1], *(r), *(a), *(b))
#define MPN_SQR(r,a,n) umul_ppmm((r)[1], *(r), *(a), *(a))
#define MPN_REDUCE(rp,tp,mp,n,mip) MPN_REDC_0(rp, tp, mp, mip[0])
INNERLOOP;
}
else
#if WANT_REDC_2
if (REDC_1_TO_REDC_2_THRESHOLD < MUL_TOOM22_THRESHOLD)
{
if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_2_THRESHOLD))
{
if (REDC_1_TO_REDC_2_THRESHOLD < SQR_BASECASE_THRESHOLD
|| BELOW_THRESHOLD (n, SQR_BASECASE_THRESHOLD))
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n) mpn_mul_basecase (r,a,n,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) MPN_REDC_1 (rp, tp, mp, n, mip[0])
INNERLOOP;
}
else
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n) mpn_sqr_basecase (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) MPN_REDC_1 (rp, tp, mp, n, mip[0])
INNERLOOP;
}
}
else if (BELOW_THRESHOLD (n, MUL_TOOM22_THRESHOLD))
{
if (MUL_TOOM22_THRESHOLD < SQR_BASECASE_THRESHOLD
|| BELOW_THRESHOLD (n, SQR_BASECASE_THRESHOLD))
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n) mpn_mul_basecase (r,a,n,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) MPN_REDC_2 (rp, tp, mp, n, mip)
INNERLOOP;
}
else
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n) mpn_sqr_basecase (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) MPN_REDC_2 (rp, tp, mp, n, mip)
INNERLOOP;
}
}
else if (BELOW_THRESHOLD (n, REDC_2_TO_REDC_N_THRESHOLD))
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n) mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) MPN_REDC_2 (rp, tp, mp, n, mip)
INNERLOOP;
}
else
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n) mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) mpn_redc_n (rp, tp, mp, n, mip)
INNERLOOP;
}
}
else
{
if (BELOW_THRESHOLD (n, MUL_TOOM22_THRESHOLD))
{
if (MUL_TOOM22_THRESHOLD < SQR_BASECASE_THRESHOLD
|| BELOW_THRESHOLD (n, SQR_BASECASE_THRESHOLD))
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n) mpn_mul_basecase (r,a,n,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) MPN_REDC_1 (rp, tp, mp, n, mip[0])
INNERLOOP;
}
else
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n) mpn_sqr_basecase (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) MPN_REDC_1 (rp, tp, mp, n, mip[0])
INNERLOOP;
}
}
else if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_2_THRESHOLD))
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n) mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) MPN_REDC_1 (rp, tp, mp, n, mip[0])
INNERLOOP;
}
else if (BELOW_THRESHOLD (n, REDC_2_TO_REDC_N_THRESHOLD))
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n) mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) MPN_REDC_2 (rp, tp, mp, n, mip)
INNERLOOP;
}
else
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n) mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) mpn_redc_n (rp, tp, mp, n, mip)
INNERLOOP;
}
}
#else /* WANT_REDC_2 */
if (REDC_1_TO_REDC_N_THRESHOLD < MUL_TOOM22_THRESHOLD)
{
if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_N_THRESHOLD))
{
if (REDC_1_TO_REDC_N_THRESHOLD < SQR_BASECASE_THRESHOLD
|| BELOW_THRESHOLD (n, SQR_BASECASE_THRESHOLD))
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n) mpn_mul_basecase (r,a,n,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) MPN_REDC_1 (rp, tp, mp, n, mip[0])
INNERLOOP;
}
else
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n) mpn_sqr_basecase (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) MPN_REDC_1 (rp, tp, mp, n, mip[0])
INNERLOOP;
}
}
else if (BELOW_THRESHOLD (n, MUL_TOOM22_THRESHOLD))
{
if (MUL_TOOM22_THRESHOLD < SQR_BASECASE_THRESHOLD
|| BELOW_THRESHOLD (n, SQR_BASECASE_THRESHOLD))
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n) mpn_mul_basecase (r,a,n,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) mpn_redc_n (rp, tp, mp, n, mip)
INNERLOOP;
}
else
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n) mpn_sqr_basecase (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) mpn_redc_n (rp, tp, mp, n, mip)
INNERLOOP;
}
}
else
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n) mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) mpn_redc_n (rp, tp, mp, n, mip)
INNERLOOP;
}
}
else
{
if (BELOW_THRESHOLD (n, MUL_TOOM22_THRESHOLD))
{
if (MUL_TOOM22_THRESHOLD < SQR_BASECASE_THRESHOLD
|| BELOW_THRESHOLD (n, SQR_BASECASE_THRESHOLD))
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n) mpn_mul_basecase (r,a,n,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) MPN_REDC_1 (rp, tp, mp, n, mip[0])
INNERLOOP;
}
else
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_basecase (r,a,n,b,n)
#define MPN_SQR(r,a,n) mpn_sqr_basecase (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) MPN_REDC_1 (rp, tp, mp, n, mip[0])
INNERLOOP;
}
}
else if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_N_THRESHOLD))
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n) mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) MPN_REDC_1 (rp, tp, mp, n, mip[0])
INNERLOOP;
}
else
{
#undef MPN_MUL_N
#undef MPN_SQR
#undef MPN_REDUCE
#define MPN_MUL_N(r,a,b,n) mpn_mul_n (r,a,b,n)
#define MPN_SQR(r,a,n) mpn_sqr (r,a,n)
#define MPN_REDUCE(rp,tp,mp,n,mip) mpn_redc_n (rp, tp, mp, n, mip)
INNERLOOP;
}
}
#endif /* WANT_REDC_2 */
done:
MPN_COPY (tp, rp, n);
MPN_ZERO (tp + n, n);
#if WANT_REDC_2
if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_2_THRESHOLD))
MPN_REDC_1 (rp, tp, mp, n, mip[0]);
else if (BELOW_THRESHOLD (n, REDC_2_TO_REDC_N_THRESHOLD))
MPN_REDC_2 (rp, tp, mp, n, mip);
#else
if (BELOW_THRESHOLD (n, REDC_1_TO_REDC_N_THRESHOLD))
MPN_REDC_1 (rp, tp, mp, n, mip[0]);
#endif
else
mpn_redc_n (rp, tp, mp, n, mip);
if (mpn_cmp (rp, mp, n) >= 0)
mpn_sub_n (rp, rp, mp, n);
TMP_FREE;
}
void
mpz_powm_flag (mpz_ptr r, mpz_srcptr b, mpz_srcptr e, mpz_srcptr m, int flag)
{
mp_size_t n, nodd, ncnt;
int cnt;
mp_ptr rp, tp;
mp_srcptr bp, ep, mp;
mp_size_t rn, bn, es, en, itch;
mpz_t new_b; /* note: value lives long via 'b' */
TMP_DECL;
n = ABSIZ(m);
if (UNLIKELY (n == 0))
DIVIDE_BY_ZERO;
mp = PTR(m);
TMP_MARK;
es = SIZ(e);
if (UNLIKELY (es <= 0))
{
if (es == 0)
{
/* b^0 mod m, b is anything and m is non-zero.
Result is 1 mod m, i.e., 1 or 0 depending on if m = 1. */
SIZ(r) = n != 1 || mp[0] != 1;
MPZ_NEWALLOC (r, 1)[0] = 1;
TMP_FREE; /* we haven't really allocated anything here */
return;
}
#if HANDLE_NEGATIVE_EXPONENT
MPZ_TMP_INIT (new_b, n + 1);
if (UNLIKELY (! mpz_invert (new_b, b, m)))
DIVIDE_BY_ZERO;
b = new_b;
es = -es;
#else
DIVIDE_BY_ZERO;
#endif
}
en = es;
bn = ABSIZ(b);
if (UNLIKELY (bn == 0))
{
SIZ(r) = 0;
TMP_FREE;
return;
}
ep = PTR(e);
/* Handle (b^1 mod m) early, since mpn_pow* do not handle that case. */
if (UNLIKELY (en == 1 && ep[0] == 1))
{
rp = TMP_ALLOC_LIMBS (n);
bp = PTR(b);
if (bn >= n)
{
mp_ptr qp = TMP_ALLOC_LIMBS (bn - n + 1);
mpn_tdiv_qr (qp, rp, 0L, bp, bn, mp, n);
rn = n;
MPN_NORMALIZE (rp, rn);
if (rn != 0 && SIZ(b) < 0)
{
mpn_sub (rp, mp, n, rp, rn);
rn = n;
MPN_NORMALIZE_NOT_ZERO (rp, rn);
}
}
else
{
if (SIZ(b) < 0)
{
mpn_sub (rp, mp, n, bp, bn);
rn = n;
MPN_NORMALIZE_NOT_ZERO (rp, rn);
}
else
{
MPN_COPY (rp, bp, bn);
rn = bn;
}
}
goto ret;
}
/* Remove low zero limbs from M. This loop will terminate for correctly
represented mpz numbers. */
ncnt = 0;
while (UNLIKELY (mp[0] == 0))
{
mp++;
ncnt++;
}
nodd = n - ncnt;
cnt = 0;
if (mp[0] % 2 == 0)
{
mp_ptr newmp = TMP_ALLOC_LIMBS (nodd);
count_trailing_zeros (cnt, mp[0]);
mpn_rshift (newmp, mp, nodd, cnt);
nodd -= newmp[nodd - 1] == 0;
mp = newmp;
ncnt++;
}
if (ncnt != 0)
{
/* We will call both mpn_powm and mpn_powlo. */
/* rp needs n, mpn_powlo needs 4n, the 2 mpn_binvert might need more */
mp_size_t n_largest_binvert = MAX (ncnt, nodd);
mp_size_t itch_binvert = mpn_binvert_itch (n_largest_binvert);
itch = 3 * n + MAX (itch_binvert, 2 * n);
}
else
{
/* We will call just mpn_powm. */
mp_size_t itch_binvert = mpn_binvert_itch (nodd);
itch = n + MAX (itch_binvert, 2 * n);
}
tp = TMP_ALLOC_LIMBS (itch);
rp = tp; tp += n;
bp = PTR(b);
mpn_powm_flag (rp, bp, bn, ep, en, mp, nodd, tp, flag);
rn = n;
if (ncnt != 0)
{
mp_ptr r2, xp, yp, odd_inv_2exp;
unsigned long t;
int bcnt;
if (bn < ncnt)
{
mp_ptr newbp = TMP_ALLOC_LIMBS (ncnt);
MPN_COPY (newbp, bp, bn);
MPN_ZERO (newbp + bn, ncnt - bn);
bp = newbp;
}
r2 = tp;
if (bp[0] % 2 == 0)
{
if (en > 1)
{
MPN_ZERO (r2, ncnt);
goto zero;
}
ASSERT (en == 1);
t = (ncnt - (cnt != 0)) * GMP_NUMB_BITS + cnt;
/* Count number of low zero bits in B, up to 3. */
bcnt = (0x1213 >> ((bp[0] & 7) << 1)) & 0x3;
/* Note that ep[0] * bcnt might overflow, but that just results
in a missed optimization. */
if (ep[0] * bcnt >= t)
{
MPN_ZERO (r2, ncnt);
goto zero;
}
}
mpn_powlo (r2, bp, ep, en, ncnt, tp + ncnt);
zero:
if (nodd < ncnt)
{
mp_ptr newmp = TMP_ALLOC_LIMBS (ncnt);
MPN_COPY (newmp, mp, nodd);
MPN_ZERO (newmp + nodd, ncnt - nodd);
mp = newmp;
}
odd_inv_2exp = tp + n;
mpn_binvert (odd_inv_2exp, mp, ncnt, tp + 2 * n);
mpn_sub (r2, r2, ncnt, rp, nodd > ncnt ? ncnt : nodd);
xp = tp + 2 * n;
mpn_mullo_n (xp, odd_inv_2exp, r2, ncnt);
if (cnt != 0)
xp[ncnt - 1] &= (CNST_LIMB(1) << cnt) - 1;
yp = tp;
if (ncnt > nodd)
mpn_mul (yp, xp, ncnt, mp, nodd);
else
mpn_mul (yp, mp, nodd, xp, ncnt);
mpn_add (rp, yp, n, rp, nodd);
ASSERT (nodd + ncnt >= n);
ASSERT (nodd + ncnt <= n + 1);
}
MPN_NORMALIZE (rp, rn);
if ((ep[0] & 1) && SIZ(b) < 0 && rn != 0)
{
mpn_sub (rp, PTR(m), n, rp, rn);
rn = n;
MPN_NORMALIZE (rp, rn);
}
ret:
MPZ_NEWALLOC (r, rn);
SIZ(r) = rn;
MPN_COPY (PTR(r), rp, rn);
TMP_FREE;
}
#define SPEED_ROUTINE_MPZ_POWM_EXPO(FLAG) \
{ \
mpz_t r, b, e, m; \
unsigned i; \
double t; \
\
SPEED_RESTRICT_COND (s->size >= 1); \
\
mpz_init (r); \
mpz_init_set_n (b, s->xp, s->size); \
mpz_init_set_n (e, s->xp_block, s->size); \
if (s->r > 1) { \
mpz_tdiv_r_2exp (e, e, s->size * GMP_NUMB_BITS / s->r); \
} else if (s->r == 1) { \
mpz_set_ui (e, 0); \
mpz_setbit (e, s->size * GMP_NUMB_BITS); \
mpz_sub_ui (e, e, 1); \
} \
mpz_setbit (e, s->size * GMP_NUMB_BITS - 1); \
mpz_init_set_n (m, s->yp, s->size); \
mpz_setbit (m, 0); /* force m to odd */ \
\
speed_starttime (); \
i = s->reps; \
do \
mpz_powm_flag (r, b, e, m, FLAG); \
while (--i != 0); \
t = speed_endtime (); \
\
mpz_clear (r); \
mpz_clear (b); \
mpz_clear (e); \
mpz_clear (m); \
return t; \
}
double
speed_mpz_powm_bitsize (struct speed_params *s)
{
SPEED_ROUTINE_MPZ_POWM_EXPO (0);
}
double
speed_mpz_powm_popcount (struct speed_params *s)
{
SPEED_ROUTINE_MPZ_POWM_EXPO (1);
}
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel