Re: Has there been historical work done to investigate small integer optimization?
as Niels said, I fear the cost of the tests might make the benefit vanish. But to be sure, the only way is to try to implement this idea inside GMP, which involves quite some work. Paul Zimmermann ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: What's a reasonable size ratio for toom32?
Dear Niels, > See below patch, to support mpn_mul/0.6. Example output: > > $ ./speed -c -r -s 10-50 -t 5 -p 100 mpn_mul mpn_mul/0.6 > overhead 5.00 cycles, precision 100 units of 1.25e-09 secs, CPU freq > 798.28 MHz > mpn_mul mpn_mul/0.6 > 10 215.48 #0.6295 > 15 443.16 #0.6066 > 20 784.50 #0.6138 > 251209.57 #0.5957 > 301490.59 #0.6934 > 351986.51 #0.6918 > 402547.99 #0.6981 > 453189.27 #0.6633 > 503827.87 #0.6734 > > What do you think? (Also deleted FLAG_RSIZE, which appeared unused). looks nice! > I think it's nicer to be able to specify it separately for each function > under test. indeed, can you try speed ... mpn_mul mpn_mul/0.9 mpn_mul/0.8 mpn_mul/0.7 mpn_mul/0.6 mpn_mul/0.5 ? Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: What's a reasonable size ratio for toom32?
Dear Niels, ./speed -p 100 -c -s 10-200 -f1.1 mpn_mul.0.6 would be more readable, although the change in speed.h would be larger. Or maybe ./speed -p 100 -c -s 10-200 -f1.1 -r 0.6 mpn_mul ? Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: What's a reasonable size ratio for toom32?
Hi Niels, > Not sure what would be the appropriate way to benchmark; could look at a > range of unbalancedness ratio, at one or a few fixed sizes relevant to > toom22 and toom32, or look at fixed ratio, perhaps 1/2, over a range of > sizes. I suggest taking fixed non-rational ratios, for example sqrt(2), sqrt(3), sqrt(5), and comparing the old and new code over the range of sizes where this code is used. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
stdio.h
Hi, the fine manual says: All declarations needed to use GMP are collected in the include file 'gmp.h'. It is designed to work with both C and C++ compilers. #include Note however that prototypes for GMP functions with 'FILE *' parameters are only provided if '' is included too. #include #include What is the reason why is not included by default? Best regards, Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mul_fft, cleaning up some details of the code
Dear Marco, > Uhm, the last line of code just before that ones is: > >cc = mpn_sub_1 (r, r, m, cc) + 1; >/* add 1 to cc instead of rd since rd might overflow */ > > So that I'd say that cc is 1 or 2. > Then the case cc=2, m=n-1, r[m]=0, and rd=GMP_NUMB_MAX seems very very > unlikely, but possible. it seems you are right! Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mul_fft, cleaning up some details of the code
Hi Marco, > Since you deeply know how this code works, I ask you one more question. > The last lines of the function mpn_fft_mul_2exp_modF (branch m < n) > contain: > >/* now subtract cc and rd from r[m..n] */ > >r[n] = -mpn_sub_1 (r + m, r + m, n - m, cc); >r[n] -= mpn_sub_1 (r + m, r + m, n - m, rd); >if (r[n] & GMP_LIMB_HIGHBIT) > r[n] = mpn_add_1 (r, r, n, CNST_LIMB(1)); > > This code assumes that there can only be a single borrow. Is it correct? > I'm going to change them into the following: > >r[n] = 2; >MPN_DECR_U (r + m, n - m + 1, cc); >MPN_DECR_U (r + m, n - m + 1, rd); >if (UNLIKELY ((r[n] -= 2) != 0)) > { >mp_limb_t cy = -r[n]; >/* cy should always be 1, but we did not check if the case > m=n-1, r[m]=0, cc+rd>GMP_NUMB_MAX+1, and then cy = 2, > is actually possible. */ >r[n] = 0; >MPN_INCR_U (r, n + 1, cy); > } > > If we really can assume that cc+rd <= GMP_NUMB_MAX+1, the code could be > simpler: > >r[n] = 1; >MPN_DECR_U (r + m, n - m + 1, cc); >MPN_DECR_U (r + m, n - m + 1, rd); >if (UNLIKELY (r[n] == 0)) > MPN_INCR_U (r, n + 1, 1); >else > r[n] = 0; yes there can only be a single borrow, since since cc is 0 or 1, if r[n] is non zero after r[n] = -mpn_sub_1 (r + m, r + m, n - m, cc), this can only happen for cc=1 and {r+m,n-m} = 000...000 before the mpn_sub_1 call, and {r+m,n-m} = 111...111 after, then the second mpn_sub_1 call cannot produce a borrow. Best regards, Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mul_fft, cleaning up some details of the code
Dear Marco, > Date: Sun, 06 Mar 2022 11:14:49 +0100 > From: Marco Bodrato > > Ciao, > > I looked into the code published by Samuel Vivien. > I realised that in mul_fft there are a lot of _add_1 and _sub_1. At > least some of them can be easily replaced by _INCR_U or _DECR_U... > > Specifically I'd focus into a suspicious piece of code, shared by both > our current code and Vivien's. > > The function mpn_mul_fft_decompose has a branch "if (dif > Kl)", that > ends with the following lines: > > if (cy >= 0) > cy = mpn_add_1 (tmp, tmp, Kl, cy); > else > cy = mpn_sub_1 (tmp, tmp, Kl, -cy); > > Can those lines be correct? Can the value in cy be used after this code > both if it contains the carry or if it contains the borrow of the > operation on tmp? thank you, I agree with your analysis. Computations in that part are done modulo B^Kl+1, thus any carry (resp. borrow) at tmp[Kl] should be subtracted (resp. added) at tmp[0]. Here are some comments added to the code (from GMP 6.2.1) and a fix: --- mul_fft.c.orig 2022-03-07 09:56:28.133700601 +0100 +++ mul_fft.c 2022-03-07 10:12:20.127270744 +0100 @@ -712,15 +712,28 @@ cy += mpn_sub (tmp, tmp, Kl, n, dif); else cy -= mpn_add (tmp, tmp, Kl, n, dif); + /* cy is the borrow at tmp[Kl], thus we should subtract + cy at tmp+Kl, or equivalently add cy at tmp, since + B^Kl = -1 */ if (cy >= 0) cy = mpn_add_1 (tmp, tmp, Kl, cy); + /* cy is now the carry at tmp[Kl] */ else + { cy = mpn_sub_1 (tmp, tmp, Kl, -cy); +/* cy is now the borrow at tmp[Kl] */ +if (cy) + cy = mpn_add_1 (tmp, tmp, Kl, cy); +/* cy is now the carry at tmp[Kl] */ + } } else /* dif <= Kl, i.e. nl <= 2 * Kl */ { cy = mpn_sub (tmp, n, Kl, n + Kl, dif); + /* cy is the borrow at tmp[Kl], thus we should add + cy at tmp[0] */ cy = mpn_add_1 (tmp, tmp, Kl, cy); + /* cy is now the carry at tmp[Kl] */ } tmp[Kl] = cy; nl = Kl + 1; > I believe this is an error, I'd change the last line into > cy = 1 - mpn_sub_1 (tmp, tmp, Kl, -cy - 1); > and I propose the attached patch changing this and some other _1 > functions. Yes this seems ok. Assume we are in the case cy < 0. Since cy is the borrow at tmp[Kl], we should add cy at tmp[0], thus (since cy < 0) subtract -cy (say b=-cy) at tmp[0]. Since b > 0 and B^Kl+1 = 0 it is equivalent to subtract b-1 at tmp[0] and add 1 at tmp[Kl]. > This is not really needed to solve a bug. > The comment before the mpn_mul_fft_decompose function says "We must have > nl <= 2*K*l", this means that we should never have ((dif = nl - Kl) > > Kl), and the code in that branch should never be used. > According to the coverage analisys, the code is not explored by the > tests: > https://gmplib.org/devel/lcov/shell/var/tmp/lcov/gmp/mpn/mul_fft.c.gcov#691 > The "bug" is never triggered. > > Maybe the branch could be used if someone re-enables mpn_mul_fft_full > and uses it for very unbalanced (more than 4x1) operands? yes it could be used in that case. Best regards, Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
longlong.h patch
Hi, a patch for longlong.h was submitted on the glibc-alpha list, to ease compilation with clang: https://sourceware.org/pipermail/libc-alpha/2021-October/131824.html It is apparently an old longlong.h (from 6.1.0), but maybe the patch still applies to the development version. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
mul_fft
Hi, about https://gmplib.org/list-archives/gmp-devel/2021-June/005976.html, the second part of this mail is nonsense, since --enable-old-fft-full (which was introduced in revision 13152 on Dec 21 2009) has never been working as expected, since revision 13145 on Dec 20 2009 disabled it completely (thus with --enable-old-fft-full you still call mpn_nussbaumer_mul): #if 0 #define mpn_fft_mul mpn_mul_fft_full #else #define mpn_fft_mul mpn_nussbaumer_mul #endif Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
mul_fft
Hi, with Samuel Vivien (in cc) we noticed two things related to the FFT code: 1) the use of mpn_add_n_sub_n is not activated by default in mul_fft.c. It might give a small speedup in some cases. 2) the previous code to multiply integers (mpn_mul_fft_full) which did two multiplications mod 2^N1+1 and 2^N2+1 has been replaced by a new code which computes 2^(2N)-1 for 2N ~ N1+N2, using the fact that 2^(2N)-1=(2^N+1)(2^(N/2)+1)(2^(N/4)+1)...(2^(N/2^k)+1)(2^(N/2^k)-1). However that new code is not always faster, for example on an Intel Core i5-4590 with gcc 10.2.1: Vanilla GMP 6.2.1: $ ./speed -s 100,200,500,1000 mpn_mul_n mpn_mul_n mpn_mul_n mpn_mul_n mpn_mul_n mpn_mul_n 1000.447985000 0.452133000 #0.447968000 200 #0.949739000 0.949996000 0.956713000 500 #2.669156000 2.678638000 2.677642000 1000 #5.540204000 5.547306000 5.547152000 With --enable-old-fft-full (without WANT_ADDSUB): 1000.44875 0.451685000 #0.447571000 200 #0.941488000 0.942339000 0.947603000 500 #2.668528000 2.675208000 2.669147000 10005.543246000 5.580915000 #5.538155000 Best regards, Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: 2-adic Svoboda
Dear Torbjörn, > yes, see Section 2.4.2 of Modern Computer Arithmetic, where we call it > "Montgomery-Svoboda". The quotient selection becomes trivial, which means > one can reduce the latency between two mpn_addmul_1 calls. > > It really becomes a mul_basecase except that the first round there is > done with mul_1. If we replace that mul_1 by addmul_1, and > then called the resulting function like this, > > mul_basecase'(rp, up, un, rp, rn-un) > > then we will get the desired remainder at mul_basecase speed. a slight difference is that here the limb multiplier is not known in advance, but only at the beginning of each loop. But maybe it does not make any difference. I tried to implement Montgomery-Svoboda at the C level, but did not manage to beat the mpn_redc_x routines. I'm very interested to see your results! Best regards, Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: 2-adic Svoboda
Dear Torbjörn, > IIRC, Svoboda's division trick for N/D is to find a small multiplier m > such that, for the division mN/(mD) we have mD = 10...XX... with the > high limb of mD being 1000...0. > > This idea works also for 2-adic division. Find m = D^(-1) mod \beta > where \beta is the lomb base. Then do mN/(mD) or mN mod (mD) with > 2-adic norm. Now mD will end in 0...0001, and the quotient limbs will > be the low limbs of the intermediate remainder. > > That should mean that we could use mul_basecase directly for > sbpi1_bdiv_r (or its sibling redc_1). > > Surely this is not a new observation. yes, see Section 2.4.2 of Modern Computer Arithmetic, where we call it "Montgomery-Svoboda". The quotient selection becomes trivial, which means one can reduce the latency between two mpn_addmul_1 calls. If you reduce k limbs at a time, by precomputing m = D^(-1) mod \beta^k, then you can use mpn_addmul_k at each step. But to reduce the last k limbs, you need to revert to classical (Montgomery) division, since mN has n+k limbs. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: strange install directory
thank you all for your answers. Maybe the solution would be to check for $prefix/lib64 (and $prefix/lib32) if we don't find libgmp in $prefix/lib. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
strange install directory
Hi, on the gcc118 machine from the gcc compile farm, with gmp-6.2.1, make install puts the libgmp.{a,so} files in $PREFIX/lib64 instead of $PREFIX/lib. Is there any reason for that? This non-standard install directory is a problem, since when configuring MPFR (for example) with --with-gmp=$PREFIX, the MPFR configuration fails, since it does not find the libgmp.{a,so} files. And I guess many software tools rely on this --with-gmp=$PREFIX mechanism. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpz_add_si (was: Re: mpz_prevprime)
Hi Niels, > >>> mpz_add_si (p, p, diff * (odds_in_composite_sieve-1)); > > > > We don't have such a function :-) > > Ooops. > > Maybe we should have? We do have both _ui and _si versions of other > basic mpz functions, including mpz_set, mpz_mul, and mpz_cmp. mpz_addmul_si and mpz_submul_si are missing too. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: GMP used during 3 and a half years to solve MIT's LCS35
Dear Marco, thank you for the benchmark, which shows a clear gain with the popcount method when the exponent is sparse, and no regression when it is not. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Faster integer remainder
> "Nelson H. F. Beebe" writes: > > > During a routine, and rather delayed, bibliography update, I found and > > read a recent paper that might stimulate rethinking multiple-precision > > integer remainder computations in gmp: > > > > Daniel Lemire and Owen Kaser and Nathan Kurz > > Faster remainder by direct computation: Applications to > > compilers and software libraries > > Software: Practice & Experience 49(6) 953--970 June 2019 > > https://doi.org/10.1002/spe.2689 > > > > A preprint PDF file is available at > > > > https://arxiv.org/pdf/1902.01961 after having a closer look, the main idea of that paper (Theorem 1) already appears in Bernstein's "scaled remainder tree" paper [1]. Indeed, the assumption of Theorem 1 says that c' = c/2^{N+L} is a good approximation of 1/d. Then if n = qd + r, the fractional part of c'n is close to r/d, and multiplying by d reveals r. More precisely, here is a 4-line proof of Theorem 1, assuming n = qd + r: (1) we have 1/d <= c' := c/2^{N+L} <= 1/d + 1/(d 2^N) (2) it follows q+r/d <= c'n <= q + r/d + n/(d 2^N) (3) thus r/d <= frac(c'n) <= r/d + n/(d 2^N) < (r+1)/d (4) then r <= frac(c'n)*d < r+1 Paul Zimmermann [1] cr.yp.to/arith/scaledmod-20040820.pdf ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: patch for speed
> How many "users" will want to use tune/speed and at the same time not be > able to figure out that we alias mpz_legendre to mpz_jacobi? That alias > might not be documented, but tune/speed is "more undocumented". note also that since mpz_legendre only accepts odd primes as second argument, its running time might differ from that of mpz_jacobi, even if it is an alias. Anyway if my contributed patch is not welcome, just ignore it! Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: patch for speed
> I'd like to understand the benefits of ./speed mpz_legendre first, since > I don't really see any. Currently, it's an alias for mpz_jacobi, and I'm > not aware of any concrete plans to change that. the benefit for the user is to be able to measure the speed of mpz_legendre. Since the documentation does not say that mpz_legendre is an alias for mpz_jacobi, I believe it is legitimate. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: patch for speed
Niels, > Wouldn't that patch call mpz_legendre (a, p) with p not prime? Which is > invalid, but kind-of works *because* mpz_legendre is an alias for > mpz_jacobi. good point. Do you want a new patch which ensures p is an odd positive prime? Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: patch for speed
> Adding a different symbol in the library for mpz_kronecker or > mpz_legendre, would be an incompatible change to the interface. Not likely > in the near future. ok, then what about the original subject (adding mpz_legendre to speed)? Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: patch for speed
Dear Marco, maybe one day mpz_legendre and mpz_jacobi will differ. In that case the patch will be useful. Otherwise why two functions? Paul > Date: Mon, 9 Sep 2019 19:24:22 +0200 > From: "Marco Bodrato" > > Ciao Paul, > > Il Lun, 9 Settembre 2019 6:18 pm, paul zimmermann ha scritto: > > this patch adds mpz_legendre to the functions known by speed: > > Currently in gmp-h.in we have > #define mpz_legendre mpz_jacobi /* alias */ > > So that I do not expect any difference between mpz_legendre and mpz_jacobi. > > Do you? > > Ĝis, > m ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
patch for speed
Hi, this patch adds mpz_legendre to the functions known by speed: zimmerma@tomate:/tmp/gmp-6.1.2/tune$ ./speed -r -s 1-10 mpn_mul_n mpn_gcd mpz_legendre mpz_jacobi overhead 0.1 secs, precision 1 units of 3.04e-10 secs, CPU freq 3292.22 MHz mpn_mul_n mpn_gcd mpz_legendrempz_jacobi 1#0.5 18.7516 26.0471 26.0335 2#0.7 41.2102 50.0996 49.8808 3#0.00014 81.6871 87.5821 87.4862 4#0.00019 90.4106 97.1702 96.5991 5#0.00026 88.6899 95.3075 94.5296 6#0.00032 87.9981 92.8041 92.5853 7#0.00041 82.1337 86.7046 86.5697 8#0.00051 77.1496 81.5147 81.4188 9#0.00068 65.8028 69.6469 69.6326 10 #0.00076 66.2903 69.9241 69.7451 11 #0.00092 61.7560 65.2479 65.0786 12 #0.00107 58.3972 61.1965 60.9988 13 #0.00126 53.8224 56.9173 56.4985 14 #0.00142 51.7653 55.0378 54.8534 15 #0.00164 48.5111 51.1119 50.9965 16 #0.00185 45.7281 48.6286 48.5163 Paul --- speed.c.orig2019-09-09 18:01:39.453570648 +0200 +++ speed.c 2019-09-09 18:02:01.877937859 +0200 @@ -309,6 +309,7 @@ { "mpn_gcdext_lehmer", speed_mpn_gcdext_lehmer }, #endif { "mpz_jacobi",speed_mpz_jacobi }, + { "mpz_legendre", speed_mpz_legendre }, { "mpn_jacobi_base", speed_mpn_jacobi_base }, { "mpn_jacobi_base_1", speed_mpn_jacobi_base_1}, { "mpn_jacobi_base_2", speed_mpn_jacobi_base_2}, --- common.c.orig 2019-09-09 18:00:54.416833037 +0200 +++ common.c2019-09-09 18:01:19.381241921 +0200 @@ -1722,6 +1722,11 @@ SPEED_ROUTINE_MPZ_JACOBI (mpz_jacobi); } double +speed_mpz_legendre (struct speed_params *s) +{ + SPEED_ROUTINE_MPZ_JACOBI (mpz_legendre); +} +double speed_mpn_jacobi_base (struct speed_params *s) { SPEED_ROUTINE_MPN_JACBASE (mpn_jacobi_base); --- speed.h.orig2019-09-09 18:02:33.074448664 +0200 +++ speed.h 2019-09-09 18:02:51.742754302 +0200 @@ -393,6 +393,7 @@ double speed_mpz_init_clear (struct speed_params *); double speed_mpz_init_realloc_clear (struct speed_params *); double speed_mpz_jacobi (struct speed_params *); +double speed_mpz_legendre (struct speed_params *); double speed_mpz_lucnum_ui (struct speed_params *); double speed_mpz_lucnum2_ui (struct speed_params *); double speed_mpz_mod (struct speed_params *); ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpz_nextprime
Torbjörn, > With k-ary modexp, the number of multiplications is log2(n)/k, > while the number of squarings are log(n). Therefore a small b will not > make a huge difference. > > As Marco points implies, one may handle some special values like b = 2 > also in REDC coding and still not pay the price of a full > multiplication. with a small base, no need of k-ary modexp. In the REDC domain, instead of computing a'*b'/r mod n, where b' = r*b mod n is the REDC encoding of the base b, just compute a'*b mod n. Another trick is to use a base b such that b' is small, then you don't even need to modify the REDC code (assuming it recognizes small multipliers). Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Quotient sequence with symmetric signed quotients?
Dear Niels, > Is there any analysis of Euclid's algorithm, with different rounding of > the quotients? E.g., for the fibonacci inputs gcd(21, 13), we have a > standard quotient sequence of (1,1,1,1,1,1,1), taking 7 steps to get > from (21,13) to (1,0). > > If we instead try quotients rounded to nearest, we get > > (21,13) > (13, -5) > (-5, -2) > (-2, -1) > (-1, 0) > q = 2 q =-3 q = 2 q = 2 > > with 4 quotients rather than 7, and different from the standard continued > fraction expansion. the right person to ask is Brigitte Vallée: https://vallee.users.greyc.fr/Publications/index.html (Look for "nearest" or "odd" on this page.) Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: New n log n algorithm for high-precision multiplication
Dear Niels, > From: ni...@lysator.liu.se (Niels Möller) > Date: Mon, 15 Apr 2019 14:02:28 +0200 > > paul zimmermann writes: > > > Schönhage-Strassen can be implemented with good cache locality: > > > > https://hal.inria.fr/inria-00126462 > > Thanks! As I read it, for large inputs, the top level transforms operate > on coefficients that fit in L2 cache but not L1 cache, and with several > passes over the data, depending on fft size. Is that right? not quite. In Section 2.2.3 (Bailey's 4-step transform) we describe a way to please both the L2 and L1 caches. Of course this work should be revisited with modern processors. > Then I think small-prime fft has potential for better locality, with few > complete passes of the data and all the heavy fft work operating on data > in the L1 cache. maybe, I'm curious to see a comparison with our code! Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: New n log n algorithm for high-precision multiplication
Dear Niels, > There's also unfinished work on small-prime FFT, motivated mainly to get > better cache locality (for huge numbers, field elements in > Schönhage-Strassen get so large that they exceed the L1 cache, slowing > down the supposedly very efficient field operations). Schönhage-Strassen can be implemented with good cache locality: https://hal.inria.fr/inria-00126462 Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Unused code in mpn_mul_fft_decompose?
Dear Mikhail, > From: Mikhail Hogrefe > Date: Mon, 4 Mar 2019 21:44:50 -0500 > > I was digging around in the GMP 6.1.2 source, and I noticed something in > mpn/generic/mul_fft.c. > > In mpn_mul_fft_decompose, there is an if statement checking 'dif > Kl', > which is equivalent to 'nl > 2 * K * l'. But there is a comment above the > function saying 'We must have nl <= 2*K*l.', suggesting that the first > branch of the if is never taken. > > Maybe you're already aware of this, but if not I wanted to let you know. > > Best, > Mikhail Hogrefe good remark. Indeed in the mul_fft.c file, it seems this code is dead. However if mpn_mul_fft_decompose is exported, for example if one wants to compute a product of two numbers modulo 2^n+1, where the two numbers have more than 2n bits, then this branch might be useful. Anyway this code is now obsolete (see [1] and [2]). Best regards, Paul Zimmermann [1] http://homepages.loria.fr/PZimmermann/papers/#fft [2] https://gmplib.org/devel/ "This will largely depend on ongoing work on FFT multiplication" ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Side channel silent karatsuba / mpn_addmul_2 karatsuba
> I got stuck with that code since there are so many ways of summing > things. would it make sense to write a tool that generates all possible ways? Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
mini-gmp
Hi, would it be possible to move the following from mini-gmp.c to mini-gmp.h? #define GMP_LIMB_BITS (sizeof(mp_limb_t) * CHAR_BIT) It would expose GMP_LIMB_BITS to applications using mini-gmp, without requiring them to redefine it. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Micro-GMP
> Supporting very small sizes, and odd sizes might not be easy. And > currently Paul reported failures for 32-bit limbs on 64-bit machines... > asl.h is more interesting for GMP than for mini-. after investigation, those failures were coming from the MPFR side. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Side channel silent karatsuba / mpn_addmul_2 karatsuba
Dear Marco, > I feel we should at least add an umulhi macro. > We have too many umul_ppmm(used, dummy, a, b) in the code now... I support that proposal. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Micro-GMP
Dear Marco, > A attach a possible implementation of mini-gmp that should support limbs > of "any" size. Can you test it with MPFR? Both correctness and speed... it passes all MPFR tests, both with uint16 and uint8. It is also faster than micro-gmp (I disabled GMP_CHECK_RANDOMIZE so that the tests are deterministic): micro-gmp mini-gmp patched tdivtcos tdiv tcos uint16 2.8s7m25s 2.2s 4m45s uint8 7.1s 21m40s 3.8s 7m26s Great, thank you! Paul PS: it would be nice to indicate in mini-gmp.h the places to modify to enable 16-bit or 8-bit limbs. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Micro-GMP
Dear Marco, > mpz_cmpabs_ui (const mpz_t u, unsigned long v) > { > int ulongsize = GMP_ULONG_BITS / GMP_LIMB_BITS; > mp_limb_t ulongrem = 0; > mp_size_t un = GMP_ABS (u->_mp_size); > > if (GMP_ULONG_BITS % GMP_LIMB_BITS != 0) > ulongrem = (ULONG_MAX >> GMP_LIMB_BITS * ulongsize) + 1; is GMP_ULONG_BITS % GMP_LIMB_BITS != 0 allowed in GMP and/or asl.h? > if (un > ulongsize && (u->_mp_d[ulongsize] >= ulongrem || un > ulongsize > + 1)) > return 1; > else > { > unsigned long uu = mpz_get_ui (u); > return GMP_CMP(uu, v); > } > } > > > When mp_limb_t is unsigned long, gcc -O2 compiles this code exactly as > if (GMP_ABS (u->_mp_size) > 1) > return 1; > else > GMP_CMP(mpz_get_ui (u), v) > [...] sounds good. > The same as the current code. And it gets optimised also for uint16 or uint8. > > >> You also changed mpn_invert_3by2. Maybe a handful of (mp_limb_t) casts > >> could be added to the original function to make it work... but your > >> rewrite is simpler and faster. Maybe the same could be done on > >> gmp_umul_ppmm? for speed? > > > > indeed, in several places I opted for simplicity, since speed was not > > (at first) a goal. But when running "make check" of MPFR with micro-gmp-8, > > I now realize speed is also important! > > A attach a possible implementation of mini-gmp that should support limbs > of "any" size. Can you test it with MPFR? Both correctness and speed... I'll do. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Micro-GMP
Dear Marco, > Date: Fri, 30 Nov 2018 07:43:40 +0100 > From: Marco Bodrato > > Ciao Paul! > > Il 2018-11-23 13:38 paul zimmermann ha scritto: > > I presented some work in progress on "Micro-GMP", a modified > > version of Mini-GMP which works with 16-bit or even 8-bit limbs: > > Great! > It is always interesting to see a further development based on some code > we wrote! > > > https://members.loria.fr/PZimmermann/talks/microgmp.pdf > > In your presentation you write that "27 functions from mini-gmp.c need > to be modified". > > Of course the code in mini-gmp exploits the fact that mp_limb_t is > defined as unsigned long, to keep the code as simple and as effective as > possible. If you need to define a different mp_limb_t type, some > modifications are obviously needed. > > But... are you sure you need to change that many? I mean: > - you changed mpz_div_qr_ui, did you really need to change also most of > the functions mpz_?div_*_ui originally implemented as a simple call to > mpz_div_qr_ui? most changes for the _ui functions (including mpz_div_qr_ui) consist in first converting the unsigned long argument into a mpz_t using mpz_set_ui, then calling the corresponding function without _ui. But indeed I don't remember why I changed the other mpz_?div_*_ui functions. Maybe because it was simpler to just convert using mpz_set_ui, and do not have to think at all. > - for similar reasons, was changing mpz_mul_si really needed? or > mpz_lcm_ui? it seems mpz_mul_si does not need to be changed. Same for mpz_lcm_ui. > I looked at the code and I'd suggest a couple of simpler > implementations: > > mpz_set_si (mpz_t r, signed long int x) > { >if (x >= 0) > mpz_set_ui (r, x); >else /* (x < 0) */ > { >mpz_set_ui (r, GMP_NEG_CAST (unsigned long int, x)); >mpz_neg (r, r); > } > } > > int > mpz_cmpabs_ui (const mpz_t u, unsigned long v) > { >mpz_t absu; >mpz_roinit_n (absu, u->_mp_d, GMP_ABS (u->_mp_size)); >return mpz_cmp_ui (absu, v); > } indeed this is simpler. Now there are only 18 that need to be modified. > You also changed mpn_invert_3by2. Maybe a handful of (mp_limb_t) casts > could be added to the original function to make it work... but your > rewrite is simpler and faster. Maybe the same could be done on > gmp_umul_ppmm? for speed? indeed, in several places I opted for simplicity, since speed was not (at first) a goal. But when running "make check" of MPFR with micro-gmp-8, I now realize speed is also important! > Unfortunately, writing a micro-gmp implementation outside of the > project, has the bad effect that you will have to rewrite it for any new > release. > Your micro- does not benefit of the improvement in mini- since the > release of GMP-6.1.2, and some of your modifications might be > superfluous now. > E.g. you changed in mpz_probab_prime_p the line: > >nm1->_mp_size = mpz_abs_sub_ui (nm1, n, 1); > > into > >mpz_abs (nm1, n); >mpz_sub_ui (nm1, nm1, 1); > > but the current code reads > https://gmplib.org/repo/gmp/file/17701/mini-gmp/mini-gmp.c#l3648 > >mpz_abs (nm1, n); >nm1->_mp_d[0] -= 1; > > which works for any limb size (n is odd, -=1 on the lowest limb is > always possible) and is faster than calling _sub_ui. > > Maybe we can find a strategy to release also micro- as a part of the > project, reducing code duplication. Would you like to cooperate on this? yes of course! Best regards, Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Micro-GMP
Dear Torbjörn, > From: t...@gmplib.org (Torbjörn Granlund) > Date: Fri, 23 Nov 2018 18:03:31 +0100 > > You might be interested in GMP's asl.h, it implements artificially small > limbs ("asl"), down to 4 bits IIRC. The repo GMP doesn't complete its > tests with it currently. (A student of mine did something similar some > years previously, but I couldn't get paperwork in order for releasing > it.) quite interesting! Is there any documentation how to use it? > Some notes about our slides: > > First public GMP releaae wasn't 1.3.2 (sic) in 1993 but 1.1 in 1991. my source was ftp://ftp.gnu.org/gnu/gmp/, where the first version is 1.3.2. But indeed in https://members.loria.fr/PZimmermann/bignum/comp.ps.gz I used GMP 1.2 from December 1991. Where can I find the first public releases together with their release dates? > I believe Niels and Marco wrote Mini-GMP. > > Not only O(n^2) algorithms; addition is actually O(n) while modexp is > O(n^3)... > > The mpz_random2 function is long obsolete, mpz_rrandomb replaces it. thank you, I will prepare a revised version of my slides. Any other feedback is welcome! Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Micro-GMP
Dear GMP developers, today at the iRRAM/MPFR/MPC workshop in Trier (http://irram.uni-trier.de/mpfr-mpc-irram-workshop-trier-2018/) I presented some work in progress on "Micro-GMP", a modified version of Mini-GMP which works with 16-bit or even 8-bit limbs: https://members.loria.fr/PZimmermann/talks/microgmp.pdf Any feedback is welcome, Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
mpz_spsp pseudoprimes
Hi, Thomas Nicely has updated his web page to include the smallest mpz_spsp pseudoprimes in GMP 6.1.2: http://www.trnicely.net/misc/mpzspsp.html#GMP612 It would be nice to list the smallest mpz_spsp5. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: New code for primality testing
Dear Marco, > > if you have already implemented BPSW, probaby the simplest is to make > > mpz_probab_prime_p use BPSW (removing the reps argument), and have another > > We did. The newly implemented test was the main reason for this thread. > > Currently we did not change the interface. BPSW is used as a first step in > mpz_probab_prime_p, and substitute the first "number" repetitions, then > the remaining iteration of MillerRabin are used. This is intended for > GMP-6.2. > > Do you have suggestions for that "number"? Then, I agree, for GMP-7 we > will have to redesign the interface... since no counter-example is known for BPSW, you could substitute all REPS iterations. On the other hand, you could leave all REPS iterations, and hope a counter-example is found by GMP (in that case, make sure to send the counter-example to the gmp-devel list). Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: New code for primality testing
Ciao Marco, > Someone published a sequence of 7 bases to deterministically test > primality up to 2^64... > But someone also claim that BPSW has no false positives up to 2^64... > > If we can verify the source of the second claim, since we just implemented > BPSW, we should decide to return 2 (surely prime) for all "probable" > primes under this threshold. > > By the way, I agree. In the API a generic function can take an unspecified > reps parameter. On the other side, if we decide to expose a specific > implementation (like MillerRabin), we should give to the users the full > control on the parameters (the bases in this case). > > A proposal for the function prototype? That could fit the need of GMP-ECM? if you have already implemented BPSW, probaby the simplest is to make mpz_probab_prime_p use BPSW (removing the reps argument), and have another function mpz_millerrabin (mpz_t n, mpz_t base) and/or mpz_millerrabin_ui (mpz_t n, unsigned long base). As application, I was thinking more at CADO-NFS than GMP-ECM. In the cofactorization phase of the Number Field Sieve, we have billions of numbers of up to say 128 bits that we have to check for primality, before trying to factor them. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: New code for primality testing
> > Some other free software projects already contain code for primality > > proving. E.g. Pari/GP [http://pari.math.u-bordeaux.fr/] implement both > > APR-CL and ECPP... I don't think GMP should duplicate that (good) work. > > mpz_aprcl uses mpz already and is on a similar track about license > https://sourceforge.net/projects/mpzaprcl/ > > Thank you for the pointers on these projects. > It's just that, personal feeling maybe, there is a something "missing" from > the absence of deterministic primality test function (whatever the perf, to > some extent). > But if it's too complex it's indeed a good reason. GMP-ECM contains an implementation of APRCL, written by David Cleaver. See aprtcle/mpz_aprcl.c. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: New code for primality testing
> And it could be specified that this function > int mpz_millerrabin (mpz_t n, int reps) > internally takes care of randstate boilerplate, like in current code. > Which adds to its added value and simplicity of use. > Repeated calls may not bring additional confidence, > but it's not important because the number of "calls" can already be set with > the reps parameter. in some contexts, we want to get a full primality test for moderate size numbers. For that purpose, it would be more useful to provide a list of (user chosen) bases to mpz_millerrabin, or at least that the bases chosen by a given 'reps' argument are documented. See https://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_test. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: udiv_qr_3by2 vs divappr
Torbjörn, > I suppose a "naive" implementation scales the full dividend and the full > divisor by a factor which fits in a limb. scaling the dividend is not mandatory. You can first reduce the upper n-1 limbs of the original dividend using the scaled divisor, then reduce the last limb using the original divisor. This is especially interesting if only the remainder is needed. In fact, the multiplier has one limb + 1 bit, if the divisor is normalized, so k*B = \beta^{n+1} + B' with B' < \beta^n (see Algorithm 1.7 in [1]). > One could in theory save some > time by scaling just their O(log(Q)) most significant limb. I don't see how you achieve this, since for the submul part you need the full scaled divisor. > How likely does Svoboda overesimate a quotient limb? good question. Unfortunately the authors of [1] do not answer that question. If at some point of the loop in Algorithm 1.7 we have A = a_{n+j} \beta^{n+j} + R and B' = \beta^{n+1} + S, with 0 \leq R < \beta^{n+j} and 0 \leq S < \beta^n, then the next remainder will be T = A - a_{n+j} \beta^{j-1} B' = R - a_{n+j} \beta^{j-1} S. We thus have a_{n+j} \beta^{n+j-1} < T < \beta^{n+j}. Writing R = X \beta^{n+j}, S = Y \beta^n, a_{n+j} = Z \beta, with 0 <= X, Y, Z < 1, we get: T = (X - Y Z) \beta^{n+j}. If X, Y, Z are uniform in [0,1] and independent, then the probability that X - Y Z >= 0 is 3/4, thus Svoboda's algorithm will overestimate the quotient with probability 1/4 (and in that case, only one correction is needed). Of course you can use a 2-limb variant of Svoboda: multiply the divisor by 2 limbs + 1 bit, so that the scaled divisor is now k*B = \beta^{n+2} + B' with B' < \beta^n. Then the quotient will be overestimated with probability less than 1/\beta. Paul [1] https://members.loria.fr/PZimmermann/mca/mca-cup-0.5.9.pdf ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: udiv_qr_3by2 vs divappr
Torbjörn, > this looks quite interesting, I'm looking forward for cycle numbers. > (I guess you target the case where the divisor is not invariant, otherwise > you might do some preprocessing to speed up the partial quotient > approximation.) > > The divisor is invariant in some sense when the quotient is more than > one limb, right? yes it is. But I was thinking about many divisions with the same divisor (say 2n limbs divided by n limbs). In short we can divide preprocessing methods in two classes: 1) those with only O(1) preprocessing time, like the precomputation of an inverse of the 1 or 2 uppers limbs of the divisor 2) those with O(n) preprocessing time, like Svoboda division > Are you thinking of Svoboda division here? That's another possibility > for speeding up schoolbook division. yes, in Svoboda division the quotient selection is for free. But you still have to wait the end of the previous submul_1 (or submul_2) call to access the next quotient, whereas in mpn_mul_basecase all multipliers are known in advance. Does this makes a difference? Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: udiv_qr_3by2 vs divappr
Dear Torbjörn, this looks quite interesting, I'm looking forward for cycle numbers. (I guess you target the case where the divisor is not invariant, otherwise you might do some preprocessing to speed up the partial quotient approximation.) Paul > From: t...@gmplib.org (Torbjörn Granlund) > Date: Wed, 19 Sep 2018 18:18:29 +0200 > > This is a sketch of an idea which I have had for a long time, which we > actually might have discussed here before. The idea is to cut a large > O(nn-dn) time off of schoolbook division. > > Normally, the time taken by schoolbook division is latency(mpn_submul_1) > + 2*latency(mul) + C. Here C is the number of plain operations of > quotient limb approximation and mul means a scalar multiply used by the > same. > > Couldn't we make the quotient limb approximation in parallel with the > O(n^2) work? I think we can, but it requires that we run the submul_1 > partially backwards, and thereby produce the most significant partial > remainder limbs early. > > But that's impossible, right? The quotient limb approximation is, eh, > approximate after all! We cannot know the most significant partial > remainder limbs as early as required! No, we cannot know them for sure, > but we can almost always tell. And when we're wrong, we need to add > back and recompute. Below is a very rough implementation. (It surely is > not correct wrt indexes.) > > The hope is that this could run almost as as fast as mul_basecase. > There, and in this proposed algorithm, subsequent addmul_1/submul_1 can > actually overlap with the help of OoO execution. So hopefully on good > OoO pipelines this could not just hide the division limb approximation > latency, but actually hide much of the feed-in and wind-down latency of > submul_1! > > qnext = divappr (np[1], np[0], d1, d0); > for (...) > { > // Apply q to most significant partial remainder limbs > cy = submul_1 (np - 2, dp + dn - 2, qnext, 2); > > q = qnext; > // Now, with the most significant few limbs of the next partial remainder > // computed to some accuracy, compute q for the next iteration. This > might > // overestimate q worse than with the old algorithms. > qnext = divappr (np[1], np[0], d1, d0); > > // Apply q to all but the most significant partial remainder limbs. > cy = submul_1 (np - dn, dp, q, dn - 2); > > if (UNLIKELY (cy > np[0])) > { > np[0] -= cy; > // q was too large, qnext is garbage so recompute it! > // (We could perhaps recompute it more cleverly.) > ASSERT_CARRY(mpn_add_n (np - dn, np - dn, dp, dn)); > qnext = divappr (np[1], np[0], d1, d0); > } > else > np[0] -= cy; > np--; > } > > -- > Torbjörn > Please encrypt, key id 0xC8601622 > ___ > gmp-devel mailing list > gmp-devel@gmplib.org > https://gmplib.org/mailman/listinfo/gmp-devel ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
mini-gmp test suite could be improved
Hi, I noticed the mini-gmp test suite does not seem to contain any test of mpz_add_ui with the unsigned long argument having more than 32 bits. Indeed with the following change "make check" still says "All 25 tests passed" (with the mini-gmp from gmp-6.1.2): void mpz_add_ui (mpz_t r, const mpz_t a, unsigned long b) { b = b % 0x; if (a->_mp_size >= 0) r->_mp_size = mpz_abs_add_ui (r, a, b); else r->_mp_size = -mpz_abs_sub_ui (r, a, b); } At least one test with ULONG_MAX would be welcome. Maybe the same holds for other ui/si functions. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: udiv_qr_3by2 vs divappr
Dear Niels, > I've updated the paper (same location, > https://www.lysator.liu.se/~nisse/misc/schoolbook-divappr.pdf). page 1: there is an extra ')' in the footnote page 3: can you detail the proof of R_{bignum} < D? If the algorithm would also take as input u_{-1} as in your 2011 paper, I am ok, but here I am not sure. In the proof of Lemma 1, in u_0 (\beta^3 - \beta D - D), the term \beta^3 - \beta D - D is negative when D >= \beta^2 - \beta - 1, thus you cannot simply bound it by bounding u_0 by \beta-1. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: udiv_qr_3by2 vs divappr
Dear Niels, your idea works. If I am correct, we save a \beta/4 factor, which is enough as long as \beta>=4. Paul e.pdf Description: Adobe PDF document ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: udiv_qr_3by2 vs divappr
Dear Niels, it works with r = (u_0 - q d_1 - p_1 - 1) \bmod \beta line 6 in all cases, assuming it works with -1 replaced by - [p_0 > 0]. We only need to check the case p_0 = 0. p_0 = 0 means that q d_0 is divisible by \beta, i.e., R' is multiple of \beta. Let still be the two low words from R'. We have r_0 = 0. We have r = r_1 - 1 in the algorithm, instead of r = r_1. For the first adjustment, the only case that differs from the original algorithm is when r_1 = q_0 thus r = q_0 - 1 and the first adjustment no longer applies. Since R' > q_0 \beta - \beta^2 [*], then necessarily R' = q_0 \beta. If the second adjustment does not apply, then r <= d_1 - 2, which means q_0 <= d_1 - 1, thus 0 <= R' <= (d_1-1) \beta and we are done. If the second adjustment applies, then r >= d_1 - 1, which means q_0 >= d_1, thus after the adjustment, R = q_0 \beta - d > -\beta. (For the upper bound, R < \beta^2 - d <= d.) [*] the proof of Theorem 3 from [4] says "The lower bounds ... and \tilde{r} > q_0 \beta - \beta^2 follow in the same way as in the proof of Theorem 2". For the second adjustment, assume r = d_1 - 2, in which case the second adjustment would have been made with r = r_1. If the first adjustment did not apply, then we have r_1 - 1 = d_1 - 2 < q_0, thus r_1 = d_1 - 1 < q_0 + 1. Thus r_1 <= q_0. But since R' > q_0 \beta - \beta^2, then R' = = which is ok. If the first adjustment did apply, then we have r = r_1 + d_1 mod \beta, which with r = d_1 - 2 implies r_1 = \beta - 2. If R' < 0, then R' = r_1 \beta - \beta^2 = -2 \beta which is ok. If R' >= 0, then R' = \beta^2 - 2 \beta. But this is not possible with the upper bound R' < max(\beta^2 - d, q_0 \beta) - \beta, since \beta^2 - d <= \beta^2/2, and since the first adjustment did apply, r >= q_0 thus r_1 - 1 >= q_0, thus q_0 <= \beta - 3. Surely this can be simplified... Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: udiv_qr_3by2 vs divappr
Dear Niels, > > page 1: the division instruction is now much faster than before on modern > > processors > > According to https://gmplib.org/~tege/x86-timing.pdf, they're still an > order of magnitute slower than multiplication. E.g 86 vs 3 cycles on > Intel skylake. And in addition, multiplication is piplined, division > likely isn't. > > Do you have any suggestion for better wording than "vastly slower" ? maybe give the number of cycles and refer to x86-timing.pdf? > > page 3, line 5 of Section 3: -> > > I think is right. yes sorry the extra 0 should have been removed (same in the algorithm) > > page 4, line 27: for the upper bound, I agree you can subtract the maximal > > value of the u_0 term in the proof of Theorem 3 from [4], but this gives > > \beta-1, not \beta. > > This is a bit subtle, maybe I should explicitly repeat the proof form > [4] with needed modifications. My argument is that the proof in [4] is > of the form > > \tilde r = ... + u0 < ... + \beta <= c = max (\beta^2 - D, q_0 \beta) > > But if u_0 < \beta is the only thing that lets us conclude that \tilde r > < c, with strict inequality, then there's an off-by-one error here. > > Now there are multiple bounds involved; to get \tilde r = c we'd also > need something like u_1, u_0 and q_0 maximal and d_0 = 0. if you need to save \beta with respect to the proof of [4], yes maybe you need to repeat that proof to explain how you save the extra +1. > > I wonder whether we can get rid of that term, or always subtract -1. > > In the latter case (assuming the proof with the p_0 > 0 extra term is > > correct), > > the only problem would be when p_0 = 0. > > I have looked at that, so far without success. If I remember correctly, > always subtracting -1 can be treated in the analysis as extending the > single-word range for r_0 to 0 <= r_0 <= \beta. But that breaks the > analysis of the R' < 0 case. We have > > {r_1, r_0} >= q_0 \beta > > We want to conclude r_1 >= q_0, but that no longer holds if we allow r_0 > = \beta. To make it work, we'd need to streighten the lower bound > > R' >= q_0 \beta - \beta^2 > > Hmm... but from another look at the proof from [4], seems to give us > strict inequality > > R' > q_0 \beta - \beta^2 > > with no extra effort. So maybe we can do that. I'll have a look. The exhaustive search up to \beta=2^7 says that always subtracting 1 seems to work, and the lower bound even improves to -2\beta < R' instead of -2\beta <= R'. Paul PS: an extra typo in the new version: page 5, lines 4 and 5, d_o should be d_0 ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: udiv_qr_3by2 vs divappr
Dear Niels, > From: ni...@lysator.liu.se (Niels Möller) > Date: Sun, 05 Aug 2018 09:59:46 +0200 > > ni...@lysator.liu.se (Niels Möller) writes: > > > static inline mp_limb_t > > divappr_2 (mp_limb_t n2, mp_limb_t n1, > >mp_limb_t d1, mp_limb_t d0, mp_limb_t dinv) > > { > > mp_limb_t q1, q0, r1, t1, t0, mask; > > > > umul_ppmm (q1, q0, n2, dinv); > > add_ss (q1, q0, q1, q0, n2, n1); > > > > q1++; /* Can't overflow */ > > umul_ppmm (t1, t0, q1, d0); /* t0 isn't used */ > > r1 = n1 - d1 * q1 - t1; > > To be able to prove correctness, I'm afraid this needs to be > > r1 = n1 - d1 * q1 - t1 - (t0 > 0); > > > mask = - (mp_limb_t) (r1 >= q0); > > q1 += mask; > > r1 += mask & (d1 + 1); > > q1 += (r1 >= d1 - 1); > > > > return q1; > > } > > I'm attaching a draft paper with the analysis. As for performance, I've > measurered a few percent speedup on shell. > > Regards, > /Niels > > [2:application/pdf Show Save:schoolbook-divappr.pdf (200kB)] quite interesting! I have a few comments: page 1: the division instruction is now much faster than before on modern processors page 2: sigificant -> significant page 3, line 2 of Section 3, remove "The output" page 3, line 5 of Section 3: -> line 6: correesponds -> corresponds page 3, The algorithm: -> step 3: u_2 -> u_1, u_1 -> u_0 page 4, line 5: d_1 - d_0 -> d_0 - d_1 page 4, line 7: inputes -> inputs page 4, line 13: the the -> the page 4, line 20: u_1 -> u_0 page 4, line 21: The last terms -> term page 4, line 27: for the upper bound, I agree you can subtract the maximal value of the u_0 term in the proof of Theorem 3 from [4], but this gives \beta-1, not \beta. page 4: "it follows that r_1 + d_1 + 1 > \beta". I do not agree, take for example u_1 = u_0 = 0, d_1 = 128, d_0 = 1 with \beta = 256. You get r_1 = 127, and r_1 + d_1 + 1 = 256 = \beta. But r_1 + d_1 + 1 >= \beta is enough for the rest of the proof. page 4, line -2: on line 8 -> on line 9 page 5, line 6: \beta r_1 + d_0 -> \beta r'_1 + d_0 For the case R' < 0, you did not prove the lower bound for r'_1 <= d_1 - 2, and did not prove the upper bound for r'_1 >= d_1 - 1. Case R' >= 0: if the -\beta term is replaced by -(\beta-1), then all I can get is r_1 <= \beta - d_1 - 1. Remark: on the other hand, an exhaustive search with \beta=2^k with 1 <= k <= 7 did not find any error, even without the p_0 > 0 extra term. I wonder whether we can get rid of that term, or always subtract -1. In the latter case (assuming the proof with the p_0 > 0 extra term is correct), the only problem would be when p_0 = 0. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: divappr interface
Niels, > From: ni...@lysator.liu.se (Niels Möller) > Date: Fri, 27 Apr 2018 22:28:39 +0200 > > ni...@lysator.liu.se (Niels Möller) writes: > > > Once we reach qn = dn - 1, keep looping to produce quotient limbs, but > > also discard one limb of dp in each interation, until we in the final > > iteration have qn = 1, qn+2 = 3 numerator limbs, and qn+1 = 2 divisor > > limbs, i.e., a single udiv_qr_3by2 (where we might consider skipping the > > adjustment steps). I haven't done the error analysis, but I would expect > > that errors are similar to the current code. > > In the current mpn_sbpi1_divappr_q, there's a curious flag (or mask) > variable used in this loop. It's initially all ones, cleared if we ever > fail to cancel the top limb of the current partial remainder. When the > flag is cleared, remaining quotient limbs are set to GMP_NUMB_MAX. > > Torbjörn, Paul, do you remember the analysis behind this? (Code is since > 2009...). > > I would guess it might happen that even if higher quotient limbs are all > correct, we might get non-zero high limb in > > partial remainder - GMP_NUMB_MAX * truncated D > > If we set the rest of the quotient limbs to GMP_NUMB_MAX, then the > quotient should be large enough thanks to the low end divisor limbs > which we ignored in the truncation. > > Regards, > /Niels I don't believe I have written that code. Anyway I don't quite understand the code from sbpi1_divappr_q.c (say in GMP 6.1.2): * when flag becomes 0, the condition UNLIKELY (n1 >= (d1 & flag)) is always true, thus we always take that branch. Is that wanted? * when n1 != cy on line 129, I guess that q = GMP_NUMB_MASK was too large, thus we should decrease q and add back the divisor. But the code in lines 133-134 is never executed (https://gmplib.org/devel/lcov/shell/var/tmp/lcov/gmp/mpn/sbpi1_divappr_q.c.gcov.html) * idem for the code in lines 177-178 Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: udiv_qr_3by2 vs divappr
Niels, > From: ni...@lysator.liu.se (Niels Möller) > Date: Sat, 28 Apr 2018 13:27:36 +0200 > [...] > Inputs are a two limb normlized divisor, {d1, d0}, d1 >= B/2, and two > numerator words, {n2, n1}, which must be less than {d1, d0}. [...] if you use udiv_qr_3by2 to divide (say) a 6-limb number by a 3-limb number, then in the schoolbook loop you might have {n2, n1} = {d1, d0}, in the case where n0 is smaller than the next divisor limb d[-1]. If it is the responsibility of the caller to ensure {n2, n1} < {d1, d0}, then every caller of udiv_qr_3by2 must deal with that special case. In my opinion, it would be simpler to allow {n2, n1} = {d1, d0}, and return q=GMP_NUMB_MAX in that case. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: udiv_qr_3by2 vs divappr
Niels, > static inline mp_limb_t > divappr_2 (mp_limb_t n2, mp_limb_t n1, >mp_limb_t d1, mp_limb_t d0, mp_limb_t dinv) > This needs more analysis. [...] you might be interested by Algorithms DivApprox1 and DivApprox2 from https://hal.inria.fr/hal-01502326. Best regards, Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: dead code in div_q.c?
Niels, I fully agree with your analysis. Small comments below. > From: ni...@lysator.liu.se (Niels Möller) > Date: Thu, 26 Apr 2018 22:12:09 +0200 > > paul zimmermann <paul.zimmerm...@inria.fr> writes: > > > In the second case (lines 267-276) this is less obvious since you use an > > approximate division (divappr), and I don't know what is the maximal allowed > > difference between divappr and div. If divappr always return a quotient > > less or equal to the true quotient, then qh cannot be zero in this case. > > But if qh can be larger than the true quotient, even by one bit, then it > > is possible that qh is 1, in the case {new_np, new_nn} = 0111...111 and > > {new_dp, dn} = 1000...000. > > You're right again... I thought we had more margin. > > According to comments, quotient from sb and dc divappr is correct or > one too large, while quotient from mu divappr can be at most 4 too > large. > > And the limit case, N = B^n/2 - 1, D = B^k/2 is > >floor ((B^n/2 - 1) / (B^k/2)) = floor(B^{n-k} - B^{-k}/2) > = B^{n-k} - 1 > > So even we only a single unit of error, we could get qh == 1. > > I wonder what actual error can be produced by the divappr functions in > the corner cases that error could spill over into qh? > > In the cy != 0 case, it looks like we we make a balanced divappr call, > (2qn+2) / (qn+1). Lets put k = qn+1. For divappr we then have > > N <= B^{2k}/2 - 1, > D >= B^k/2, lets say D = B/2 + e, e >= 0 D = B/2 + e should read D = B^k/2 + e > To have any possibility to get qh > 0 with from the approximation error, > we must have Q >= B^k - 4, where Q denotes the correct quotient. So > assume that is the case. Now, > > D <= N/Q <= (B^{2k}/2 - 1) / (B^k - 4) > > which implies > > e <= N/Q - B^k/2 <= (2 B^k - 1) / (B^k - 4) < 3 > > So we need to consider only D = B^k/2 + {0, 1, 2}. Then, any inverse not > taking the very least significant D limb into account (including mu inverse) > must be *all* ones, right? yes if you compute the inverse as floor((B^(2k-2)-1)/floor(D/B)) > It would be good to either prove that qh > 0 can't happen, or add a test > that will exercise the code handling that case. yes more generally it would be nice to know for the divappr functions, if the exact quotient fits on qn limbs (i.e., the exact qh is 0), can the "approximate" qh be 1? > > beware that in the case cy <> 0, one divides {new_np, nn+1} by {new_dp, dn} > > thus filling {qp, nn-dn+1}. > > Thanks for pointing this out. Now I understand better. > > Regards, > /Niels > > -- > Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. > Internet email is subject to wholesale government surveillance. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: C99 and GMP
> > quite interesting. Why is gmp/mpn not tested in the head coverage? > > It is tested. It appears as /var/tmp/lcov/gmp/mpn because it is a set of > symlinks created at build time. sorry I missed that. I see some of the files are not tested at all (add_err3_n.c for example), and some have a low coverage (div_qr_1.c for example). Are there any plans to improve that? Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: dead code in div_q.c?
Niels, > From: ni...@lysator.liu.se (Niels Möller) > Cc: gmp-devel@gmplib.org, raphael.rieu-he...@lri.fr > Date: Wed, 25 Apr 2018 22:22:41 +0200 > User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (berkeley-unix) > > ni...@lysator.liu.se (Niels Möller) writes: > > > paul zimmermann <paul.zimmerm...@inria.fr> writes: > > > >> together with Raphaël Rieu-Hleft (in cc), we believe we have found some > >> dead code in > >> mpn/generic/div_q.c around lines 173-182: > >> > >> else if (UNLIKELY (qh != 0)) > >> { > >> /* This happens only when the quotient is close to B^n and > >> > >> mpn_*_divappr_q returned B^n. */ > > > > I think your right. This comment is wrong in two ways: First, divappr is > > not called (directly) for this branch of code, and second, the quotient > > can be at most close to B^n/2. > > I'm trying the appended patch. Besides the analysis, I've ran > > while GMP_CHECK_RANDOMIZE=0 ./tests/mpn/t-div ; do : ; done > > for at least half an hour. So I think we can drop the code. for the first case qh != 0 (lines 174-183) we are sure this cannot happen, since cnt < GMP_NUMB_BITS, thus new_dp[new_nn-1] has its most significant bit 0, and since new_dp[dn-1] has its most significant set, we cannot have a carry in the division. In the second case (lines 267-276) this is less obvious since you use an approximate division (divappr), and I don't know what is the maximal allowed difference between divappr and div. If divappr always return a quotient less or equal to the true quotient, then qh cannot be zero in this case. But if qh can be larger than the true quotient, even by one bit, then it is possible that qh is 1, in the case {new_np, new_nn} = 0111...111 and {new_dp, dn} = 1000...000. > Question is, > how much of comments and/or ASSERT are useful? And should we make the > assignments > > qp[qn - 1] = qh; > > and > > tp[qn] = qh; > > unconditional? As far as I see, the areas are always large enough. beware that in the case cy <> 0, one divides {new_np, nn+1} by {new_dp, dn} thus filling {qp, nn-dn+1}. If you write qp[qn-1]=qh afterwards, you will erase the high limb of the quotient. (By the way, this is another way to prove that qh=0 in this case, since we know the quotient fits on nn-dn+1 limbs.) If the division routines already check for zero high limbs of the dividend, then it would be simpler to have new_nn = nn + 1 in all cases, and we would not need to write qp[qn-1]. Paul > But it's a unclear to me if and why it's sometimes permissible to omit > writing the top limbs? Reads of the top limbs look unconditional. > > I've tried adding > > --- a/tests/mpn/t-div.c Wed Apr 25 07:38:14 2018 +0200 > +++ b/tests/mpn/t-div.c Wed Apr 25 22:19:17 2018 +0200 > @@ -445,6 +445,7 @@ main (int argc, char **argv) > alloc = itch + 1; > } > scratch[itch] = ran; > + MPN_COPY (qp, junkp, nn - dn + 1); > mpn_div_q (qp, np, nn, dup, dn, scratch); > ASSERT_ALWAYS (ran == scratch[itch]); > ASSERT_ALWAYS (qp[-1] == qran0); ASSERT_ALWAYS (qp[nn - dn + 1] == > qran1); > > to the test, to ensure that limbs aren't correct thanks to the previous > call. Tests still pass. So I'm a bit puzzled. > > Regards, > /Niels > > *** /tmp/extdiff.0aBNfO/gmp.765c2c27523b/mpn/generic/div_q.c 2018-04-25 > 21:55:51.457769871 +0200 > --- /home/nisse/hack/gmp/mpn/generic/div_q.c 2018-04-25 21:18:01.516501993 > +0200 > *** mpn_div_q (mp_ptr qp, > *** 171,184 > if (cy == 0) > qp[qn - 1] = qh; > ! else if (UNLIKELY (qh != 0)) > ! { > ! /* This happens only when the quotient is close to B^n and > ! mpn_*_divappr_q returned B^n. */ > ! mp_size_t i, n; > ! n = new_nn - dn; > ! for (i = 0; i < n; i++) > ! qp[i] = GMP_NUMB_MAX; > ! qh = 0; /* currently ignored */ > ! } > } > else /* divisor is already normalised */ > --- 171,176 > if (cy == 0) > qp[qn - 1] = qh; > ! else > ! ASSERT_ALWAYS (qh == 0); > } > else /* divisor is already normalised */ > *** mpn_div_q (mp_ptr qp, > *** 263,276 > if (cy == 0) > tp[qn] = qh; > ! else if (UNLIKELY (qh != 0)) > ! { > ! /* This happens only when the quotient is close to B^n and > ! mpn_*_divappr_q returned B^n. */ > ! mp_size_t i, n; > ! n = new_nn - (qn + 1)
Re: Comparison of multiple-precision floating-point software
Dear Torbjörn, > I am surprised that there are non-marginal differences for larger > operations. Don't we all use mpn? Bookkeeping should be the only > difference. since 2004 we did implement in MPFR "short" product/square/division, which compute an approximation of the upper n limbs of a nxn product or square, or of the quotient of a 2n/n division, following this paper: @article{Mulders00, author = {T. Mulders}, title ={On short Multiplications and Divisions}, journal = "AAECC", year = 2000, volume = 11, number = 1, pages = "69--88" } If such algorithms would be available in GMP, of course we would use them. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Comparison of multiple-precision floating-point software
Dear GMP developers, I have updated my comparison of multiple-precision floating-point software. The old page was comparing MPF from GMP 5.0.2 with (among others) MPFR 3.1.2: http://www.mpfr.org/mpfr-3.1.2/timings.html Here, MPF was only faster than MPFR for 100d mul and sqr, and 1000d div. The new comparison is available at (temporary web page): https://members.loria.fr/PZimmermann/timings.html Now MPF is faster than MPFR for all 100d operations, for 1000d and 1d div. You have done a great work in GMP 6! You should be able to reproduce the timings since they were done on gcc67, and the source code is available. Your feedback is welcome. Best regards, Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpn_mul is embarrassingly slow
Dear Torbjörn, > What do you think about this stopgap change? I would entirely drop all the squaring-related stuff from mpn_mul: the user/developer should call mpn_sqr instead (see my previous mail). Then the code would become: if (BELOW_THRESHOLD (vn, MUL_TOOM22_THRESHOLD)) { mpn_mul_basecase (prodp, up, un, vp, vn); return prodp[un + vn - 1]; /* historic */ } if (un == vn) return mpn_mul_n (prodp, up, vp, un); ... (I believe it is vn and not un that should be compared to MUL_TOOM22_THRESHOLD.) Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpn_mul is embarrassingly slow
> It is surely silly that we don't have any mpn call for when it is known > that the multiplier and multiplicand are distinct. but now that mpn_sqr is in the GMP interface (since GMP 6 I guess), why check for {up,un} = {vp,vn} in mpn_mul? Shouldn't the user or the GMP developer call mpn_sqr directly? In "make check" of GMP 6.1.2, the "square" branch of mpn_mul is taken only in 4 tests: mpf/t-sqrt, mpf/t-sqrt_ui, mpf/reuse and mpf/t-pow_ui. It seems some of those cases are because there is no mpf_sqr function, and mpf_mul calls mpn_mul without checking for squares. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
MPN_FILL vs MPN_ZERO
Hi, I just discovered the internal macro MPN_FILL. I understand MPN_ZERO(p,n) is implemented as "if (n) MPN_FILL(p, n, 0)". Then in case MPN_FILL is implemented using memset, since memset also checks for the case n=0, MPN_ZERO(p,n) will perform two tests for n > 0. Why not directly call MPN_FILL? Best regards, Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
unused function
Dear all, in GMP 6.1.2, the mpn_bdiv_q_1 function in mpn/generic/bdiv_q_1.c is used nowhere in the library, and is not exported, thus it could be removed. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
dead code in div_q.c?
Hi, together with Raphaël Rieu-Hleft (in cc), we believe we have found some dead code in mpn/generic/div_q.c around lines 173-182: else if (UNLIKELY (qh != 0)) { /* This happens only when the quotient is close to B^n and mpn_*_divappr_q returned B^n. */ mp_size_t i, n; n = new_nn - dn; for (i = 0; i < n; i++) qp[i] = GMP_NUMB_MAX; qh = 0; /* currently ignored */ } Indeed, in the case cy <> 0, since cnt < GMP_NUMB_BITS, the most significant limb new_np[nn] of the dividend has at least one 0 leading bit, and thus is smaller than new_dp[dn-1] (which is normalized). Therefore qh is always 0 (no approximate division is performed here). Raphaël and Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpn_mul is embarrassingly slow
Niels, > Such an assembly routine would need access to the threshold between > basecase and generic, which in the case of fat builds isn't a compile > time constant. but you could determine at compile time a lower bound for fat builds, no? Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Lazy mpz allocation
> Only 0 can have lazy allocation, and I think we document that it isn't > legal to put 0 on the denominator. where is this documented? In mpfr_set_q we use the fact that the user can set q to 1/0 for example to represent +Inf. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: How large are arbitrarily large integers? (Was Re: Bitfield subtlety)
> For some reason, I also limited the choices so that either 8 or 16 bits > are added to the _mp_size field. That might give a small speedup in > some cases, perhaps, but the result is that only one balanced > possibility remain: > > b = 2 > manstissa bits = 10 > exponent bits = 6 > > Let's abbreviate this as (2,10,6). > > This gives an allocation range of almost 2^73 limbs, which is of course > silly, but (2,11,5) would result in just the range 2^42 limbs. another possibility is the "unum" type, where some bits are used to define how many bits are used for the significand and exponent: https://en.m.wikipedia.org/wiki/Unum_(number_format) In the limit of unum, you could use only 1 extra bit, say for a 31-bit type (thus with 30 remaining bits): 1) if 0, the 30 bits are interpreted as a size between 0 and 2^30-1 2) if 1, the 30 bits are split into say 24 significand bits and 6 exponent bits, which would be enough up to 2^87 with an overhead of less than 1e-6 Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: How large are arbitrarily large integers? (Was Re: Bitfield subtlety)
> I think one relevant question is on how large numbers we expect to be > able to complete any computation with worse complexity than O(n) in > reasonable time. But when we integrate fft multiplication with decent > locality, the limit will be cpu power, not available RAM, I'd hope. not sure. With GMP 17, we'll have multi-thread multiplication, then memory will be the main limitation. Paul Zimmermann ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Miller-Rabin
Niels, > int > mpz_miller_rabin_uis (mpz_srcptr n, size_t count, unsigned long *bases); this interface would be nice for me. I don't think a bignum base is necessary, unless you use REDC in powm, in which case (for n having at least 2 limbs) it might be faster to use a base b such that b*R mod n is small, where R is the REDC multiplier. > Out of curiosity, are any similar results known when using bases given > by Euler's polynomial, > > a_j = j^2 + j + 41, > > with the curious property that j_40 = 41^2 is the first composite in the > sequence? a quick test shows that n=21 already requires 3 bases, like 764941, 1004653... Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Miller-Rabin
Hi, it would be nice if there would be a GMP function implementing the Miller-Rabin test, to have finer control than with mpz_probab_prime_p. This would allow to get a real primality test for small enough input. For example on https://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_test we can read: if n < 4,759,123,141, it is enough to test a = 2, 7, and 61; thus the GMP user who wants to check inputs less than 4759123141 could write : int my_isprime (mpz_t p) { assert (mpz_cmp_ui (p, 4759123141UL) < 0); return mpz_miller_rabin (p, 2) && mpz_miller_rabin (p, 7) && mpz_miller_rabin (p, 61); } Alternatively, if you don't want to add a new function, you could allow negative REPS argument for mpz_probab_prime_p: * for example REPS=-1 would mean checking base 2 (n < 2,047) * REPS=-2 would mean checking bases 31 and 73 (n < 9080191) * REPS=-3 would mean checking bases 2, 7, and 61 (n < 4759123141) * ... Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mini-gmp
Dear Niels, > If we add a version number to mini-gmp, what should it be? The only > "releases" of mini-gmp are gmp releases, should we adopt that, and then > have no well-defined version number for mini-gmp between gmp releases? it could then be the corresponding GMP release version, for example you could have in mini-gmp.h: #define __GMP_IS_MINI #define __GNU_MP_VERSION6 #define __GNU_MP_VERSION_MINOR 1 #define __GNU_MP_VERSION_PATCHLEVEL 2 then applications could know the version of mini-gmp is that shipped with GMP 6.1.2. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mini-gmp
Marc, > I am curious why you are trying to support mini-gmp in mpfr at all. As far > as I understand, the goal of mini-gmp is that a user can take a copy of > those 2 files, stick them in his project, and get a self-contained > program. Unless you provide a similar mini-mpfr, your user is going to > have to install the mpfr dependency anyway, his project is not > self-contained, so he might as well install the real GMP. mpfr users might have their own interest in using mini-gmp, but my own interest is for mpfr development. Best regards, Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mini-gmp
Dear Niels, > We could add some documented macro for this purpse, but I'm not sure I > understand why you need it. for example, we check the version of GMP is at least 5.0.0: #if !__MPFR_GMP(5,0,0) # error "GMP 5.0.0 or newer is required" #endif where the __MPFR_GMP macro uses __GNU_MP_VERSION, ... Since mini-gmp does not provide any version macro, we can't check the version of mini-gmp is good enough. > And then I have a couple of #if NETTLE_USE_MINI_GMP spread out in the > rest of the code. same in MPFR: $ grep MPFR_USE_MINI_GMP src/*.[ch] tests/*.[ch] | wc -l 47 > > and finally mpz_dump. > > Hmm. It seems this function is declared in gmp.h, but not mentioned in > the manual, and with a discouraging comment at the top of mpz/dump.c. > > Any reason you can't switch to always use mpz_out_str instead? in fact it seems we no longer need this function in MPFR. Thanks! Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
mini-gmp
Dear GMP developers, I see that you have added some functions in the mini-gmp interface, with respect to the last time I tried it with MPFR, for example mpn_neg, mpn_com and mpn_sqrtrem. Thanks! However, we are still missing a proper (documented) way to distinguish mini-gmp from plain gmp. We could use the __MINI_GMP_H__ macro but it is not documented. A macro like GMP_IS_MINI would be great, maybe with another macro that would help us to identify the corresponding version of GMP (the corresponding plain GMP macro would be GMP_IS_GREAT of course :-) And I notice that very few of the division functions from mini-gmp correspond to those of the GMP documentation. Thus to compile MPFR with mini-gmp we had to re-implement mpn_divrem_1, mpn_divrem and mpn_tdiv_qr. We can contribute them to mini-gmp if you want. The other functions we had to re-implement are some random functions (using srand48/lrand48, thus generating different random sequences, but for MPFR this is fine), and finally mpz_dump. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Revisiting gcd and gcdext for small operands
> > Without detail knowledge of the current implementation, these numbers > > suggest to me that we could speed small operand gcd and gcdext. But I > > might be wrong. not clear to me. For one word basically you can perform a multiply in a few cycles, whereas for a gcd you may need up to mp_bits_per_limb operations. If you find an algorithm that computes a gcd of two words in a few cycles, it would be a major improvement! Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Generic get_d_2exp failures
Niels, > It would be nice if we could find a portable way to add two floating > point values without rounding up. see http://perso.univ-perp.fr/langlois/slides/imacs05_sl.pdf, slide 10, algorithms "Two Sum" and "Fast Two Sum" in the literature. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: Replacing "redc"
quite interesting. Is the API of the new interface fixed? I'd like to try it with GMP-ECM, where we currently use the redc_* internal functions. For example on my machine Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz, running at 3499.804Mhz: $ ./bench_mulredc -v ... Time in microseconds per call, size=8 (iter=1807889): mpn_mul_n = 0.055 mpn_sqr= 0.042 mpn_redc_1 = 0.069 * mpn_redc_2 = 0.077 mpn_redc_n = disabled mulredc= 0.124 mul+redc_1 = 0.122 * mul+redc_2 = 0.135 mul+redc_n = disabled mulredc= 0.124 sqr+redc_1 = 0.108 * sqr+redc_2 = 0.122 sqr+redc_n = disabled mulredc1 = 0.013 Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpn_sqrtrem{1,2} - patch for pure C implem
Dear Marco, > Finally we can totally skip the first approximation and directly compute > > i' = T1[h(a)] + l(a) * T2[h(a)] > > using a piece-wise linear function. yes, we noticed with Niels that we can replace most of the arithmetic in the first Newton iteration by another table lookup. Not sure l(a) is better than using the upper (say) 16 bits of a, in case you want a 16-bit approximation i'. To compute T1 and T2 you have two choices: 1) either try to minimize the maximal error after the first Newton iteration (i.e., on i') 2) or try to minimize the maximal error after k iterations. See the article by Jean-Michel Muller and Peter Kornerup: https://hal.archives-ouvertes.fr/inria-00071899/document Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
patch for speed to measure the basic mpf functions
Dear GMP developers, below is a patch (against GMP 6.1.1) that allows to measure the basic mpf functions using speed (with -s in bits): $ ./speed -p10 -c -s 113 mpf_add mpf_sub mpf_mul mpf_div mpf_sqrt overhead 3.56 cycles, precision 10 units of 1.25e-09 secs, CPU freq 800.13 MHz mpf_add mpf_sub mpf_mul mpf_div mpf_sqrt 113#45.00 48.36 46.30145.90352.58 I contribute this patch to GMP, but most probably you can improve it. Best regards, Paul --- common.c.orig 2016-09-01 15:48:21.378349300 +0200 +++ common.c2016-09-01 16:34:24.035003662 +0200 @@ -1998,6 +1998,152 @@ mpf_clear (f)); } +double +speed_mpf_add (struct speed_params *s) +{ + mpf_t w, x, y; + unsigned i; + doublet; + + mpf_init2 (w, s->size); + mpf_init2 (x, s->size); + mpf_init2 (y, s->size); + + mpf_random2 (x, 1 + (s->size - 1) / GMP_NUMB_BITS, 0); + mpf_random2 (y, 1 + (s->size - 1) / GMP_NUMB_BITS, 0); + mpf_add (w, x, y); + + speed_starttime (); + i = s->reps; + do +{ + mpf_add (w, x, y); +} + while (--i != 0); + t = speed_endtime (); + + mpf_clear (w); + mpf_clear (x); + mpf_clear (y); + return t; +} + +double +speed_mpf_sub (struct speed_params *s) +{ + mpf_t w, x, y; + unsigned i; + doublet; + + mpf_init2 (w, s->size); + mpf_init2 (x, s->size); + mpf_init2 (y, s->size); + + mpf_random2 (x, 1 + (s->size - 1) / GMP_NUMB_BITS, 0); + mpf_random2 (y, 1 + (s->size - 1) / GMP_NUMB_BITS, 0); + mpf_sub (w, x, y); + + speed_starttime (); + i = s->reps; + do +{ + mpf_sub (w, x, y); +} + while (--i != 0); + t = speed_endtime (); + + mpf_clear (w); + mpf_clear (x); + mpf_clear (y); + return t; +} + +double +speed_mpf_mul (struct speed_params *s) +{ + mpf_t w, x, y; + unsigned i; + doublet; + + mpf_init2 (w, s->size); + mpf_init2 (x, s->size); + mpf_init2 (y, s->size); + + mpf_random2 (x, 1 + (s->size - 1) / GMP_NUMB_BITS, 0); + mpf_random2 (y, 1 + (s->size - 1) / GMP_NUMB_BITS, 0); + mpf_mul (w, x, y); + + speed_starttime (); + i = s->reps; + do +{ + mpf_mul (w, x, y); +} + while (--i != 0); + t = speed_endtime (); + + mpf_clear (w); + mpf_clear (x); + mpf_clear (y); + return t; +} + +double +speed_mpf_div (struct speed_params *s) +{ + mpf_t w, x, y; + unsigned i; + doublet; + + mpf_init2 (w, s->size); + mpf_init2 (x, s->size); + mpf_init2 (y, s->size); + + mpf_random2 (x, 1 + (s->size - 1) / GMP_NUMB_BITS, 0); + mpf_random2 (y, 1 + (s->size - 1) / GMP_NUMB_BITS, 0); + mpf_div (w, x, y); + + speed_starttime (); + i = s->reps; + do +{ + mpf_div (w, x, y); +} + while (--i != 0); + t = speed_endtime (); + + mpf_clear (w); + mpf_clear (x); + mpf_clear (y); + return t; +} + +double +speed_mpf_sqrt (struct speed_params *s) +{ + mpf_t w, x; + unsigned i; + doublet; + + mpf_init2 (w, s->size); + mpf_init2 (x, s->size); + + mpf_random2 (x, 1 + (s->size - 1) / GMP_NUMB_BITS, 0); + mpf_sqrt (w, x); + + speed_starttime (); + i = s->reps; + do +{ + mpf_sqrt (w, x); +} + while (--i != 0); + t = speed_endtime (); + + mpf_clear (w); + mpf_clear (x); + return t; +} /* Compare this to mpn_add_n to see how much overhead mpz_add adds. Note that repeatedly calling mpz_add with the same data gives branch prediction --- speed.c.orig2016-09-01 15:48:04.174345004 +0200 +++ speed.c 2016-09-01 15:40:37.746232291 +0200 @@ -509,6 +509,11 @@ { "mpz_init_clear", speed_mpz_init_clear }, { "mpq_init_clear", speed_mpq_init_clear }, { "mpf_init_clear", speed_mpf_init_clear }, + { "mpf_add", speed_mpf_add }, + { "mpf_sub", speed_mpf_sub }, + { "mpf_mul", speed_mpf_mul }, + { "mpf_div", speed_mpf_div }, + { "mpf_sqrt", speed_mpf_sqrt }, { "mpz_init_realloc_clear", speed_mpz_init_realloc_clear }, { "umul_ppmm", speed_umul_ppmm, FLAG_R_OPTIONAL }, --- speed.h.orig2016-09-01 15:48:10.370346551 +0200 +++ speed.h 2016-09-01 15:37:47.686188699 +0200 @@ -153,6 +153,11 @@ double speed_binvert_limb_arith (struct speed_params *); double speed_mpf_init_clear (struct speed_params *); +double speed_mpf_add (struct speed_params *); +double speed_mpf_sub (struct speed_params *); +double speed_mpf_mul (struct speed_params *); +double speed_mpf_div (struct speed_params *); +double speed_mpf_sqrt (struct speed_params *); double speed_mpn_add_n (struct speed_params *); double speed_mpn_add_1 (struct speed_params *); ___ gmp-devel mailing list
mpn_sqrtrem2
Hi, the attached file contains a new version (mpn_sqrtrem2_new) of mpn_sqrtrem2 for 64-bit computers. On my computer i5-4590 it runs in about 65 cycles whereas the gmp-6.1.1 version runs in about 96 cycles. It uses an auxiliary routine (mpn_rsqrtrem1) which computes a 41-bit approximation of 1/sqrt(a). Before I try to adapt those two routines to 32-bit limbs, I'd like someone to review my code (comments are included). Paul /* mpn_sqrtrem -- square root and remainder Contributed to the GNU project by Paul Zimmermann (most code), Torbjorn Granlund (mpn_sqrtrem1) and Marco Bodrato (mpn_dc_sqrt). THE FUNCTIONS IN THIS FILE EXCEPT mpn_sqrtrem ARE INTERNAL WITH A MUTABLE INTERFACE. IT IS ONLY SAFE TO REACH THEM THROUGH DOCUMENTED INTERFACES. IN FACT, IT IS ALMOST GUARANTEED THAT THEY WILL CHANGE OR DISAPPEAR IN A FUTURE GMP RELEASE. Copyright 1999-2002, 2004, 2005, 2008, 2010, 2012, 2015 Free Software Foundation, Inc. This file is part of the GNU MP Library. The GNU MP Library is free software; you can redistribute it and/or modify it under the terms of either: * the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. or * the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. or both in parallel, as here. The GNU MP Library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with the GNU MP Library. If not, see https://www.gnu.org/licenses/. */ /* See "Karatsuba Square Root", reference in gmp.texi. */ #include #include #include "gmp.h" #include "gmp-impl.h" #include "longlong.h" #define USE_DIVAPPR_Q 1 #define TRACE(x) static const unsigned char invsqrttab[384] = /* The common 0x100 was removed */ { 0xff,0xfd,0xfb,0xf9,0xf7,0xf5,0xf3,0xf2, /* sqrt(1/80)..sqrt(1/87) */ 0xf0,0xee,0xec,0xea,0xe9,0xe7,0xe5,0xe4, /* sqrt(1/88)..sqrt(1/8f) */ 0xe2,0xe0,0xdf,0xdd,0xdb,0xda,0xd8,0xd7, /* sqrt(1/90)..sqrt(1/97) */ 0xd5,0xd4,0xd2,0xd1,0xcf,0xce,0xcc,0xcb, /* sqrt(1/98)..sqrt(1/9f) */ 0xc9,0xc8,0xc6,0xc5,0xc4,0xc2,0xc1,0xc0, /* sqrt(1/a0)..sqrt(1/a7) */ 0xbe,0xbd,0xbc,0xba,0xb9,0xb8,0xb7,0xb5, /* sqrt(1/a8)..sqrt(1/af) */ 0xb4,0xb3,0xb2,0xb0,0xaf,0xae,0xad,0xac, /* sqrt(1/b0)..sqrt(1/b7) */ 0xaa,0xa9,0xa8,0xa7,0xa6,0xa5,0xa4,0xa3, /* sqrt(1/b8)..sqrt(1/bf) */ 0xa2,0xa0,0x9f,0x9e,0x9d,0x9c,0x9b,0x9a, /* sqrt(1/c0)..sqrt(1/c7) */ 0x99,0x98,0x97,0x96,0x95,0x94,0x93,0x92, /* sqrt(1/c8)..sqrt(1/cf) */ 0x91,0x90,0x8f,0x8e,0x8d,0x8c,0x8c,0x8b, /* sqrt(1/d0)..sqrt(1/d7) */ 0x8a,0x89,0x88,0x87,0x86,0x85,0x84,0x83, /* sqrt(1/d8)..sqrt(1/df) */ 0x83,0x82,0x81,0x80,0x7f,0x7e,0x7e,0x7d, /* sqrt(1/e0)..sqrt(1/e7) */ 0x7c,0x7b,0x7a,0x79,0x79,0x78,0x77,0x76, /* sqrt(1/e8)..sqrt(1/ef) */ 0x76,0x75,0x74,0x73,0x72,0x72,0x71,0x70, /* sqrt(1/f0)..sqrt(1/f7) */ 0x6f,0x6f,0x6e,0x6d,0x6d,0x6c,0x6b,0x6a, /* sqrt(1/f8)..sqrt(1/ff) */ 0x6a,0x69,0x68,0x68,0x67,0x66,0x66,0x65, /* sqrt(1/100)..sqrt(1/107) */ 0x64,0x64,0x63,0x62,0x62,0x61,0x60,0x60, /* sqrt(1/108)..sqrt(1/10f) */ 0x5f,0x5e,0x5e,0x5d,0x5c,0x5c,0x5b,0x5a, /* sqrt(1/110)..sqrt(1/117) */ 0x5a,0x59,0x59,0x58,0x57,0x57,0x56,0x56, /* sqrt(1/118)..sqrt(1/11f) */ 0x55,0x54,0x54,0x53,0x53,0x52,0x52,0x51, /* sqrt(1/120)..sqrt(1/127) */ 0x50,0x50,0x4f,0x4f,0x4e,0x4e,0x4d,0x4d, /* sqrt(1/128)..sqrt(1/12f) */ 0x4c,0x4b,0x4b,0x4a,0x4a,0x49,0x49,0x48, /* sqrt(1/130)..sqrt(1/137) */ 0x48,0x47,0x47,0x46,0x46,0x45,0x45,0x44, /* sqrt(1/138)..sqrt(1/13f) */ 0x44,0x43,0x43,0x42,0x42,0x41,0x41,0x40, /* sqrt(1/140)..sqrt(1/147) */ 0x40,0x3f,0x3f,0x3e,0x3e,0x3d,0x3d,0x3c, /* sqrt(1/148)..sqrt(1/14f) */ 0x3c,0x3b,0x3b,0x3a,0x3a,0x39,0x39,0x39, /* sqrt(1/150)..sqrt(1/157) */ 0x38,0x38,0x37,0x37,0x36,0x36,0x35,0x35, /* sqrt(1/158)..sqrt(1/15f) */ 0x35,0x34,0x34,0x33,0x33,0x32,0x32,0x32, /* sqrt(1/160)..sqrt(1/167) */ 0x31,0x31,0x30,0x30,0x2f,0x2f,0x2f,0x2e, /* sqrt(1/168)..sqrt(1/16f) */ 0x2e,0x2d,0x2d,0x2d,0x2c,0x2c,0x2b,0x2b, /* sqrt(1/170)..sqrt(1/177) */ 0x2b,0x2a,0x2a,0x29,0x29,0x29,0x28,0x28, /* sqrt(1/178)..sqrt(1/17f) */ 0x27,0x27,0x27,0x26,0x26,0x26,0x25,0x25, /* sqrt(1/180)..sqrt(1/187) */ 0x24,0x24,0x24,0x23,0x23,0x23,0x22,0x22, /* sqrt(1/188)..sqrt(1/18f) */ 0x21,0x21,0x21,0x20,0x20,0x20,0x1f,0x1f, /* sqrt(1/190)..sqrt(1/197) */ 0x1f,0x1e,0x1e,0x1e,0x1d,0x1d,0x1d,0x1c, /* sqrt(1/198)..sqrt(1/19f) */ 0x1c,0x1b,0x1b,0x1b,0x1a,0x1a,0x1a,0x19, /* sqrt(1/1a0)..sqrt(1/1a7) */ 0x19,0x19,0x18,0x18,0x18,0x18,0x17,0x17, /* sqrt(1/1a8)..sqrt(1/1af) */ 0x17,0x1
Re: mpn_invert_limb
Niels, > > would it be possible to add mpn_invert_limb in the public interface? > > It would be very useful for GNU MPFR. > > Makes sense to me, it's a useful and well-defined operations, and > discussed at length in various papers. So should be quite easy to > document. great! Please keep me informed when it happens, so that we can update MPFR accordingly. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpn_sqrtrem1
Torbjörn, > I am not proud of the mpn_sqrtrem1 code. It was made with some analysis > and lots of testing. The many undocumented magic constants are ugly. > Perhaps I should never have checked in this code. ah ok. I have an alternate implementation which is almost as efficient, but although it passes make check with both GMP and MPFR, I have not yet proven it is correct :-( > Furthermore, for a real performance win one would need to rewrite > mpn_sqrtrem2. I know how to improve mpn_sqrtrem2. For a 64-bit computer, start from a 32-bit approximation, say x, of 1/sqrt(a), which is already done by mpn_sqrtrem1. Then use Karp-Markstein's trick for the square root: y = approx(a*x) # 32-bit approximation of sqrt(a) t = a - y^2 return y + x/2*t # 64-bit approximation of sqrt(a) I will try to implement that. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
mpn_sqrtrem1
Hi, in my current effort of optimizing MPFR for small precisions (1-2 limbs), I am currently reviewing mpn_sqrtrem1() and mpn_sqrtrem2(), since we need such functions to speed up mpfr_sqrt. In mpn_sqrtrem1, I wonder what the 0x3 constant is used for. By exhaustive search on all possible values of a1 (there are only 3/4*2^33) on a 64-bit computer, the maximal error on the 16-bit approximation x0 is the same without this constant, more precisely -0.23 <= x0/2^32 - (a0/2^64)^(-1/2) <= 0. The worst error is even smaller (in absolute value) without the 0x3 constant, since with - 0x3 it is attained for a1=2248147012 (error -0.231867), and without 0x3 it is attained for a1=2248147013 (error -0.231863). Also "make check" passes without the 0x3 constant. If the 0x3 constant is really needed, it would be nice to add an example that breaks make check if we remove it. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
__udiv_qrnnd_c
Hello, with the patch below, I get a small speedup on __udiv_qrnnd_c. Best regards, Paul Zimmermann --- longlong.h.orig 2016-07-18 09:47:05.832658464 +0200 +++ longlong.h 2016-07-18 09:57:21.832816211 +0200 @@ -2064,7 +2064,7 @@ /* Define this unconditionally, so it can be used for debugging. */ #define __udiv_qrnnd_c(q, r, n1, n0, d) \ do { \ -UWtype __d1, __d0, __q1, __q0, __r1, __r0, __m;\ +UWtype __d1, __d0, __q1, __q0, __r1, __r0; \ \ ASSERT ((d) != 0); \ ASSERT ((n1) < (d)); \ @@ -2074,29 +2074,25 @@ \ __q1 = (n1) / __d1; \ __r1 = (n1) - __q1 * __d1; \ -__m = __q1 * __d0; \ -__r1 = __r1 * __ll_B | __ll_highpart (n0); \ -if (__r1 < __m)\ +__r0 = __r1 * __ll_B | __ll_highpart (n0); \ +__r1 = __r0 - __q1 * __d0; \ +if (__r1 > __r0) /* borrow when subtracting q1*d0 */\ { \ __q1--, __r1 += (d);\ - if (__r1 >= (d)) /* i.e. we didn't get carry when adding to __r1 */\ - if (__r1 < __m) \ - __q1--, __r1 += (d);\ + if (__r1 > (d)) /* no carry when adding d */\ + __q1--, __r1 += (d); \ } \ -__r1 -= __m; \ \ __q0 = __r1 / __d1; \ __r0 = __r1 - __q0 * __d1; \ -__m = __q0 * __d0; \ -__r0 = __r0 * __ll_B | __ll_lowpart (n0); \ -if (__r0 < __m)\ +__r1 = __r0 * __ll_B | __ll_lowpart (n0); \ +__r0 = __r1 - __q0 * __d0; \ +if (__r0 > __r1) /* borrow when subtracting q0*d0 */\ { \ __q0--, __r0 += (d);\ - if (__r0 >= (d))\ - if (__r0 < __m) \ - __q0--, __r0 += (d);\ + if (__r0 > (d)) /* no carry when adding d */\ + __q0--, __r0 += (d); \ } \ -__r0 -= __m; \ \ (q) = __q1 * __ll_B | __q0; \ (r) = __r0; \ ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
mpn_invert_limb
Dear GMP developers, would it be possible to add mpn_invert_limb in the public interface? It would be very useful for GNU MPFR. Best regards, Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: getrusage vs clock
> since version 2.18 of the glibc, the clock() function is much more precise. > For example on my machine it has a resolution of 1 micro-seconds whereas > getrusage() has a resolution of 4ms only, as demonstrated by the program > below: > > I believe we already use clock_gettime(2) directly on GNU/Linux, which > is the underlying syscall that clock(3) uses in newer GNU libc versions. > > Does that not happen in your tests? I don't know how to check which function is used. With gmp-6.1.0 on my workstation I get: zimmerma@tomate:/tmp/gmp-6.1.0/tune$ ./speed ... CPU cycle counter, supplemented by microsecond getrusage() Gnuplot home page http://www.gnuplot.info/ Quickplot home page http://quickplot.sourceforge.net/ which lets me think that getrusage() is still used. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
getrusage vs clock
Dear GMP developers, since version 2.18 of the glibc, the clock() function is much more precise. For example on my machine it has a resolution of 1 micro-seconds whereas getrusage() has a resolution of 4ms only, as demonstrated by the program below: zimmerma@tomate:~/try$ ./a.out GNU libc version: 2.21 GNU libc release: stable loops=3512013 clock=1000662 getrusage=100 nb_clock=100 nb_getrusage=250 resolution clock = 1.00us resolution getrusage = 4000.00us I therefore suggest to replace getrusage() by clock() in speed/tune. Paul #include #include #include #include #include unsigned long time_clock () { return clock (); } unsigned long time_getrusage () { struct rusage rus; getrusage (0, ); return rus.ru_utime.tv_sec * 100 + rus.ru_utime.tv_usec + rus.ru_stime.tv_sec * 100 + rus.ru_stime.tv_usec; } int main() { unsigned long t_clock, t_getrusage; unsigned long last_clock, last_getrusage; unsigned long nb_clock = 0, nb_getrusage = 0, loops = 0; printf("GNU libc version: %s\n", gnu_get_libc_version ()); printf("GNU libc release: %s\n", gnu_get_libc_release ()); last_clock = time_clock (); last_getrusage = time_getrusage(); while (nb_clock < 100 && nb_getrusage < 100) { t_clock = time_clock (); nb_clock += t_clock != last_clock; last_clock = t_clock; t_getrusage = time_getrusage (); nb_getrusage += t_getrusage != last_getrusage; last_getrusage = t_getrusage; loops ++; } printf ("loops=%lu clock=%lu getrusage=%lu nb_clock=%lu nb_getrusage=%lu\n", loops, time_clock (), time_getrusage (), nb_clock, nb_getrusage); printf ("resolution clock = %.2fus\n", (double) time_clock () / nb_clock); printf ("resolution getrusage = %.2fus\n", (double) time_getrusage () / nb_getrusage); } ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
mpz_ndiv_qr wanted
Dear GMP developers, after having written ten times (if not hundred) a function that computes an integer quotient and remainder with rounding to nearest, I believe this could be a nice addition to GMP, and useful to other users too. I propose: -- Function: void mpz_ndiv_qr (mpz_t Q, mpz_t R, const mpz_t N, const mpz_t D) Divide N by D, forming a quotient Q and/or remainder R such that N = Q*D + R. The quotient Q is rounded to nearest, i.e., either |R| < D/2, or if |R| = D/2, then Q is even (can occur only for even D). and of course the corresponding mpz_ndiv_q and mpz_ndiv_r functions. (The tie-breaking rule for the case |R| = D/2 comes from IEEE 754.) Best regards, Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: speed of mpn_sqrtrem vs mpn_rootrem
Ciao Marco, Date: Thu, 30 Jul 2015 17:21:40 +0200 From: Marco Bodrato bodr...@mail.dm.unipi.it Ciao Paul, Il Mer, 22 Gennaio 2014 1:00 pm, Zimmermann Paul ha scritto: tarte% ./speed -s 1000,3000,1 -r mpn_sqrtrem mpn_rootrem.2 overhead 0.2 secs, precision 1 units of 3.53e-10 secs, CPU freq 2833.00 MHz mpn_sqrtrem mpn_rootrem.2 1000 0.000153504 #0.9454 3000 0.000816813 #0.9187 1 0.004606970 #0.9623 There are some more experiment that we can try, but the anomaly you spotted should be healed now: $ tune/speed -s 1000,3000,1 -r mpn_sqrt mpn_root.2 mpn_sqrtrem mpn_rootrem.2 overhead 0.2 secs, precision 1 units of 2.86e-10 secs, CPU freq 3500.08 MHz mpn_sqrtmpn_root.2 mpn_sqrtrem mpn_rootrem.2 1000 #0.515161.38661.46432.1827 3000 #0.0002954371.31681.38551.8478 1#0.0018503521.24481.31491.7012 You are welcome to test and/or give further suggestions Best regards, m good to see GMP is improving! I confirm the above on my machine with the latest and greatest development version: zimmerma@tarte:/tmp/gmp/tune$ ./speed -s 1000,3000,1 -r mpn_sqrt mpn_root.2 mpn_sqrtrem mpn_rootrem.2 overhead 0.2 secs, precision 1 units of 3.53e-10 secs, CPU freq 2833.00 MHz mpn_sqrtmpn_root.2 mpn_sqrtrem mpn_rootrem.2 1000 #0.0001021051.37361.44982.1121 3000 #0.0005781141.26801.36151.7848 1#0.0034251731.22281.28681.6352 Best regards, Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
3-prime FFT
on https://hal.archives-ouvertes.fr/hal-01022383, page 27, table 16, the 3-prime FFT implemented in Mathemagix is faster than GMP for 2^23 to 2^25 bits. Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: 3-prime FFT
Hi Torbjörn, The new GMP code beats GMP too, in particular when SSFFT's coefficients approach the L1d cache size. I'm looking forward to the new GMP code then... Best regards, Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpn_zero_p [Was: fast inversion]
Dear Marco, I don't see mpn_zero_p in the API of the current stable version 6.0.0 (according to gmplib.org). In which version will it be available? It will be available from the next release. great! Looking forward for the next release... Paul ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mini-gmp
Dear Niels, From: ni...@lysator.liu.se (Niels Möller) Date: Mon, 15 Jun 2015 09:57:39 +0200 Zimmermann Paul paul.zimmerm...@inria.fr writes: I tried to compile MPFR with mini-gmp. I needed to do a few changes: Replying to an old mail. How did this work out? How do you deal with gmp vs mini-gmp in public mpfr header files and the mpfr ABI? Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. as far as I remember, it worked at the end. See the doc/mini-gmp file in the MPFR svn repo (reproduced below). This was with GMP 6.0.0. I did not test it with more recent GMP snapshot. Maybe some changes in mini-gmp.c and mini-gmp.h are no longer necessary. Cf MPFR_USE_MINI_GMP for your last question. Best regards, Paul How to compile GNU MPFR with mini-gmp = (this was tested with MPFR r9149 and GMP 6.0.0 on x86_64 GNU/Linux machine) 1) extract the GMP tarball in say /tmp/gmp-6.0.0 go into /tmp/gmp-6.0.0/mini-gmp add the following line in mini-gmp.c (say at line 53): char gmp_version[] = 6.0.0; gcc -O2 -g -fPIC -c mini-gmp.c ar r libgmp.a mini-gmp.o 2) create a GMP install directory in say /tmp mkdir /tmp/include mkdir /tmp/lib cp libgmp.a /tmp/lib cp mini-gmp.h /tmp/include/gmp.h 3) do the following changes in /tmp/include/gmp.h: $ diff gmp.h.orig gmp.h 29a30,35 #define __GNU_MP_VERSION5 #define __GNU_MP_VERSION_MINOR 1 #define __GNU_MP_VERSION_PATCHLEVEL 3 extern char gmp_version[]; 34a41,46 #endif /* random generation functions */ #ifndef gmp_randstate_t typedef long int __gmp_randstate_struct; typedef __gmp_randstate_struct gmp_randstate_t[1]; 4) extract the MPFR tarball in say /tmp/mpfr-3.1.2 ./configure --with-gmp=/tmp --enable-mini-gmp Note: to use this version of the MPFR library, you need to define the MPFR_USE_MINI_GMP macro before including mpfr.h (alternatively, you can modify mpfr.h to define this macro at the beginning). ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel
Re: mpn_redc_n
Niels, good. Please keep me informed of your progress. This is very interesting for GMP-ECM. I am ready to beta-test the interface and/or the code. Paul From: ni...@lysator.liu.se (Niels Möller) Date: Tue, 31 Mar 2015 09:10:34 +0200 t...@gmplib.org writes: Niels and Marco might have some comments; they were involved in the design around this, My guess is as follows: The convention for redc_1 and redc_2 was changed for mpn_sec_powm, to get the final conditional subtraction moved out of the redc functions. redc_n didn't get that treatment, because it isn't used by mpn_powm_sec. To make progress, I'd like to see 1. Merge of my bdiv changes from a year or two ago, to align conventions between bdiv and redc. 2. Replace redc calls by bdiv calls (and write or rewrite any needed assembly code), and sort out any performance regressions. 3. Make the bdiv functions public. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance. ___ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel