Re: Has there been historical work done to investigate small integer optimization?

2024-02-12 Thread Paul Zimmermann
as Niels said, I fear the cost of the tests might make the benefit vanish.
But to be sure, the only way is to try to implement this idea inside GMP,
which involves quite some work.

Paul Zimmermann
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: What's a reasonable size ratio for toom32?

2023-09-18 Thread Paul Zimmermann
   Dear Niels,

> See below patch, to support mpn_mul/0.6. Example output:
> 
>   $ ./speed -c -r -s 10-50 -t 5 -p 100 mpn_mul mpn_mul/0.6 
>   overhead 5.00 cycles, precision 100 units of 1.25e-09 secs, CPU freq 
> 798.28 MHz
> mpn_mul   mpn_mul/0.6
>   10 215.48   #0.6295
>   15 443.16   #0.6066
>   20 784.50   #0.6138
>   251209.57   #0.5957
>   301490.59   #0.6934
>   351986.51   #0.6918
>   402547.99   #0.6981
>   453189.27   #0.6633
>   503827.87   #0.6734
> 
> What do you think? (Also deleted FLAG_RSIZE, which appeared unused).

looks nice!

> I think it's nicer to be able to specify it separately for each function
> under test.

indeed, can you try speed ... mpn_mul mpn_mul/0.9 mpn_mul/0.8 mpn_mul/0.7 
mpn_mul/0.6 mpn_mul/0.5 ?

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: What's a reasonable size ratio for toom32?

2023-08-23 Thread Paul Zimmermann
   Dear Niels,

./speed -p 100 -c -s 10-200 -f1.1 mpn_mul.0.6 would be more readable,
although the change in speed.h would be larger.

Or maybe ./speed -p 100 -c -s 10-200 -f1.1 -r 0.6 mpn_mul ?

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: What's a reasonable size ratio for toom32?

2023-07-25 Thread Paul Zimmermann
   Hi Niels,

> Not sure what would be the appropriate way to benchmark; could look at a
> range of unbalancedness ratio, at one or a few fixed sizes relevant to
> toom22 and toom32, or look at fixed ratio, perhaps 1/2, over a range of
> sizes.

I suggest taking fixed non-rational ratios, for example sqrt(2), sqrt(3),
sqrt(5), and comparing the old and new code over the range of sizes where
this code is used.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


stdio.h

2022-08-22 Thread Paul Zimmermann
   Hi,

the fine manual says:

   All declarations needed to use GMP are collected in the include file
   'gmp.h'.  It is designed to work with both C and C++ compilers.

#include 

  Note however that prototypes for GMP functions with 'FILE *'
   parameters are only provided if '' is included too.

#include 
#include 

What is the reason why  is not included by default?

Best regards,
Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mul_fft, cleaning up some details of the code

2022-03-08 Thread Paul Zimmermann
   Dear Marco,

> Uhm, the last line of code just before that ones is:
> 
>cc = mpn_sub_1 (r, r, m, cc) + 1;
>/* add 1 to cc instead of rd since rd might overflow */
> 
> So that I'd say that cc is 1 or 2.
> Then the case cc=2, m=n-1, r[m]=0, and rd=GMP_NUMB_MAX seems very very 
> unlikely, but possible.

it seems you are right!

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mul_fft, cleaning up some details of the code

2022-03-08 Thread Paul Zimmermann
   Hi Marco,

> Since you deeply know how this code works, I ask you one more question.
> The last lines of the function mpn_fft_mul_2exp_modF (branch m < n) 
> contain:
> 
>/* now subtract cc and rd from r[m..n] */
> 
>r[n] = -mpn_sub_1 (r + m, r + m, n - m, cc);
>r[n] -= mpn_sub_1 (r + m, r + m, n - m, rd);
>if (r[n] & GMP_LIMB_HIGHBIT)
>  r[n] = mpn_add_1 (r, r, n, CNST_LIMB(1));
> 
> This code assumes that there can only be a single borrow. Is it correct?
> I'm going to change them into the following:
> 
>r[n] = 2;
>MPN_DECR_U (r + m, n - m + 1, cc);
>MPN_DECR_U (r + m, n - m + 1, rd);
>if (UNLIKELY ((r[n] -= 2) != 0))
>  {
>mp_limb_t cy = -r[n];
>/* cy should always be 1, but we did not check if the case
>   m=n-1, r[m]=0, cc+rd>GMP_NUMB_MAX+1, and then cy = 2,
>   is actually possible. */
>r[n] = 0;
>MPN_INCR_U (r, n + 1, cy);
>  }
> 
> If we really can assume that cc+rd <= GMP_NUMB_MAX+1, the code could be 
> simpler:
> 
>r[n] = 1;
>MPN_DECR_U (r + m, n - m + 1, cc);
>MPN_DECR_U (r + m, n - m + 1, rd);
>if (UNLIKELY (r[n] == 0))
>  MPN_INCR_U (r, n + 1, 1);
>else
>  r[n] = 0;

yes there can only be a single borrow, since since cc is 0 or 1,
if r[n] is non zero after r[n] = -mpn_sub_1 (r + m, r + m, n - m, cc),
this can only happen for cc=1 and {r+m,n-m} = 000...000 before the
mpn_sub_1 call, and {r+m,n-m} = 111...111 after, then the second
mpn_sub_1 call cannot produce a borrow.

Best regards,
Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mul_fft, cleaning up some details of the code

2022-03-07 Thread Paul Zimmermann
   Dear Marco,

> Date: Sun, 06 Mar 2022 11:14:49 +0100
> From: Marco Bodrato 
> 
> Ciao,
> 
> I looked into the code published by Samuel Vivien.
> I realised that in mul_fft there are a lot of _add_1 and _sub_1. At 
> least some of them can be easily replaced by _INCR_U or _DECR_U...
> 
> Specifically I'd focus into a suspicious piece of code, shared by both 
> our current code and Vivien's.
> 
> The function mpn_mul_fft_decompose has a branch "if (dif > Kl)", that
> ends with the following lines:
> 
> if (cy >= 0)
>   cy = mpn_add_1 (tmp, tmp, Kl, cy);
> else
>   cy = mpn_sub_1 (tmp, tmp, Kl, -cy);
> 
> Can those lines be correct? Can the value in cy be used after this code 
> both if it contains the carry or if it contains the borrow of the 
> operation on tmp?

thank you, I agree with your analysis. Computations in that part are done
modulo B^Kl+1, thus any carry (resp. borrow) at tmp[Kl] should be subtracted
(resp. added) at tmp[0]. Here are some comments added to the code (from
GMP 6.2.1) and a fix:

--- mul_fft.c.orig  2022-03-07 09:56:28.133700601 +0100
+++ mul_fft.c   2022-03-07 10:12:20.127270744 +0100
@@ -712,15 +712,28 @@
cy += mpn_sub (tmp, tmp, Kl, n, dif);
  else
cy -= mpn_add (tmp, tmp, Kl, n, dif);
+  /* cy is the borrow at tmp[Kl], thus we should subtract
+ cy at tmp+Kl, or equivalently add cy at tmp, since
+ B^Kl = -1 */
  if (cy >= 0)
cy = mpn_add_1 (tmp, tmp, Kl, cy);
+  /* cy is now the carry at tmp[Kl] */
  else
+  {
cy = mpn_sub_1 (tmp, tmp, Kl, -cy);
+/* cy is now the borrow at tmp[Kl] */
+if (cy)
+  cy = mpn_add_1 (tmp, tmp, Kl, cy);
+/* cy is now the carry at tmp[Kl] */
+  }
}
   else /* dif <= Kl, i.e. nl <= 2 * Kl */
{
  cy = mpn_sub (tmp, n, Kl, n + Kl, dif);
+  /* cy is the borrow at tmp[Kl], thus we should add
+ cy at tmp[0] */
  cy = mpn_add_1 (tmp, tmp, Kl, cy);
+  /* cy is now the carry at tmp[Kl] */
}
   tmp[Kl] = cy;
   nl = Kl + 1;

> I believe this is an error, I'd change the last line into
>   cy = 1 - mpn_sub_1 (tmp, tmp, Kl, -cy - 1);
> and I propose the attached patch changing this and some other _1 
> functions.

Yes this seems ok. Assume we are in the case cy < 0.
Since cy is the borrow at tmp[Kl], we should add cy at tmp[0],
thus (since cy < 0) subtract -cy (say b=-cy) at tmp[0].
Since b > 0 and B^Kl+1 = 0 it is equivalent to subtract b-1 at
tmp[0] and add 1 at tmp[Kl].

> This is not really needed to solve a bug.
> The comment before the mpn_mul_fft_decompose function says "We must have 
> nl <= 2*K*l", this means that we should never have ((dif = nl - Kl) > 
> Kl), and the code in that branch should never be used.
> According to the coverage analisys, the code is not explored by the 
> tests:
> https://gmplib.org/devel/lcov/shell/var/tmp/lcov/gmp/mpn/mul_fft.c.gcov#691
> The "bug" is never triggered.
> 
> Maybe the branch could be used if someone re-enables mpn_mul_fft_full 
> and uses it for very unbalanced (more than 4x1) operands?

yes it could be used in that case.

Best regards,
Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


longlong.h patch

2021-10-11 Thread Paul Zimmermann
   Hi,

a patch for longlong.h was submitted on the glibc-alpha list,
to ease compilation with clang:

https://sourceware.org/pipermail/libc-alpha/2021-October/131824.html

It is apparently an old longlong.h (from 6.1.0), but maybe the patch
still applies to the development version.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


mul_fft

2021-06-30 Thread Paul Zimmermann
   Hi,

about https://gmplib.org/list-archives/gmp-devel/2021-June/005976.html,
the second part of this mail is nonsense, since --enable-old-fft-full
(which was introduced in revision 13152 on Dec 21 2009) has never been
working as expected, since revision 13145 on Dec 20 2009 disabled it
completely (thus with --enable-old-fft-full you still call
mpn_nussbaumer_mul):

#if 0
#define mpn_fft_mul mpn_mul_fft_full
#else
#define mpn_fft_mul mpn_nussbaumer_mul
#endif

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


mul_fft

2021-06-16 Thread Paul Zimmermann
   Hi,

with Samuel Vivien (in cc) we noticed two things related to the FFT code:

1) the use of mpn_add_n_sub_n is not activated by default in mul_fft.c.
   It might give a small speedup in some cases.

2) the previous code to multiply integers (mpn_mul_fft_full) which did
   two multiplications mod 2^N1+1 and 2^N2+1 has been replaced by a new
   code which computes 2^(2N)-1 for 2N ~ N1+N2, using the fact that
   2^(2N)-1=(2^N+1)(2^(N/2)+1)(2^(N/4)+1)...(2^(N/2^k)+1)(2^(N/2^k)-1).
   However that new code is not always faster, for example on an Intel
   Core i5-4590 with gcc 10.2.1:

Vanilla GMP 6.2.1:
$ ./speed -s 100,200,500,1000 mpn_mul_n mpn_mul_n mpn_mul_n
mpn_mul_n mpn_mul_n mpn_mul_n
1000.447985000   0.452133000  #0.447968000
200   #0.949739000   0.949996000   0.956713000
500   #2.669156000   2.678638000   2.677642000
1000   #5.540204000   5.547306000   5.547152000

With --enable-old-fft-full (without WANT_ADDSUB):
1000.44875   0.451685000  #0.447571000
200   #0.941488000   0.942339000   0.947603000
500   #2.668528000   2.675208000   2.669147000
10005.543246000   5.580915000  #5.538155000

Best regards,
Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: 2-adic Svoboda

2021-05-03 Thread Paul Zimmermann
   Dear Torbjörn,

>   yes, see Section 2.4.2 of Modern Computer Arithmetic, where we call it
>   "Montgomery-Svoboda". The quotient selection becomes trivial, which means
>   one can reduce the latency between two mpn_addmul_1 calls.
> 
> It really becomes a mul_basecase except that the first round there is
> done with mul_1.  If we replace that mul_1 by addmul_1, and
> then called the resulting function like this,
> 
> mul_basecase'(rp, up, un, rp, rn-un)
> 
> then we will get the desired remainder at mul_basecase speed.

a slight difference is that here the limb multiplier is not known in advance,
but only at the beginning of each loop. But maybe it does not make any
difference.

I tried to implement Montgomery-Svoboda at the C level, but did not manage
to beat the mpn_redc_x routines. I'm very interested to see your results!

Best regards,
Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: 2-adic Svoboda

2021-05-02 Thread Paul Zimmermann
   Dear Torbjörn,

> IIRC, Svoboda's division trick for N/D is to find a small multiplier m
> such that, for the division mN/(mD) we have mD = 10...XX... with the
> high limb of mD being 1000...0.
> 
> This idea works also for 2-adic division.  Find m = D^(-1) mod \beta
> where \beta is the lomb base.  Then do mN/(mD) or mN mod (mD) with
> 2-adic norm.  Now mD will end in 0...0001, and the quotient limbs will
> be the low limbs of the intermediate remainder.
> 
> That should mean that we could use mul_basecase directly for
> sbpi1_bdiv_r (or its sibling redc_1).
> 
> Surely this is not a new observation.

yes, see Section 2.4.2 of Modern Computer Arithmetic, where we call it
"Montgomery-Svoboda". The quotient selection becomes trivial, which means
one can reduce the latency between two mpn_addmul_1 calls.

If you reduce k limbs at a time, by precomputing m = D^(-1) mod \beta^k,
then you can use mpn_addmul_k at each step. But to reduce the last k limbs,
you need to revert to classical (Montgomery) division, since mN has n+k limbs.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: strange install directory

2021-02-26 Thread Paul Zimmermann
thank you all for your answers. Maybe the solution would be to check for
$prefix/lib64 (and $prefix/lib32) if we don't find libgmp in $prefix/lib.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


strange install directory

2021-02-25 Thread Paul Zimmermann
   Hi,

on the gcc118 machine from the gcc compile farm, with gmp-6.2.1, make install
puts the libgmp.{a,so} files in $PREFIX/lib64 instead of $PREFIX/lib. Is there
any reason for that?

This non-standard install directory is a problem, since when configuring MPFR
(for example) with --with-gmp=$PREFIX, the MPFR configuration fails, since it
does not find the libgmp.{a,so} files. And I guess many software tools rely
on this --with-gmp=$PREFIX mechanism.

Paul

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mpz_add_si (was: Re: mpz_prevprime)

2020-10-16 Thread Paul Zimmermann
   Hi Niels,

> >>>   mpz_add_si (p, p, diff * (odds_in_composite_sieve-1));
> >
> > We don't have such a function :-)
> 
> Ooops. 
> 
> Maybe we should have? We do have both _ui and _si versions of other
> basic mpz functions, including mpz_set, mpz_mul, and mpz_cmp.

mpz_addmul_si and mpz_submul_si are missing too.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: GMP used during 3 and a half years to solve MIT's LCS35

2019-12-16 Thread paul zimmermann
   Dear Marco,

thank you for the benchmark, which shows a clear gain with the popcount
method when the exponent is sparse, and no regression when it is not.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Faster integer remainder

2019-10-15 Thread paul zimmermann
> "Nelson H. F. Beebe"  writes:
> 
> > During a routine, and rather delayed, bibliography update, I found and
> > read a recent paper that might stimulate rethinking multiple-precision
> > integer remainder computations in gmp:
> >
> > Daniel Lemire and Owen Kaser and Nathan Kurz
> > Faster remainder by direct computation: Applications to
> > compilers and software libraries
> > Software: Practice & Experience 49(6) 953--970 June 2019
> > https://doi.org/10.1002/spe.2689
> >
> > A preprint PDF file is available at
> >
> > https://arxiv.org/pdf/1902.01961

after having a closer look, the main idea of that paper (Theorem 1) already
appears in Bernstein's "scaled remainder tree" paper [1].

Indeed, the assumption of Theorem 1 says that c' = c/2^{N+L} is a good
approximation of 1/d. Then if n = qd + r, the fractional part of c'n is
close to r/d, and multiplying by d reveals r.

More precisely, here is a 4-line proof of Theorem 1, assuming n = qd + r:

(1) we have 1/d <= c' := c/2^{N+L} <= 1/d + 1/(d 2^N)
(2) it follows  q+r/d <= c'n <= q + r/d + n/(d 2^N)
(3) thus r/d <= frac(c'n) <= r/d + n/(d 2^N) < (r+1)/d
(4) then r <= frac(c'n)*d < r+1

Paul Zimmermann

[1] cr.yp.to/arith/scaledmod-20040820.pdf
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: patch for speed

2019-09-10 Thread paul zimmermann
> How many "users" will want to use tune/speed and at the same time not be
> able to figure out that we alias mpz_legendre to mpz_jacobi?  That alias
> might not be documented, but tune/speed is "more undocumented".

note also that since mpz_legendre only accepts odd primes as second argument,
its running time might differ from that of mpz_jacobi, even if it is an alias.
Anyway if my contributed patch is not welcome, just ignore it!

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: patch for speed

2019-09-10 Thread paul zimmermann
> I'd like to understand the benefits of ./speed mpz_legendre first, since
> I don't really see any. Currently, it's an alias for mpz_jacobi, and I'm
> not aware of any concrete plans to change that.

the benefit for the user is to be able to measure the speed of mpz_legendre.
Since the documentation does not say that mpz_legendre is an alias for
mpz_jacobi, I believe it is legitimate.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: patch for speed

2019-09-10 Thread paul zimmermann
   Niels,

> Wouldn't that patch call mpz_legendre (a, p) with p not prime? Which is
> invalid, but kind-of works *because* mpz_legendre is an alias for
> mpz_jacobi.

good point. Do you want a new patch which ensures p is an odd positive prime?

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: patch for speed

2019-09-10 Thread paul zimmermann
> Adding a different symbol in the library for mpz_kronecker or
> mpz_legendre, would be an incompatible change to the interface. Not likely
> in the near future.

ok, then what about the original subject (adding mpz_legendre to speed)?

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: patch for speed

2019-09-09 Thread paul zimmermann
   Dear Marco,

maybe one day mpz_legendre and mpz_jacobi will differ. In that case the patch
will be useful. Otherwise why two functions?

Paul

> Date: Mon, 9 Sep 2019 19:24:22 +0200
> From: "Marco Bodrato" 
> 
> Ciao Paul,
> 
> Il Lun, 9 Settembre 2019 6:18 pm, paul zimmermann ha scritto:
> > this patch adds mpz_legendre to the functions known by speed:
> 
> Currently in gmp-h.in we have
> #define mpz_legendre mpz_jacobi  /* alias */
> 
> So that I do not expect any difference between mpz_legendre and mpz_jacobi.
> 
> Do you?
> 
> Ĝis,
> m

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


patch for speed

2019-09-09 Thread paul zimmermann
   Hi,

this patch adds mpz_legendre to the functions known by speed:

zimmerma@tomate:/tmp/gmp-6.1.2/tune$ ./speed -r -s 1-10 mpn_mul_n mpn_gcd 
mpz_legendre mpz_jacobi
overhead 0.1 secs, precision 1 units of 3.04e-10 secs, CPU freq 
3292.22 MHz
mpn_mul_n   mpn_gcd  mpz_legendrempz_jacobi
1#0.5   18.7516   26.0471   26.0335
2#0.7   41.2102   50.0996   49.8808
3#0.00014   81.6871   87.5821   87.4862
4#0.00019   90.4106   97.1702   96.5991
5#0.00026   88.6899   95.3075   94.5296
6#0.00032   87.9981   92.8041   92.5853
7#0.00041   82.1337   86.7046   86.5697
8#0.00051   77.1496   81.5147   81.4188
9#0.00068   65.8028   69.6469   69.6326
10   #0.00076   66.2903   69.9241   69.7451
11   #0.00092   61.7560   65.2479   65.0786
12   #0.00107   58.3972   61.1965   60.9988
13   #0.00126   53.8224   56.9173   56.4985
14   #0.00142   51.7653   55.0378   54.8534
15   #0.00164   48.5111   51.1119   50.9965
16   #0.00185   45.7281   48.6286   48.5163

Paul

--- speed.c.orig2019-09-09 18:01:39.453570648 +0200
+++ speed.c 2019-09-09 18:02:01.877937859 +0200
@@ -309,6 +309,7 @@
   { "mpn_gcdext_lehmer", speed_mpn_gcdext_lehmer },
 #endif
   { "mpz_jacobi",speed_mpz_jacobi   },
+  { "mpz_legendre",  speed_mpz_legendre },
   { "mpn_jacobi_base",   speed_mpn_jacobi_base  },
   { "mpn_jacobi_base_1", speed_mpn_jacobi_base_1},
   { "mpn_jacobi_base_2", speed_mpn_jacobi_base_2},
--- common.c.orig   2019-09-09 18:00:54.416833037 +0200
+++ common.c2019-09-09 18:01:19.381241921 +0200
@@ -1722,6 +1722,11 @@
   SPEED_ROUTINE_MPZ_JACOBI (mpz_jacobi);
 }
 double
+speed_mpz_legendre (struct speed_params *s)
+{
+  SPEED_ROUTINE_MPZ_JACOBI (mpz_legendre);
+}
+double
 speed_mpn_jacobi_base (struct speed_params *s)
 {
   SPEED_ROUTINE_MPN_JACBASE (mpn_jacobi_base);
--- speed.h.orig2019-09-09 18:02:33.074448664 +0200
+++ speed.h 2019-09-09 18:02:51.742754302 +0200
@@ -393,6 +393,7 @@
 double speed_mpz_init_clear (struct speed_params *);
 double speed_mpz_init_realloc_clear (struct speed_params *);
 double speed_mpz_jacobi (struct speed_params *);
+double speed_mpz_legendre (struct speed_params *);
 double speed_mpz_lucnum_ui (struct speed_params *);
 double speed_mpz_lucnum2_ui (struct speed_params *);
 double speed_mpz_mod (struct speed_params *);
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mpz_nextprime

2019-09-04 Thread paul zimmermann
   Torbjörn,

> With k-ary modexp, the number of multiplications is log2(n)/k,
> while the number of squarings are log(n).  Therefore a small b will not
> make a huge difference.
>
> As Marco points implies, one may handle some special values like b = 2
> also in REDC coding and still not pay the price of a full
> multiplication.

with a small base, no need of k-ary modexp. In the REDC domain, instead of
computing a'*b'/r mod n, where  b' = r*b mod n is the REDC encoding of the
base b, just compute a'*b mod n.

Another trick is to use a base b such that b' is small, then you don't even
need to modify the REDC code (assuming it recognizes small multipliers).

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Quotient sequence with symmetric signed quotients?

2019-04-23 Thread paul zimmermann
   Dear Niels,

> Is there any analysis of Euclid's algorithm, with different rounding of
> the quotients? E.g., for the fibonacci inputs gcd(21, 13), we have a
> standard quotient sequence of (1,1,1,1,1,1,1), taking 7 steps to get
> from (21,13) to (1,0).
> 
> If we instead try quotients rounded to nearest, we get
> 
> (21,13) >  (13, -5) > (-5, -2) > (-2, -1) > (-1, 0)
> q = 2   q =-3  q = 2  q = 2
> 
> with 4 quotients rather than 7, and different from the standard continued
> fraction expansion.

the right person to ask is Brigitte Vallée:

https://vallee.users.greyc.fr/Publications/index.html

(Look for "nearest" or "odd" on this page.)

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: New n log n algorithm for high-precision multiplication

2019-04-15 Thread paul zimmermann
   Dear Niels,

> From: ni...@lysator.liu.se (Niels Möller)
> Date: Mon, 15 Apr 2019 14:02:28 +0200
> 
> paul zimmermann  writes:
> 
> > Schönhage-Strassen can be implemented with good cache locality:
> >
> > https://hal.inria.fr/inria-00126462
> 
> Thanks! As I read it, for large inputs, the top level transforms operate
> on coefficients that fit in L2 cache but not L1 cache, and with several
> passes over the data, depending on fft size. Is that right?

not quite. In Section 2.2.3 (Bailey's 4-step transform) we describe a way
to please both the L2 and L1 caches. Of course this work should be revisited
with modern processors.

> Then I think small-prime fft has potential for better locality, with few
> complete passes of the data and all the heavy fft work operating on data
> in the L1 cache.

maybe, I'm curious to see a comparison with our code!

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: New n log n algorithm for high-precision multiplication

2019-04-15 Thread paul zimmermann
   Dear Niels,

> There's also unfinished work on small-prime FFT, motivated mainly to get
> better cache locality (for huge numbers, field elements in
> Schönhage-Strassen get so large that they exceed the L1 cache, slowing
> down the supposedly very efficient field operations).

Schönhage-Strassen can be implemented with good cache locality:

https://hal.inria.fr/inria-00126462

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Unused code in mpn_mul_fft_decompose?

2019-03-05 Thread paul zimmermann
   Dear Mikhail,

> From: Mikhail Hogrefe 
> Date: Mon, 4 Mar 2019 21:44:50 -0500
> 
> I was digging around in the GMP 6.1.2 source, and I noticed something in
> mpn/generic/mul_fft.c.
> 
> In mpn_mul_fft_decompose, there is an if statement checking 'dif > Kl',
> which is equivalent to 'nl > 2 * K * l'. But there is a comment above the
> function saying 'We must have nl <= 2*K*l.', suggesting that the first
> branch of the if is never taken.
> 
> Maybe you're already aware of this, but if not I wanted to let you know.
> 
> Best,
> Mikhail Hogrefe

good remark. Indeed in the mul_fft.c file, it seems this code is dead.
However if mpn_mul_fft_decompose is exported, for example if one wants to
compute a product of two numbers modulo 2^n+1, where the two numbers have
more than 2n bits, then this branch might be useful.

Anyway this code is now obsolete (see [1] and [2]).

Best regards,
Paul Zimmermann

[1] http://homepages.loria.fr/PZimmermann/papers/#fft

[2] https://gmplib.org/devel/ "This will largely depend on ongoing work on FFT
multiplication"
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Side channel silent karatsuba / mpn_addmul_2 karatsuba

2018-12-17 Thread paul zimmermann
> I got stuck with that code since there are so many ways of summing
> things.

would it make sense to write a tool that generates all possible ways?

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


mini-gmp

2018-12-14 Thread paul zimmermann
   Hi,

would it be possible to move the following from mini-gmp.c to mini-gmp.h?

#define GMP_LIMB_BITS (sizeof(mp_limb_t) * CHAR_BIT)

It would expose GMP_LIMB_BITS to applications using mini-gmp, without requiring
them to redefine it.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Micro-GMP

2018-12-14 Thread paul zimmermann
> Supporting very small sizes, and odd sizes might not be easy. And
> currently Paul reported failures for 32-bit limbs on 64-bit machines...
> asl.h is more interesting for GMP than for mini-.

after investigation, those failures were coming from the MPFR side.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Side channel silent karatsuba / mpn_addmul_2 karatsuba

2018-12-14 Thread paul zimmermann
   Dear Marco,

> I feel we should at least add an umulhi macro.
> We have too many umul_ppmm(used, dummy, a, b) in the code now...

I support that proposal.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Micro-GMP

2018-12-10 Thread paul zimmermann
   Dear Marco,

> A attach a possible implementation of mini-gmp that should support limbs
> of "any" size. Can you test it with MPFR? Both correctness and speed...

it passes all MPFR tests, both with uint16 and uint8. It is also faster than
micro-gmp (I disabled GMP_CHECK_RANDOMIZE so that the tests are deterministic):

micro-gmp   mini-gmp patched
   tdivtcos  tdiv   tcos
uint16 2.8s7m25s 2.2s   4m45s
uint8  7.1s   21m40s 3.8s   7m26s

Great, thank you!
Paul

PS: it would be nice to indicate in mini-gmp.h the places to modify to enable
16-bit or 8-bit limbs.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Micro-GMP

2018-12-10 Thread paul zimmermann
   Dear Marco,

> mpz_cmpabs_ui (const mpz_t u, unsigned long v)
> {
>   int ulongsize = GMP_ULONG_BITS / GMP_LIMB_BITS;
>   mp_limb_t ulongrem = 0;
>   mp_size_t un = GMP_ABS (u->_mp_size);
> 
>   if (GMP_ULONG_BITS % GMP_LIMB_BITS != 0)
> ulongrem = (ULONG_MAX >> GMP_LIMB_BITS * ulongsize) + 1;

is GMP_ULONG_BITS % GMP_LIMB_BITS != 0 allowed in GMP and/or asl.h?

>   if (un > ulongsize && (u->_mp_d[ulongsize] >= ulongrem || un > ulongsize
> + 1))
> return 1;
>   else
> {
>   unsigned long uu = mpz_get_ui (u);
>   return GMP_CMP(uu, v);
> }
> }
> 
> 
> When mp_limb_t is unsigned long, gcc -O2 compiles this code exactly as
>   if (GMP_ABS (u->_mp_size) > 1)
> return 1;
>   else
> GMP_CMP(mpz_get_ui (u), v)
> [...]

sounds good.

> The same as the current code. And it gets optimised also for uint16 or uint8.
> 
> >> You also changed mpn_invert_3by2. Maybe a handful of (mp_limb_t) casts
> >> could be added to the original function to make it work... but your
> >> rewrite is simpler and faster. Maybe the same could be done on
> >> gmp_umul_ppmm? for speed?
> >
> > indeed, in several places I opted for simplicity, since speed was not
> > (at first) a goal. But when running "make check" of MPFR with micro-gmp-8,
> > I now realize speed is also important!
> 
> A attach a possible implementation of mini-gmp that should support limbs
> of "any" size. Can you test it with MPFR? Both correctness and speed...

I'll do.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Micro-GMP

2018-12-03 Thread paul zimmermann
   Dear Marco,

> Date: Fri, 30 Nov 2018 07:43:40 +0100
> From: Marco Bodrato 
> 
> Ciao Paul!
> 
> Il 2018-11-23 13:38 paul zimmermann ha scritto:
> > I presented some work in progress on "Micro-GMP", a modified
> > version of Mini-GMP which works with 16-bit or even 8-bit limbs:
> 
> Great!
> It is always interesting to see a further development based on some code 
> we wrote!
> 
> > https://members.loria.fr/PZimmermann/talks/microgmp.pdf
> 
> In your presentation you write that "27 functions from mini-gmp.c need 
> to be modified".
> 
> Of course the code in mini-gmp exploits the fact that mp_limb_t is 
> defined as unsigned long, to keep the code as simple and as effective as 
> possible. If you need to define a different mp_limb_t type, some 
> modifications are obviously needed.
> 
> But... are you sure you need to change that many? I mean:
>   - you changed mpz_div_qr_ui, did you really need to change also most of 
> the functions mpz_?div_*_ui originally implemented as a simple call to 
> mpz_div_qr_ui?

most changes for the _ui functions (including mpz_div_qr_ui) consist in first
converting the unsigned long argument into a mpz_t using mpz_set_ui, then
calling the corresponding function without _ui.

But indeed I don't remember why I changed the other mpz_?div_*_ui functions.
Maybe because it was simpler to just convert using mpz_set_ui, and do not have
to think at all.

>   - for similar reasons, was changing mpz_mul_si really needed? or 
> mpz_lcm_ui?

it seems mpz_mul_si does not need to be changed. Same for mpz_lcm_ui.

> I looked at the code and I'd suggest a couple of simpler 
> implementations:
> 
> mpz_set_si (mpz_t r, signed long int x)
> {
>if (x >= 0)
>  mpz_set_ui (r, x);
>else /* (x < 0) */
>  {
>mpz_set_ui (r, GMP_NEG_CAST (unsigned long int, x));
>mpz_neg (r, r);
>  }
> }
>
> int
> mpz_cmpabs_ui (const mpz_t u, unsigned long v)
> {
>mpz_t absu;
>mpz_roinit_n (absu, u->_mp_d, GMP_ABS (u->_mp_size));
>return mpz_cmp_ui (absu, v);
> }

indeed this is simpler.

Now there are only 18 that need to be modified.

> You also changed mpn_invert_3by2. Maybe a handful of (mp_limb_t) casts 
> could be added to the original function to make it work... but your 
> rewrite is simpler and faster. Maybe the same could be done on 
> gmp_umul_ppmm? for speed?

indeed, in several places I opted for simplicity, since speed was not
(at first) a goal. But when running "make check" of MPFR with micro-gmp-8,
I now realize speed is also important!

> Unfortunately, writing a micro-gmp implementation outside of the 
> project, has the bad effect that you will have to rewrite it for any new 
> release.
> Your micro- does not benefit of the improvement in mini- since the 
> release of GMP-6.1.2, and some of your modifications might be 
> superfluous now.
> E.g. you changed in mpz_probab_prime_p the line:
> 
>nm1->_mp_size = mpz_abs_sub_ui (nm1, n, 1);
> 
> into
> 
>mpz_abs (nm1, n);
>mpz_sub_ui (nm1, nm1, 1);
> 
> but the current code reads 
> https://gmplib.org/repo/gmp/file/17701/mini-gmp/mini-gmp.c#l3648
> 
>mpz_abs (nm1, n);
>nm1->_mp_d[0] -= 1;
> 
> which works for any limb size (n is odd, -=1 on the lowest limb is 
> always possible) and is faster than calling _sub_ui.
> 
> Maybe we can find a strategy to release also micro- as a part of the 
> project, reducing code duplication. Would you like to cooperate on this?

yes of course!

Best regards,
Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Micro-GMP

2018-12-03 Thread paul zimmermann
   Dear Torbjörn,

> From: t...@gmplib.org (Torbjörn Granlund)
> Date: Fri, 23 Nov 2018 18:03:31 +0100
> 
> You might be interested in GMP's asl.h, it implements artificially small
> limbs ("asl"), down to 4 bits IIRC.  The repo GMP doesn't complete its
> tests with it currently.  (A student of mine did something similar some
> years previously, but I couldn't get paperwork in order for releasing
> it.)

quite interesting! Is there any documentation how to use it?

> Some notes about our slides:
> 
> First public GMP releaae wasn't 1.3.2 (sic) in 1993 but 1.1 in 1991.

my source was ftp://ftp.gnu.org/gnu/gmp/, where the first version is 1.3.2.
But indeed in https://members.loria.fr/PZimmermann/bignum/comp.ps.gz I used
GMP 1.2 from December 1991. Where can I find the first public releases
together with their release dates?

> I believe Niels and Marco wrote Mini-GMP.
>
> Not only O(n^2) algorithms; addition is actually O(n) while modexp is
> O(n^3)...
> 
> The mpz_random2 function is long obsolete, mpz_rrandomb replaces it.

thank you, I will prepare a revised version of my slides.

Any other feedback is welcome!

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Micro-GMP

2018-11-23 Thread paul zimmermann
   Dear GMP developers,

today at the iRRAM/MPFR/MPC workshop in Trier
(http://irram.uni-trier.de/mpfr-mpc-irram-workshop-trier-2018/)
I presented some work in progress on "Micro-GMP", a modified
version of Mini-GMP which works with 16-bit or even 8-bit limbs:

https://members.loria.fr/PZimmermann/talks/microgmp.pdf

Any feedback is welcome,
Paul

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


mpz_spsp pseudoprimes

2018-11-23 Thread paul zimmermann
   Hi,

Thomas Nicely has updated his web page to include the smallest mpz_spsp
pseudoprimes in GMP 6.1.2:

http://www.trnicely.net/misc/mpzspsp.html#GMP612

It would be nice to list the smallest mpz_spsp5.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: New code for primality testing

2018-11-22 Thread paul zimmermann
   Dear Marco,

> > if you have already implemented BPSW, probaby the simplest is to make
> > mpz_probab_prime_p use BPSW (removing the reps argument), and have another
> 
> We did. The newly implemented test was the main reason for this thread.
> 
> Currently we did not change the interface. BPSW is used as a first step in
> mpz_probab_prime_p, and substitute the first "number" repetitions, then
> the remaining iteration of MillerRabin are used. This is intended for
> GMP-6.2.
> 
> Do you have suggestions for that "number"? Then, I agree, for GMP-7 we
> will have to redesign the interface...

since no counter-example is known for BPSW, you could substitute all REPS
iterations.

On the other hand, you could leave all REPS iterations, and hope a
counter-example is found by GMP (in that case, make sure to send the
counter-example to the gmp-devel list).

Paul

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: New code for primality testing

2018-11-21 Thread paul zimmermann
   Ciao Marco,

> Someone published a sequence of 7 bases to deterministically test
> primality up to 2^64...
> But someone also claim that BPSW has no false positives up to 2^64...
> 
> If we can verify the source of the second claim, since we just implemented
> BPSW, we should decide to return 2 (surely prime) for all "probable"
> primes under this threshold.
> 
> By the way, I agree. In the API a generic function can take an unspecified
> reps parameter. On the other side, if we decide to expose a specific
> implementation (like MillerRabin), we should give to the users the full
> control on the parameters (the bases in this case).
> 
> A proposal for the function prototype? That could fit the need of GMP-ECM?

if you have already implemented BPSW, probaby the simplest is to make
mpz_probab_prime_p use BPSW (removing the reps argument), and have another
function mpz_millerrabin (mpz_t n, mpz_t base) and/or
mpz_millerrabin_ui (mpz_t n, unsigned long base).

As application, I was thinking more at CADO-NFS than GMP-ECM. In the
cofactorization phase of the Number Field Sieve, we have billions of
numbers of up to say 128 bits that we have to check for primality,
before trying to factor them.

Paul



___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: New code for primality testing

2018-11-21 Thread paul zimmermann
> > Some other free software projects already contain code for primality
> > proving. E.g. Pari/GP [http://pari.math.u-bordeaux.fr/] implement both
> > APR-CL and ECPP... I don't think GMP should duplicate that (good) work.
> 
> mpz_aprcl uses mpz already and is on a similar track about license
> https://sourceforge.net/projects/mpzaprcl/
> 
> Thank you for the pointers on these projects.
> It's just that, personal feeling maybe, there is a something "missing" from
> the absence of deterministic primality test function (whatever the perf, to 
> some extent).
> But if it's too complex it's indeed a good reason.

GMP-ECM contains an implementation of APRCL, written by David Cleaver.
See aprtcle/mpz_aprcl.c.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: New code for primality testing

2018-11-21 Thread paul zimmermann
> And it could be specified that this function
>   int mpz_millerrabin (mpz_t n, int reps)
> internally takes care of randstate boilerplate, like in current code.
> Which adds to its added value and simplicity of use.
> Repeated calls may not bring additional confidence,
> but it's not important because the number of "calls" can already be set with 
> the reps parameter.

in some contexts, we want to get a full primality test for moderate size
numbers. For that purpose, it would be more useful to provide a list of
(user chosen) bases to mpz_millerrabin, or at least that the bases chosen
by a given 'reps' argument are documented.

See https://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_test.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: udiv_qr_3by2 vs divappr

2018-09-21 Thread paul zimmermann
   Torbjörn,

> I suppose a "naive" implementation scales the full dividend and the full
> divisor by a factor which fits in a limb.

scaling the dividend is not mandatory. You can first reduce the upper n-1
limbs of the original dividend using the scaled divisor, then reduce the
last limb using the original divisor. This is especially interesting if only
the remainder is needed.

In fact, the multiplier has one limb + 1 bit, if the divisor is normalized,
so k*B = \beta^{n+1} + B' with B' < \beta^n (see Algorithm 1.7 in [1]).

> One could in theory save some
> time by scaling just their O(log(Q)) most significant limb.

I don't see how you achieve this, since for the submul part you need the full
scaled divisor.

> How likely does Svoboda overesimate a quotient limb?

good question. Unfortunately the authors of [1] do not answer that question.
If at some point of the loop in Algorithm 1.7 we have
A = a_{n+j} \beta^{n+j} + R and B' = \beta^{n+1} + S, with
0 \leq R < \beta^{n+j} and 0 \leq S < \beta^n, then the next remainder
will be T = A - a_{n+j} \beta^{j-1} B' = R - a_{n+j} \beta^{j-1} S.
We thus have a_{n+j} \beta^{n+j-1} < T < \beta^{n+j}. Writing
R = X \beta^{n+j}, S = Y \beta^n, a_{n+j} = Z \beta, with 0 <= X, Y, Z < 1,
we get: T = (X  - Y Z) \beta^{n+j}. If X, Y, Z are uniform in [0,1] and
independent, then the probability that X  - Y Z >= 0 is 3/4, thus Svoboda's
algorithm will overestimate the quotient with probability 1/4 (and in that
case, only one correction is needed).

Of course you can use a 2-limb variant of Svoboda: multiply the divisor by
2 limbs + 1 bit, so that the scaled divisor is now k*B = \beta^{n+2} + B'
with B' < \beta^n. Then the quotient will be overestimated with probability
less than 1/\beta.

Paul

[1] https://members.loria.fr/PZimmermann/mca/mca-cup-0.5.9.pdf
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: udiv_qr_3by2 vs divappr

2018-09-20 Thread paul zimmermann
   Torbjörn,

>   this looks quite interesting, I'm looking forward for cycle numbers.
>   (I guess you target the case where the divisor is not invariant, otherwise
>   you might do some preprocessing to speed up the partial quotient
>   approximation.)
> 
> The divisor is invariant in some sense when the quotient is more than
> one limb, right?

yes it is. But I was thinking about many divisions with the same divisor
(say 2n limbs divided by n limbs). In short we can divide preprocessing
methods in two classes:

1) those with only O(1) preprocessing time, like the precomputation of an
   inverse of the 1 or 2 uppers limbs of the divisor

2) those with O(n) preprocessing time, like Svoboda division

> Are you thinking of Svoboda division here?  That's another possibility
> for speeding up schoolbook division.

yes, in Svoboda division the quotient selection is for free. But you still
have to wait the end of the previous submul_1 (or submul_2) call to access
the next quotient, whereas in mpn_mul_basecase all multipliers are known in
advance. Does this makes a difference?

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: udiv_qr_3by2 vs divappr

2018-09-20 Thread paul zimmermann
   Dear Torbjörn,

this looks quite interesting, I'm looking forward for cycle numbers.
(I guess you target the case where the divisor is not invariant, otherwise
you might do some preprocessing to speed up the partial quotient
approximation.)

Paul

> From: t...@gmplib.org (Torbjörn Granlund)
> Date: Wed, 19 Sep 2018 18:18:29 +0200
> 
> This is a sketch of an idea which I have had for a long time, which we
> actually might have discussed here before.  The idea is to cut a large
> O(nn-dn) time off of schoolbook division.
> 
> Normally, the time taken by schoolbook division is latency(mpn_submul_1)
> + 2*latency(mul) + C.  Here C is the number of plain operations of
> quotient limb approximation and mul means a scalar multiply used by the
> same.
> 
> Couldn't we make the quotient limb approximation in parallel with the
> O(n^2) work?  I think we can, but it requires that we run the submul_1
> partially backwards, and thereby produce the most significant partial
> remainder limbs early.
> 
> But that's impossible, right?  The quotient limb approximation is, eh,
> approximate after all!  We cannot know the most significant partial
> remainder limbs as early as required!  No, we cannot know them for sure,
> but we can almost always tell.  And when we're wrong, we need to add
> back and recompute.  Below is a very rough implementation.  (It surely is
> not correct wrt indexes.)
> 
> The hope is that this could run almost as as fast as mul_basecase.
> There, and in this proposed algorithm, subsequent addmul_1/submul_1 can
> actually overlap with the help of OoO execution.  So hopefully on good
> OoO pipelines this could not just hide the division limb approximation
> latency, but actually hide much of the feed-in and wind-down latency of
> submul_1!
> 
> qnext = divappr (np[1], np[0], d1, d0);
> for (...)
>   {
> // Apply q to most significant partial remainder limbs
> cy = submul_1 (np - 2, dp + dn - 2, qnext, 2);
> 
> q = qnext;
> // Now, with the most significant few limbs of the next partial remainder
> // computed to some accuracy, compute q for the next iteration.  This 
> might
> // overestimate q worse than with the old algorithms.
> qnext = divappr (np[1], np[0], d1, d0);
> 
> // Apply q to all but the most significant partial remainder limbs.
> cy = submul_1 (np - dn, dp, q, dn - 2);
> 
> if (UNLIKELY (cy > np[0]))
>   {
> np[0] -= cy;
>   // q was too large, qnext is garbage so recompute it!
>   // (We could perhaps recompute it more cleverly.)
>   ASSERT_CARRY(mpn_add_n (np - dn, np - dn, dp, dn));
>   qnext = divappr (np[1], np[0], d1, d0);
>   }
> else
>   np[0] -= cy;
> np--;
>   }
> 
> -- 
> Torbjörn
> Please encrypt, key id 0xC8601622
> ___
> gmp-devel mailing list
> gmp-devel@gmplib.org
> https://gmplib.org/mailman/listinfo/gmp-devel
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


mini-gmp test suite could be improved

2018-09-05 Thread paul zimmermann
   Hi,

I noticed the mini-gmp test suite does not seem to contain any test of
mpz_add_ui with the unsigned long argument having more than 32 bits.
Indeed with the following change "make check" still says "All 25 tests passed"
(with the mini-gmp from gmp-6.1.2):

void
mpz_add_ui (mpz_t r, const mpz_t a, unsigned long b)
{
  b = b % 0x;
  if (a->_mp_size >= 0)
r->_mp_size = mpz_abs_add_ui (r, a, b);
  else
r->_mp_size = -mpz_abs_sub_ui (r, a, b);
}

At least one test with ULONG_MAX would be welcome.

Maybe the same holds for other ui/si functions.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: udiv_qr_3by2 vs divappr

2018-08-30 Thread paul zimmermann
   Dear Niels,

> I've updated the paper (same location,
> https://www.lysator.liu.se/~nisse/misc/schoolbook-divappr.pdf).

page 1: there is an extra ')' in the footnote

page 3: can you detail the proof of R_{bignum} < D? If the algorithm would
also take as input u_{-1} as in your 2011 paper, I am ok, but here I am not
sure.

In the proof of Lemma 1, in u_0 (\beta^3 - \beta D - D), the term
\beta^3 - \beta D - D is negative when D >= \beta^2 - \beta - 1, thus
you cannot simply bound it by bounding u_0 by \beta-1.

Paul


___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: udiv_qr_3by2 vs divappr

2018-08-29 Thread paul zimmermann
   Dear Niels,

your idea works. If I am correct, we save a \beta/4 factor, which is enough
as long as \beta>=4.

Paul

e.pdf
Description: Adobe PDF document
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: udiv_qr_3by2 vs divappr

2018-08-28 Thread paul zimmermann
   Dear Niels,

it works with r = (u_0 - q d_1 - p_1 - 1) \bmod \beta line 6 in all
cases, assuming it works with -1 replaced by - [p_0 > 0].

We only need to check the case p_0 = 0.

p_0 = 0 means that q d_0 is divisible by \beta, i.e., R' is multiple of \beta.

Let still  be the two low words from R'. We have r_0 = 0.

We have r = r_1 - 1 in the algorithm, instead of r = r_1.

For the first adjustment, the only case that differs from the original
algorithm is when r_1 = q_0 thus r = q_0 - 1 and the first adjustment
no longer applies. Since R' > q_0 \beta - \beta^2 [*], then necessarily
R' = q_0 \beta.

If the second adjustment does not apply, then r <= d_1 - 2, which means
q_0 <= d_1 - 1, thus 0 <= R' <= (d_1-1) \beta and we are done.

If the second adjustment applies, then r >= d_1 - 1, which means
q_0 >= d_1, thus after the adjustment, R = q_0 \beta - d > -\beta.
(For the upper bound, R < \beta^2 - d <= d.)

[*] the proof of Theorem 3 from [4] says "The lower bounds ... and
\tilde{r} > q_0 \beta - \beta^2 follow in the same way as in the proof of
Theorem 2".

For the second adjustment, assume r = d_1 - 2, in which case the second
adjustment would have been made with r = r_1.

If the first adjustment did not apply, then we have r_1 - 1 = d_1 - 2 < q_0,
thus r_1 = d_1 - 1 < q_0 + 1. Thus r_1 <= q_0. But since
R' > q_0 \beta - \beta^2, then R' =  =  which is ok.

If the first adjustment did apply, then we have r = r_1 + d_1 mod \beta,
which with r = d_1 - 2 implies r_1 = \beta - 2. If R' < 0, then
R' = r_1 \beta - \beta^2 = -2 \beta which is ok. If R' >= 0, then
R' = \beta^2 - 2 \beta. But this is not possible with the upper bound
R' < max(\beta^2 - d, q_0 \beta) - \beta, since \beta^2 - d <= \beta^2/2,
and since the first adjustment did apply, r >= q_0 thus r_1 - 1 >= q_0,
thus q_0 <= \beta - 3.

Surely this can be simplified...

Paul

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: udiv_qr_3by2 vs divappr

2018-08-28 Thread paul zimmermann
   Dear Niels,

> > page 1: the division instruction is now much faster than before on modern
> >  processors
> 
> According to https://gmplib.org/~tege/x86-timing.pdf, they're still an
> order of magnitute slower than multiplication. E.g 86 vs 3 cycles on
> Intel skylake. And in addition, multiplication is piplined, division
> likely isn't.
> 
> Do you have any suggestion for better wording than "vastly slower" ?

maybe give the number of cycles and refer to x86-timing.pdf?

> > page 3, line 5 of Section 3:  -> 
> 
> I think  is right.

yes sorry the extra 0 should have been removed (same in the algorithm)

> > page 4, line 27: for the upper bound, I agree you can subtract the maximal
> > value of the u_0 term in the proof of Theorem 3 from [4], but this gives
> > \beta-1, not \beta.
> 
> This is a bit subtle, maybe I should explicitly repeat the proof form
> [4] with needed modifications. My argument is that the proof in [4] is
> of the form
> 
>   \tilde r = ... + u0 < ... + \beta <= c = max (\beta^2 - D, q_0 \beta)
> 
> But if u_0 < \beta is the only thing that lets us conclude that \tilde r
> < c, with strict inequality, then there's an off-by-one error here.
> 
> Now there are multiple bounds involved; to get \tilde r = c we'd also
> need something like u_1, u_0 and q_0 maximal and d_0 = 0.

if you need to save \beta with respect to the proof of [4], yes maybe you need
to repeat that proof to explain how you save the extra +1.

> > I wonder whether we can get rid of that term, or always subtract -1.
> > In the latter case (assuming the proof with the p_0 > 0 extra term is 
> > correct),
> > the only problem would be when p_0 = 0.
> 
> I have looked at that, so far without success. If I remember correctly,
> always subtracting -1 can be treated in the analysis as extending the
> single-word range for r_0 to 0 <= r_0 <= \beta. But that breaks the
> analysis of the  R' < 0 case. We have
> 
>   {r_1, r_0}  >=  q_0 \beta
> 
> We want to conclude r_1 >= q_0, but that no longer holds if we allow r_0
> = \beta. To make it work, we'd need to streighten the lower bound 
> 
>   R' >= q_0 \beta - \beta^2
> 
> Hmm... but from another look at the proof from [4], seems to give us
> strict inequality
> 
>   R' > q_0 \beta - \beta^2
> 
> with no extra effort. So maybe we can do that.

I'll have a look. The exhaustive search up to \beta=2^7 says that always
subtracting 1 seems to work, and the lower bound even improves to -2\beta < R'
instead of -2\beta <= R'.

Paul

PS: an extra typo in the new version: page 5, lines 4 and 5, d_o should be d_0
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: udiv_qr_3by2 vs divappr

2018-08-27 Thread paul zimmermann
   Dear Niels,

> From: ni...@lysator.liu.se (Niels Möller)
> Date: Sun, 05 Aug 2018 09:59:46 +0200
> 
> ni...@lysator.liu.se (Niels Möller) writes:
> 
> > static inline mp_limb_t
> > divappr_2 (mp_limb_t n2, mp_limb_t n1,
> >mp_limb_t d1, mp_limb_t d0, mp_limb_t dinv)
> > {
> >   mp_limb_t q1, q0, r1, t1, t0, mask;
> >
> >   umul_ppmm (q1, q0, n2, dinv);
> >   add_ss (q1, q0, q1, q0, n2, n1);
> >
> >   q1++;  /* Can't overflow */
> >   umul_ppmm (t1, t0, q1, d0);  /* t0 isn't used */
> >   r1 = n1 - d1 * q1 - t1;
> 
> To be able to prove correctness, I'm afraid this needs to be
> 
>   r1 = n1 - d1 * q1 - t1 - (t0 > 0);
> 
> >   mask = - (mp_limb_t) (r1 >= q0);
> >   q1 += mask;
> >   r1 += mask & (d1 + 1);
> >   q1 += (r1 >= d1 - 1);
> >
> >   return q1;
> > }
> 
> I'm attaching a draft paper with the analysis. As for performance, I've
> measurered a few percent speedup on shell.
> 
> Regards,
> /Niels
> 
> [2:application/pdf Show Save:schoolbook-divappr.pdf (200kB)]

quite interesting! I have a few comments:

page 1: the division instruction is now much faster than before on modern
 processors

page 2: sigificant -> significant

page 3, line 2 of Section 3, remove "The output"

page 3, line 5 of Section 3:  -> 
line 6: correesponds -> corresponds

page 3, The algorithm:  -> 
step 3: u_2 -> u_1, u_1 -> u_0

page 4, line 5: d_1 - d_0 -> d_0 - d_1

page 4, line 7: inputes -> inputs

page 4, line 13: the the -> the

page 4, line 20: u_1 -> u_0

page 4, line 21: The last terms -> term

page 4, line 27: for the upper bound, I agree you can subtract the maximal
value of the u_0 term in the proof of Theorem 3 from [4], but this gives
\beta-1, not \beta.

page 4: "it follows that r_1 + d_1 + 1 > \beta". I do not agree, take for
example u_1 = u_0 = 0, d_1 = 128, d_0 = 1 with \beta = 256. You get r_1 = 127,
and r_1 + d_1 + 1 = 256 = \beta. But r_1 + d_1 + 1 >= \beta is enough for the
rest of the proof.

page 4, line -2: on line 8 -> on line 9

page 5, line 6: \beta r_1 + d_0 -> \beta r'_1 + d_0

For the case R' < 0, you did not prove the lower bound for r'_1 <= d_1 - 2,
and did not prove the upper bound for r'_1 >= d_1 - 1.

Case R' >= 0: if the -\beta term is replaced by -(\beta-1), then all I can get
is r_1 <= \beta - d_1 - 1.

Remark: on the other hand, an exhaustive search with \beta=2^k with
1 <= k <= 7 did not find any error, even without the p_0 > 0 extra term.
I wonder whether we can get rid of that term, or always subtract -1.
In the latter case (assuming the proof with the p_0 > 0 extra term is correct),
the only problem would be when p_0 = 0.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: divappr interface

2018-05-09 Thread paul zimmermann
   Niels,

> From: ni...@lysator.liu.se (Niels Möller)
> Date: Fri, 27 Apr 2018 22:28:39 +0200
> 
> ni...@lysator.liu.se (Niels Möller) writes:
> 
> > Once we reach qn = dn - 1, keep looping to produce quotient limbs, but
> > also discard one limb of dp in each interation, until we in the final
> > iteration have qn = 1, qn+2 = 3 numerator limbs, and qn+1 = 2 divisor
> > limbs, i.e., a single udiv_qr_3by2 (where we might consider skipping the
> > adjustment steps). I haven't done the error analysis, but I would expect
> > that errors are similar to the current code.
> 
> In the current mpn_sbpi1_divappr_q, there's a curious flag (or mask)
> variable used in this loop. It's initially all ones, cleared if we ever
> fail to cancel the top limb of the current partial remainder. When the
> flag is cleared, remaining quotient limbs are set to GMP_NUMB_MAX.
> 
> Torbjörn, Paul, do you remember the analysis behind this? (Code is since
> 2009...).
> 
> I would guess it might happen that even if higher quotient limbs are all
> correct, we might get non-zero high limb in
> 
>   partial remainder - GMP_NUMB_MAX * truncated D 
> 
> If we set the rest of the quotient limbs to GMP_NUMB_MAX, then the
> quotient should be large enough thanks to the low end divisor limbs
> which we ignored in the truncation.
> 
> Regards,
> /Niels

I don't believe I have written that code. Anyway I don't quite understand
the code from sbpi1_divappr_q.c (say in GMP 6.1.2):

* when flag becomes 0, the condition UNLIKELY (n1 >= (d1 & flag)) is always
  true, thus we always take that branch. Is that wanted?

* when n1 != cy on line 129, I guess that q = GMP_NUMB_MASK was too large,
  thus we should decrease q and add back the divisor. But the code in lines
  133-134 is never executed 
(https://gmplib.org/devel/lcov/shell/var/tmp/lcov/gmp/mpn/sbpi1_divappr_q.c.gcov.html)

* idem for the code in lines 177-178

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: udiv_qr_3by2 vs divappr

2018-05-09 Thread paul zimmermann
   Niels,

> From: ni...@lysator.liu.se (Niels Möller)
> Date: Sat, 28 Apr 2018 13:27:36 +0200
> [...]
> Inputs are a two limb normlized divisor, {d1, d0}, d1 >= B/2, and two
> numerator words, {n2, n1}, which must be less than {d1, d0}. [...]

if you use udiv_qr_3by2 to divide (say) a 6-limb number by a 3-limb number,
then in the schoolbook loop you might have {n2, n1} = {d1, d0}, in the case
where n0 is smaller than the next divisor limb d[-1].

If it is the responsibility of the caller to ensure {n2, n1} < {d1, d0},
then every caller of udiv_qr_3by2 must deal with that special case.

In my opinion, it would be simpler to allow {n2, n1} = {d1, d0}, and
return q=GMP_NUMB_MAX in that case.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: udiv_qr_3by2 vs divappr

2018-04-29 Thread paul zimmermann
   Niels,

> static inline mp_limb_t
> divappr_2 (mp_limb_t n2, mp_limb_t n1,
>mp_limb_t d1, mp_limb_t d0, mp_limb_t dinv)
> This needs more analysis. [...]

you might be interested by Algorithms DivApprox1 and DivApprox2
from https://hal.inria.fr/hal-01502326.

Best regards,
Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: dead code in div_q.c?

2018-04-27 Thread paul zimmermann
   Niels,

I fully agree with your analysis. Small comments below.

> From: ni...@lysator.liu.se (Niels Möller)
> Date: Thu, 26 Apr 2018 22:12:09 +0200
> 
> paul zimmermann <paul.zimmerm...@inria.fr> writes:
> 
> > In the second case (lines 267-276) this is less obvious since you use an
> > approximate division (divappr), and I don't know what is the maximal allowed
> > difference between divappr and div. If divappr always return a quotient
> > less or equal to the true quotient, then qh cannot be zero in this case.
> > But if qh can be larger than the true quotient, even by one bit, then it
> > is possible that qh is 1, in the case {new_np, new_nn} = 0111...111 and
> > {new_dp, dn} = 1000...000.
> 
> You're right again... I thought we had more margin.
> 
> According to comments, quotient from sb and dc divappr is correct or
> one too large, while quotient from mu divappr can be at most 4 too
> large.
> 
> And the limit case, N = B^n/2 - 1, D = B^k/2 is 
> 
>floor ((B^n/2 - 1) / (B^k/2)) = floor(B^{n-k} - B^{-k}/2) 
>  = B^{n-k} - 1 
> 
> So even we only a single unit of error, we could get qh == 1.
> 
> I wonder what actual error can be produced by the divappr functions in
> the corner cases that error could spill over into qh?
> 
> In the cy != 0 case, it looks like we we make a balanced divappr call,
> (2qn+2) / (qn+1). Lets put k = qn+1. For divappr we then have
> 
>   N <= B^{2k}/2 - 1, 
>   D >= B^k/2, lets say D = B/2 + e, e >= 0

D = B/2 + e should read D = B^k/2 + e

> To have any possibility to get qh > 0 with from the approximation error,
> we must have Q >= B^k - 4, where Q denotes the correct quotient. So
> assume that is the case. Now,
> 
>   D <= N/Q <= (B^{2k}/2 - 1) / (B^k - 4)
> 
> which implies
> 
>   e <= N/Q - B^k/2 <= (2 B^k - 1) / (B^k - 4) < 3
> 
> So we need to consider only D = B^k/2 + {0, 1, 2}. Then, any inverse not
> taking the very least significant D limb into account (including mu inverse)
> must be *all* ones, right?

yes if you compute the inverse as floor((B^(2k-2)-1)/floor(D/B))

> It would be good to either prove that qh > 0 can't happen, or add a test
> that will exercise the code handling that case.

yes more generally it would be nice to know for the divappr functions, if the
exact quotient fits on qn limbs (i.e., the exact qh is 0), can the
"approximate" qh be 1?

> > beware that in the case cy <> 0, one divides {new_np, nn+1} by {new_dp, dn}
> > thus filling {qp, nn-dn+1}. 
> 
> Thanks for pointing this out. Now I understand better.
> 
> Regards,
> /Niels
> 
> -- 
> Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
> Internet email is subject to wholesale government surveillance.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: C99 and GMP

2018-04-27 Thread paul zimmermann
> > quite interesting. Why is gmp/mpn not tested in the head coverage?
> 
> It is tested. It appears as /var/tmp/lcov/gmp/mpn because it is a set of 
> symlinks created at build time.

sorry I missed that. I see some of the files are not tested at all
(add_err3_n.c for example), and some have a low coverage (div_qr_1.c
for example). Are there any plans to improve that?

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: dead code in div_q.c?

2018-04-26 Thread paul zimmermann
   Niels,

> From: ni...@lysator.liu.se (Niels Möller)
> Cc: gmp-devel@gmplib.org,  raphael.rieu-he...@lri.fr
> Date: Wed, 25 Apr 2018 22:22:41 +0200
> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/25.3 (berkeley-unix)
> 
> ni...@lysator.liu.se (Niels Möller) writes:
> 
> > paul zimmermann <paul.zimmerm...@inria.fr> writes:
> >
> >> together with Raphaël Rieu-Hleft (in cc), we believe we have found some 
> >> dead code in
> >> mpn/generic/div_q.c around lines 173-182:
> >>
> >>   else if (UNLIKELY (qh != 0))
> >> {
> >>   /* This happens only when the quotient is close to B^n and   
> >>  
> >>  mpn_*_divappr_q returned B^n.  */
> >
> > I think your right. This comment is wrong in two ways: First, divappr is
> > not called (directly) for this branch of code, and second, the quotient
> > can be at most close to B^n/2.
> 
> I'm trying the appended patch. Besides the analysis, I've ran 
> 
>   while GMP_CHECK_RANDOMIZE=0 ./tests/mpn/t-div ; do : ; done
> 
> for at least half an hour. So I think we can drop the code. 

for the first case qh != 0 (lines 174-183) we are sure this cannot happen,
since cnt < GMP_NUMB_BITS, thus new_dp[new_nn-1] has its most significant
bit 0, and since new_dp[dn-1] has its most significant set, we cannot have
a carry in the division.

In the second case (lines 267-276) this is less obvious since you use an
approximate division (divappr), and I don't know what is the maximal allowed
difference between divappr and div. If divappr always return a quotient
less or equal to the true quotient, then qh cannot be zero in this case.
But if qh can be larger than the true quotient, even by one bit, then it
is possible that qh is 1, in the case {new_np, new_nn} = 0111...111 and
{new_dp, dn} = 1000...000.

> Question is,
> how much of comments and/or ASSERT are useful? And should we make the
> assignments
> 
>   qp[qn - 1] = qh;
> 
> and 
> 
>   tp[qn] = qh;
> 
> unconditional? As far as I see, the areas are always large enough.

beware that in the case cy <> 0, one divides {new_np, nn+1} by {new_dp, dn}
thus filling {qp, nn-dn+1}. If you write qp[qn-1]=qh afterwards, you will
erase the high limb of the quotient. (By the way, this is another way to
prove that qh=0 in this case, since we know the quotient fits on nn-dn+1
limbs.)

If the division routines already check for zero high limbs of the dividend,
then it would be simpler to have new_nn = nn + 1 in all cases, and we would
not need to write qp[qn-1].

Paul

> But it's a unclear to me if and why it's sometimes permissible to omit
> writing the top limbs? Reads of the top limbs look unconditional. 
> 
> I've tried adding 
> 
> --- a/tests/mpn/t-div.c   Wed Apr 25 07:38:14 2018 +0200
> +++ b/tests/mpn/t-div.c   Wed Apr 25 22:19:17 2018 +0200
> @@ -445,6 +445,7 @@ main (int argc, char **argv)
> alloc = itch + 1;
>   }
> scratch[itch] = ran;
> +   MPN_COPY (qp, junkp, nn - dn + 1);
> mpn_div_q (qp, np, nn, dup, dn, scratch);
> ASSERT_ALWAYS (ran == scratch[itch]);
> ASSERT_ALWAYS (qp[-1] == qran0);  ASSERT_ALWAYS (qp[nn - dn + 1] == 
> qran1);
> 
> to the test, to ensure that limbs aren't correct thanks to the previous
> call. Tests still pass. So I'm a bit puzzled.
> 
> Regards,
> /Niels
> 
> *** /tmp/extdiff.0aBNfO/gmp.765c2c27523b/mpn/generic/div_q.c  2018-04-25 
> 21:55:51.457769871 +0200
> --- /home/nisse/hack/gmp/mpn/generic/div_q.c  2018-04-25 21:18:01.516501993 
> +0200
> *** mpn_div_q (mp_ptr qp,
> *** 171,184 
> if (cy == 0)
>   qp[qn - 1] = qh;
> !   else if (UNLIKELY (qh != 0))
> ! {
> !   /* This happens only when the quotient is close to B^n and
> !  mpn_*_divappr_q returned B^n.  */
> !   mp_size_t i, n;
> !   n = new_nn - dn;
> !   for (i = 0; i < n; i++)
> ! qp[i] = GMP_NUMB_MAX;
> !   qh = 0;   /* currently ignored */
> ! }
>   }
> else  /* divisor is already normalised */
> --- 171,176 
> if (cy == 0)
>   qp[qn - 1] = qh;
> !   else
> ! ASSERT_ALWAYS (qh == 0);
>   }
> else  /* divisor is already normalised */
> *** mpn_div_q (mp_ptr qp,
> *** 263,276 
> if (cy == 0)
>   tp[qn] = qh;
> !   else if (UNLIKELY (qh != 0))
> ! {
> !   /* This happens only when the quotient is close to B^n and
> !  mpn_*_divappr_q returned B^n.  */
> !   mp_size_t i, n;
> !   n = new_nn - (qn + 1)

Re: Comparison of multiple-precision floating-point software

2018-04-25 Thread paul zimmermann
   Dear Torbjörn,

> I am surprised that there are non-marginal differences for larger
> operations.  Don't we all use mpn?  Bookkeeping should be the only
> difference.

since 2004 we did implement in MPFR "short" product/square/division,
which compute an approximation of the upper n limbs of a nxn product
or square, or of the quotient of a 2n/n division, following this paper:

@article{Mulders00,
  author =   {T. Mulders},
  title ={On short Multiplications and Divisions},
  journal = "AAECC",
  year = 2000,
  volume = 11,
  number = 1,
  pages = "69--88"
}

If such algorithms would be available in GMP, of course we would use them.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Comparison of multiple-precision floating-point software

2018-04-25 Thread paul zimmermann
   Dear GMP developers,

I have updated my comparison of multiple-precision floating-point software.
The old page was comparing MPF from GMP 5.0.2 with (among others) MPFR 3.1.2:

http://www.mpfr.org/mpfr-3.1.2/timings.html

Here, MPF was only faster than MPFR for 100d mul and sqr, and 1000d div.

The new comparison is available at (temporary web page):

https://members.loria.fr/PZimmermann/timings.html

Now MPF is faster than MPFR for all 100d operations, for 1000d and 1d
div. You have done a great work in GMP 6!

You should be able to reproduce the timings since they were done on gcc67,
and the source code is available.

Your feedback is welcome.

Best regards,
Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mpn_mul is embarrassingly slow

2018-04-24 Thread paul zimmermann
   Dear Torbjörn,

> What do you think about this stopgap change?

I would entirely drop all the squaring-related stuff from mpn_mul:
the user/developer should call mpn_sqr instead (see my previous mail).

Then the code would become:

   if (BELOW_THRESHOLD (vn, MUL_TOOM22_THRESHOLD))
 {
   mpn_mul_basecase (prodp, up, un, vp, vn);
   return prodp[un + vn - 1];   /* historic */
 }

   if (un == vn)
   return mpn_mul_n (prodp, up, vp, un);

   ...

(I believe it is vn and not un that should be compared to MUL_TOOM22_THRESHOLD.)

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mpn_mul is embarrassingly slow

2018-04-24 Thread paul zimmermann
> It is surely silly that we don't have any mpn call for when it is known
> that the multiplier and multiplicand are distinct.

but now that mpn_sqr is in the GMP interface (since GMP 6 I guess), why
check for {up,un} = {vp,vn} in mpn_mul? Shouldn't the user or the GMP
developer call mpn_sqr directly?

In "make check" of GMP 6.1.2, the "square" branch of mpn_mul is taken only
in 4 tests: mpf/t-sqrt, mpf/t-sqrt_ui, mpf/reuse and mpf/t-pow_ui. It seems
some of those cases are because there is no mpf_sqr function, and mpf_mul
calls mpn_mul without checking for squares.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


MPN_FILL vs MPN_ZERO

2018-04-24 Thread paul zimmermann
   Hi,

I just discovered the internal macro MPN_FILL.

I understand MPN_ZERO(p,n) is implemented as "if (n) MPN_FILL(p, n, 0)".

Then in case MPN_FILL is implemented using memset, since memset also checks
for the case n=0, MPN_ZERO(p,n) will perform two tests for n > 0. Why not
directly call MPN_FILL?

Best regards,
Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


unused function

2018-04-21 Thread paul zimmermann
   Dear all,

in GMP 6.1.2, the mpn_bdiv_q_1 function in mpn/generic/bdiv_q_1.c is used
nowhere in the library, and is not exported, thus it could be removed.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


dead code in div_q.c?

2018-04-20 Thread paul zimmermann
   Hi,

together with Raphaël Rieu-Hleft (in cc), we believe we have found some dead 
code in
mpn/generic/div_q.c around lines 173-182:

  else if (UNLIKELY (qh != 0))
{
  /* This happens only when the quotient is close to B^n and
 mpn_*_divappr_q returned B^n.  */
  mp_size_t i, n;
  n = new_nn - dn;
  for (i = 0; i < n; i++)
qp[i] = GMP_NUMB_MAX;
  qh = 0;   /* currently ignored */
}

Indeed, in the case cy <> 0, since cnt < GMP_NUMB_BITS, the most significant 
limb
new_np[nn] of the dividend has at least one 0 leading bit, and thus is smaller
than new_dp[dn-1] (which is normalized). Therefore qh is always 0 (no 
approximate
division is performed here).

Raphaël and Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mpn_mul is embarrassingly slow

2018-04-20 Thread paul zimmermann
   Niels,

> Such an assembly routine would need access to the threshold between
> basecase and generic, which in the case of fat builds isn't a compile
> time constant. 

but you could determine at compile time a lower bound for fat builds, no?

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Lazy mpz allocation

2018-04-20 Thread paul zimmermann
> Only 0 can have lazy allocation, and I think we document that it isn't 
> legal to put 0 on the denominator.

where is this documented? In mpfr_set_q we use the fact that the user can set q 
to 1/0
for example to represent +Inf.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: How large are arbitrarily large integers? (Was Re: Bitfield subtlety)

2018-03-30 Thread paul zimmermann
> For some reason, I also limited the choices so that either 8 or 16 bits
> are added to the _mp_size field.  That might give a small speedup in
> some cases, perhaps, but the result is that only one balanced
> possibility  remain:
> 
> b = 2
> manstissa bits = 10
> exponent bits = 6
> 
> Let's abbreviate this as (2,10,6).
> 
> This gives an allocation range of almost 2^73 limbs, which is of course
> silly, but (2,11,5) would result in just the range 2^42 limbs.

another possibility is the "unum" type, where some bits are used to define
how many bits are used for the significand and exponent:

https://en.m.wikipedia.org/wiki/Unum_(number_format)

In the limit of unum, you could use only 1 extra bit, say for a 31-bit type
(thus with 30 remaining bits):

1) if 0, the 30 bits are interpreted as a size between 0 and 2^30-1
2) if 1, the 30 bits are split into say 24 significand bits and 6
   exponent bits, which would be enough up to 2^87 with an overhead of
   less than 1e-6

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: How large are arbitrarily large integers? (Was Re: Bitfield subtlety)

2018-03-29 Thread paul zimmermann
> I think one relevant question is on how large numbers we expect to be
> able to complete any computation with worse complexity than O(n) in
> reasonable time. But when we integrate fft multiplication with decent
> locality, the limit will be cpu power, not available RAM, I'd hope.

not sure. With GMP 17, we'll have multi-thread multiplication,
then memory will be the main limitation.

Paul Zimmermann
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Miller-Rabin

2018-03-27 Thread paul zimmermann
   Niels,

>   int
>   mpz_miller_rabin_uis (mpz_srcptr n, size_t count, unsigned long *bases);

this interface would be nice for me. I don't think a bignum base is necessary,
unless you use REDC in powm, in which case (for n having at least 2 limbs) it
might be faster to use a base b such that b*R mod n is small, where R is the 
REDC
multiplier.

> Out of curiosity, are any similar results known when using bases given
> by Euler's polynomial,
> 
>   a_j = j^2 + j + 41, 
> 
> with the curious property that j_40 = 41^2 is the first composite in the 
> sequence?

a quick test shows that n=21 already requires 3 bases, like 764941, 1004653...

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Miller-Rabin

2018-03-26 Thread paul zimmermann
   Hi,

it would be nice if there would be a GMP function implementing the Miller-Rabin 
test,
to have finer control than with mpz_probab_prime_p.

This would allow to get a real primality test for small enough input.

For example on 
https://en.wikipedia.org/wiki/Miller%E2%80%93Rabin_primality_test we
can read:

if n < 4,759,123,141, it is enough to test a = 2, 7, and 61;

thus the GMP user who wants to check inputs less than 4759123141 could write :

int
my_isprime (mpz_t p)
{
   assert (mpz_cmp_ui (p, 4759123141UL) < 0);
   return mpz_miller_rabin (p, 2) && mpz_miller_rabin (p, 7) && 
mpz_miller_rabin (p, 61);
}

Alternatively, if you don't want to add a new function, you could allow 
negative REPS
argument for mpz_probab_prime_p:

* for example REPS=-1 would mean checking base 2 (n < 2,047)
* REPS=-2 would mean checking bases 31 and 73 (n < 9080191)
* REPS=-3 would mean checking bases 2, 7, and 61 (n < 4759123141)
* ...

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mini-gmp

2017-12-08 Thread paul zimmermann
   Dear Niels,

> If we add a version number to mini-gmp, what should it be? The only
> "releases" of mini-gmp are gmp releases, should we adopt that, and then
> have no well-defined version number for mini-gmp between gmp releases?

it could then be the corresponding GMP release version, for example you could
have in mini-gmp.h:

#define __GMP_IS_MINI
#define __GNU_MP_VERSION6
#define __GNU_MP_VERSION_MINOR  1
#define __GNU_MP_VERSION_PATCHLEVEL 2

then applications could know the version of mini-gmp is that shipped with
GMP 6.1.2.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mini-gmp

2017-12-07 Thread paul zimmermann
   Marc,

> I am curious why you are trying to support mini-gmp in mpfr at all. As far 
> as I understand, the goal of mini-gmp is that a user can take a copy of 
> those 2 files, stick them in his project, and get a self-contained 
> program. Unless you provide a similar mini-mpfr, your user is going to 
> have to install the mpfr dependency anyway, his project is not 
> self-contained, so he might as well install the real GMP.

mpfr users might have their own interest in using mini-gmp, but my own
interest is for mpfr development.

Best regards,
Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mini-gmp

2017-12-06 Thread paul zimmermann
   Dear Niels,

> We could add some documented macro for this purpse, but I'm not sure I
> understand why you need it.

for example, we check the version of GMP is at least 5.0.0:

#if !__MPFR_GMP(5,0,0)
# error "GMP 5.0.0 or newer is required"
#endif

where the __MPFR_GMP macro uses __GNU_MP_VERSION, ...

Since mini-gmp does not provide any version macro, we can't check
the version of mini-gmp is good enough.

> And then I have a couple of #if NETTLE_USE_MINI_GMP spread out in the
> rest of the code.

same in MPFR:

$ grep MPFR_USE_MINI_GMP src/*.[ch] tests/*.[ch] | wc -l
47

> > and finally mpz_dump.
> 
> Hmm. It seems this function is declared in gmp.h, but not mentioned in
> the manual, and with a discouraging comment at the top of mpz/dump.c.
> 
> Any reason you can't switch to always use mpz_out_str instead?

in fact it seems we no longer need this function in MPFR.

Thanks!
Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


mini-gmp

2017-12-06 Thread paul zimmermann
   Dear GMP developers,

I see that you have added some functions in the mini-gmp interface, with
respect to the last time I tried it with MPFR, for example mpn_neg, mpn_com
and mpn_sqrtrem. Thanks!

However, we are still missing a proper (documented) way to distinguish
mini-gmp from plain gmp. We could use the __MINI_GMP_H__ macro but it is not
documented. A macro like GMP_IS_MINI would be great, maybe with another
macro that would help us to identify the corresponding version of GMP
(the corresponding plain GMP macro would be GMP_IS_GREAT of course :-)

And I notice that very few of the division functions from mini-gmp correspond
to those of the GMP documentation. Thus to compile MPFR with mini-gmp we had
to re-implement mpn_divrem_1, mpn_divrem and mpn_tdiv_qr. We can contribute
them to mini-gmp if you want.

The other functions we had to re-implement are some random functions
(using srand48/lrand48, thus generating different random sequences, but
for MPFR this is fine), and finally mpz_dump.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Revisiting gcd and gcdext for small operands

2017-12-01 Thread paul zimmermann
> > Without detail knowledge of the current implementation, these numbers
> > suggest to me that we could speed small operand gcd and gcdext.  But I
> > might be wrong.

not clear to me. For one word basically you can perform a multiply in a few
cycles, whereas for a gcd you may need up to mp_bits_per_limb operations.

If you find an algorithm that computes a gcd of two words in a few cycles,
it would be a major improvement!

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Generic get_d_2exp failures

2017-10-28 Thread paul zimmermann
   Niels,

> It would be nice if we could find a portable way to add two floating
> point values without rounding up.

see http://perso.univ-perp.fr/langlois/slides/imacs05_sl.pdf,
slide 10, algorithms "Two Sum" and "Fast Two Sum" in the literature.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: Replacing "redc"

2017-07-20 Thread paul zimmermann
quite interesting. Is the API of the new interface fixed? I'd like to try it
with GMP-ECM, where we currently use the redc_* internal functions. For example
on my machine Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz, running at 3499.804Mhz:

$ ./bench_mulredc -v
...
Time in microseconds per call, size=8 (iter=1807889):
mpn_mul_n  = 0.055
mpn_sqr= 0.042
mpn_redc_1 = 0.069 *
mpn_redc_2 = 0.077
mpn_redc_n = disabled

mulredc= 0.124
mul+redc_1 = 0.122 *
mul+redc_2 = 0.135
mul+redc_n = disabled

mulredc= 0.124
sqr+redc_1 = 0.108 *
sqr+redc_2 = 0.122
sqr+redc_n = disabled
mulredc1   = 0.013

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mpn_sqrtrem{1,2} - patch for pure C implem

2017-06-09 Thread paul zimmermann
   Dear Marco,

> Finally we can totally skip the first approximation and directly compute
> 
> i' = T1[h(a)] + l(a) * T2[h(a)]
> 
> using a piece-wise linear function.

yes, we noticed with Niels that we can replace most of the arithmetic in the
first Newton iteration by another table lookup.

Not sure l(a) is better than using the upper (say) 16 bits of a, in case you
want a 16-bit approximation i'.

To compute T1 and T2 you have two choices:

1) either try to minimize the maximal error after the first Newton iteration
   (i.e., on i')

2) or try to minimize the maximal error after k iterations. See the article
   by Jean-Michel Muller and Peter Kornerup:
   https://hal.archives-ouvertes.fr/inria-00071899/document

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


patch for speed to measure the basic mpf functions

2016-09-01 Thread paul zimmermann
   Dear GMP developers,

below is a patch (against GMP 6.1.1) that allows to measure the basic mpf 
functions
using speed (with -s in bits):

$ ./speed -p10 -c -s 113 mpf_add mpf_sub mpf_mul mpf_div mpf_sqrt
overhead 3.56 cycles, precision 10 units of 1.25e-09 secs, CPU freq 
800.13 MHz
  mpf_add   mpf_sub   mpf_mul   mpf_div  mpf_sqrt
113#45.00 48.36 46.30145.90352.58

I contribute this patch to GMP, but most probably you can improve it.

Best regards,
Paul

--- common.c.orig   2016-09-01 15:48:21.378349300 +0200
+++ common.c2016-09-01 16:34:24.035003662 +0200
@@ -1998,6 +1998,152 @@
mpf_clear (f));
 }
 
+double
+speed_mpf_add (struct speed_params *s)
+{
+  mpf_t w, x, y;
+  unsigned  i;
+  doublet;
+
+  mpf_init2 (w, s->size);
+  mpf_init2 (x, s->size);
+  mpf_init2 (y, s->size);
+
+  mpf_random2 (x, 1 + (s->size - 1) / GMP_NUMB_BITS, 0);
+  mpf_random2 (y, 1 + (s->size - 1) / GMP_NUMB_BITS, 0);
+  mpf_add (w, x, y);
+
+  speed_starttime ();
+  i = s->reps;
+  do
+{
+  mpf_add (w, x, y);
+}
+  while (--i != 0);
+  t = speed_endtime ();
+
+  mpf_clear (w);
+  mpf_clear (x);
+  mpf_clear (y);
+  return t;
+}
+
+double
+speed_mpf_sub (struct speed_params *s)
+{
+  mpf_t w, x, y;
+  unsigned  i;
+  doublet;
+
+  mpf_init2 (w, s->size);
+  mpf_init2 (x, s->size);
+  mpf_init2 (y, s->size);
+
+  mpf_random2 (x, 1 + (s->size - 1) / GMP_NUMB_BITS, 0);
+  mpf_random2 (y, 1 + (s->size - 1) / GMP_NUMB_BITS, 0);
+  mpf_sub (w, x, y);
+
+  speed_starttime ();
+  i = s->reps;
+  do
+{
+  mpf_sub (w, x, y);
+}
+  while (--i != 0);
+  t = speed_endtime ();
+
+  mpf_clear (w);
+  mpf_clear (x);
+  mpf_clear (y);
+  return t;
+}
+
+double
+speed_mpf_mul (struct speed_params *s)
+{
+  mpf_t w, x, y;
+  unsigned  i;
+  doublet;
+
+  mpf_init2 (w, s->size);
+  mpf_init2 (x, s->size);
+  mpf_init2 (y, s->size);
+
+  mpf_random2 (x, 1 + (s->size - 1) / GMP_NUMB_BITS, 0);
+  mpf_random2 (y, 1 + (s->size - 1) / GMP_NUMB_BITS, 0);
+  mpf_mul (w, x, y);
+
+  speed_starttime ();
+  i = s->reps;
+  do
+{
+  mpf_mul (w, x, y);
+}
+  while (--i != 0);
+  t = speed_endtime ();
+
+  mpf_clear (w);
+  mpf_clear (x);
+  mpf_clear (y);
+  return t;
+}
+
+double
+speed_mpf_div (struct speed_params *s)
+{
+  mpf_t w, x, y;
+  unsigned  i;
+  doublet;
+
+  mpf_init2 (w, s->size);
+  mpf_init2 (x, s->size);
+  mpf_init2 (y, s->size);
+
+  mpf_random2 (x, 1 + (s->size - 1) / GMP_NUMB_BITS, 0);
+  mpf_random2 (y, 1 + (s->size - 1) / GMP_NUMB_BITS, 0);
+  mpf_div (w, x, y);
+
+  speed_starttime ();
+  i = s->reps;
+  do
+{
+  mpf_div (w, x, y);
+}
+  while (--i != 0);
+  t = speed_endtime ();
+
+  mpf_clear (w);
+  mpf_clear (x);
+  mpf_clear (y);
+  return t;
+}
+
+double
+speed_mpf_sqrt (struct speed_params *s)
+{
+  mpf_t w, x;
+  unsigned  i;
+  doublet;
+
+  mpf_init2 (w, s->size);
+  mpf_init2 (x, s->size);
+
+  mpf_random2 (x, 1 + (s->size - 1) / GMP_NUMB_BITS, 0);
+  mpf_sqrt (w, x);
+
+  speed_starttime ();
+  i = s->reps;
+  do
+{
+  mpf_sqrt (w, x);
+}
+  while (--i != 0);
+  t = speed_endtime ();
+
+  mpf_clear (w);
+  mpf_clear (x);
+  return t;
+}
 
 /* Compare this to mpn_add_n to see how much overhead mpz_add adds.  Note
that repeatedly calling mpz_add with the same data gives branch prediction
--- speed.c.orig2016-09-01 15:48:04.174345004 +0200
+++ speed.c 2016-09-01 15:40:37.746232291 +0200
@@ -509,6 +509,11 @@
   { "mpz_init_clear",   speed_mpz_init_clear   },
   { "mpq_init_clear",   speed_mpq_init_clear   },
   { "mpf_init_clear",   speed_mpf_init_clear   },
+  { "mpf_add",  speed_mpf_add  },
+  { "mpf_sub",  speed_mpf_sub  },
+  { "mpf_mul",  speed_mpf_mul  },
+  { "mpf_div",  speed_mpf_div  },
+  { "mpf_sqrt", speed_mpf_sqrt },
   { "mpz_init_realloc_clear",   speed_mpz_init_realloc_clear   },
 
   { "umul_ppmm", speed_umul_ppmm, FLAG_R_OPTIONAL },
--- speed.h.orig2016-09-01 15:48:10.370346551 +0200
+++ speed.h 2016-09-01 15:37:47.686188699 +0200
@@ -153,6 +153,11 @@
 double speed_binvert_limb_arith (struct speed_params *);
 
 double speed_mpf_init_clear (struct speed_params *);
+double speed_mpf_add (struct speed_params *);
+double speed_mpf_sub (struct speed_params *);
+double speed_mpf_mul (struct speed_params *);
+double speed_mpf_div (struct speed_params *);
+double speed_mpf_sqrt (struct speed_params *);
 
 double speed_mpn_add_n (struct speed_params *);
 double speed_mpn_add_1 (struct speed_params *);
___
gmp-devel mailing list

mpn_sqrtrem2

2016-07-20 Thread paul zimmermann
   Hi,

the attached file contains a new version (mpn_sqrtrem2_new) of mpn_sqrtrem2
for 64-bit computers. On my computer i5-4590 it runs in about 65 cycles
whereas the gmp-6.1.1 version runs in about 96 cycles.

It uses an auxiliary routine (mpn_rsqrtrem1) which computes a 41-bit
approximation of 1/sqrt(a).

Before I try to adapt those two routines to 32-bit limbs, I'd like someone
to review my code (comments are included).

Paul
/* mpn_sqrtrem -- square root and remainder

   Contributed to the GNU project by Paul Zimmermann (most code),
   Torbjorn Granlund (mpn_sqrtrem1) and Marco Bodrato (mpn_dc_sqrt).

   THE FUNCTIONS IN THIS FILE EXCEPT mpn_sqrtrem ARE INTERNAL WITH A
   MUTABLE INTERFACE.  IT IS ONLY SAFE TO REACH THEM THROUGH DOCUMENTED
   INTERFACES.  IN FACT, IT IS ALMOST GUARANTEED THAT THEY WILL CHANGE OR
   DISAPPEAR IN A FUTURE GMP RELEASE.

Copyright 1999-2002, 2004, 2005, 2008, 2010, 2012, 2015 Free Software
Foundation, Inc.

This file is part of the GNU MP Library.

The GNU MP Library is free software; you can redistribute it and/or modify
it under the terms of either:

  * the GNU Lesser General Public License as published by the Free
Software Foundation; either version 3 of the License, or (at your
option) any later version.

or

  * the GNU General Public License as published by the Free Software
Foundation; either version 2 of the License, or (at your option) any
later version.

or both in parallel, as here.

The GNU MP Library is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
for more details.

You should have received copies of the GNU General Public License and the
GNU Lesser General Public License along with the GNU MP Library.  If not,
see https://www.gnu.org/licenses/.  */


/* See "Karatsuba Square Root", reference in gmp.texi.  */


#include 
#include 

#include "gmp.h"
#include "gmp-impl.h"
#include "longlong.h"
#define USE_DIVAPPR_Q 1
#define TRACE(x)

static const unsigned char invsqrttab[384] = /* The common 0x100 was removed */
{
  0xff,0xfd,0xfb,0xf9,0xf7,0xf5,0xf3,0xf2, /* sqrt(1/80)..sqrt(1/87) */
  0xf0,0xee,0xec,0xea,0xe9,0xe7,0xe5,0xe4, /* sqrt(1/88)..sqrt(1/8f) */
  0xe2,0xe0,0xdf,0xdd,0xdb,0xda,0xd8,0xd7, /* sqrt(1/90)..sqrt(1/97) */
  0xd5,0xd4,0xd2,0xd1,0xcf,0xce,0xcc,0xcb, /* sqrt(1/98)..sqrt(1/9f) */
  0xc9,0xc8,0xc6,0xc5,0xc4,0xc2,0xc1,0xc0, /* sqrt(1/a0)..sqrt(1/a7) */
  0xbe,0xbd,0xbc,0xba,0xb9,0xb8,0xb7,0xb5, /* sqrt(1/a8)..sqrt(1/af) */
  0xb4,0xb3,0xb2,0xb0,0xaf,0xae,0xad,0xac, /* sqrt(1/b0)..sqrt(1/b7) */
  0xaa,0xa9,0xa8,0xa7,0xa6,0xa5,0xa4,0xa3, /* sqrt(1/b8)..sqrt(1/bf) */
  0xa2,0xa0,0x9f,0x9e,0x9d,0x9c,0x9b,0x9a, /* sqrt(1/c0)..sqrt(1/c7) */
  0x99,0x98,0x97,0x96,0x95,0x94,0x93,0x92, /* sqrt(1/c8)..sqrt(1/cf) */
  0x91,0x90,0x8f,0x8e,0x8d,0x8c,0x8c,0x8b, /* sqrt(1/d0)..sqrt(1/d7) */
  0x8a,0x89,0x88,0x87,0x86,0x85,0x84,0x83, /* sqrt(1/d8)..sqrt(1/df) */
  0x83,0x82,0x81,0x80,0x7f,0x7e,0x7e,0x7d, /* sqrt(1/e0)..sqrt(1/e7) */
  0x7c,0x7b,0x7a,0x79,0x79,0x78,0x77,0x76, /* sqrt(1/e8)..sqrt(1/ef) */
  0x76,0x75,0x74,0x73,0x72,0x72,0x71,0x70, /* sqrt(1/f0)..sqrt(1/f7) */
  0x6f,0x6f,0x6e,0x6d,0x6d,0x6c,0x6b,0x6a, /* sqrt(1/f8)..sqrt(1/ff) */
  0x6a,0x69,0x68,0x68,0x67,0x66,0x66,0x65, /* sqrt(1/100)..sqrt(1/107) */
  0x64,0x64,0x63,0x62,0x62,0x61,0x60,0x60, /* sqrt(1/108)..sqrt(1/10f) */
  0x5f,0x5e,0x5e,0x5d,0x5c,0x5c,0x5b,0x5a, /* sqrt(1/110)..sqrt(1/117) */
  0x5a,0x59,0x59,0x58,0x57,0x57,0x56,0x56, /* sqrt(1/118)..sqrt(1/11f) */
  0x55,0x54,0x54,0x53,0x53,0x52,0x52,0x51, /* sqrt(1/120)..sqrt(1/127) */
  0x50,0x50,0x4f,0x4f,0x4e,0x4e,0x4d,0x4d, /* sqrt(1/128)..sqrt(1/12f) */
  0x4c,0x4b,0x4b,0x4a,0x4a,0x49,0x49,0x48, /* sqrt(1/130)..sqrt(1/137) */
  0x48,0x47,0x47,0x46,0x46,0x45,0x45,0x44, /* sqrt(1/138)..sqrt(1/13f) */
  0x44,0x43,0x43,0x42,0x42,0x41,0x41,0x40, /* sqrt(1/140)..sqrt(1/147) */
  0x40,0x3f,0x3f,0x3e,0x3e,0x3d,0x3d,0x3c, /* sqrt(1/148)..sqrt(1/14f) */
  0x3c,0x3b,0x3b,0x3a,0x3a,0x39,0x39,0x39, /* sqrt(1/150)..sqrt(1/157) */
  0x38,0x38,0x37,0x37,0x36,0x36,0x35,0x35, /* sqrt(1/158)..sqrt(1/15f) */
  0x35,0x34,0x34,0x33,0x33,0x32,0x32,0x32, /* sqrt(1/160)..sqrt(1/167) */
  0x31,0x31,0x30,0x30,0x2f,0x2f,0x2f,0x2e, /* sqrt(1/168)..sqrt(1/16f) */
  0x2e,0x2d,0x2d,0x2d,0x2c,0x2c,0x2b,0x2b, /* sqrt(1/170)..sqrt(1/177) */
  0x2b,0x2a,0x2a,0x29,0x29,0x29,0x28,0x28, /* sqrt(1/178)..sqrt(1/17f) */
  0x27,0x27,0x27,0x26,0x26,0x26,0x25,0x25, /* sqrt(1/180)..sqrt(1/187) */
  0x24,0x24,0x24,0x23,0x23,0x23,0x22,0x22, /* sqrt(1/188)..sqrt(1/18f) */
  0x21,0x21,0x21,0x20,0x20,0x20,0x1f,0x1f, /* sqrt(1/190)..sqrt(1/197) */
  0x1f,0x1e,0x1e,0x1e,0x1d,0x1d,0x1d,0x1c, /* sqrt(1/198)..sqrt(1/19f) */
  0x1c,0x1b,0x1b,0x1b,0x1a,0x1a,0x1a,0x19, /* sqrt(1/1a0)..sqrt(1/1a7) */
  0x19,0x19,0x18,0x18,0x18,0x18,0x17,0x17, /* sqrt(1/1a8)..sqrt(1/1af) */
  0x17,0x1

Re: mpn_invert_limb

2016-07-20 Thread paul zimmermann
   Niels,

> > would it be possible to add mpn_invert_limb in the public interface?
> > It would be very useful for GNU MPFR.
> 
> Makes sense to me, it's a useful and well-defined operations, and
> discussed at length in various papers. So should be quite easy to
> document.

great!

Please keep me informed when it happens, so that we can update MPFR accordingly.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mpn_sqrtrem1

2016-07-19 Thread paul zimmermann
   Torbjörn,

> I am not proud of the mpn_sqrtrem1 code.  It was made with some analysis
> and lots of testing.  The many undocumented magic constants are ugly.
> Perhaps I should never have checked in this code.

ah ok. I have an alternate implementation which is almost as efficient,
but although it passes make check with both GMP and MPFR, I have not yet
proven it is correct :-(

> Furthermore, for a real performance win one would need to rewrite
> mpn_sqrtrem2.

I know how to improve mpn_sqrtrem2. For a 64-bit computer, start from a 32-bit
approximation, say x, of 1/sqrt(a), which is already done by mpn_sqrtrem1. Then
use Karp-Markstein's trick for the square root:

y = approx(a*x) # 32-bit approximation of sqrt(a)
t = a - y^2 
return y + x/2*t # 64-bit approximation of sqrt(a)

I will try to implement that.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


mpn_sqrtrem1

2016-07-19 Thread paul zimmermann
   Hi,

in my current effort of optimizing MPFR for small precisions (1-2 limbs),
I am currently reviewing mpn_sqrtrem1() and mpn_sqrtrem2(), since we need
such functions to speed up mpfr_sqrt.

In mpn_sqrtrem1, I wonder what the 0x3 constant is used for. By exhaustive
search on all possible values of a1 (there are only 3/4*2^33) on a 64-bit 
computer,
the maximal error on the 16-bit approximation x0 is the same without this 
constant,
more precisely -0.23 <= x0/2^32 - (a0/2^64)^(-1/2) <= 0.

The worst error is even smaller (in absolute value) without the 0x3 
constant,
since with - 0x3 it is attained for a1=2248147012 (error -0.231867), and
without 0x3 it is attained for a1=2248147013 (error -0.231863).

Also "make check" passes without the 0x3 constant. If the 0x3 constant 
is
really needed, it would be nice to add an example that breaks make check if we
remove it.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


__udiv_qrnnd_c

2016-07-18 Thread paul zimmermann
   Hello,

with the patch below, I get a small speedup on __udiv_qrnnd_c.

Best regards,
Paul Zimmermann

--- longlong.h.orig 2016-07-18 09:47:05.832658464 +0200
+++ longlong.h  2016-07-18 09:57:21.832816211 +0200
@@ -2064,7 +2064,7 @@
 /* Define this unconditionally, so it can be used for debugging.  */
 #define __udiv_qrnnd_c(q, r, n1, n0, d) \
   do { \
-UWtype __d1, __d0, __q1, __q0, __r1, __r0, __m;\
+UWtype __d1, __d0, __q1, __q0, __r1, __r0;  \
\
 ASSERT ((d) != 0); \
 ASSERT ((n1) < (d));   \
@@ -2074,29 +2074,25 @@
\
 __q1 = (n1) / __d1;
\
 __r1 = (n1) - __q1 * __d1; \
-__m = __q1 * __d0; \
-__r1 = __r1 * __ll_B | __ll_highpart (n0); \
-if (__r1 < __m)\
+__r0 = __r1 * __ll_B | __ll_highpart (n0); \
+__r1 = __r0 - __q1 * __d0;  \
+if (__r1 > __r0) /* borrow when subtracting q1*d0 */\
   {
\
__q1--, __r1 += (d);\
-   if (__r1 >= (d)) /* i.e. we didn't get carry when adding to __r1 */\
- if (__r1 < __m)   \
-   __q1--, __r1 += (d);\
+   if (__r1 > (d)) /* no carry when adding d */\
+  __q1--, __r1 += (d);  \
   }
\
-__r1 -= __m;   \
\
 __q0 = __r1 / __d1;
\
 __r0 = __r1  - __q0 * __d1;
\
-__m = __q0 * __d0; \
-__r0 = __r0 * __ll_B | __ll_lowpart (n0);  \
-if (__r0 < __m)\
+__r1 = __r0 * __ll_B | __ll_lowpart (n0);  \
+__r0 = __r1 - __q0 * __d0;  \
+if (__r0 > __r1) /* borrow when subtracting q0*d0 */\
   {
\
__q0--, __r0 += (d);\
-   if (__r0 >= (d))\
- if (__r0 < __m)   \
-   __q0--, __r0 += (d);\
+   if (__r0 > (d)) /* no carry when adding d */\
+  __q0--, __r0 += (d);  \
   }
\
-__r0 -= __m;   \
\
 (q) = __q1 * __ll_B | __q0;
\
 (r) = __r0;
\
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


mpn_invert_limb

2016-07-15 Thread paul zimmermann
   Dear GMP developers,

would it be possible to add mpn_invert_limb in the public interface?
It would be very useful for GNU MPFR.

Best regards,
Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: getrusage vs clock

2016-01-21 Thread paul zimmermann
>   since version 2.18 of the glibc, the clock() function is much more precise.
>   For example on my machine it has a resolution of 1 micro-seconds whereas
>   getrusage() has a resolution of 4ms only, as demonstrated by the program
>   below:
>   
> I believe we already use clock_gettime(2) directly on GNU/Linux, which
> is the underlying syscall that clock(3) uses in newer GNU libc versions.
> 
> Does that not happen in your tests?

I don't know how to check which function is used. With gmp-6.1.0 on my
workstation I get:

zimmerma@tomate:/tmp/gmp-6.1.0/tune$ ./speed
...
CPU cycle counter, supplemented by microsecond getrusage()
Gnuplot home page http://www.gnuplot.info/
Quickplot home page http://quickplot.sourceforge.net/

which lets me think that getrusage() is still used.

Paul

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


getrusage vs clock

2016-01-18 Thread paul zimmermann
   Dear GMP developers,

since version 2.18 of the glibc, the clock() function is much more precise.
For example on my machine it has a resolution of 1 micro-seconds whereas
getrusage() has a resolution of 4ms only, as demonstrated by the program
below:

zimmerma@tomate:~/try$ ./a.out 
GNU libc version: 2.21
GNU libc release: stable
loops=3512013 clock=1000662 getrusage=100 nb_clock=100 nb_getrusage=250
resolution clock = 1.00us
resolution getrusage = 4000.00us

I therefore suggest to replace getrusage() by clock() in speed/tune.

Paul

#include 
#include 
#include 
#include 
#include 

unsigned long
time_clock ()
{
  return clock ();
}

unsigned long
time_getrusage ()
{
  struct rusage rus;
  getrusage (0, );
  return rus.ru_utime.tv_sec * 100 + rus.ru_utime.tv_usec
+ rus.ru_stime.tv_sec * 100 + rus.ru_stime.tv_usec;
}

int
main()
{
  unsigned long t_clock, t_getrusage;
  unsigned long last_clock, last_getrusage;
  unsigned long nb_clock = 0, nb_getrusage = 0, loops = 0;

  printf("GNU libc version: %s\n", gnu_get_libc_version ());
  printf("GNU libc release: %s\n", gnu_get_libc_release ());
  last_clock = time_clock ();
  last_getrusage = time_getrusage();
  while (nb_clock < 100 && nb_getrusage < 100)
{
  t_clock = time_clock ();
  nb_clock += t_clock != last_clock;
  last_clock = t_clock;

  t_getrusage = time_getrusage ();
  nb_getrusage += t_getrusage != last_getrusage;
  last_getrusage = t_getrusage;

  loops ++;
}
  printf ("loops=%lu clock=%lu getrusage=%lu nb_clock=%lu nb_getrusage=%lu\n",
  loops, time_clock (), time_getrusage (), nb_clock, nb_getrusage);
  printf ("resolution clock = %.2fus\n", (double) time_clock () / nb_clock);
  printf ("resolution getrusage = %.2fus\n", (double) time_getrusage () / 
nb_getrusage);
}


___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


mpz_ndiv_qr wanted

2015-09-18 Thread paul zimmermann
   Dear GMP developers,

after having written ten times (if not hundred) a function that computes an
integer quotient and remainder with rounding to nearest, I believe this could
be a nice addition to GMP, and useful to other users too.

I propose:

 -- Function: void mpz_ndiv_qr (mpz_t Q, mpz_t R, const mpz_t N, const mpz_t D)

Divide N by D, forming a quotient Q and/or remainder R such that
N = Q*D + R. The quotient Q is rounded to nearest, i.e., either
|R| < D/2, or if |R| = D/2, then Q is even (can occur only for even D).

and of course the corresponding mpz_ndiv_q and mpz_ndiv_r functions.
(The tie-breaking rule for the case |R| = D/2 comes from IEEE 754.)

Best regards,
Paul



___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: speed of mpn_sqrtrem vs mpn_rootrem

2015-07-31 Thread paul zimmermann
   Ciao Marco,

 Date: Thu, 30 Jul 2015 17:21:40 +0200
 From: Marco Bodrato bodr...@mail.dm.unipi.it
 
 Ciao Paul,
 
 Il Mer, 22 Gennaio 2014 1:00 pm, Zimmermann Paul ha scritto:
  tarte% ./speed -s 1000,3000,1 -r mpn_sqrtrem mpn_rootrem.2
  overhead 0.2 secs, precision 1 units of 3.53e-10 secs,
  CPU freq 2833.00 MHz
mpn_sqrtrem mpn_rootrem.2
  1000  0.000153504   #0.9454
  3000  0.000816813   #0.9187
  1 0.004606970   #0.9623
 
 There are some more experiment that we can try, but the anomaly you
 spotted should be healed now:
 $ tune/speed -s 1000,3000,1 -r mpn_sqrt mpn_root.2 mpn_sqrtrem
 mpn_rootrem.2
 overhead 0.2 secs, precision 1 units of 2.86e-10 secs, CPU
 freq 3500.08 MHz
  mpn_sqrtmpn_root.2   mpn_sqrtrem mpn_rootrem.2
 1000 #0.515161.38661.46432.1827
 3000 #0.0002954371.31681.38551.8478
 1#0.0018503521.24481.31491.7012
 
 You are welcome to test and/or give further suggestions
 
 Best regards,
 m

good to see GMP is improving! I confirm the above on my machine with the latest
and greatest development version:

zimmerma@tarte:/tmp/gmp/tune$ ./speed -s 1000,3000,1 -r mpn_sqrt mpn_root.2 
mpn_sqrtrem mpn_rootrem.2
overhead 0.2 secs, precision 1 units of 3.53e-10 secs, CPU freq 
2833.00 MHz
 mpn_sqrtmpn_root.2   mpn_sqrtrem mpn_rootrem.2
1000 #0.0001021051.37361.44982.1121
3000 #0.0005781141.26801.36151.7848
1#0.0034251731.22281.28681.6352

Best regards,
Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


3-prime FFT

2015-07-16 Thread paul zimmermann
on https://hal.archives-ouvertes.fr/hal-01022383, page 27, table 16, the
3-prime FFT implemented in Mathemagix is faster than GMP for 2^23 to 2^25 bits.

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: 3-prime FFT

2015-07-16 Thread paul zimmermann
   Hi Torbjörn,

 The new GMP code beats GMP too, in particular when SSFFT's
 coefficients approach the L1d cache size.

I'm looking forward to the new GMP code then...

Best regards,
Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mpn_zero_p [Was: fast inversion]

2015-06-23 Thread paul zimmermann
   Dear Marco,

  I don't see mpn_zero_p in the API of the current stable version 6.0.0
  (according to gmplib.org). In which version will it be available?
 
 It will be available from the next release.

great! Looking forward for the next release...

Paul
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mini-gmp

2015-06-15 Thread paul zimmermann
   Dear Niels,

 From: ni...@lysator.liu.se (Niels Möller)
 Date: Mon, 15 Jun 2015 09:57:39 +0200
 
 Zimmermann Paul paul.zimmerm...@inria.fr writes:
 
  I tried to compile MPFR with mini-gmp. I needed to do a few changes:
 
 Replying to an old mail. How did this work out? How do you deal with gmp
 vs mini-gmp in public mpfr header files and the mpfr ABI?
 
 Regards,
 /Niels
 
 -- 
 Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
 Internet email is subject to wholesale government surveillance.

as far as I remember, it worked at the end. See the doc/mini-gmp file in the
MPFR svn repo (reproduced below). This was with GMP 6.0.0. I did not test it
with more recent GMP snapshot. Maybe some changes in mini-gmp.c and mini-gmp.h
are no longer necessary. Cf MPFR_USE_MINI_GMP for your last question.

Best regards,
Paul

How to compile GNU MPFR with mini-gmp
=

(this was tested with MPFR r9149 and GMP 6.0.0 on x86_64 GNU/Linux machine)

1) extract the GMP tarball in say /tmp/gmp-6.0.0
   go into /tmp/gmp-6.0.0/mini-gmp
   add the following line in mini-gmp.c (say at line 53):
   char gmp_version[] = 6.0.0;
   gcc -O2 -g -fPIC -c mini-gmp.c
   ar r libgmp.a mini-gmp.o

2) create a GMP install directory in say /tmp
   mkdir /tmp/include
   mkdir /tmp/lib
   cp libgmp.a /tmp/lib
   cp mini-gmp.h /tmp/include/gmp.h

3) do the following changes in /tmp/include/gmp.h:

$ diff gmp.h.orig gmp.h
29a30,35
 #define __GNU_MP_VERSION5
 #define __GNU_MP_VERSION_MINOR  1
 #define __GNU_MP_VERSION_PATCHLEVEL 3

 extern char gmp_version[];

34a41,46
 #endif

 /* random generation functions */
 #ifndef gmp_randstate_t
 typedef long int __gmp_randstate_struct;
 typedef __gmp_randstate_struct gmp_randstate_t[1];

4) extract the MPFR tarball in say /tmp/mpfr-3.1.2
   ./configure --with-gmp=/tmp --enable-mini-gmp

Note: to use this version of the MPFR library, you need to define
the MPFR_USE_MINI_GMP macro before including mpfr.h (alternatively,
you can modify mpfr.h to define this macro at the beginning).

___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel


Re: mpn_redc_n

2015-03-31 Thread paul zimmermann
   Niels,

good. Please keep me informed of your progress. This is very interesting for
GMP-ECM. I am ready to beta-test the interface and/or the code.

Paul

 From: ni...@lysator.liu.se (Niels Möller)
 Date: Tue, 31 Mar 2015 09:10:34 +0200
 
 t...@gmplib.org writes:
 
  Niels and Marco might have some comments; they were involved in the
  design around this,
 
 My guess is as follows: The convention for redc_1 and redc_2 was changed
 for mpn_sec_powm, to get the final conditional subtraction moved out of the
 redc functions. redc_n didn't get that treatment, because it isn't used
 by mpn_powm_sec.
 
 To make progress, I'd like to see
 
 1. Merge of my bdiv changes from a year or two ago, to align conventions
between bdiv and redc.
 
 2. Replace redc calls by bdiv calls (and write or rewrite any needed
assembly code), and sort out any performance regressions.
 
 3. Make the bdiv functions public.
 
 Regards,
 /Niels
 
 -- 
 Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
 Internet email is subject to wholesale government surveillance.
___
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel