I didn't see your proposed patch on the March 21st email, we use different
tricks which might be possible to combine.

You suggestion to avoiding miller_rabin by using trial division is where
the core of the speedup for both of our code is coming from.
We both use int instead of mpz_t to avoid the overhead of mpz_cdiv_ui which
I believe is an important speedup but not the most important.
(I'm don't think your ulong patch is needed but I can measure at a later
point)

My code doesn't use a sieve (and instead checks each number for divisors
similar to the old code) because the average gap for small numbers is small
300000 / primepi(300000) = 11
Because the measured threshold for this was much larger (~80 million), it
might actually make sense to use a sieve after some threshold
e.g. checking ~1000 primes 11 times is more expensive than marking off
multiple numbers for the smaller primes.

If you think that's a 1.5-3x speedup for inputs 16-30bits is important I
can try to dynamically generate a longer list of primes
My hesitation with dynamically generation primes is that mpz_nextprime only
takes ~ 4e-6*  seconds for 30 bit inputs on my computer
and it will take non-trivial work to sieve all the primes up to 1X,000.

* tune/speed's output which I believe has the units seconds and can be read
directly but that's an assumption.

My suggestion is:
* Commit my code; it's simple and get's a huge speedup (over both the
existing and your code) for <16 bit inputs
unfortunately this can't be tuned above the static list size ** 2

If you want to try with a dynamical list (e.g. it's important to get a
1.5-3x speedup on 16-30 bit inputs)
* Commit my code with a new tunable THRESHOLD_TRAIL_DIV (I'd guess
X,000,000) and possibly extending the list to 300 primes
* I'll modify your code to dynamically generate primes up to X and we'll
tune a 2nd THRESHOLD_TRAIL_DIV_SIEVE (I'd guess ~150,000,000)


On Mon, Mar 23, 2020 at 2:38 PM Marco Bodrato <bodr...@mail.dm.unipi.it>
wrote:

> Ciao,
>
> Il 2020-03-22 10:00 Seth Troisi ha scritto:
> > I measure the regression as ~5-15% for input < ~10,000


> And no regression for larger input? Good.
>

> > I have a proposed patch which uses int, and trial div up to some
> > threshold this appears to be 5x faster.
>
> A couple of question arises.
> 5x faster than what? Than the patch I proposed or than the old/current
> code?
>
Faster than the old code and your code
==> time_nextprime_pre.data <==
10       2.153700000e-06
11       2.143721467e-06
12       2.132065283e-06
13       2.143785751e-06
14       2.132306615e-06
15       2.161254031e-06
16       2.234000000e-06
17       2.291227474e-06

==> time_nextprime_seth.data <==
10       5.253385820e-07
11       4.980276023e-07
12       4.826503269e-07
13       4.860404419e-07
14       4.833486686e-07
15       5.412016226e-07
16       6.028669974e-07
17       6.123738235e-07

==> time_nextprime_marco.data <==
10       1.626300000e-06
11       1.580771135e-06
12       1.623004920e-06
13       1.572217889e-06
14       1.647112356e-06
15       1.636840921e-06
16       1.742080000e-06
17       1.966933016e-06

>
> Which one is the good idea? using int or avoiding the sieve?
>
See above

> Using unsigned long to compute the modulus is obviously faster than many
> calls to mpz_cdiv_ui. I tried a patch that surely is not as fast as
> yours, but should speed-up all numbers that fits in a single unsigned
> long. The speed-up seems small, I'm not sure it's worth.
>
> > The cut off point as I measure it is ~80,000,000 but requires
> > increasing the primegap_small list to ~1100 entries
> >
> > I'm not sure what you think about a list of that length.
>
> Well, the cut off point is surely different on different architectures.
> It should be tuned.
> Moreover: a large list ... can't it be generated on the fly? If we want
> a larger table, in my opinion it should be an improvement for all the
> functions that need primes (_primorial, _bin, _fac...).

I'm not sure I'm up to that task right now. see above for my suggestions

>


> > It might also be useful to commit tune_nextprime_small which
> > dynamically selects a higher reps count for
> > mpz_nextprime input and helps increase precision.
>
> Uhm...
> j = 200000 / s->size;
>
> This means that
> for size=10 you compute 20000 primes: all primes between 2^10 and 2^18.
> for size=11: all primes between 2^11 and 2^18.
> for size=12... and so on up to 2^18.
>
> This is not exactly representative of the speed of nextprime on operands
> around 2^10, 2^11... and so on.
>
> This is a great point, I modified the code so It only changes the outer
loop count.
I included a screenshot showing how much this reduces variance over
multiple runs.

>
> PS: a comment in the current code says
>   /* Larger threshold is faster but takes ~O(n/ln(n)) memory.
> To be honest, the array sieve needs n/24 bytes, and the array
> primegap_tmp uses primepi(n) ~= n/ln(n) bytes. And asymptotically the
> sum of both is O(n).
>
You are correct, let's change the comment by which I mean you :)


> Ĝis,
> m
>
Again thanks for the great comments.
diff -r a7836288d2c0 tests/mpz/t-nextprime.c
--- a/tests/mpz/t-nextprime.c	Fri Mar 20 16:19:07 2020 +0100
+++ b/tests/mpz/t-nextprime.c	Mon Mar 23 17:00:59 2020 -0700
@@ -151,7 +151,10 @@
       mpz_nextprime (next_p, x);
       refmpz_nextprime (ref_next_p, x);
       if (mpz_cmp (next_p, ref_next_p) != 0)
-	abort ();
+        {
+          gmp_printf ("Ref mismatch %Zd => %Zd vs %Zd\n", x, ref_next_p, next_p);
+	  abort ();
+        }
     }
 
   mpz_clear (bs);
diff -r a7836288d2c0 tune/speed.h
--- a/tune/speed.h	Fri Mar 20 16:19:07 2020 +0100
+++ b/tune/speed.h	Mon Mar 23 17:00:59 2020 -0700
@@ -3214,7 +3214,7 @@
 /* Calculate nextprime(n) for random n of s->size bits (not limbs). */
 #define SPEED_ROUTINE_MPZ_NEXTPRIME(function)				\
   {									\
-    unsigned  i, j;							\
+    unsigned  i, j, m;							\
     mpz_t     wp, n;							\
     double    t;							\
 									\
@@ -3228,8 +3228,10 @@
     mpz_setbit (n, s->size - 1);					\
     mpz_clrbit (n, s->size - 2);					\
 									\
+    /* Increase time_divisor significantly for small numbers */		\
+    m = s->size <= 250 ? 500/s->size : 1;				\
+    i = m * s->reps;							\
     speed_starttime ();							\
-    i = s->reps;							\
     do									\
       {									\
         /* nextprime timing is variable, so average over many calls */	\
@@ -3248,7 +3250,7 @@
     mpz_clear (wp);							\
     mpz_clear (n);							\
 									\
-    s->time_divisor = SPEED_BLOCK_SIZE;					\
+    s->time_divisor = m * SPEED_BLOCK_SIZE;				\
     return t;								\
   }
 
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to