I didn't see your proposed patch on the March 21st email, we use different tricks which might be possible to combine.
You suggestion to avoiding miller_rabin by using trial division is where the core of the speedup for both of our code is coming from. We both use int instead of mpz_t to avoid the overhead of mpz_cdiv_ui which I believe is an important speedup but not the most important. (I'm don't think your ulong patch is needed but I can measure at a later point) My code doesn't use a sieve (and instead checks each number for divisors similar to the old code) because the average gap for small numbers is small 300000 / primepi(300000) = 11 Because the measured threshold for this was much larger (~80 million), it might actually make sense to use a sieve after some threshold e.g. checking ~1000 primes 11 times is more expensive than marking off multiple numbers for the smaller primes. If you think that's a 1.5-3x speedup for inputs 16-30bits is important I can try to dynamically generate a longer list of primes My hesitation with dynamically generation primes is that mpz_nextprime only takes ~ 4e-6* seconds for 30 bit inputs on my computer and it will take non-trivial work to sieve all the primes up to 1X,000. * tune/speed's output which I believe has the units seconds and can be read directly but that's an assumption. My suggestion is: * Commit my code; it's simple and get's a huge speedup (over both the existing and your code) for <16 bit inputs unfortunately this can't be tuned above the static list size ** 2 If you want to try with a dynamical list (e.g. it's important to get a 1.5-3x speedup on 16-30 bit inputs) * Commit my code with a new tunable THRESHOLD_TRAIL_DIV (I'd guess X,000,000) and possibly extending the list to 300 primes * I'll modify your code to dynamically generate primes up to X and we'll tune a 2nd THRESHOLD_TRAIL_DIV_SIEVE (I'd guess ~150,000,000) On Mon, Mar 23, 2020 at 2:38 PM Marco Bodrato <bodr...@mail.dm.unipi.it> wrote: > Ciao, > > Il 2020-03-22 10:00 Seth Troisi ha scritto: > > I measure the regression as ~5-15% for input < ~10,000 > And no regression for larger input? Good. > > > I have a proposed patch which uses int, and trial div up to some > > threshold this appears to be 5x faster. > > A couple of question arises. > 5x faster than what? Than the patch I proposed or than the old/current > code? > Faster than the old code and your code ==> time_nextprime_pre.data <== 10 2.153700000e-06 11 2.143721467e-06 12 2.132065283e-06 13 2.143785751e-06 14 2.132306615e-06 15 2.161254031e-06 16 2.234000000e-06 17 2.291227474e-06 ==> time_nextprime_seth.data <== 10 5.253385820e-07 11 4.980276023e-07 12 4.826503269e-07 13 4.860404419e-07 14 4.833486686e-07 15 5.412016226e-07 16 6.028669974e-07 17 6.123738235e-07 ==> time_nextprime_marco.data <== 10 1.626300000e-06 11 1.580771135e-06 12 1.623004920e-06 13 1.572217889e-06 14 1.647112356e-06 15 1.636840921e-06 16 1.742080000e-06 17 1.966933016e-06 > > Which one is the good idea? using int or avoiding the sieve? > See above > Using unsigned long to compute the modulus is obviously faster than many > calls to mpz_cdiv_ui. I tried a patch that surely is not as fast as > yours, but should speed-up all numbers that fits in a single unsigned > long. The speed-up seems small, I'm not sure it's worth. > > > The cut off point as I measure it is ~80,000,000 but requires > > increasing the primegap_small list to ~1100 entries > > > > I'm not sure what you think about a list of that length. > > Well, the cut off point is surely different on different architectures. > It should be tuned. > Moreover: a large list ... can't it be generated on the fly? If we want > a larger table, in my opinion it should be an improvement for all the > functions that need primes (_primorial, _bin, _fac...). I'm not sure I'm up to that task right now. see above for my suggestions > > > It might also be useful to commit tune_nextprime_small which > > dynamically selects a higher reps count for > > mpz_nextprime input and helps increase precision. > > Uhm... > j = 200000 / s->size; > > This means that > for size=10 you compute 20000 primes: all primes between 2^10 and 2^18. > for size=11: all primes between 2^11 and 2^18. > for size=12... and so on up to 2^18. > > This is not exactly representative of the speed of nextprime on operands > around 2^10, 2^11... and so on. > > This is a great point, I modified the code so It only changes the outer loop count. I included a screenshot showing how much this reduces variance over multiple runs. > > PS: a comment in the current code says > /* Larger threshold is faster but takes ~O(n/ln(n)) memory. > To be honest, the array sieve needs n/24 bytes, and the array > primegap_tmp uses primepi(n) ~= n/ln(n) bytes. And asymptotically the > sum of both is O(n). > You are correct, let's change the comment by which I mean you :) > Ĝis, > m > Again thanks for the great comments.
diff -r a7836288d2c0 tests/mpz/t-nextprime.c --- a/tests/mpz/t-nextprime.c Fri Mar 20 16:19:07 2020 +0100 +++ b/tests/mpz/t-nextprime.c Mon Mar 23 17:00:59 2020 -0700 @@ -151,7 +151,10 @@ mpz_nextprime (next_p, x); refmpz_nextprime (ref_next_p, x); if (mpz_cmp (next_p, ref_next_p) != 0) - abort (); + { + gmp_printf ("Ref mismatch %Zd => %Zd vs %Zd\n", x, ref_next_p, next_p); + abort (); + } } mpz_clear (bs); diff -r a7836288d2c0 tune/speed.h --- a/tune/speed.h Fri Mar 20 16:19:07 2020 +0100 +++ b/tune/speed.h Mon Mar 23 17:00:59 2020 -0700 @@ -3214,7 +3214,7 @@ /* Calculate nextprime(n) for random n of s->size bits (not limbs). */ #define SPEED_ROUTINE_MPZ_NEXTPRIME(function) \ { \ - unsigned i, j; \ + unsigned i, j, m; \ mpz_t wp, n; \ double t; \ \ @@ -3228,8 +3228,10 @@ mpz_setbit (n, s->size - 1); \ mpz_clrbit (n, s->size - 2); \ \ + /* Increase time_divisor significantly for small numbers */ \ + m = s->size <= 250 ? 500/s->size : 1; \ + i = m * s->reps; \ speed_starttime (); \ - i = s->reps; \ do \ { \ /* nextprime timing is variable, so average over many calls */ \ @@ -3248,7 +3250,7 @@ mpz_clear (wp); \ mpz_clear (n); \ \ - s->time_divisor = SPEED_BLOCK_SIZE; \ + s->time_divisor = m * SPEED_BLOCK_SIZE; \ return t; \ }
_______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel