ni...@lysator.liu.se (Niels Möller) writes: Maybe it makes sense to check this in now, deleting the old div1 and div2? If it turns out to be a regression (is that easy to see from nightly tests?), we can consider putting it back under some #if DIV1_METHOD ..., or replace with some other method, q-by-table-lookup, or do it bitwise, a <-- a - 2^k b. The current div1 and div2 seem unlikely to be the right way to do it.
I don't know of any nightly test which would flag that. As a matter of fact, we don't have any performance regression tests. We should. It is not easy to do. I don't think we even measure it, except perhaps indirectly for determining some _THRESHOLD. We might want to teach tuneup to choose between small quotient division primitives., like we do for Jacobi. On shell: Before: ./tune/speed -c -p 100000 -s3-5 mpn_gcd overhead 5.92 cycles, precision 100000 units of 2.86e-10 secs, CPU freq 3500.09 MHz mpn_gcd 3 4328.73 4 6317.43 5 8320.82 After: mpn_gcd 3 3768.16 4 5509.65 5 7182.58 So for n = 3, cycles per input bit is reduced from 22.5 to 19.5 (on the broadwell machine, around 16). How does that compare to mpn_gcd_33, I'd imagine it's considerably faster? C AMD bd4 8.6 C AMD zn1 7.3 C AMD zn2 6.6 C Intel HWL 10.3 C Intel BWL 8.0 C Intel SKL 7.6 I only wrote one variant, and that variant uses shlx and shrx, which only the CPUs above support. The system 'shell' does not run it (but ashell does!). Ivy bridge, which 'shell' uses, is slightly slower than hwl (haswell). We can estinate that gcd_33 is around 1.75x faster than the new and improved mpn_gcd. I believe using / on these processors is not the right approach and that some table based division or my previously suggested division (which handles q <= 7 with some compares and masking) could shave off another 25%... -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel