Thanks for the fast and helpful reply!

I see, I definitely need to read up on the CPU pipelines. I also tested one of your automated scripts for measuring cycles per limbs for a variety of functions, and it checks out.

Anyway, in regards to the performance of multiplication: I did manage to write some half-hardcoded that outperforms the mpn_mul_basecase quite a bit on Apple M1 (only tested on the Mac Mini on cfarm). They are basically on the form

        mpn_mul_N(mp_ptr, mp_srcptr, mp_size_t, mp_srcptr)

for N in 1, 2, ..., 15. I recall that this translated very well into the Toom-Cook territories (when using this, the cutoff between Toom22 using these underlying algorithms and GMP's Toom33 is at ~480 limbs, pretty impressive!(?)). For instance, with N = 8 it is 80% faster asymptotically than mpn_mul_basecase on M1. They do, however, span a lot of code as each case has to be handcoded, so I suppose they would not fit into GMP.

Anyway, thanks for your reply!

Best,
Albin
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to