Re: mpn_mul is embarrassingly slow

Marc Glisse Fri, 20 Apr 2018 02:36:58 -0700

On Fri, 20 Apr 2018, Vincent Lefevre wrote:

On 2018-04-20 04:14:15 +0200, Fredrik Johansson wrote:

For operands with 1-4 limbs, that is; on my machine, mpn_mul takes up to
twice as long as mpn_mul_basecase, and inline assembly for 1x1, 2x1 or 2x2
multiplication is even faster. The problem is that there are three function
calls (mpn_mul -> mpn_mul_n -> mpn_mul_basecase) + branches between the
user code and GMP's lightning fast assembly code.


I was reminded of this old issue when seeing this new paper on arXiv:
https://arxiv.org/abs/1804.07236. Here, the author benchmarked a C++
implementation of bignum arithmetic against mpn_mul for small operand sizes
and came to the conclusion that the former approach performs better than
hand-optimized assembly (one wishes that compilers really were that clever
about bignum code by now!).

Some advanced GMP users (including myself) know about the issue and simply
avoid mpn_mul for performance-critical code with short operands. The most
convenient solution is to call mpn_mul_basecase directly instead of
mpn_mul. Unfortunately, mpn_mul_basecase is not public, so this is a bit
iffy to rely on. One feature request would be to simply make
mpn_mul_basecase / mpn_sqr_basecase public.

[...]

I'm wondering... With the current GMP code, does LTO help to avoid
such issues?

mpn_mul and mpn_mul_n are too large to be completely inlined (unlessthat's the only place where they are used, which could happen in amicrotest, but doesn't seem realistic in an application). What couldhappen is partial inlining of the first test of each. Maybe using LTO+PGO(profile-guided optimization)? Still, I am not particularly optimistic.


--
Marc Glisse
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Re: mpn_mul is embarrassingly slow

Reply via email to