On Sun, 11 Feb 2024, Stefan Koch wrote:

There is a fair amount of recent activity in the c++ world to optimize for
speed by minimizing processor cache misses.  One example is for in the
std::string class there is a concept of "short string optimization".  What
this does is that for shorter strings the data is stored in the string
class itself without need for dynamic memory allocation.   This has two
advantages: it does not need to make a call to dynamic memory allocation,
and the string data is in the same location as the meta data, and so only
in one cache fetch.  Both of which are potentially slow.

When I was looking at the implementation the _mp_d member of __mpz_struct
is 8 bytes, which is also the size of a single limb entry.  (I suspect that
this is common on recent systems, but it is probably not be universal.)  My
thought was that if _mp_alloc is one, then we have one limb needed.  In
that case, the space of _mp_d could serve as the limb itself instead of the
pointer to the limb.  That would eliminate dynamic memory allocation for
all integers under 2^64.

I was coming at this from a computer algebra system standpoint.  I think
(but I really don't know), that much of the arithmetic that is done is done
with integers under 2^64, and so those systems may be much faster if gmp
had a small integer optimization.

This approach will add a check when accessing the limb data for all
accesses but will save accessing potentially non cpu cache data for small
integers.  It may turn out that the small delay is not so noticeable for
large numbers, and the optimization for small numbers makes a large
difference for small numbers.

A second thought is to do the arithmetic if the sources are and result will
fit in a single limb with inline functions or macros. That would eliminate
the function call entirely for small numbers.  It would make the code
longer, and add additional overhead for when the integers are not short,
but is it not clear how much of an impact that is for lager numbers.

Anyway, I know you guys take performance very seriously, so I was wondering
if you had experience with this?
1. Is there real potential for performance improvements in cas systems if
gmp has a small integer optimization?
2. Have you looked at an approach like this in the past?  and if so what
have you found?
3. If you think this is something worth pursuing, do you have any advice?

Several software already wrap GMP using this kind of strategy: a "number" is either a 63-bit integer or a pointer to a GMP big integer (this is determined by one bit), operations on 63-bit integers are inline.

What is efficient depends on your use case. If you want to use the same integer type everywhere, you should definitely optimize for the small case. If you use separate types for small vs big, and big integers are always big, then you should optimize for the big case.

There is also an intermediate size where numbers occupy just a few limbs, where allocation can still dominate computation, and a small-string-like optimization can help significantly (see Boost.Multiprecision's cpp_int). Again, a one-size-fits-all seems optimistic.

--
Marc Glisse
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to