Thanks for the further explanation, Niels!
> For an assembly loop, one can find out from properties of the
> processor what cycle counts are implied by these three limits. It's
> often possible (but tedious) to tweak scheduling to get an actual
> speed pretty close to the limit. And it aids optimization to
> understand which one is the performance bottleneck.
[snip]
> I would expect the speed of such a hard-coded function to be limited
> by multiplier throughput (O(N^2)); it should be possible to arrange
> the order you add up the N^2 terms so that your carry chain
> corresponds to the size of the product (O(N)).
Yeah, sorry my benchmark was wrong, so it is only ~20% faster
asymptotically. Sorry for this noise.
Best,
Albin
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel