Thanks for the further explanation, Niels!

> For an assembly loop, one can find out from properties of the
> processor what cycle counts are implied by these three limits. It's
> often possible (but tedious) to tweak scheduling to get an actual
> speed pretty close to the limit. And it aids optimization to
> understand which one is the performance bottleneck.

[snip]

> I would expect the speed of such a hard-coded function to be limited
> by multiplier throughput (O(N^2)); it should be possible to arrange
> the order you add up the N^2 terms so that your carry chain
> corresponds to the size of the product (O(N)).

Yeah, sorry my benchmark was wrong, so it is only ~20% faster asymptotically. Sorry for this noise.

Best,
Albin
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to