ni...@lysator.liu.se (Niels Möller) writes: I think add_n_sub_n was originally motivated by improved locality (could apply at different levels of memory hierarcy).
But maybe we could get close to twice the speed using newer instructions with multiple carry flags (I guess that's what powerpc64/mode64/p9/add_n_sub_n.asm is doing)? We could probably do something similar on x86_64 with adox and adcx (if corresponding subtract instructions are missing, on-the-fly negation should be fairly efficient). Yes, p9/add_n_sub_n.asm takes advantage of new addex instructions which use alternative carry bits. The POWER processors have since P4 a unique pipeline where all instructions have 2t cycles latency for some integer t. Plain integer ops have a latency of 2. (It is really as if the CPUs run at half the impressive advertised clock frequency with twice the assumed dispatch width.) Because of this, add_n and sub_n cannot be made to run at less than 2 c/l. And add_n_sub_n is close at 2.25 c/l on P9. Unfortunately, x86 is not as easy. The subtract-with-borrow instructions clobber all flags, making the new adcx/adox pretty pointless. And no, there are no non-clobbering negation (or one's complement) instructions. One needs to save and restore carry. There will be many register-to-register copy instructions, because of x86's habit of overwriting source operands. (Such copies are essentially free on AMD processors, as they only manipulate the register rename map.) Another example where there is just one carry bit is arm64. For Apple M1, as you can see at https://gmplib.org/devel/asm, we have an add_n_sub_n which runs at just 1.3 c/l. (Almost all GMP loops, including add_n and sub_n run at exactly 1 c/l there, amazingly.) -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel