ni...@lysator.liu.se (Niels Möller) writes: Maybe we should have some macrology for that? Or do all relevant processors and compilers support efficient cmov these days? I'm sticking to masking expressions for now.
Let's not trust results from compiler generated code for these things. The mixture of inline asm and plain code is hard for compilers to deal with. Very subtle things can make a huge cycle count difference. For conclusive results, asm is needed, unfortunately. (That's not always the case; Marco and I have played with AVX3/AVX512 lately with both asm and C using intrinsics. C behaved well there. but that wss for non- arithmetic loops.) So what about cmov's performance? Intel fixed its latency for their high-end cores with broadwell, which is about 6 years ago. Their low-power cores still have 2 cycles though. AMD's cores always have had 1 cycle latency and good throughput. Worries about side-channel leakage of cmov isn't so relevant for these particular functions, since the use of MPN_INCR_U is a data dependent loop anyway. Granted. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel