I think you should delay writing through QP to avoid adc to a memory place, and have just one plain write through QP per iteration.
The dec UN and the branch might run faster if put adjacent to each other, as many CPUs fuse these into a single instruction. Your cycle numbers should proably be multiplied by a factor ("turbo" frequency) / (nominal frequency) as 7.x c/l seems faster than we ever measured. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel