I think you should delay writing through QP to avoid adc to a memory
place, and have just one plain write through QP per iteration.

The dec UN and the branch might run faster if put adjacent to each
other, as many CPUs fuse these into a single instruction.

Your cycle numbers should proably be multiplied by a factor

  ("turbo" frequency) / (nominal frequency)

as 7.x c/l seems faster than we ever measured.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to