Albin Ahlbäck <albin.ahlb...@gmail.com> writes: I am looking at Torbjörn's `aorsmul_1.asm' for Apple M1, and I am having trouble understanding how the cycles per limb number was calculated.
It was not calculated. It was measured. But that asm code was written with understanding of the M1 pipeline. Here, 1.25 c/l was actually what was expected. As I understand it, the cycles per limb number represents the loop(s) in any routine. Looking at the main loop, it seems like it should scale at 10 cycles per loop (of which 2 cycles are lost due to latency from loading x4, I believe), for which it treats four limbs from `up' at a time. However, the given number is 1.25 which is half the size of my calculated 10 / 4. I cannot follow your reasoning here, but as it seems incorrect from measurements, I am not too worried. :-) The performance limit to 1.25 c/l (5 cycles per loop) comes from these lines: ADDSUB x8, x8, CY C W0 carry-in ADDSUBC x4, x4, x9 C W1 ADDSUBC x5, x5, x10 C W2 ADDSUBC x6, x6, x11 C W2 csinc CY, x7, x7, COND C W3 carry-out These 5 instructions form a circular dependency, and each instruction has a one cycle latency, so 5 cycles in total. There are other parts of the loop with also sets performance limits, e.g., the mul/umulh has a throughput of 2 per cycle, so even with the above circular dependency removed, the loop would need 4 cycles. I hope this reply helps. Modern CPUs have complex pipelines, with lots of buffering to cate for dependencies and very wide exeecution in the absence of dependencies. -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel