Albin Ahlbäck <albin.ahlb...@gmail.com> writes:

  I am looking at Torbjörn's `aorsmul_1.asm' for Apple M1, and I am
  having trouble understanding how the cycles per limb number was
  calculated.

It was not calculated.  It was measured.

But that asm code was written with understanding of the M1 pipeline.
Here, 1.25 c/l was actually what was expected.

  As I understand it, the cycles per limb number represents the loop(s)
  in any routine. Looking at the main loop, it seems like it should
  scale at 10 cycles per loop (of which 2 cycles are lost due to latency
  from loading x4, I believe), for which it treats four limbs from `up'
  at a time. However, the given number is 1.25 which is half the size of
  my calculated 10 / 4.

I cannot follow your reasoning here, but as it seems incorrect from
measurements, I am not too worried.  :-)

The performance limit to 1.25 c/l (5 cycles per loop) comes from these
lines:

        ADDSUB  x8, x8, CY              C W0    carry-in
        ADDSUBC x4, x4, x9              C W1
        ADDSUBC x5, x5, x10             C W2
        ADDSUBC x6, x6, x11             C W2
        csinc   CY, x7, x7, COND        C W3    carry-out

These 5 instructions form a circular dependency, and each instruction
has a one cycle latency, so 5 cycles in total.

There are other parts of the loop with also sets performance limits,
e.g., the mul/umulh has a throughput of 2 per cycle, so even with the
above circular dependency removed, the loop would need 4 cycles.

I hope this reply helps.  Modern CPUs have complex pipelines, with lots
of buffering to cate for dependencies and very wide exeecution in the
absence of dependencies.


-- 
Torbjörn
Please encrypt, key id 0xC8601622
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to