ni...@lysator.liu.se (Niels Möller) writes:

  Here's a sketch of a loop, that should work for both addaddmul_1msb0 and
  addsubmul_1msb0:

  L(top):
        mov     (ap, n, 8), %rdx
        mulx    %r8, alo, hi
        adox    ahi, alo
        mov     hi, ahi                 C 2-way unroll.
        adox    zero, ahi               C Clears O
        
        mov     (bp, n), %rdx
        mulx    %r9, blo, hi
        adox    bhi, blo
        mov     hi, bhi
        adox    zero, bhi               C clears O

        adc     blo, alo                C Or sbb, for addsubmul_1msb0
        mov     alo, (rp, n, 8)
        inc     n
        jnz     top

  L(done):
        adc     bhi, ahi                C No carry out, thanks to msb0
        mov     ahi, %rax               C Return value

Neat!

Some unrolling would save several instructions:

Put r8 (aka u0) in rdx over k ways of unrolling.  Supply ap[...] limbs
to mulx directly.  Then accumulate with an adox chain, reducing the adox
zero need.  Same for r9/v0.

  (BTW, do I get operand order right for mulx? I'm confused by the docs
  that use the generally different intel conventions).

Your use looks right.

  Now, question is if it can beat mul_1 + addmul_1. I don't know.

It surely has potential.

-- 
Torbjörn
Please encrypt, key id 0xC8601622
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to