ni...@lysator.liu.se (Niels Möller) writes:

> If we have adox/adcx, use same strategy as suggested for
> addaddmul_1msb0, but subtract rather than add in the chain with long
> lived carry.

Here's a sketch of a loop, that should work for both addaddmul_1msb0 and
addsubmul_1msb0:

L(top):
        mov     (ap, n, 8), %rdx
        mulx    %r8, alo, hi
        adox    ahi, alo
        mov     hi, ahi                 C 2-way unroll.
        adox    zero, ahi               C Clears O
        
        mov     (bp, n), %rdx
        mulx    %r9, blo, hi
        adox    bhi, blo
        mov     hi, bhi
        adox    zero, bhi               C clears O

        adc     blo, alo                C Or sbb, for addsubmul_1msb0
        mov     alo, (rp, n, 8)
        inc     n
        jnz     top

L(done):
        adc     bhi, ahi                C No carry out, thanks to msb0
        mov     ahi, %rax               C Return value

(BTW, do I get operand order right for mulx? I'm confused by the docs
that use the generally different intel conventions).

Note that in this form, I think we could allow full limb inputs (%r8,
%r9), except that we would get a final carry, and we'd need to return a
65-bit value.

For the addadd case, this could be simplified by adding ahi and bhi
together early (since there can be no overflow), eliminating a few of
the adox instructions. 

Now, question is if it can beat mul_1 + addmul_1. I don't know.

Regards,
/Niels

-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to