Torbjörn Granlund <t...@gmplib.org> writes:

> On 2nd thought, the bookkeeping inc will clear the O flag and leave the C
> flag alone.

And will cause an interesting failure if one can ever afford enough RAM
to use an input size larger than 2^63 limbs ;-)

> So, except for a typo in addressing your code loooks
> correct.

Attaching a version that actually passes some tests (I should commit the
unit tests, but not today). The loop is only 2-way, and there are three
spare registers:

L(top):
        mov     (ap, n, 8), %rdx
        mulx    u0, l0, hi
        mov     (bp, n, 8), %rdx
        mulx    v0, l1, c1
        adox    c0, l0
        adox    hi, c1
        adc     l0, l1
        mov     l1, (rp, n, 8)

        inc     n
        mov     (ap, n, 8), %rdx
        mulx    u0, l0, hi
        mov     (bp, n, 8), %rdx
        mulx    v0, l1, c0
        adox    c1, l0
        adox    hi, c0
        adc     l0, l1
        mov     l1, (rp, n, 8)
        inc     n

        jnz     L(top)

I've eliminated the zero adds by adding together the high half of the
products earlier (thanks to the "msb0" condition, there's no carry out
from the second adox in the pairs). I think the critical recurrency
involves the alternating carry limbs c0, c1, and the OF flag which is
live only between the adjacent adox instructions.

Benchmarking on my laptop (AMD ryzen), I get 

$ ./tune/speed -p 10000000 -C mpn_mul_1.0x9999999999999999 
mpn_addmul_1.9999999999999999 mpn_addaddmul_1msb0 -s 1-200 -f 1.3
overhead 4.34 cycles, precision 10000000 units of 7.16e-10 secs, CPU freq 
1397.19 MHz
        mpn_mul_1.0x9999999999999999 mpn_addmul_1.9999999999999999 
mpn_addaddmul_1msb0
1             #6.0772        6.5963        6.2874
2             #3.4713        3.9056        4.6860
3             #2.6382        2.6919        4.0909
4              2.4400       #2.2825        4.1684
5             #2.4537        2.5213        3.7127
6             #2.2549        2.5762        3.5821
7             #2.1174        2.3741        3.7313
9             #2.0589        2.1572        3.3585
11            #1.8486        2.1634        3.1464
14            #1.7917        1.9840        3.2297
18            #1.6255        1.9356        2.9995
23            #1.5938        1.8239        2.9059
29            #1.5380        1.7372        2.8521
37            #1.5505        1.7278        2.8299
48            #1.4754        1.7155        2.8755
62            #1.4706        1.7699        2.7680
80            #1.4662        1.7399        2.7284
104           #1.4628        1.7610        2.7784
135           #1.4634        1.8041        2.8026
175           #1.4553        1.8036        2.7376

so about 20% faster than mul_1 + addmul_1. Maybe instruction issue is a
bottleneck? 

Cycle numbers are getting bit fuzzy, so let's scale all numbers by some
fudge factor to get 2 c/l for addmul_1, we then have

               1.6           2.0           3.0

The 3 cycles/limb means 6 cycles for the two-way loop of 19
instructions, or 3.1 instructions per cycle. To reach the ideal of 2
cycles/limb (critical path limit, one iteration of this 2-way loop in 4
cycles), one would need to issue almost 5 instructions per cycle, and
that's not very realistic, right?

A few instructions can be shaved off by going to k-way for larger k, and
moving the invariant u0/v0 to %rdx, as you suggested. Is that the way to
go?

This structure doesn't work as is for addsubmul_1msb0, one would need to
either organize it differently, or negate on the fly (but adding 4 not
instructions into the loop sounds rather expensive, if speed is
already limited by the number of instruction).

Ah, and one final question: Where should mulx-code go? I tried putting
this in x86_64/mulx/adx/, but it wasn't picked up automagically by
configure, I had to link it in manually.
 
Regards,
/Niels

dnl  AMD64 mpn_addsubmul_1msb0, R = Au - Bv, u,v < 2^63.

dnl  Copyright 2021 Free Software Foundation, Inc.

dnl  This file is part of the GNU MP Library.
dnl
dnl  The GNU MP Library is free software; you can redistribute it and/or modify
dnl  it under the terms of either:
dnl
dnl    * the GNU Lesser General Public License as published by the Free
dnl      Software Foundation; either version 3 of the License, or (at your
dnl      option) any later version.
dnl
dnl  or
dnl
dnl    * the GNU General Public License as published by the Free Software
dnl      Foundation; either version 2 of the License, or (at your option) any
dnl      later version.
dnl
dnl  or both in parallel, as here.
dnl
dnl  The GNU MP Library is distributed in the hope that it will be useful, but
dnl  WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
dnl  or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
dnl  for more details.
dnl
dnl  You should have received copies of the GNU General Public License and the
dnl  GNU Lesser General Public License along with the GNU MP Library.  If not,
dnl  see https://www.gnu.org/licenses/.

include(`../config.m4')

C INPUT PARAMETERS
define(`rp',    `%rdi')
define(`ap',    `%rsi')
define(`bp_param', `%rdx')
define(`n',     `%rcx')
define(`u0',    `%r8')
define(`v0',    `%r9')

define(`bp', `%rbp')

define(`c0', `%rax')    C c0 and c1 alternate as carry limbs
define(`c1', `%rbx')    C c0 is also the return value.
define(`l0', `%r10')
define(`l1', `%r11')
define(`hi', `%r12')

undefine(`mulx', `adcx', `adox')

ASM_START()
        TEXT
        ALIGN(16)
PROLOGUE(mpn_addaddmul_1msb0)

        push    %rbx
        push    %rbp
        push    %r12

        lea     (ap,n,8), ap
        lea     (bp_param,n,8), bp
        lea     (rp,n,8), rp
        neg     n

        xor     c0, c0          C Also clears CF and OF
        test    $1, n
        jz      L(top)

        mov     (ap, n, 8), %rdx
        mulx    u0, l0, hi
        mov     (bp, n, 8), %rdx
        mulx    v0, l1, c0
        adox    hi, c0          C No carry out
        adc     l0, l1
        mov     l1, (rp, n, 8)

        C This and below uses of inc instruction clears OF, except if
        C the increment is from 2^63-1 to 2^63. It seems fairly safe
        C to assume that limb size is < 2^63, and then overflow can't
        C happen.
        inc     n
        jz      L(end)

L(top):
        mov     (ap, n, 8), %rdx
        mulx    u0, l0, hi
        mov     (bp, n, 8), %rdx
        mulx    v0, l1, c1
        adox    c0, l0
        adox    hi, c1
        adc     l0, l1
        mov     l1, (rp, n, 8)

        inc     n
        mov     (ap, n, 8), %rdx
        mulx    u0, l0, hi
        mov     (bp, n, 8), %rdx
        mulx    v0, l1, c0
        adox    c1, l0
        adox    hi, c0
        adc     l0, l1
        mov     l1, (rp, n, 8)
        inc     n

        jnz     L(top)

L(end):
        adc     $0, c0

        pop     %r12
        pop     %rbp
        pop     %rbx
        ret
EPILOGUE()
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to