Torbjörn Granlund <t...@gmplib.org> writes: > On 2nd thought, the bookkeeping inc will clear the O flag and leave the C > flag alone.
And will cause an interesting failure if one can ever afford enough RAM to use an input size larger than 2^63 limbs ;-) > So, except for a typo in addressing your code loooks > correct. Attaching a version that actually passes some tests (I should commit the unit tests, but not today). The loop is only 2-way, and there are three spare registers: L(top): mov (ap, n, 8), %rdx mulx u0, l0, hi mov (bp, n, 8), %rdx mulx v0, l1, c1 adox c0, l0 adox hi, c1 adc l0, l1 mov l1, (rp, n, 8) inc n mov (ap, n, 8), %rdx mulx u0, l0, hi mov (bp, n, 8), %rdx mulx v0, l1, c0 adox c1, l0 adox hi, c0 adc l0, l1 mov l1, (rp, n, 8) inc n jnz L(top) I've eliminated the zero adds by adding together the high half of the products earlier (thanks to the "msb0" condition, there's no carry out from the second adox in the pairs). I think the critical recurrency involves the alternating carry limbs c0, c1, and the OF flag which is live only between the adjacent adox instructions. Benchmarking on my laptop (AMD ryzen), I get $ ./tune/speed -p 10000000 -C mpn_mul_1.0x9999999999999999 mpn_addmul_1.9999999999999999 mpn_addaddmul_1msb0 -s 1-200 -f 1.3 overhead 4.34 cycles, precision 10000000 units of 7.16e-10 secs, CPU freq 1397.19 MHz mpn_mul_1.0x9999999999999999 mpn_addmul_1.9999999999999999 mpn_addaddmul_1msb0 1 #6.0772 6.5963 6.2874 2 #3.4713 3.9056 4.6860 3 #2.6382 2.6919 4.0909 4 2.4400 #2.2825 4.1684 5 #2.4537 2.5213 3.7127 6 #2.2549 2.5762 3.5821 7 #2.1174 2.3741 3.7313 9 #2.0589 2.1572 3.3585 11 #1.8486 2.1634 3.1464 14 #1.7917 1.9840 3.2297 18 #1.6255 1.9356 2.9995 23 #1.5938 1.8239 2.9059 29 #1.5380 1.7372 2.8521 37 #1.5505 1.7278 2.8299 48 #1.4754 1.7155 2.8755 62 #1.4706 1.7699 2.7680 80 #1.4662 1.7399 2.7284 104 #1.4628 1.7610 2.7784 135 #1.4634 1.8041 2.8026 175 #1.4553 1.8036 2.7376 so about 20% faster than mul_1 + addmul_1. Maybe instruction issue is a bottleneck? Cycle numbers are getting bit fuzzy, so let's scale all numbers by some fudge factor to get 2 c/l for addmul_1, we then have 1.6 2.0 3.0 The 3 cycles/limb means 6 cycles for the two-way loop of 19 instructions, or 3.1 instructions per cycle. To reach the ideal of 2 cycles/limb (critical path limit, one iteration of this 2-way loop in 4 cycles), one would need to issue almost 5 instructions per cycle, and that's not very realistic, right? A few instructions can be shaved off by going to k-way for larger k, and moving the invariant u0/v0 to %rdx, as you suggested. Is that the way to go? This structure doesn't work as is for addsubmul_1msb0, one would need to either organize it differently, or negate on the fly (but adding 4 not instructions into the loop sounds rather expensive, if speed is already limited by the number of instruction). Ah, and one final question: Where should mulx-code go? I tried putting this in x86_64/mulx/adx/, but it wasn't picked up automagically by configure, I had to link it in manually. Regards, /Niels
dnl AMD64 mpn_addsubmul_1msb0, R = Au - Bv, u,v < 2^63. dnl Copyright 2021 Free Software Foundation, Inc. dnl This file is part of the GNU MP Library. dnl dnl The GNU MP Library is free software; you can redistribute it and/or modify dnl it under the terms of either: dnl dnl * the GNU Lesser General Public License as published by the Free dnl Software Foundation; either version 3 of the License, or (at your dnl option) any later version. dnl dnl or dnl dnl * the GNU General Public License as published by the Free Software dnl Foundation; either version 2 of the License, or (at your option) any dnl later version. dnl dnl or both in parallel, as here. dnl dnl The GNU MP Library is distributed in the hope that it will be useful, but dnl WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY dnl or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License dnl for more details. dnl dnl You should have received copies of the GNU General Public License and the dnl GNU Lesser General Public License along with the GNU MP Library. If not, dnl see https://www.gnu.org/licenses/. include(`../config.m4') C INPUT PARAMETERS define(`rp', `%rdi') define(`ap', `%rsi') define(`bp_param', `%rdx') define(`n', `%rcx') define(`u0', `%r8') define(`v0', `%r9') define(`bp', `%rbp') define(`c0', `%rax') C c0 and c1 alternate as carry limbs define(`c1', `%rbx') C c0 is also the return value. define(`l0', `%r10') define(`l1', `%r11') define(`hi', `%r12') undefine(`mulx', `adcx', `adox') ASM_START() TEXT ALIGN(16) PROLOGUE(mpn_addaddmul_1msb0) push %rbx push %rbp push %r12 lea (ap,n,8), ap lea (bp_param,n,8), bp lea (rp,n,8), rp neg n xor c0, c0 C Also clears CF and OF test $1, n jz L(top) mov (ap, n, 8), %rdx mulx u0, l0, hi mov (bp, n, 8), %rdx mulx v0, l1, c0 adox hi, c0 C No carry out adc l0, l1 mov l1, (rp, n, 8) C This and below uses of inc instruction clears OF, except if C the increment is from 2^63-1 to 2^63. It seems fairly safe C to assume that limb size is < 2^63, and then overflow can't C happen. inc n jz L(end) L(top): mov (ap, n, 8), %rdx mulx u0, l0, hi mov (bp, n, 8), %rdx mulx v0, l1, c1 adox c0, l0 adox hi, c1 adc l0, l1 mov l1, (rp, n, 8) inc n mov (ap, n, 8), %rdx mulx u0, l0, hi mov (bp, n, 8), %rdx mulx v0, l1, c0 adox c1, l0 adox hi, c0 adc l0, l1 mov l1, (rp, n, 8) inc n jnz L(top) L(end): adc $0, c0 pop %r12 pop %rbp pop %rbx ret EPILOGUE()
-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.
_______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel