Ciao, Il Dom, 25 Agosto 2019 2:28 am, Torbjörn Granlund ha scritto: > Now we have a nice set of x86_64 gcd_22. The code is not as well tuned > as the gcd_11 code, but it runs somewhat fast.
So if I suggest to reorder some instructions in the loop, you will not upset :-) If we can change cmovc-s then there are a couple of instructions that can be removed from the "unlikely" branch in gcd_22: *** /tmp/extdiff.1GZGxB/gmp.746b5528f6a5/mpn/x86_64/core2/gcd_22.asm 2019-08-25 10:32:39.000000000 +0200 --- /home/bodrato/gmp/mpn/x86_64/core2/gcd_22.asm 2019-08-26 23:23:30.321470451 +0200 *************** *** 94,108 **** bsf t0, cnt sub v0, u0 sbb v1, u1 - L(bck): cmovc t0, u0 C u = |u - v| cmovc t1, u1 C u = |u - v| cmovc s0, v0 C v = min(u,v) cmovc s1, v1 C v = min(u,v) shrd R8(cnt), u1, u0 shr R8(cnt), u1 mov v1, t1 --- 94,108 ---- bsf t0, cnt sub v0, u0 sbb v1, u1 cmovc t1, u1 C u = |u - v| cmovc s0, v0 C v = min(u,v) + L(bck): cmovc t0, u0 C u = |u - v| cmovc s1, v1 C v = min(u,v) shrd R8(cnt), u1, u0 shr R8(cnt), u1 mov v1, t1 *************** *** 118,131 **** C 1. If v1 - u1 = 0, then gcd is u = v. C 2. Else compute gcd_21({v1,v0}, |u1-v1|) mov v1, t0 sub u1, t0 je L(end) ! xor t1, t1 ! mov u0, s0 mov u1, s1 bsf t0, cnt mov u1, u0 xor u1, u1 sub v1, u0 jmp L(bck) --- 118,131 ---- C 1. If v1 - u1 = 0, then gcd is u = v. C 2. Else compute gcd_21({v1,v0}, |u1-v1|) mov v1, t0 sub u1, t0 je L(end) ! C xor t1, t1 ! C mov u0, s0 mov u1, s1 bsf t0, cnt mov u1, u0 xor u1, u1 sub v1, u0 jmp L(bck) The same applies to other x86_64 gcd_22... > I haven't explored the table based variant which gives 3 bits of > progress per iteration. It might make the new code obsolete for > machines with fast multiply. I'm not sure it's easy to handle the case with u+v that overflows... Ĝis, m -- http://bodrato.it/papers/ _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel