ni...@lysator.liu.se (Niels Möller) writes: I doubt sign multiply would be that useful. An instruction to do signed 64b x unsigned 64b --> signed 128b, would be nice, but I don't know any arch having that.
Unsigned ans signed widening multiplication can be done in terms of ther other. There is code in longlong.h for doing that: /* If we lack umul_ppmm but have smul_ppmm, define umul_ppmm in terms of smul_ppmm. */ #if !defined (umul_ppmm) && defined (smul_ppmm) #define umul_ppmm(w1, w0, u, v) \ do { \ UWtype __w1; \ UWtype __xm0 = (u), __xm1 = (v); \ smul_ppmm (__w1, w0, __xm0, __xm1); \ (w1) = __w1 + (-(__xm0 >> (W_TYPE_SIZE - 1)) & __xm1) \ + (-(__xm1 >> (W_TYPE_SIZE - 1)) & __xm0); \ } while (0) #endif I think signed x unsigned is about half that hard. :-) Just suggesting that we'd may get slightly less latency with a = up[0] - m*vp[0]; count_trailing_zeros (cnt, a); cy = mpn_addmul_1 (up, vp, n, m); Maybe not so useful on it's own, but with assembly, there are further opportunities. Like doing rshift on-the-fly. > This addmul_1-only, positive-only table is a strange gcd animal. It > will never yield U = 0. Individual limbs are not unlikely to become > zero. Ooops. Then one definitely need a different stop condition than comparing something to zero. I tried the stop condition U = 0 first, and it seemed to run really slowly. :-) If your willing to have both count_leading_zeros and count_trailing_zeros in the loop, maybe it's simpler and more efficient with a table-based binary euclid? CND_SWAP count_leading_zeros(cnt, up[n-1] | vp[n-1]) q = tab[...high bits...] if (q odd) u <-- u - q v else u <-- (q+1) u - v count_trailing_zeros(cnt, up[0]); /* At least one */ u >>= cnt; I don't follow the u assignment in the else clause. And even*odd-odd would seem to become odd. Perhaps you meant u <-- u - (v - q)v? But a table-based Euclid would be interesting! On many machines, count_*ing_zeros are cheap. On AMD Ryzen, _leading_ has the same latency and throughput as plain addition (latency = 1, throughput 4/cycle), while _trailing_ has twice the latency and half the throughput. On Intel the latency is worse, 3 cycles for both trailing and leading with a throughput of 1/cycle, but that's still not bad. One might unroll Euclid 2x and only recompute _leading_ for the affected operand. Would you give your table-based Euclid a try? Please use ~tege/gcd/test-gcd_33.c for testing, replacing the _33 call with an _n call. (I'll might make a soecial test_gcd_n.c at some point.) It seems like an mpn_rsbmul_1 would be a useful primitive for several variants. Yep. I never even considered that in GMP until now. Sounds like the opposite of binary euclid. Perhaps! -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel