Marius Hillenbrand <mhil...@linux.ibm.com> writes: Done, completed successfully for both mpn_addmul_1 and mpn_submul1.
Good! I measure ~40% reduction in cycles per limb on z14 and ~60% reduction on z15, for both addmul_1 and submul_1 (smaller gains for small n <10 but no regressions), yet ... Great speedup! Smaller gains for small loop trip counts are expected. > 2. Is the measured performance close to what you would hope for, given > some hard pipeline limits? If not, would you be willing to try to > improve things? ... I assumed that I hit max multiplication throughput, yet I just learned that there is potential to increase that further. I'll accept that challenge and keep working on the patch :) Good! I went through the process before and already have copyright assignments in place for multiple GNU projects, including GMP, so we should be good from that side. Ah, that will simplify things. Your addmul_1 and submul_1 will be great contributins. Please do a mul_1 too, as that should be very straightforward by removing some insn from addmul_1. If you really want to make these CPUs perform near their maximum caapbility, consider writing mul_basecase, sqr_basecase, and perhap mpn_sbpi1_bdiv_r too. The last one might not be too clear, but it is really very similar to mul_basecase, only that one innerloop invariant multiplier comes from a Hensel limb quotient. A fast sqr_basecase is tricky, and GMP uses a plethora of algorithms for it. The most recent saves a lot of cycles and look like below in C. In asm, one would inline addmul_1, of course. And for both mul_basecase and in particular sqr_basecase, one should implement straight line code for small operands (this also tends to simplify the main code as it doesn't need to consider small operands). For sqr_basecase, one would exit the outer loop long before un becomes zero, and fall into straight line code. void mpn_sqr_basecase (mp_ptr rp, mp_srcptr up, mp_size_t un) { mp_limb_t u0; mp_limb_t cin; u0 = up[0]; umul_ppmm (cin, rp[0], u0, u0); ++rp; if (--un) { u0 = u0 << 1; up += 1; rp[un] = mpn_mul_1c (rp, up, un, u0, cin); // error: up[0]:63 x up[(n-1)...1], analogous after each iteration below do { u0 = up[0]; mp_limb_t ci = -(up[-1] >> (GMP_NUMB_BITS-1)) & u0; // correction term mp_limb_t x0 = rp[1] + ci; mp_limb_t c0 = x0 < ci; mp_limb_t hi, lo; umul_ppmm (hi, lo, u0, u0); mp_limb_t x1 = x0 + lo; mp_limb_t c1 = x1 < lo; cin = hi + c0 + c1; ASSERT (cin < hi); rp[1] = x1; rp += 2; if (--un == 0) break; u0 = (up[-1] >> (GMP_NUMB_BITS-1)) + (u0 << 1); up += 1; rp[un] = mpn_addmul_1c (rp, up, un, u0, cin); } while (1); } rp[0] = cin; } -- Torbjörn Please encrypt, key id 0xC8601622 _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel