Hi, Based on your feedback on my previous patches, I rewrote addmul_1/mul_1 and added implementations for addmul_2/mul_2 and mul_basecase. They are still based on multiplying 64x64->128 in gpr pairs and accumulating 128-bit-wise in vector registers.
The code passes "make check", of course, and I have run "try" for ~72 hours for each of the functions (on top of countless iterations of the relevant individual test cases in tests/devel). GMPbench.base.multiply improves by about 50% on z15, the overall score in GMPbench improves by ~35%. The patches do not include new tuneup parameters, yet. All the implementations are in C with enough inline assembly to result in decent code. mul_basecase #includes and inlines the (add)mul functions to avoid calls and unnecessary branches. All the (add)mul_1/2 functions are 4x unrolled for the first operand (i.e., 4 mults per iteration in addmul_1, 8 mults in addmul_2). Mul_basecase is structured so that it branches on (un % 4) to select the correct loop prologue only once on entry, and does not need branches for that in each body of addmul. The accumulation structure in addmul_2 is maybe a little unexpected. The idea there is to use 128-bit adds without carry over full adds with carry-in and carry-out whenever possible because the latter require two instructions for each sum and have instruction grouping limitations. The resulting code performs better than strictly using adds with carry-in/out for the moderate number of limbs that are relevant for mul_basecase. Regards, Marius _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org https://gmplib.org/mailman/listinfo/gmp-devel