David Miller <da...@davemloft.net> writes: From: Torbjorn Granlund <t...@gmplib.org> Date: Fri, 04 Jan 2013 14:54:15 +0100 > Did you add umulxhi use in your patch from a few days ago? Yes I did use mulx/umulxhi (both T3 and T4 have umulxhi) and yes the multiplies do pipeline on T4 (it doesn't on T3), and it gets about 4 cycles per limb in a two-way unrolled loop in mul_1. addmul_1 gets about 6.5 cycles per limb. Could you please try my mul-only loop to determine the throughput?
It is a 2-issue pipeline, right? So the two extra instructions for addmul_1 compared to mul_1, if both are deply unrolled, should allow for 1 + epsilon differential cycle. With two-way unrolling, we will get 3 extra instructions per way (5 or 6 in total per loop). This still does not explain the slowdown from 4 c/l to 6.5 c/l. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel