David Miller <da...@davemloft.net> writes: Thanks for your help, the following works. I'll work on unrolling and scheduling it. PROLOGUE(mpn_sub_nc) ba,pt %xcc, L(ent) xor cy, 1, cy EPILOGUE() PROLOGUE(mpn_sub_n) mov 1, cy L(ent): cmp %g0, cy L(top): ldx [up+0], %o4 add up, 8, up ldx [vp+0], %o5 add vp, 8, vp add rp, 8, rp add n, -1, n xnor %o5, %g0, %o5 addxccc %o4, %o5, %g3 brgz n, L(top) stx %g3, [rp-8] clr %o0 retl movcc %xcc, 1, %o0 EPILOGUE() Since we are working with a throughput constrained pipeline, we should really use as few insns as possible.
There are 6 operation insns, and it seems hard to use less than 5 bookkeeping insns. With k-way unrolling we should then get to max(3,(6k+5)/(2k)) cycles/limb. For small k, we could put the pointers the end of its operands, then use a combined index and loop counter -n...0. This would give max(3,(7k+1)/(2k)) cycles/limb. (The max(3...) handles the load/store bandwidth limit. It has no limiting effect for sub_n, but it does for add_n.) sub_n: n method 1 method 2 1 5.5 4.0 2 4.2 3.8 3 3.8 3.7 4 3.6 3.6 5 3.5 3.6 6 3.4 3.6 7 3.4 3.6 8 3.3 3.6 oo 3.0 3.5 add_n: n method 1 method 2 1 5.0 3.5 2 3.8 3.2 3 3.3 3.2 4 3.1 3.1 5 3.0 3.1 6 3.0 3.1 7 3.0 3.1 8 3.0 3.1 oo 3.0 3.0 For add_n, I recommend either method 1 with 4-way unrolling, or method 2 with 2-way unrolling. For sub_n we should use at least 4-way unrolling. -- Torbjörn _______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel