David, First mul_1, renamed again, now encoding the load scheduling. Only the 6c variant is new. Please time it. If it doesn't run at 3 c/l, then there are 2 simple things to try, indicated in a comment.
sparct34-mul_1-3c.asm
Description: Binary data
sparct34-mul_1-6c.asm
Description: Binary data
sparct34-mul_1-8c.asm
Description: Binary data
This is probably the same code as before for mul_2 and addmul_2. I intend to check it in now. We really ought to trim the 0.25 c/l at some point, it is a 7% speedup after all.
sparct34-aormul_2.asm
Description: Binary data
The rest are division asm functions, updated to avoid constants in the impldep insns. I suppose these are ready for check-in once you have time to test them on real sand. It is enough to time them as correctness test, I have made sure they run properly. (Incidentally, not every function here is supported by tests/devel/try.c.) I expect all of these, except mod_1_4, to suffer from the huge mul delay.
sparct34-mode1o.asm
Description: Binary data
sparct34-mod_1_4.asm
Description: Binary data
sparct34-bdiv_dbm1c.asm
Description: Binary data
sparct34-dive_1.asm
Description: Binary data
sparct34-invert_limb.asm
Description: Binary data
Could you also please time the current copyi and copyd? And then gcd_1, using 'tune/speed -CD -s32-64 -t32 mpn_gcd_1'? So what remains to be done for T4? And which ones would you want to work on? I'd suggest this prio order: 1. Write new addmul_1, aiming at 4.25 c/l. It is like mul_1 plus one ldx,addxccc pair per limb, and one carry propagating addxc per iteration. I'd suggest 4-way unrolling with single pointers; 2-way should strain OoO too much to run well. We could reach ceil((7*k+5)/2)/k cycles/limb for k-way unrolling, so 8-way would be 10% faster than 4-way. Feed-in for 8-way would require either a jump table, or a binary search. 2. Write new submul_1, aiming at 4.75 c/l, using 4-way unrolling. We'd reach ceil((8*k+5)/2)/k cycles/limb here, or 4.375 c/l for 8-way. 3. Write new new add_n, aiming at 3 c/l using 2-way unrolling and the multi-pointer trick. The code would have just 10 insns in the loop, and be cache port rather than decode/issue constrained. 4. Write new sub_n, aiming at 3 c/l using 2-way unrolling and the multi-pointer trick. The code would have 12 insns in the loop, and hit both the cache port and issue bandwidth. 5. Write the various addlsh, sublsh, rshadd, rshsub functions. Again, 2-way unrolling should be adequate in most cases. An exception is addlsh1_n which should be 4-way, using two chains of addxccc, making heavy use of carry flag register renaming, ideally reaching 3.25 c/l. We could make analogous sublsh1_n and rsblsh1_n, hitting 3.75 c/l. All other functions in this group would need sllx,srlx,or for shifting, adding 1 c/l to the add_n speed (since that was not issue-constrained...) and 1.5 c/l to the sub_n speed. 6. Write addmul_k, k > 2. At some point, we can go to 2-way unrolling without losing speed (perhaps already for k=2, with some accumulation rewriting). At some point, surely no later than for k=8, we could skip unrlling altogether. We could gain 50% general speedup with this approach, 7. Write mul_basecase, sqr_basecase, mullo_basecase, redc_1, redc_2... These would inline addmul_1, addmul_2, and whatever larger addmuk_k we've come up with. Use "overlapped software pipelining". 8. Anything else missing from the T4 column at gmplib.org/devel/asm.html. -- Torbjörn
_______________________________________________ gmp-devel mailing list gmp-devel@gmplib.org http://gmplib.org/mailman/listinfo/gmp-devel