David,

First mul_1, renamed again, now encoding the load scheduling.  Only the
6c variant is new.  Please time it.  If it doesn't run at 3 c/l, then
there are 2 simple things to try, indicated in a comment.

Attachment: sparct34-mul_1-3c.asm
Description: Binary data

Attachment: sparct34-mul_1-6c.asm
Description: Binary data

Attachment: sparct34-mul_1-8c.asm
Description: Binary data

This is probably the same code as before for mul_2 and addmul_2.  I
intend to check it in now.  We really ought to trim the 0.25 c/l at some
point, it is a 7% speedup after all.

Attachment: sparct34-aormul_2.asm
Description: Binary data

The rest are division asm functions, updated to avoid constants in the
impldep insns.  I suppose these are ready for check-in once you have
time to test them on real sand.  It is enough to time them as
correctness test, I have made sure they run properly.  (Incidentally,
not every function here is supported by tests/devel/try.c.)

I expect all of these, except mod_1_4, to suffer from the huge mul
delay.

Attachment: sparct34-mode1o.asm
Description: Binary data

Attachment: sparct34-mod_1_4.asm
Description: Binary data

Attachment: sparct34-bdiv_dbm1c.asm
Description: Binary data

Attachment: sparct34-dive_1.asm
Description: Binary data

Attachment: sparct34-invert_limb.asm
Description: Binary data

Could you also please time the current copyi and copyd?
And then gcd_1, using 'tune/speed -CD -s32-64 -t32 mpn_gcd_1'?

So what remains to be done for T4?  And which ones would you want to
work on?

I'd suggest this prio order:

1. Write new addmul_1, aiming at 4.25 c/l.  It is like mul_1 plus one
   ldx,addxccc pair per limb, and one carry propagating addxc per
   iteration.  I'd suggest 4-way unrolling with single pointers; 2-way
   should strain OoO too much to run well.  We could reach
   ceil((7*k+5)/2)/k cycles/limb for k-way unrolling, so 8-way would be
   10% faster than 4-way.  Feed-in for 8-way would require either a jump
   table, or a binary search.

2. Write new submul_1, aiming at 4.75 c/l, using 4-way unrolling.  We'd
   reach ceil((8*k+5)/2)/k cycles/limb here, or 4.375 c/l for 8-way.

3. Write new new add_n, aiming at 3 c/l using 2-way unrolling and the
   multi-pointer trick.  The code would have just 10 insns in the loop,
   and be cache port rather than decode/issue constrained.

4. Write new sub_n, aiming at 3 c/l using 2-way unrolling and the
   multi-pointer trick.  The code would have 12 insns in the loop,
   and hit both the cache port and issue bandwidth.

5. Write the various addlsh, sublsh, rshadd, rshsub functions.  Again,
   2-way unrolling should be adequate in most cases.  An exception is
   addlsh1_n which should be 4-way, using two chains of addxccc, making
   heavy use of carry flag register renaming, ideally reaching 3.25 c/l.
   We could make analogous sublsh1_n and rsblsh1_n, hitting 3.75 c/l.
   All other functions in this group would need sllx,srlx,or for
   shifting, adding 1 c/l to the add_n speed (since that was not
   issue-constrained...) and 1.5 c/l to the sub_n speed.

6. Write addmul_k, k > 2.  At some point, we can go to 2-way unrolling
   without losing speed (perhaps already for k=2, with some accumulation
   rewriting).  At some point, surely no later than for k=8, we could
   skip unrlling altogether.  We could gain 50% general speedup with
   this approach,

7. Write mul_basecase, sqr_basecase, mullo_basecase, redc_1, redc_2...
   These would inline addmul_1, addmul_2, and whatever larger addmuk_k
   we've come up with.  Use "overlapped software pipelining".

8. Anything else missing from the T4 column at gmplib.org/devel/asm.html.

-- 
Torbjörn
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
http://gmplib.org/mailman/listinfo/gmp-devel

Reply via email to