Presently for a 20x20 mul_basecase we have cycle counts of
2082 gmp-4.2.4
1461 mpir trunk
1153 mpir k8-branch
1103 pipeline
1033 perfect

"pipeline" is my fully pipelined version of mpir k8-branch and "perfect" is 
assuming that every slot can be filled with a macro op. I can perhaps trim 
another 10 cycles off the time, but it seems we have to have some unfilled 
slots. 
ie 13 cycles from branch mispredict, 27 cycles from suboptimal outer loop 
schedule, 20 from first iteration startup.The branch misprediction is 
unavoidible , the suboptimal outer loop schedule is the best I can get , and 
the first iteration startup is a mystery!

Note: the % speedup from k8-branch to "pipeline" is better for smaller n , we 
get about 10% for 8x8 mul_basecase

It appears the only way to get more speed is to reduce the number of 
instructions , ie unroll the addmul_1 part more , from 4-way to 8-way to 
reduce the loop control. It's not alot but we can take advantage of it better 
in mul_basecase than in addmul_1

A large multiplication will be broken down into mul_basecase sized pieces , 
but they will have sizes  NxN to 2Nx2N   where 2N is the kara-threshold. So 
any addmul_1 used in the basecase doesn't need loop control for 1/2 the time.

Or assuming it can be scheduled then an inf-way unroll with a jump in , and 
some magic for the feed-in.


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to