Presently for a 20x20 mul_basecase we have cycle counts of 2082 gmp-4.2.4 1461 mpir trunk 1153 mpir k8-branch 1103 pipeline 1033 perfect
"pipeline" is my fully pipelined version of mpir k8-branch and "perfect" is assuming that every slot can be filled with a macro op. I can perhaps trim another 10 cycles off the time, but it seems we have to have some unfilled slots. ie 13 cycles from branch mispredict, 27 cycles from suboptimal outer loop schedule, 20 from first iteration startup.The branch misprediction is unavoidible , the suboptimal outer loop schedule is the best I can get , and the first iteration startup is a mystery! Note: the % speedup from k8-branch to "pipeline" is better for smaller n , we get about 10% for 8x8 mul_basecase It appears the only way to get more speed is to reduce the number of instructions , ie unroll the addmul_1 part more , from 4-way to 8-way to reduce the loop control. It's not alot but we can take advantage of it better in mul_basecase than in addmul_1 A large multiplication will be broken down into mul_basecase sized pieces , but they will have sizes NxN to 2Nx2N where 2N is the kara-threshold. So any addmul_1 used in the basecase doesn't need loop control for 1/2 the time. Or assuming it can be scheduled then an inf-way unroll with a jump in , and some magic for the feed-in. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---