On Monday 19 January 2009 19:32:39 ja...@njkfrudils.plus.com wrote: > Presently for a 20x20 mul_basecase we have cycle counts of > 2082 gmp-4.2.4 > 1461 mpir trunk > 1153 mpir k8-branch > 1103 pipeline > 1033 perfect > > "pipeline" is my fully pipelined version of mpir k8-branch and "perfect" is > assuming that every slot can be filled with a macro op. I can perhaps trim > another 10 cycles off the time, but it seems we have to have some unfilled > slots. > ie 13 cycles from branch mispredict, 27 cycles from suboptimal outer loop > schedule, 20 from first iteration startup.The branch misprediction is > unavoidible , the suboptimal outer loop schedule is the best I can get , > and the first iteration startup is a mystery! > > Note: the % speedup from k8-branch to "pipeline" is better for smaller n , > we get about 10% for 8x8 mul_basecase > > It appears the only way to get more speed is to reduce the number of > instructions , ie unroll the addmul_1 part more , from 4-way to 8-way to > reduce the loop control. It's not alot but we can take advantage of it > better in mul_basecase than in addmul_1 > > A large multiplication will be broken down into mul_basecase sized pieces , > but they will have sizes NxN to 2Nx2N where 2N is the kara-threshold. So > any addmul_1 used in the basecase doesn't need loop control for 1/2 the > time. > > Or assuming it can be scheduled then an inf-way unroll with a jump in , and > some magic for the feed-in. >
Another way is to break the mul_basecase rectangle into columns rather than rows.Here we get a much better basic block of mov (src1),AX mul (src2) add AX,t1 adc DX,t2 adc 0,t3 which is 6 macro-ops , hopefully leading to 2c/l to give mov (src1),AX mul (src2) add AX,t1 adc DX,t2 adc 0,t3 mov 1(src1),AX mul -1(src2) add AX,t1 adc DX,t2 adc 0,t3 mov 2(src1),AX mul -2(src2) add AX,t1 adc DX,t2 adc 0,t3 mov 3(src1),AX mul -3(src2) add AX,t1 adc DX,t2 adc 0,t3 etc But even with 12-way unroll and 9 temp vars the best I could get is 2.6c/l :( There is very little freedom in the above code to re-arrange it. I've not got any code which beats 2.5c/l yet :( Jason > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---