[mpir-devel] Re: Some thoughts on mul basecase for AMD

jason Sun, 25 Jan 2009 06:48:43 -0800

On Monday 19 January 2009 19:32:39 ja...@njkfrudils.plus.com wrote:
> Presently for a 20x20 mul_basecase we have cycle counts of
> 2082 gmp-4.2.4
> 1461 mpir trunk
> 1153 mpir k8-branch
> 1103 pipeline
> 1033 perfect
>
> "pipeline" is my fully pipelined version of mpir k8-branch and "perfect" is
> assuming that every slot can be filled with a macro op. I can perhaps trim
> another 10 cycles off the time, but it seems we have to have some unfilled
> slots.
> ie 13 cycles from branch mispredict, 27 cycles from suboptimal outer loop
> schedule, 20 from first iteration startup.The branch misprediction is
> unavoidible , the suboptimal outer loop schedule is the best I can get ,
> and the first iteration startup is a mystery!
>
> Note: the % speedup from k8-branch to "pipeline" is better for smaller n ,
> we get about 10% for 8x8 mul_basecase
>
> It appears the only way to get more speed is to reduce the number of
> instructions , ie unroll the addmul_1 part more , from 4-way to 8-way to
> reduce the loop control. It's not alot but we can take advantage of it
> better in mul_basecase than in addmul_1
>
> A large multiplication will be broken down into mul_basecase sized pieces ,
> but they will have sizes  NxN to 2Nx2N   where 2N is the kara-threshold. So
> any addmul_1 used in the basecase doesn't need loop control for 1/2 the
> time.
>
> Or assuming it can be scheduled then an inf-way unroll with a jump in , and
> some magic for the feed-in.
>


Another way is to break the mul_basecase rectangle into columns rather than 
rows.Here we get a much better basic block of

mov (src1),AX
mul (src2)
add AX,t1
adc DX,t2
adc 0,t3

which is 6 macro-ops  , hopefully leading to 2c/l

to give
mov (src1),AX
mul (src2)
add AX,t1
adc DX,t2
adc 0,t3
mov 1(src1),AX
mul -1(src2)
add AX,t1
adc DX,t2
adc 0,t3
mov 2(src1),AX
mul -2(src2)
add AX,t1
adc DX,t2
adc 0,t3
mov 3(src1),AX
mul -3(src2)
add AX,t1
adc DX,t2
adc 0,t3

etc

But even with 12-way unroll and 9 temp vars  the best I could get is 2.6c/l :(
There is very little freedom in the above code to re-arrange it.

I've not got any code which beats 2.5c/l yet :(

Jason

>
> 


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

[mpir-devel] Re: Some thoughts on mul basecase for AMD

Reply via email to