attatched is mpn_mul_1 which runs at 2.78 c/l for 4 way unroll
and 2.65 c/l for 8 way unroll and 2.59 c/l for 16 way unroll. Note they are 
not complete , still need to put in the un-unrolled bits. I may be able to 
improve them a bit more!

The basic block is

load src into ax
mul cx
move 0 into temp2
add ax into temp1
adc dx into temp2
store temp1 into dst
// next iteration swap temp1 temp2

which is 7 slots on the K8 , which gives us a minimum time of 2.333 c/l , I've 
not seen any thing which uses less slots 
I suppose I could try unrolling by a multiple of 3 to see if that would help 
with the overhead/scheduling

Considering addmul  , I have for the basic block

load src into ax
mul cx
mov 0 into temp2
add ax into temp1
adc dx into temp2
add dst into temp1
adc 0 into temp2
store temp1 into dst
// next iteration swap temp1 temp2

which is 9 slots on the K8 , which gives us a minimum time of 3.0 c/l 
. Could change to "adc dst into temp1"  to "adc temp1 into dst" which would 
save a slot , but add to mem has a large latency , maybe? 
Anyone got anything better?

I suppose its the "mul" instruction that causing the trouble , as it operates 
in only one pipe out of 3 , and this causes the K8 scheduler problems. 
Perhaps floating point or sse  would ease it ? although may have to change 
from 64bit to 53bit?

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Attachment: mul2780
Description: Troff document

Attachment: mul2656
Description: Troff document

Reply via email to