attatched is mpn_mul_1 which runs at 2.78 c/l for 4 way unroll and 2.65 c/l for 8 way unroll and 2.59 c/l for 16 way unroll. Note they are not complete , still need to put in the un-unrolled bits. I may be able to improve them a bit more!
The basic block is load src into ax mul cx move 0 into temp2 add ax into temp1 adc dx into temp2 store temp1 into dst // next iteration swap temp1 temp2 which is 7 slots on the K8 , which gives us a minimum time of 2.333 c/l , I've not seen any thing which uses less slots I suppose I could try unrolling by a multiple of 3 to see if that would help with the overhead/scheduling Considering addmul , I have for the basic block load src into ax mul cx mov 0 into temp2 add ax into temp1 adc dx into temp2 add dst into temp1 adc 0 into temp2 store temp1 into dst // next iteration swap temp1 temp2 which is 9 slots on the K8 , which gives us a minimum time of 3.0 c/l . Could change to "adc dst into temp1" to "adc temp1 into dst" which would save a slot , but add to mem has a large latency , maybe? Anyone got anything better? I suppose its the "mul" instruction that causing the trouble , as it operates in only one pipe out of 3 , and this causes the K8 scheduler problems. Perhaps floating point or sse would ease it ? although may have to change from 64bit to 53bit? --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---
mul2780
Description: Troff document
mul2656
Description: Troff document