That's very impressive! What do you mean by a slot?
I presume by ax you mean rax, etc. There's also going to be some loop overhead right? Bill. 2008/11/23 <[EMAIL PROTECTED]>: > attatched is mpn_mul_1 which runs at 2.78 c/l for 4 way unroll > and 2.65 c/l for 8 way unroll and 2.59 c/l for 16 way unroll. Note they are > not complete , still need to put in the un-unrolled bits. I may be able to > improve them a bit more! > > The basic block is > > load src into ax > mul cx > move 0 into temp2 > add ax into temp1 > adc dx into temp2 > store temp1 into dst > // next iteration swap temp1 temp2 > > which is 7 slots on the K8 , which gives us a minimum time of 2.333 c/l , I've > not seen any thing which uses less slots > I suppose I could try unrolling by a multiple of 3 to see if that would help > with the overhead/scheduling > > Considering addmul , I have for the basic block > > load src into ax > mul cx > mov 0 into temp2 > add ax into temp1 > adc dx into temp2 > add dst into temp1 > adc 0 into temp2 > store temp1 into dst > // next iteration swap temp1 temp2 > > which is 9 slots on the K8 , which gives us a minimum time of 3.0 c/l > . Could change to "adc dst into temp1" to "adc temp1 into dst" which would > save a slot , but add to mem has a large latency , maybe? > Anyone got anything better? > > I suppose its the "mul" instruction that causing the trouble , as it operates > in only one pipe out of 3 , and this causes the K8 scheduler problems. > Perhaps floating point or sse would ease it ? although may have to change > from 64bit to 53bit? > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---