That's very impressive!

What do you mean by a slot?

I presume by ax you mean rax, etc.

There's also going to be some loop overhead right?

Bill.

2008/11/23  <[EMAIL PROTECTED]>:
> attatched is mpn_mul_1 which runs at 2.78 c/l for 4 way unroll
> and 2.65 c/l for 8 way unroll and 2.59 c/l for 16 way unroll. Note they are
> not complete , still need to put in the un-unrolled bits. I may be able to
> improve them a bit more!
>
> The basic block is
>
> load src into ax
> mul cx
> move 0 into temp2
> add ax into temp1
> adc dx into temp2
> store temp1 into dst
> // next iteration swap temp1 temp2
>
> which is 7 slots on the K8 , which gives us a minimum time of 2.333 c/l , I've
> not seen any thing which uses less slots
> I suppose I could try unrolling by a multiple of 3 to see if that would help
> with the overhead/scheduling
>
> Considering addmul  , I have for the basic block
>
> load src into ax
> mul cx
> mov 0 into temp2
> add ax into temp1
> adc dx into temp2
> add dst into temp1
> adc 0 into temp2
> store temp1 into dst
> // next iteration swap temp1 temp2
>
> which is 9 slots on the K8 , which gives us a minimum time of 3.0 c/l
> . Could change to "adc dst into temp1"  to "adc temp1 into dst" which would
> save a slot , but add to mem has a large latency , maybe?
> Anyone got anything better?
>
> I suppose its the "mul" instruction that causing the trouble , as it operates
> in only one pipe out of 3 , and this causes the K8 scheduler problems.
> Perhaps floating point or sse  would ease it ? although may have to change
> from 64bit to 53bit?
>
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to