Ah, this probably won't make that much difference to pverall
performance. Here is why:

In rearranging the instructions in this way we have had to mix up the
instructions in an unrolled loop. That means that one can't just jump
into the loop at the required spot as before. The wind up and wind
down code needs to be made more complex. This is fine, but it possibly
adds a few cycles for small sizes.

Large mul_1's and addmul_1's are never used by GMP for mul_n. Recall
that mul_basecase switches over to Karatsuba after about 30 limbs on
the Opteron.

But it also probably takes a number of iterations of the loop before
the hardware settles into a pattern. The data cache hardware needs to
prime, the branch prediction needs to prime, the instruction cache
needs to prime and the actual picking of instructions in the correct
order does not necessarily happen on the first iteration of the loop.

I might be overstating the case a little. Perhaps by about 8 limbs you
win, I don't know.

Anyhow, I believe jason (not Martin) is working on getting fully
working mul_1 and addmul_1 ready for inclusion into eMPIRe. Since he
has actually done all the really hard work here with the initial
scheduling to get down to 2.75 c/l, I'll let him post any performance
figures once he is done with the code. He deserves the credit!

Bill.

2008/11/26 mabshoff <[EMAIL PROTECTED]>:
>
>
>
> On Nov 26, 6:18 am, Bill Hart <[EMAIL PROTECTED]> wrote:
>> Some other things I forgot to mention:
>>
>> 1) It probably wouldn't have been possible for me to get 2.5c/l
>> without jason's code, in both the mul_1 and addmul_1 cases.
>
> :)
>
>> 2) You can often insert nops with lone or pair instructions which are
>> not 3 macro ops together, further proving that the above analysis is
>> correct.
>>
>> 3) The addmul_1 code I get is very close to the code obtained by
>> someone else through independent means, so I won't post it here. Once
>> the above tricks have been validated on other code, I'll commit the
>> addmul_1 code I have to the repo. Or perhaps someone else will
>> rediscover it from what I have written above.
>>
>> In fact I was only able to find about 16 different versions of
>> addmul_1 that run in 2.5c/l all of which look very much like the
>> solution obtained independently. The order and location of most
>> instructions is fixed by the dual requirements of having triplets of
>> macro-ops and having almost nothing run in ALU0 other than muls. There
>> are very few degrees of freedom.
>>
>> Bill.
>
> This is very, very cool and I am happy that this is discussed in
> public. Any chance to see some performance numbers before and after
> the checkin?
>
> <SNIP>
>
> Cheers,
>
> Michael
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to