Brian and I have been having an interesting discussion off list about
the preliminary GMP 4.3 figures posted here:

http://gmplib.org/gmpbench.html

Note that on a 2.6 GHz Opteron we would score about 11175 with about
60950 in the multiply bench. Note unbalanced operands are not relevant
here (gmpbench multiply test only works with balanced operands).

It will be interesting to see how much improvement we get from the new
mul_1 and addmul_1.

Any ideas about how we can further improve would be most welcome.

Hmm, I wonder if they put new fft code in. That would certainly do the
trick!! I think I see how they could have gotten this much improvement
without improving the fft, but by my calculations it would be tight.

Bill.

2008/11/26 Bill Hart <[EMAIL PROTECTED]>:
> I should also add the following.
>
> In the case of addmul_1, I don't think the oOo hardware is relevant.
> The Opteron has three sets of 8 reservation stations and essentially
> it just executes what it can. That's pretty simple oOo logic.
>
> It can happen that an entire 8 reservation stations become full with
> dependent instructions. But that is not an issue with mul_1 or
> addmul_1. There are only 30 macro-ops in the whole loop, so nearly the
> whole loop is in reservation stations at any one time.
>
> Instead the issue is that all the muls need to be executed by ALU0
> (there is only 64 bit multiply hardware attached to ALU0). The problem
> then is that too much might be put in the 8 reservation stations for
> ALU0. The hardware which chooses which of the three "pipelines" or
> sets of 8 reservation stations that a macro-op goes into is called the
> pick hardware. I have thus far been unable to find a description of
> how it chooses which pipe to stick macro-ops into.
>
> One big drawback of the K8 is that once in a pipe, other pipes cannot
> steal work from that pipe, even if they are doing nothing and there
> are independent instructions to be executed queueing in another pipe.
>
> So the only relevant piece of hardware here is the pick hardware.
>
> As I say, ptlsim would give us a definitive answer, if it could be make to 
> work.
>
> Bill.
>
> 2008/11/26 Bill Hart <[EMAIL PROTECTED]>:
>> Ah, this probably won't make that much difference to pverall
>> performance. Here is why:
>>
>> In rearranging the instructions in this way we have had to mix up the
>> instructions in an unrolled loop. That means that one can't just jump
>> into the loop at the required spot as before. The wind up and wind
>> down code needs to be made more complex. This is fine, but it possibly
>> adds a few cycles for small sizes.
>>
>> Large mul_1's and addmul_1's are never used by GMP for mul_n. Recall
>> that mul_basecase switches over to Karatsuba after about 30 limbs on
>> the Opteron.
>>
>> But it also probably takes a number of iterations of the loop before
>> the hardware settles into a pattern. The data cache hardware needs to
>> prime, the branch prediction needs to prime, the instruction cache
>> needs to prime and the actual picking of instructions in the correct
>> order does not necessarily happen on the first iteration of the loop.
>>
>> I might be overstating the case a little. Perhaps by about 8 limbs you
>> win, I don't know.
>>
>> Anyhow, I believe jason (not Martin) is working on getting fully
>> working mul_1 and addmul_1 ready for inclusion into eMPIRe. Since he
>> has actually done all the really hard work here with the initial
>> scheduling to get down to 2.75 c/l, I'll let him post any performance
>> figures once he is done with the code. He deserves the credit!
>>
>> Bill.
>>
>> 2008/11/26 mabshoff <[EMAIL PROTECTED]>:
>>>
>>>
>>>
>>> On Nov 26, 6:18 am, Bill Hart <[EMAIL PROTECTED]> wrote:
>>>> Some other things I forgot to mention:
>>>>
>>>> 1) It probably wouldn't have been possible for me to get 2.5c/l
>>>> without jason's code, in both the mul_1 and addmul_1 cases.
>>>
>>> :)
>>>
>>>> 2) You can often insert nops with lone or pair instructions which are
>>>> not 3 macro ops together, further proving that the above analysis is
>>>> correct.
>>>>
>>>> 3) The addmul_1 code I get is very close to the code obtained by
>>>> someone else through independent means, so I won't post it here. Once
>>>> the above tricks have been validated on other code, I'll commit the
>>>> addmul_1 code I have to the repo. Or perhaps someone else will
>>>> rediscover it from what I have written above.
>>>>
>>>> In fact I was only able to find about 16 different versions of
>>>> addmul_1 that run in 2.5c/l all of which look very much like the
>>>> solution obtained independently. The order and location of most
>>>> instructions is fixed by the dual requirements of having triplets of
>>>> macro-ops and having almost nothing run in ALU0 other than muls. There
>>>> are very few degrees of freedom.
>>>>
>>>> Bill.
>>>
>>> This is very, very cool and I am happy that this is discussed in
>>> public. Any chance to see some performance numbers before and after
>>> the checkin?
>>>
>>> <SNIP>
>>>
>>> Cheers,
>>>
>>> Michael
>>> >>>
>>>
>>
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to