[mpir-devel] Re: mpn_mul_1 on K8

Bill Hart Wed, 26 Nov 2008 11:55:06 -0800

It's not surprising the FFT doesn't improve things much. It is only
used in about 1/5 of the benchmarks and so can only make about a
2^(1/5) = 14% difference at best (assuming we happened to hit a point
where the FFT was twice as fast).


Also, looking at the rsaapp score, it is clear that they have improved
things dramatically for small operands.

Bill.

2008/11/26 Bill Hart <[EMAIL PROTECTED]>:
> The Zimmermann et al FFT takes us to about 65881 in the bench scores
> and 11419 overall.
>
> That is totally untuned however. With correct tuning I am sure it
> makes a bigger difference.
>
> Bill.
>
> 2008/11/26 Bill Hart <[EMAIL PROTECTED]>:
>> Actually it wouldn't be a completely trivial merge. There is no tuning
>> code in the package. We'd have to write that. Not that this should be
>> hard. There is only prebuilt tuning files for K7 and Pentium 4 or
>> something like that.
>>
>> Bill.
>>
>> 2008/11/26 Bill Hart <[EMAIL PROTECTED]>:
>>> Cool!
>>>
>>> Paul Zimmermann, Pierrick Gaudry and Alexander Krupka (and possibly
>>> also Torbjorn Granlund) worked on a new FFT for GMP. It is comparable
>>> with the one in FLINT and uses Fermat numbers and Mersenne numbers I
>>> think.
>>>
>>> Also I think Torbjorn Granlund and Alexander Krupka worked on an FFT
>>> (perhaps a small prime FFT?) which is fast for very large operands. I
>>> have never seen the code or performance figures so I don't know all
>>> that much about this. I might just have this mixed up with the
>>> Zimmermann one.
>>>
>>> There is an improved fft patch on the gmp website which appears to be
>>> written by Paul Zimmermann, but it contains the following lines:
>>>
>>> "TODO:
>>>
>>>   Implement some of the tricks published at ISSAC'2007 by Gaudry, Kruppa, 
>>> and
>>>   Zimmermann."
>>>
>>> The patch has also been relicensed GPL v3+ which is odd considering
>>> that wasn't around in 2007 and there is no indication of changes to
>>> the patch since then.
>>>
>>> There are plans to dramatically improve the FFT in eMPIRe. But we need
>>> to discuss the best strategy for that. Licensing is an issue.
>>>
>>> Looking at Paul Zimmermann's website there is an fft-mul patch which
>>> claims to be up to 2 times faster than the fft distributed with GMP.
>>> This appears to have been written by TG, PZ, AK and PG.
>>>
>>> We could just plug that straight in to eMPIRe. It is licensed LGPL
>>> v2.1+ and wouldn't need merging. I very much doubt Paul would change
>>> the license on that, but I've retrieved a version under LGPL v2.1+
>>> anyhow.
>>>
>>> I know the FLINT one is faster again for small operands and
>>> occasionally for larger operands, but there is so much work to merge
>>> it, and it would be GPL only that I think we should avoid wasting our
>>> time for now. The original idea was for the Zimmermann et al FFT to
>>> end up in GMP and the FLINT one to end up in eMPIRe, but now I think
>>> we could save ourselves a lot of work by just using the Zimmermann et
>>> al one.
>>>
>>> Almost certainly we could improve the Zimmermann et al FFT using some
>>> of the tricks from FLINT.
>>>
>>> What does everyone think?
>>>
>>> Bill.
>>>
>>> 2008/11/26  <[EMAIL PROTECTED]>:
>>>>
>>>> On Wednesday 26 November 2008 18:12:15 Bill Hart wrote:
>>>>> 2008/11/26  <[EMAIL PROTECTED]>:
>>>>> > On Wednesday 26 November 2008 17:27:59 Bill Hart wrote:
>>>>> >> Brian and I have been having an interesting discussion off list about
>>>>> >> the preliminary GMP 4.3 figures posted here:
>>>>> >>
>>>>> >> http://gmplib.org/gmpbench.html
>>>>> >>
>>>>> >> Note that on a 2.6 GHz Opteron we would score about 11175 with about
>>>>> >> 60950 in the multiply bench. Note unbalanced operands are not relevant
>>>>> >> here (gmpbench multiply test only works with balanced operands).
>>>>> >>
>>>>> >> It will be interesting to see how much improvement we get from the new
>>>>> >> mul_1 and addmul_1.
>>>>> >>
>>>>> >> Any ideas about how we can further improve would be most welcome.
>>>>> >
>>>>> > from my gmpextra-1.0.1/changes
>>>>> > error mpn_newmul_n  got thresholds backwards
>>>>>
>>>>> Are you saying the new code caused an error? If so, that is not
>>>>> surprising. It only works for n = 0 mod 4 and needs the prologue and
>>>>> epilogue changed to make it work in general. I modified the timing
>>>>> code you sent me to time it for n = 0 mod 4.
>>>>>
>>>>> > can give 3-20% on fft-sizes
>>>>>
>>>>> Or are you saying code of yours will improve fft multiplication.
>>>>>
>>>>
>>>> Yes , and no
>>>> I calculate x*y mod 2^k-1 , whereas GMP-fft uses x*y mod 2^k+1
>>>> with k a high power of two , and my mod-1 is faster than gmp-mod+1 for a
>>>> reasonable range of sizes. The "error" was in the thresholds I calculated 
>>>> , I
>>>> entered the decision logic backwards.
>>>> At the moment its faster at about 8k limbs to 100k limbs , although it's 
>>>> very
>>>> uneven (and slower in places).
>>>>
>>>>
>>>>> In another one of my projects (FLINT) there exists FFT integer
>>>>> multiplication code written by David Harvey and myself which will give
>>>>> up to a factor of 2 improvement on FFT sizes. But it needs to be
>>>>> rewritten to merge with eMPIRe. Also it is GPL not LGPL and I can't
>>>>> (and probably don't want) to do anything about that. For now we don't
>>>>> have a GPL version of eMPIRe, and such code would probably not be
>>>>> helpful at this time.
>>>>>
>>>>> I've also been working with Gonzalo Tornaria on code which will
>>>>> multiply absolutely HUGE integers (well beyond what the current FFT's
>>>>> will do).  But that won't get completed for another couple of months
>>>>> and will again need merging, which will be a non-trivial job. Again it
>>>>> relies on GPL'd code.
>>>>>
>>>>> Bill.
>>>>>
>>>>> >> Hmm, I wonder if they put new fft code in. That would certainly do the
>>>>> >> trick!! I think I see how they could have gotten this much improvement
>>>>> >> without improving the fft, but by my calculations it would be tight.
>>>>> >>
>>>>> >> Bill.
>>>>> >>
>>>>> >> 2008/11/26 Bill Hart <[EMAIL PROTECTED]>:
>>>>> >> > I should also add the following.
>>>>> >> >
>>>>> >> > In the case of addmul_1, I don't think the oOo hardware is relevant.
>>>>> >> > The Opteron has three sets of 8 reservation stations and essentially
>>>>> >> > it just executes what it can. That's pretty simple oOo logic.
>>>>> >> >
>>>>> >> > It can happen that an entire 8 reservation stations become full with
>>>>> >> > dependent instructions. But that is not an issue with mul_1 or
>>>>> >> > addmul_1. There are only 30 macro-ops in the whole loop, so nearly 
>>>>> >> > the
>>>>> >> > whole loop is in reservation stations at any one time.
>>>>> >> >
>>>>> >> > Instead the issue is that all the muls need to be executed by ALU0
>>>>> >> > (there is only 64 bit multiply hardware attached to ALU0). The 
>>>>> >> > problem
>>>>> >> > then is that too much might be put in the 8 reservation stations for
>>>>> >> > ALU0. The hardware which chooses which of the three "pipelines" or
>>>>> >> > sets of 8 reservation stations that a macro-op goes into is called 
>>>>> >> > the
>>>>> >> > pick hardware. I have thus far been unable to find a description of
>>>>> >> > how it chooses which pipe to stick macro-ops into.
>>>>> >> >
>>>>> >> > One big drawback of the K8 is that once in a pipe, other pipes cannot
>>>>> >> > steal work from that pipe, even if they are doing nothing and there
>>>>> >> > are independent instructions to be executed queueing in another pipe.
>>>>> >> >
>>>>> >> > So the only relevant piece of hardware here is the pick hardware.
>>>>> >> >
>>>>> >> > As I say, ptlsim would give us a definitive answer, if it could be
>>>>> >> > make to work.
>>>>> >> >
>>>>> >> > Bill.
>>>>> >> >
>>>>> >> > 2008/11/26 Bill Hart <[EMAIL PROTECTED]>:
>>>>> >> >> Ah, this probably won't make that much difference to pverall
>>>>> >> >> performance. Here is why:
>>>>> >> >>
>>>>> >> >> In rearranging the instructions in this way we have had to mix up 
>>>>> >> >> the
>>>>> >> >> instructions in an unrolled loop. That means that one can't just 
>>>>> >> >> jump
>>>>> >> >> into the loop at the required spot as before. The wind up and wind
>>>>> >> >> down code needs to be made more complex. This is fine, but it
>>>>> >> >> possibly adds a few cycles for small sizes.
>>>>> >> >>
>>>>> >> >> Large mul_1's and addmul_1's are never used by GMP for mul_n. Recall
>>>>> >> >> that mul_basecase switches over to Karatsuba after about 30 limbs on
>>>>> >> >> the Opteron.
>>>>> >> >>
>>>>> >> >> But it also probably takes a number of iterations of the loop before
>>>>> >> >> the hardware settles into a pattern. The data cache hardware needs 
>>>>> >> >> to
>>>>> >> >> prime, the branch prediction needs to prime, the instruction cache
>>>>> >> >> needs to prime and the actual picking of instructions in the correct
>>>>> >> >> order does not necessarily happen on the first iteration of the 
>>>>> >> >> loop.
>>>>> >> >>
>>>>> >> >> I might be overstating the case a little. Perhaps by about 8 limbs
>>>>> >> >> you win, I don't know.
>>>>> >> >>
>>>>> >> >> Anyhow, I believe jason (not Martin) is working on getting fully
>>>>> >> >> working mul_1 and addmul_1 ready for inclusion into eMPIRe. Since he
>>>>> >> >> has actually done all the really hard work here with the initial
>>>>> >> >> scheduling to get down to 2.75 c/l, I'll let him post any 
>>>>> >> >> performance
>>>>> >> >> figures once he is done with the code. He deserves the credit!
>>>>> >> >>
>>>>> >> >> Bill.
>>>>> >> >>
>>>>> >> >> 2008/11/26 mabshoff <[EMAIL PROTECTED]>:
>>>>> >> >>> On Nov 26, 6:18 am, Bill Hart <[EMAIL PROTECTED]> wrote:
>>>>> >> >>>> Some other things I forgot to mention:
>>>>> >> >>>>
>>>>> >> >>>> 1) It probably wouldn't have been possible for me to get 2.5c/l
>>>>> >> >>>> without jason's code, in both the mul_1 and addmul_1 cases.
>>>>> >> >>>>
>>>>> >> >>> :)
>>>>> >> >>>>
>>>>> >> >>>> 2) You can often insert nops with lone or pair instructions which
>>>>> >> >>>> are not 3 macro ops together, further proving that the above
>>>>> >> >>>> analysis is correct.
>>>>> >> >>>>
>>>>> >> >>>> 3) The addmul_1 code I get is very close to the code obtained by
>>>>> >> >>>> someone else through independent means, so I won't post it here.
>>>>> >> >>>> Once the above tricks have been validated on other code, I'll
>>>>> >> >>>> commit the addmul_1 code I have to the repo. Or perhaps someone
>>>>> >> >>>> else will rediscover it from what I have written above.
>>>>> >> >>>>
>>>>> >> >>>> In fact I was only able to find about 16 different versions of
>>>>> >> >>>> addmul_1 that run in 2.5c/l all of which look very much like the
>>>>> >> >>>> solution obtained independently. The order and location of most
>>>>> >> >>>> instructions is fixed by the dual requirements of having triplets
>>>>> >> >>>> of macro-ops and having almost nothing run in ALU0 other than 
>>>>> >> >>>> muls.
>>>>> >> >>>> There are very few degrees of freedom.
>>>>> >> >>>>
>>>>> >> >>>> Bill.
>>>>> >> >>>
>>>>> >> >>> This is very, very cool and I am happy that this is discussed in
>>>>> >> >>> public. Any chance to see some performance numbers before and after
>>>>> >> >>> the checkin?
>>>>> >> >>>
>>>>> >> >>> <SNIP>
>>>>> >> >>>
>>>>> >> >>> Cheers,
>>>>> >> >>>
>>>>> >> >>> Michael
>>>>>
>>>>>
>>>>
>>>>
>>>> >>>>
>>>>
>>>
>>
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

[mpir-devel] Re: mpn_mul_1 on K8

Reply via email to