It's not surprising the FFT doesn't improve things much. It is only used in about 1/5 of the benchmarks and so can only make about a 2^(1/5) = 14% difference at best (assuming we happened to hit a point where the FFT was twice as fast).
Also, looking at the rsaapp score, it is clear that they have improved things dramatically for small operands. Bill. 2008/11/26 Bill Hart <[EMAIL PROTECTED]>: > The Zimmermann et al FFT takes us to about 65881 in the bench scores > and 11419 overall. > > That is totally untuned however. With correct tuning I am sure it > makes a bigger difference. > > Bill. > > 2008/11/26 Bill Hart <[EMAIL PROTECTED]>: >> Actually it wouldn't be a completely trivial merge. There is no tuning >> code in the package. We'd have to write that. Not that this should be >> hard. There is only prebuilt tuning files for K7 and Pentium 4 or >> something like that. >> >> Bill. >> >> 2008/11/26 Bill Hart <[EMAIL PROTECTED]>: >>> Cool! >>> >>> Paul Zimmermann, Pierrick Gaudry and Alexander Krupka (and possibly >>> also Torbjorn Granlund) worked on a new FFT for GMP. It is comparable >>> with the one in FLINT and uses Fermat numbers and Mersenne numbers I >>> think. >>> >>> Also I think Torbjorn Granlund and Alexander Krupka worked on an FFT >>> (perhaps a small prime FFT?) which is fast for very large operands. I >>> have never seen the code or performance figures so I don't know all >>> that much about this. I might just have this mixed up with the >>> Zimmermann one. >>> >>> There is an improved fft patch on the gmp website which appears to be >>> written by Paul Zimmermann, but it contains the following lines: >>> >>> "TODO: >>> >>> Implement some of the tricks published at ISSAC'2007 by Gaudry, Kruppa, >>> and >>> Zimmermann." >>> >>> The patch has also been relicensed GPL v3+ which is odd considering >>> that wasn't around in 2007 and there is no indication of changes to >>> the patch since then. >>> >>> There are plans to dramatically improve the FFT in eMPIRe. But we need >>> to discuss the best strategy for that. Licensing is an issue. >>> >>> Looking at Paul Zimmermann's website there is an fft-mul patch which >>> claims to be up to 2 times faster than the fft distributed with GMP. >>> This appears to have been written by TG, PZ, AK and PG. >>> >>> We could just plug that straight in to eMPIRe. It is licensed LGPL >>> v2.1+ and wouldn't need merging. I very much doubt Paul would change >>> the license on that, but I've retrieved a version under LGPL v2.1+ >>> anyhow. >>> >>> I know the FLINT one is faster again for small operands and >>> occasionally for larger operands, but there is so much work to merge >>> it, and it would be GPL only that I think we should avoid wasting our >>> time for now. The original idea was for the Zimmermann et al FFT to >>> end up in GMP and the FLINT one to end up in eMPIRe, but now I think >>> we could save ourselves a lot of work by just using the Zimmermann et >>> al one. >>> >>> Almost certainly we could improve the Zimmermann et al FFT using some >>> of the tricks from FLINT. >>> >>> What does everyone think? >>> >>> Bill. >>> >>> 2008/11/26 <[EMAIL PROTECTED]>: >>>> >>>> On Wednesday 26 November 2008 18:12:15 Bill Hart wrote: >>>>> 2008/11/26 <[EMAIL PROTECTED]>: >>>>> > On Wednesday 26 November 2008 17:27:59 Bill Hart wrote: >>>>> >> Brian and I have been having an interesting discussion off list about >>>>> >> the preliminary GMP 4.3 figures posted here: >>>>> >> >>>>> >> http://gmplib.org/gmpbench.html >>>>> >> >>>>> >> Note that on a 2.6 GHz Opteron we would score about 11175 with about >>>>> >> 60950 in the multiply bench. Note unbalanced operands are not relevant >>>>> >> here (gmpbench multiply test only works with balanced operands). >>>>> >> >>>>> >> It will be interesting to see how much improvement we get from the new >>>>> >> mul_1 and addmul_1. >>>>> >> >>>>> >> Any ideas about how we can further improve would be most welcome. >>>>> > >>>>> > from my gmpextra-1.0.1/changes >>>>> > error mpn_newmul_n got thresholds backwards >>>>> >>>>> Are you saying the new code caused an error? If so, that is not >>>>> surprising. It only works for n = 0 mod 4 and needs the prologue and >>>>> epilogue changed to make it work in general. I modified the timing >>>>> code you sent me to time it for n = 0 mod 4. >>>>> >>>>> > can give 3-20% on fft-sizes >>>>> >>>>> Or are you saying code of yours will improve fft multiplication. >>>>> >>>> >>>> Yes , and no >>>> I calculate x*y mod 2^k-1 , whereas GMP-fft uses x*y mod 2^k+1 >>>> with k a high power of two , and my mod-1 is faster than gmp-mod+1 for a >>>> reasonable range of sizes. The "error" was in the thresholds I calculated >>>> , I >>>> entered the decision logic backwards. >>>> At the moment its faster at about 8k limbs to 100k limbs , although it's >>>> very >>>> uneven (and slower in places). >>>> >>>> >>>>> In another one of my projects (FLINT) there exists FFT integer >>>>> multiplication code written by David Harvey and myself which will give >>>>> up to a factor of 2 improvement on FFT sizes. But it needs to be >>>>> rewritten to merge with eMPIRe. Also it is GPL not LGPL and I can't >>>>> (and probably don't want) to do anything about that. For now we don't >>>>> have a GPL version of eMPIRe, and such code would probably not be >>>>> helpful at this time. >>>>> >>>>> I've also been working with Gonzalo Tornaria on code which will >>>>> multiply absolutely HUGE integers (well beyond what the current FFT's >>>>> will do). But that won't get completed for another couple of months >>>>> and will again need merging, which will be a non-trivial job. Again it >>>>> relies on GPL'd code. >>>>> >>>>> Bill. >>>>> >>>>> >> Hmm, I wonder if they put new fft code in. That would certainly do the >>>>> >> trick!! I think I see how they could have gotten this much improvement >>>>> >> without improving the fft, but by my calculations it would be tight. >>>>> >> >>>>> >> Bill. >>>>> >> >>>>> >> 2008/11/26 Bill Hart <[EMAIL PROTECTED]>: >>>>> >> > I should also add the following. >>>>> >> > >>>>> >> > In the case of addmul_1, I don't think the oOo hardware is relevant. >>>>> >> > The Opteron has three sets of 8 reservation stations and essentially >>>>> >> > it just executes what it can. That's pretty simple oOo logic. >>>>> >> > >>>>> >> > It can happen that an entire 8 reservation stations become full with >>>>> >> > dependent instructions. But that is not an issue with mul_1 or >>>>> >> > addmul_1. There are only 30 macro-ops in the whole loop, so nearly >>>>> >> > the >>>>> >> > whole loop is in reservation stations at any one time. >>>>> >> > >>>>> >> > Instead the issue is that all the muls need to be executed by ALU0 >>>>> >> > (there is only 64 bit multiply hardware attached to ALU0). The >>>>> >> > problem >>>>> >> > then is that too much might be put in the 8 reservation stations for >>>>> >> > ALU0. The hardware which chooses which of the three "pipelines" or >>>>> >> > sets of 8 reservation stations that a macro-op goes into is called >>>>> >> > the >>>>> >> > pick hardware. I have thus far been unable to find a description of >>>>> >> > how it chooses which pipe to stick macro-ops into. >>>>> >> > >>>>> >> > One big drawback of the K8 is that once in a pipe, other pipes cannot >>>>> >> > steal work from that pipe, even if they are doing nothing and there >>>>> >> > are independent instructions to be executed queueing in another pipe. >>>>> >> > >>>>> >> > So the only relevant piece of hardware here is the pick hardware. >>>>> >> > >>>>> >> > As I say, ptlsim would give us a definitive answer, if it could be >>>>> >> > make to work. >>>>> >> > >>>>> >> > Bill. >>>>> >> > >>>>> >> > 2008/11/26 Bill Hart <[EMAIL PROTECTED]>: >>>>> >> >> Ah, this probably won't make that much difference to pverall >>>>> >> >> performance. Here is why: >>>>> >> >> >>>>> >> >> In rearranging the instructions in this way we have had to mix up >>>>> >> >> the >>>>> >> >> instructions in an unrolled loop. That means that one can't just >>>>> >> >> jump >>>>> >> >> into the loop at the required spot as before. The wind up and wind >>>>> >> >> down code needs to be made more complex. This is fine, but it >>>>> >> >> possibly adds a few cycles for small sizes. >>>>> >> >> >>>>> >> >> Large mul_1's and addmul_1's are never used by GMP for mul_n. Recall >>>>> >> >> that mul_basecase switches over to Karatsuba after about 30 limbs on >>>>> >> >> the Opteron. >>>>> >> >> >>>>> >> >> But it also probably takes a number of iterations of the loop before >>>>> >> >> the hardware settles into a pattern. The data cache hardware needs >>>>> >> >> to >>>>> >> >> prime, the branch prediction needs to prime, the instruction cache >>>>> >> >> needs to prime and the actual picking of instructions in the correct >>>>> >> >> order does not necessarily happen on the first iteration of the >>>>> >> >> loop. >>>>> >> >> >>>>> >> >> I might be overstating the case a little. Perhaps by about 8 limbs >>>>> >> >> you win, I don't know. >>>>> >> >> >>>>> >> >> Anyhow, I believe jason (not Martin) is working on getting fully >>>>> >> >> working mul_1 and addmul_1 ready for inclusion into eMPIRe. Since he >>>>> >> >> has actually done all the really hard work here with the initial >>>>> >> >> scheduling to get down to 2.75 c/l, I'll let him post any >>>>> >> >> performance >>>>> >> >> figures once he is done with the code. He deserves the credit! >>>>> >> >> >>>>> >> >> Bill. >>>>> >> >> >>>>> >> >> 2008/11/26 mabshoff <[EMAIL PROTECTED]>: >>>>> >> >>> On Nov 26, 6:18 am, Bill Hart <[EMAIL PROTECTED]> wrote: >>>>> >> >>>> Some other things I forgot to mention: >>>>> >> >>>> >>>>> >> >>>> 1) It probably wouldn't have been possible for me to get 2.5c/l >>>>> >> >>>> without jason's code, in both the mul_1 and addmul_1 cases. >>>>> >> >>>> >>>>> >> >>> :) >>>>> >> >>>> >>>>> >> >>>> 2) You can often insert nops with lone or pair instructions which >>>>> >> >>>> are not 3 macro ops together, further proving that the above >>>>> >> >>>> analysis is correct. >>>>> >> >>>> >>>>> >> >>>> 3) The addmul_1 code I get is very close to the code obtained by >>>>> >> >>>> someone else through independent means, so I won't post it here. >>>>> >> >>>> Once the above tricks have been validated on other code, I'll >>>>> >> >>>> commit the addmul_1 code I have to the repo. Or perhaps someone >>>>> >> >>>> else will rediscover it from what I have written above. >>>>> >> >>>> >>>>> >> >>>> In fact I was only able to find about 16 different versions of >>>>> >> >>>> addmul_1 that run in 2.5c/l all of which look very much like the >>>>> >> >>>> solution obtained independently. The order and location of most >>>>> >> >>>> instructions is fixed by the dual requirements of having triplets >>>>> >> >>>> of macro-ops and having almost nothing run in ALU0 other than >>>>> >> >>>> muls. >>>>> >> >>>> There are very few degrees of freedom. >>>>> >> >>>> >>>>> >> >>>> Bill. >>>>> >> >>> >>>>> >> >>> This is very, very cool and I am happy that this is discussed in >>>>> >> >>> public. Any chance to see some performance numbers before and after >>>>> >> >>> the checkin? >>>>> >> >>> >>>>> >> >>> <SNIP> >>>>> >> >>> >>>>> >> >>> Cheers, >>>>> >> >>> >>>>> >> >>> Michael >>>>> >>>>> >>>> >>>> >>>> >>>> >>>> >>> >> > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---