[mpir-devel] Re: mpn_mul_1 on K8

Bill Hart Wed, 26 Nov 2008 11:44:06 -0800

Actually it wouldn't be a completely trivial merge. There is no tuning
code in the package. We'd have to write that. Not that this should be
hard. There is only prebuilt tuning files for K7 and Pentium 4 or
something like that.


Bill.

2008/11/26 Bill Hart <[EMAIL PROTECTED]>:
> Cool!
>
> Paul Zimmermann, Pierrick Gaudry and Alexander Krupka (and possibly
> also Torbjorn Granlund) worked on a new FFT for GMP. It is comparable
> with the one in FLINT and uses Fermat numbers and Mersenne numbers I
> think.
>
> Also I think Torbjorn Granlund and Alexander Krupka worked on an FFT
> (perhaps a small prime FFT?) which is fast for very large operands. I
> have never seen the code or performance figures so I don't know all
> that much about this. I might just have this mixed up with the
> Zimmermann one.
>
> There is an improved fft patch on the gmp website which appears to be
> written by Paul Zimmermann, but it contains the following lines:
>
> "TODO:
>
>   Implement some of the tricks published at ISSAC'2007 by Gaudry, Kruppa, and
>   Zimmermann."
>
> The patch has also been relicensed GPL v3+ which is odd considering
> that wasn't around in 2007 and there is no indication of changes to
> the patch since then.
>
> There are plans to dramatically improve the FFT in eMPIRe. But we need
> to discuss the best strategy for that. Licensing is an issue.
>
> Looking at Paul Zimmermann's website there is an fft-mul patch which
> claims to be up to 2 times faster than the fft distributed with GMP.
> This appears to have been written by TG, PZ, AK and PG.
>
> We could just plug that straight in to eMPIRe. It is licensed LGPL
> v2.1+ and wouldn't need merging. I very much doubt Paul would change
> the license on that, but I've retrieved a version under LGPL v2.1+
> anyhow.
>
> I know the FLINT one is faster again for small operands and
> occasionally for larger operands, but there is so much work to merge
> it, and it would be GPL only that I think we should avoid wasting our
> time for now. The original idea was for the Zimmermann et al FFT to
> end up in GMP and the FLINT one to end up in eMPIRe, but now I think
> we could save ourselves a lot of work by just using the Zimmermann et
> al one.
>
> Almost certainly we could improve the Zimmermann et al FFT using some
> of the tricks from FLINT.
>
> What does everyone think?
>
> Bill.
>
> 2008/11/26  <[EMAIL PROTECTED]>:
>>
>> On Wednesday 26 November 2008 18:12:15 Bill Hart wrote:
>>> 2008/11/26  <[EMAIL PROTECTED]>:
>>> > On Wednesday 26 November 2008 17:27:59 Bill Hart wrote:
>>> >> Brian and I have been having an interesting discussion off list about
>>> >> the preliminary GMP 4.3 figures posted here:
>>> >>
>>> >> http://gmplib.org/gmpbench.html
>>> >>
>>> >> Note that on a 2.6 GHz Opteron we would score about 11175 with about
>>> >> 60950 in the multiply bench. Note unbalanced operands are not relevant
>>> >> here (gmpbench multiply test only works with balanced operands).
>>> >>
>>> >> It will be interesting to see how much improvement we get from the new
>>> >> mul_1 and addmul_1.
>>> >>
>>> >> Any ideas about how we can further improve would be most welcome.
>>> >
>>> > from my gmpextra-1.0.1/changes
>>> > error mpn_newmul_n  got thresholds backwards
>>>
>>> Are you saying the new code caused an error? If so, that is not
>>> surprising. It only works for n = 0 mod 4 and needs the prologue and
>>> epilogue changed to make it work in general. I modified the timing
>>> code you sent me to time it for n = 0 mod 4.
>>>
>>> > can give 3-20% on fft-sizes
>>>
>>> Or are you saying code of yours will improve fft multiplication.
>>>
>>
>> Yes , and no
>> I calculate x*y mod 2^k-1 , whereas GMP-fft uses x*y mod 2^k+1
>> with k a high power of two , and my mod-1 is faster than gmp-mod+1 for a
>> reasonable range of sizes. The "error" was in the thresholds I calculated , I
>> entered the decision logic backwards.
>> At the moment its faster at about 8k limbs to 100k limbs , although it's very
>> uneven (and slower in places).
>>
>>
>>> In another one of my projects (FLINT) there exists FFT integer
>>> multiplication code written by David Harvey and myself which will give
>>> up to a factor of 2 improvement on FFT sizes. But it needs to be
>>> rewritten to merge with eMPIRe. Also it is GPL not LGPL and I can't
>>> (and probably don't want) to do anything about that. For now we don't
>>> have a GPL version of eMPIRe, and such code would probably not be
>>> helpful at this time.
>>>
>>> I've also been working with Gonzalo Tornaria on code which will
>>> multiply absolutely HUGE integers (well beyond what the current FFT's
>>> will do).  But that won't get completed for another couple of months
>>> and will again need merging, which will be a non-trivial job. Again it
>>> relies on GPL'd code.
>>>
>>> Bill.
>>>
>>> >> Hmm, I wonder if they put new fft code in. That would certainly do the
>>> >> trick!! I think I see how they could have gotten this much improvement
>>> >> without improving the fft, but by my calculations it would be tight.
>>> >>
>>> >> Bill.
>>> >>
>>> >> 2008/11/26 Bill Hart <[EMAIL PROTECTED]>:
>>> >> > I should also add the following.
>>> >> >
>>> >> > In the case of addmul_1, I don't think the oOo hardware is relevant.
>>> >> > The Opteron has three sets of 8 reservation stations and essentially
>>> >> > it just executes what it can. That's pretty simple oOo logic.
>>> >> >
>>> >> > It can happen that an entire 8 reservation stations become full with
>>> >> > dependent instructions. But that is not an issue with mul_1 or
>>> >> > addmul_1. There are only 30 macro-ops in the whole loop, so nearly the
>>> >> > whole loop is in reservation stations at any one time.
>>> >> >
>>> >> > Instead the issue is that all the muls need to be executed by ALU0
>>> >> > (there is only 64 bit multiply hardware attached to ALU0). The problem
>>> >> > then is that too much might be put in the 8 reservation stations for
>>> >> > ALU0. The hardware which chooses which of the three "pipelines" or
>>> >> > sets of 8 reservation stations that a macro-op goes into is called the
>>> >> > pick hardware. I have thus far been unable to find a description of
>>> >> > how it chooses which pipe to stick macro-ops into.
>>> >> >
>>> >> > One big drawback of the K8 is that once in a pipe, other pipes cannot
>>> >> > steal work from that pipe, even if they are doing nothing and there
>>> >> > are independent instructions to be executed queueing in another pipe.
>>> >> >
>>> >> > So the only relevant piece of hardware here is the pick hardware.
>>> >> >
>>> >> > As I say, ptlsim would give us a definitive answer, if it could be
>>> >> > make to work.
>>> >> >
>>> >> > Bill.
>>> >> >
>>> >> > 2008/11/26 Bill Hart <[EMAIL PROTECTED]>:
>>> >> >> Ah, this probably won't make that much difference to pverall
>>> >> >> performance. Here is why:
>>> >> >>
>>> >> >> In rearranging the instructions in this way we have had to mix up the
>>> >> >> instructions in an unrolled loop. That means that one can't just jump
>>> >> >> into the loop at the required spot as before. The wind up and wind
>>> >> >> down code needs to be made more complex. This is fine, but it
>>> >> >> possibly adds a few cycles for small sizes.
>>> >> >>
>>> >> >> Large mul_1's and addmul_1's are never used by GMP for mul_n. Recall
>>> >> >> that mul_basecase switches over to Karatsuba after about 30 limbs on
>>> >> >> the Opteron.
>>> >> >>
>>> >> >> But it also probably takes a number of iterations of the loop before
>>> >> >> the hardware settles into a pattern. The data cache hardware needs to
>>> >> >> prime, the branch prediction needs to prime, the instruction cache
>>> >> >> needs to prime and the actual picking of instructions in the correct
>>> >> >> order does not necessarily happen on the first iteration of the loop.
>>> >> >>
>>> >> >> I might be overstating the case a little. Perhaps by about 8 limbs
>>> >> >> you win, I don't know.
>>> >> >>
>>> >> >> Anyhow, I believe jason (not Martin) is working on getting fully
>>> >> >> working mul_1 and addmul_1 ready for inclusion into eMPIRe. Since he
>>> >> >> has actually done all the really hard work here with the initial
>>> >> >> scheduling to get down to 2.75 c/l, I'll let him post any performance
>>> >> >> figures once he is done with the code. He deserves the credit!
>>> >> >>
>>> >> >> Bill.
>>> >> >>
>>> >> >> 2008/11/26 mabshoff <[EMAIL PROTECTED]>:
>>> >> >>> On Nov 26, 6:18 am, Bill Hart <[EMAIL PROTECTED]> wrote:
>>> >> >>>> Some other things I forgot to mention:
>>> >> >>>>
>>> >> >>>> 1) It probably wouldn't have been possible for me to get 2.5c/l
>>> >> >>>> without jason's code, in both the mul_1 and addmul_1 cases.
>>> >> >>>>
>>> >> >>> :)
>>> >> >>>>
>>> >> >>>> 2) You can often insert nops with lone or pair instructions which
>>> >> >>>> are not 3 macro ops together, further proving that the above
>>> >> >>>> analysis is correct.
>>> >> >>>>
>>> >> >>>> 3) The addmul_1 code I get is very close to the code obtained by
>>> >> >>>> someone else through independent means, so I won't post it here.
>>> >> >>>> Once the above tricks have been validated on other code, I'll
>>> >> >>>> commit the addmul_1 code I have to the repo. Or perhaps someone
>>> >> >>>> else will rediscover it from what I have written above.
>>> >> >>>>
>>> >> >>>> In fact I was only able to find about 16 different versions of
>>> >> >>>> addmul_1 that run in 2.5c/l all of which look very much like the
>>> >> >>>> solution obtained independently. The order and location of most
>>> >> >>>> instructions is fixed by the dual requirements of having triplets
>>> >> >>>> of macro-ops and having almost nothing run in ALU0 other than muls.
>>> >> >>>> There are very few degrees of freedom.
>>> >> >>>>
>>> >> >>>> Bill.
>>> >> >>>
>>> >> >>> This is very, very cool and I am happy that this is discussed in
>>> >> >>> public. Any chance to see some performance numbers before and after
>>> >> >>> the checkin?
>>> >> >>>
>>> >> >>> <SNIP>
>>> >> >>>
>>> >> >>> Cheers,
>>> >> >>>
>>> >> >>> Michael
>>>
>>>
>>
>>
>> >>
>>
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

[mpir-devel] Re: mpn_mul_1 on K8

Reply via email to