[mpir-devel] Re: mpn_mul_1 on K8

jason Wed, 26 Nov 2008 10:43:10 -0800

On Wednesday 26 November 2008 18:12:15 Bill Hart wrote:
> 2008/11/26  <[EMAIL PROTECTED]>:
> > On Wednesday 26 November 2008 17:27:59 Bill Hart wrote:
> >> Brian and I have been having an interesting discussion off list about
> >> the preliminary GMP 4.3 figures posted here:
> >>
> >> http://gmplib.org/gmpbench.html
> >>
> >> Note that on a 2.6 GHz Opteron we would score about 11175 with about
> >> 60950 in the multiply bench. Note unbalanced operands are not relevant
> >> here (gmpbench multiply test only works with balanced operands).
> >>
> >> It will be interesting to see how much improvement we get from the new
> >> mul_1 and addmul_1.
> >>
> >> Any ideas about how we can further improve would be most welcome.
> >
> > from my gmpextra-1.0.1/changes
> > error mpn_newmul_n  got thresholds backwards
>
> Are you saying the new code caused an error? If so, that is not
> surprising. It only works for n = 0 mod 4 and needs the prologue and
> epilogue changed to make it work in general. I modified the timing
> code you sent me to time it for n = 0 mod 4.
>
> > can give 3-20% on fft-sizes
>
> Or are you saying code of yours will improve fft multiplication.
>


Yes , and no 
I calculate x*y mod 2^k-1 , whereas GMP-fft uses x*y mod 2^k+1
with k a high power of two , and my mod-1 is faster than gmp-mod+1 for a 
reasonable range of sizes. The "error" was in the thresholds I calculated , I 
entered the decision logic backwards.
At the moment its faster at about 8k limbs to 100k limbs , although it's very 
uneven (and slower in places).


> In another one of my projects (FLINT) there exists FFT integer
> multiplication code written by David Harvey and myself which will give
> up to a factor of 2 improvement on FFT sizes. But it needs to be
> rewritten to merge with eMPIRe. Also it is GPL not LGPL and I can't
> (and probably don't want) to do anything about that. For now we don't
> have a GPL version of eMPIRe, and such code would probably not be
> helpful at this time.
>
> I've also been working with Gonzalo Tornaria on code which will
> multiply absolutely HUGE integers (well beyond what the current FFT's
> will do).  But that won't get completed for another couple of months
> and will again need merging, which will be a non-trivial job. Again it
> relies on GPL'd code.
>
> Bill.
>
> >> Hmm, I wonder if they put new fft code in. That would certainly do the
> >> trick!! I think I see how they could have gotten this much improvement
> >> without improving the fft, but by my calculations it would be tight.
> >>
> >> Bill.
> >>
> >> 2008/11/26 Bill Hart <[EMAIL PROTECTED]>:
> >> > I should also add the following.
> >> >
> >> > In the case of addmul_1, I don't think the oOo hardware is relevant.
> >> > The Opteron has three sets of 8 reservation stations and essentially
> >> > it just executes what it can. That's pretty simple oOo logic.
> >> >
> >> > It can happen that an entire 8 reservation stations become full with
> >> > dependent instructions. But that is not an issue with mul_1 or
> >> > addmul_1. There are only 30 macro-ops in the whole loop, so nearly the
> >> > whole loop is in reservation stations at any one time.
> >> >
> >> > Instead the issue is that all the muls need to be executed by ALU0
> >> > (there is only 64 bit multiply hardware attached to ALU0). The problem
> >> > then is that too much might be put in the 8 reservation stations for
> >> > ALU0. The hardware which chooses which of the three "pipelines" or
> >> > sets of 8 reservation stations that a macro-op goes into is called the
> >> > pick hardware. I have thus far been unable to find a description of
> >> > how it chooses which pipe to stick macro-ops into.
> >> >
> >> > One big drawback of the K8 is that once in a pipe, other pipes cannot
> >> > steal work from that pipe, even if they are doing nothing and there
> >> > are independent instructions to be executed queueing in another pipe.
> >> >
> >> > So the only relevant piece of hardware here is the pick hardware.
> >> >
> >> > As I say, ptlsim would give us a definitive answer, if it could be
> >> > make to work.
> >> >
> >> > Bill.
> >> >
> >> > 2008/11/26 Bill Hart <[EMAIL PROTECTED]>:
> >> >> Ah, this probably won't make that much difference to pverall
> >> >> performance. Here is why:
> >> >>
> >> >> In rearranging the instructions in this way we have had to mix up the
> >> >> instructions in an unrolled loop. That means that one can't just jump
> >> >> into the loop at the required spot as before. The wind up and wind
> >> >> down code needs to be made more complex. This is fine, but it
> >> >> possibly adds a few cycles for small sizes.
> >> >>
> >> >> Large mul_1's and addmul_1's are never used by GMP for mul_n. Recall
> >> >> that mul_basecase switches over to Karatsuba after about 30 limbs on
> >> >> the Opteron.
> >> >>
> >> >> But it also probably takes a number of iterations of the loop before
> >> >> the hardware settles into a pattern. The data cache hardware needs to
> >> >> prime, the branch prediction needs to prime, the instruction cache
> >> >> needs to prime and the actual picking of instructions in the correct
> >> >> order does not necessarily happen on the first iteration of the loop.
> >> >>
> >> >> I might be overstating the case a little. Perhaps by about 8 limbs
> >> >> you win, I don't know.
> >> >>
> >> >> Anyhow, I believe jason (not Martin) is working on getting fully
> >> >> working mul_1 and addmul_1 ready for inclusion into eMPIRe. Since he
> >> >> has actually done all the really hard work here with the initial
> >> >> scheduling to get down to 2.75 c/l, I'll let him post any performance
> >> >> figures once he is done with the code. He deserves the credit!
> >> >>
> >> >> Bill.
> >> >>
> >> >> 2008/11/26 mabshoff <[EMAIL PROTECTED]>:
> >> >>> On Nov 26, 6:18 am, Bill Hart <[EMAIL PROTECTED]> wrote:
> >> >>>> Some other things I forgot to mention:
> >> >>>>
> >> >>>> 1) It probably wouldn't have been possible for me to get 2.5c/l
> >> >>>> without jason's code, in both the mul_1 and addmul_1 cases.
> >> >>>>
> >> >>> :)
> >> >>>>
> >> >>>> 2) You can often insert nops with lone or pair instructions which
> >> >>>> are not 3 macro ops together, further proving that the above
> >> >>>> analysis is correct.
> >> >>>>
> >> >>>> 3) The addmul_1 code I get is very close to the code obtained by
> >> >>>> someone else through independent means, so I won't post it here.
> >> >>>> Once the above tricks have been validated on other code, I'll
> >> >>>> commit the addmul_1 code I have to the repo. Or perhaps someone
> >> >>>> else will rediscover it from what I have written above.
> >> >>>>
> >> >>>> In fact I was only able to find about 16 different versions of
> >> >>>> addmul_1 that run in 2.5c/l all of which look very much like the
> >> >>>> solution obtained independently. The order and location of most
> >> >>>> instructions is fixed by the dual requirements of having triplets
> >> >>>> of macro-ops and having almost nothing run in ALU0 other than muls.
> >> >>>> There are very few degrees of freedom.
> >> >>>>
> >> >>>> Bill.
> >> >>>
> >> >>> This is very, very cool and I am happy that this is discussed in
> >> >>> public. Any chance to see some performance numbers before and after
> >> >>> the checkin?
> >> >>>
> >> >>> <SNIP>
> >> >>>
> >> >>> Cheers,
> >> >>>
> >> >>> Michael
>
> 


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

[mpir-devel] Re: mpn_mul_1 on K8

Reply via email to