On Wednesday 26 November 2008 17:27:59 Bill Hart wrote: > Brian and I have been having an interesting discussion off list about > the preliminary GMP 4.3 figures posted here: > > http://gmplib.org/gmpbench.html > > Note that on a 2.6 GHz Opteron we would score about 11175 with about > 60950 in the multiply bench. Note unbalanced operands are not relevant > here (gmpbench multiply test only works with balanced operands). > > It will be interesting to see how much improvement we get from the new > mul_1 and addmul_1. > > Any ideas about how we can further improve would be most welcome. >
from my gmpextra-1.0.1/changes error mpn_newmul_n got thresholds backwards can give 3-20% on fft-sizes > Hmm, I wonder if they put new fft code in. That would certainly do the > trick!! I think I see how they could have gotten this much improvement > without improving the fft, but by my calculations it would be tight. > > Bill. > > 2008/11/26 Bill Hart <[EMAIL PROTECTED]>: > > I should also add the following. > > > > In the case of addmul_1, I don't think the oOo hardware is relevant. > > The Opteron has three sets of 8 reservation stations and essentially > > it just executes what it can. That's pretty simple oOo logic. > > > > It can happen that an entire 8 reservation stations become full with > > dependent instructions. But that is not an issue with mul_1 or > > addmul_1. There are only 30 macro-ops in the whole loop, so nearly the > > whole loop is in reservation stations at any one time. > > > > Instead the issue is that all the muls need to be executed by ALU0 > > (there is only 64 bit multiply hardware attached to ALU0). The problem > > then is that too much might be put in the 8 reservation stations for > > ALU0. The hardware which chooses which of the three "pipelines" or > > sets of 8 reservation stations that a macro-op goes into is called the > > pick hardware. I have thus far been unable to find a description of > > how it chooses which pipe to stick macro-ops into. > > > > One big drawback of the K8 is that once in a pipe, other pipes cannot > > steal work from that pipe, even if they are doing nothing and there > > are independent instructions to be executed queueing in another pipe. > > > > So the only relevant piece of hardware here is the pick hardware. > > > > As I say, ptlsim would give us a definitive answer, if it could be make > > to work. > > > > Bill. > > > > 2008/11/26 Bill Hart <[EMAIL PROTECTED]>: > >> Ah, this probably won't make that much difference to pverall > >> performance. Here is why: > >> > >> In rearranging the instructions in this way we have had to mix up the > >> instructions in an unrolled loop. That means that one can't just jump > >> into the loop at the required spot as before. The wind up and wind > >> down code needs to be made more complex. This is fine, but it possibly > >> adds a few cycles for small sizes. > >> > >> Large mul_1's and addmul_1's are never used by GMP for mul_n. Recall > >> that mul_basecase switches over to Karatsuba after about 30 limbs on > >> the Opteron. > >> > >> But it also probably takes a number of iterations of the loop before > >> the hardware settles into a pattern. The data cache hardware needs to > >> prime, the branch prediction needs to prime, the instruction cache > >> needs to prime and the actual picking of instructions in the correct > >> order does not necessarily happen on the first iteration of the loop. > >> > >> I might be overstating the case a little. Perhaps by about 8 limbs you > >> win, I don't know. > >> > >> Anyhow, I believe jason (not Martin) is working on getting fully > >> working mul_1 and addmul_1 ready for inclusion into eMPIRe. Since he > >> has actually done all the really hard work here with the initial > >> scheduling to get down to 2.75 c/l, I'll let him post any performance > >> figures once he is done with the code. He deserves the credit! > >> > >> Bill. > >> > >> 2008/11/26 mabshoff <[EMAIL PROTECTED]>: > >>> On Nov 26, 6:18 am, Bill Hart <[EMAIL PROTECTED]> wrote: > >>>> Some other things I forgot to mention: > >>>> > >>>> 1) It probably wouldn't have been possible for me to get 2.5c/l > >>>> without jason's code, in both the mul_1 and addmul_1 cases. > >>>> > >>> :) > >>>> > >>>> 2) You can often insert nops with lone or pair instructions which are > >>>> not 3 macro ops together, further proving that the above analysis is > >>>> correct. > >>>> > >>>> 3) The addmul_1 code I get is very close to the code obtained by > >>>> someone else through independent means, so I won't post it here. Once > >>>> the above tricks have been validated on other code, I'll commit the > >>>> addmul_1 code I have to the repo. Or perhaps someone else will > >>>> rediscover it from what I have written above. > >>>> > >>>> In fact I was only able to find about 16 different versions of > >>>> addmul_1 that run in 2.5c/l all of which look very much like the > >>>> solution obtained independently. The order and location of most > >>>> instructions is fixed by the dual requirements of having triplets of > >>>> macro-ops and having almost nothing run in ALU0 other than muls. There > >>>> are very few degrees of freedom. > >>>> > >>>> Bill. > >>> > >>> This is very, very cool and I am happy that this is discussed in > >>> public. Any chance to see some performance numbers before and after > >>> the checkin? > >>> > >>> <SNIP> > >>> > >>> Cheers, > >>> > >>> Michael > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---