On Wednesday 26 November 2008 17:27:59 Bill Hart wrote:
> Brian and I have been having an interesting discussion off list about
> the preliminary GMP 4.3 figures posted here:
>
> http://gmplib.org/gmpbench.html
>
> Note that on a 2.6 GHz Opteron we would score about 11175 with about
> 60950 in the multiply bench. Note unbalanced operands are not relevant
> here (gmpbench multiply test only works with balanced operands).
>
> It will be interesting to see how much improvement we get from the new
> mul_1 and addmul_1.
>
> Any ideas about how we can further improve would be most welcome.
>

from my gmpextra-1.0.1/changes
error mpn_newmul_n  got thresholds backwards

can give 3-20% on fft-sizes


> Hmm, I wonder if they put new fft code in. That would certainly do the
> trick!! I think I see how they could have gotten this much improvement
> without improving the fft, but by my calculations it would be tight.
>
> Bill.
>
> 2008/11/26 Bill Hart <[EMAIL PROTECTED]>:
> > I should also add the following.
> >
> > In the case of addmul_1, I don't think the oOo hardware is relevant.
> > The Opteron has three sets of 8 reservation stations and essentially
> > it just executes what it can. That's pretty simple oOo logic.
> >
> > It can happen that an entire 8 reservation stations become full with
> > dependent instructions. But that is not an issue with mul_1 or
> > addmul_1. There are only 30 macro-ops in the whole loop, so nearly the
> > whole loop is in reservation stations at any one time.
> >
> > Instead the issue is that all the muls need to be executed by ALU0
> > (there is only 64 bit multiply hardware attached to ALU0). The problem
> > then is that too much might be put in the 8 reservation stations for
> > ALU0. The hardware which chooses which of the three "pipelines" or
> > sets of 8 reservation stations that a macro-op goes into is called the
> > pick hardware. I have thus far been unable to find a description of
> > how it chooses which pipe to stick macro-ops into.
> >
> > One big drawback of the K8 is that once in a pipe, other pipes cannot
> > steal work from that pipe, even if they are doing nothing and there
> > are independent instructions to be executed queueing in another pipe.
> >
> > So the only relevant piece of hardware here is the pick hardware.
> >
> > As I say, ptlsim would give us a definitive answer, if it could be make
> > to work.
> >
> > Bill.
> >
> > 2008/11/26 Bill Hart <[EMAIL PROTECTED]>:
> >> Ah, this probably won't make that much difference to pverall
> >> performance. Here is why:
> >>
> >> In rearranging the instructions in this way we have had to mix up the
> >> instructions in an unrolled loop. That means that one can't just jump
> >> into the loop at the required spot as before. The wind up and wind
> >> down code needs to be made more complex. This is fine, but it possibly
> >> adds a few cycles for small sizes.
> >>
> >> Large mul_1's and addmul_1's are never used by GMP for mul_n. Recall
> >> that mul_basecase switches over to Karatsuba after about 30 limbs on
> >> the Opteron.
> >>
> >> But it also probably takes a number of iterations of the loop before
> >> the hardware settles into a pattern. The data cache hardware needs to
> >> prime, the branch prediction needs to prime, the instruction cache
> >> needs to prime and the actual picking of instructions in the correct
> >> order does not necessarily happen on the first iteration of the loop.
> >>
> >> I might be overstating the case a little. Perhaps by about 8 limbs you
> >> win, I don't know.
> >>
> >> Anyhow, I believe jason (not Martin) is working on getting fully
> >> working mul_1 and addmul_1 ready for inclusion into eMPIRe. Since he
> >> has actually done all the really hard work here with the initial
> >> scheduling to get down to 2.75 c/l, I'll let him post any performance
> >> figures once he is done with the code. He deserves the credit!
> >>
> >> Bill.
> >>
> >> 2008/11/26 mabshoff <[EMAIL PROTECTED]>:
> >>> On Nov 26, 6:18 am, Bill Hart <[EMAIL PROTECTED]> wrote:
> >>>> Some other things I forgot to mention:
> >>>>
> >>>> 1) It probably wouldn't have been possible for me to get 2.5c/l
> >>>> without jason's code, in both the mul_1 and addmul_1 cases.
> >>>>
> >>> :)
> >>>>
> >>>> 2) You can often insert nops with lone or pair instructions which are
> >>>> not 3 macro ops together, further proving that the above analysis is
> >>>> correct.
> >>>>
> >>>> 3) The addmul_1 code I get is very close to the code obtained by
> >>>> someone else through independent means, so I won't post it here. Once
> >>>> the above tricks have been validated on other code, I'll commit the
> >>>> addmul_1 code I have to the repo. Or perhaps someone else will
> >>>> rediscover it from what I have written above.
> >>>>
> >>>> In fact I was only able to find about 16 different versions of
> >>>> addmul_1 that run in 2.5c/l all of which look very much like the
> >>>> solution obtained independently. The order and location of most
> >>>> instructions is fixed by the dual requirements of having triplets of
> >>>> macro-ops and having almost nothing run in ALU0 other than muls. There
> >>>> are very few degrees of freedom.
> >>>>
> >>>> Bill.
> >>>
> >>> This is very, very cool and I am happy that this is discussed in
> >>> public. Any chance to see some performance numbers before and after
> >>> the checkin?
> >>>
> >>> <SNIP>
> >>>
> >>> Cheers,
> >>>
> >>> Michael
>
> 


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to