On Wednesday 26 November 2008 14:55:09 Bill Hart wrote: > Ah, this probably won't make that much difference to pverall > performance. Here is why: >
Gosh , Everything happens at once... AMD have codeanaylst software which I think shows you pipeline details? , it's a free download , but I've not been able to compile it. > In rearranging the instructions in this way we have had to mix up the > instructions in an unrolled loop. That means that one can't just jump > into the loop at the required spot as before. The wind up and wind > down code needs to be made more complex. This is fine, but it possibly > adds a few cycles for small sizes. > > Large mul_1's and addmul_1's are never used by GMP for mul_n. Recall > that mul_basecase switches over to Karatsuba after about 30 limbs on > the Opteron. > > But it also probably takes a number of iterations of the loop before > the hardware settles into a pattern. The data cache hardware needs to > prime, the branch prediction needs to prime, the instruction cache > needs to prime and the actual picking of instructions in the correct > order does not necessarily happen on the first iteration of the loop. > > I might be overstating the case a little. Perhaps by about 8 limbs you > win, I don't know. > > Anyhow, I believe jason (not Martin) is working on getting fully > working mul_1 and addmul_1 ready for inclusion into eMPIRe. Since he > has actually done all the really hard work here with the initial > scheduling to get down to 2.75 c/l, I'll let him post any performance > figures once he is done with the code. He deserves the credit! > I'll do a mul_basecase(which is what really counts) as well , by the weekend , and I have some other ideas , which may pan out. > Bill. > > 2008/11/26 mabshoff <[EMAIL PROTECTED]>: > > On Nov 26, 6:18 am, Bill Hart <[EMAIL PROTECTED]> wrote: > >> Some other things I forgot to mention: > >> > >> 1) It probably wouldn't have been possible for me to get 2.5c/l > >> without jason's code, in both the mul_1 and addmul_1 cases. > >> > > :) > >> > >> 2) You can often insert nops with lone or pair instructions which are > >> not 3 macro ops together, further proving that the above analysis is > >> correct. > >> > >> 3) The addmul_1 code I get is very close to the code obtained by > >> someone else through independent means, so I won't post it here. Once > >> the above tricks have been validated on other code, I'll commit the > >> addmul_1 code I have to the repo. Or perhaps someone else will > >> rediscover it from what I have written above. > >> > >> In fact I was only able to find about 16 different versions of > >> addmul_1 that run in 2.5c/l all of which look very much like the > >> solution obtained independently. The order and location of most > >> instructions is fixed by the dual requirements of having triplets of > >> macro-ops and having almost nothing run in ALU0 other than muls. There > >> are very few degrees of freedom. > >> > >> Bill. > > > > This is very, very cool and I am happy that this is discussed in > > public. Any chance to see some performance numbers before and after > > the checkin? > > > > <SNIP> > > > > Cheers, > > > > Michael > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---