Wow!!! That's AMAZING work by all of you!!! Great job squeezing every bit of efficiency out of it!!!!!!!
--jason (not the same jason that Bill is talking about) On Wed, Nov 26, 2008 at 9:18 AM, Bill Hart <[EMAIL PROTECTED]> wrote: > > Some other things I forgot to mention: > > 1) It probably wouldn't have been possible for me to get 2.5c/l > without jason's code, in both the mul_1 and addmul_1 cases. > > 2) You can often insert nops with lone or pair instructions which are > not 3 macro ops together, further proving that the above analysis is > correct. > > 3) The addmul_1 code I get is very close to the code obtained by > someone else through independent means, so I won't post it here. Once > the above tricks have been validated on other code, I'll commit the > addmul_1 code I have to the repo. Or perhaps someone else will > rediscover it from what I have written above. > > In fact I was only able to find about 16 different versions of > addmul_1 that run in 2.5c/l all of which look very much like the > solution obtained independently. The order and location of most > instructions is fixed by the dual requirements of having triplets of > macro-ops and having almost nothing run in ALU0 other than muls. There > are very few degrees of freedom. > > Bill. > > On Nov 26, 2:05 pm, Bill Hart <[EMAIL PROTECTED]> wrote: >> I managed to squeeze a bit more out of Jason's mpn_mul_1 code, so it >> is now 2.5c/l. >> >> Jason's original code which runs at 2.75 c/l is as follows: >> >> .align 16 >> loop: >> mov $0,%r9 >> mul %rcx >> add %rax,%r8 >> mov 8(%rsi,%rbx,8),%rax >> adc %rdx,%r9 >> mov %r8,(%rdi,%rbx,8) >> mul %rcx >> mov $0,%r10 >> add %rax,%r9 >> mov 16(%rsi,%rbx,8),%rax >> adc %rdx,%r10 >> mov %r9,8(%rdi,%rbx,8) >> mov $0,%r11 >> mul %rcx >> add %rax,%r10 >> mov 24(%rsi,%rbx,8),%rax >> adc %rdx,%r11 >> mov %r10,16(%rdi,%rbx,8) >> mul %rcx >> mov $0,%r8 >> add %rax,%r11 >> mov 32(%rsi,%rbx,8),%rax >> adc %rdx,%r8 >> mov %r11,24(%rdi,%rbx,8) >> add $4,%rbx >> jne loop >> end: >> >> Now I did some reading on the Opteron processor. It turns out that >> there are two different kinds of instructions in the above, so-called >> direct path instructions and so-called doubles. The mul's are the only >> doubles. Everything else is direct path. >> >> Now direct path are single macro-op instructions, whilst doubles are >> broken into two macro-op instructions. >> >> The Opteron first breaks all the incoming instructions into macro-ops. >> Then they are packed together again into groups of three macro-ops. >> Either three direct path instructions, one direct path and one double >> or three doubles over two cycles. >> >> At the end of the "pipeline" instructions are retired in the order >> they appear (but may be executed out of order in-between). The >> processor can retire 3 macro-ops per cycle. >> >> So first, let's pair up instructions in Jason's code as the processor >> would pair them up, three macro-ops at a time: >> >> .align 16 >> loop: >> mov $0,%r9 >> >> mul %rcx >> add %rax,%r8 >> >> mov 8(%rsi,%rbx,8),%rax >> adc %rdx,%r9 >> mov %r8,(%rdi,%rbx,8) >> >> mul %rcx >> mov $0,%r10 >> >> add %rax,%r9 >> mov 16(%rsi,%rbx,8),%rax >> adc %rdx,%r10 >> >> mov %r9,8(%rdi,%rbx,8) >> mov $0,%r11 >> >> mul %rcx >> add %rax,%r10 >> >> mov 24(%rsi,%rbx,8),%rax >> adc %rdx,%r11 >> mov %r10,16(%rdi,%rbx,8) >> >> mul %rcx >> mov $0,%r8 >> >> add %rax,%r11 >> mov 32(%rsi,%rbx,8),%rax >> adc %rdx,%r8 >> >> mov %r11,24(%rdi,%rbx,8) >> add $4,%rbx >> jne loop >> end: >> >> It's arbitrary whether one pairs the muls with a direct path >> instruction preceding or succeeding it. >> >> Now everything is fine except the lone mov on the first line and the >> two mov's in the middle of the code block. But we can rectify this by >> moving some code around. The mov $, reg lines are fairly independent >> and can be moved around a lot. We'll stick one of those with the lone >> pair of instructions to make a triplet and the one at the start of the >> loop to take its place. >> >> .align 16 >> loop: >> mul %rcx >> add %rax,%r8 >> >> mov 8(%rsi,%rbx,8),%rax >> adc %rdx,%r9 >> mov %r8,(%rdi,%rbx,8) >> >> mul %rcx >> mov $0,%r10 >> >> add %rax,%r9 >> mov 16(%rsi,%rbx,8),%rax >> adc %rdx,%r10 >> >> mov %r9,8(%rdi,%rbx,8) >> mov $0,%r11 >> mov $0,%r8 >> >> mul %rcx >> add %rax,%r10 >> >> mov 24(%rsi,%rbx,8),%rax >> adc %rdx,%r11 >> mov %r10,16(%rdi,%rbx,8) >> >> mul %rcx >> mov $0,%r9 >> >> add %rax,%r11 >> mov 32(%rsi,%rbx,8),%rax >> adc %rdx,%r8 >> >> mov %r11,24(%rdi,%rbx,8) >> add $4,%rbx >> jne loop >> end: >> >> I'm very lucky here that this immediately gives us 2.5 c/l. >> >> The case of addmul_1 is much harder. One starts with the above trick. >> But in order to get 2.5c/l from Jason's 2.75 c/l code, one needs to do >> more work. >> >> Instead of just moving the results out to the destination in memory, >> one needs to add them to memory. The former uses only the AGU unit, >> the latter an ALU and and AGU. The problem with the latter is that we >> need 4 mul's in the loop which already ties up alu0 for 8 of the 10 >> cycles allowed in the loop. Therefore one needs to schedule everything >> carefully so that nothing much else runs in ALU0. That requires >> knowledge of the pick hardware (or for one to fiddle for about half an >> hour). >> >> There is a program called ptlsim which gives a detailed analysis of >> how any particular piece of code is actually broken into macro-ops and >> scheduled in the Opteron (works for most x86 machines). However it >> doesn't compile correctly on sage.math and I have not been able to get >> it to work on another Opteron I have access to either (running it is >> supposedly trivial, it just quits immediately when run). It requires a >> late 2.6 linux kernel for one thing, and perhaps it has bugs. >> >> I haven't been able to find the part of the code where the Opteron >> model is defined, so I can't just look up how the pick hardware works. >> I could just write to the author I guess. It could be a very valuable >> tool for eMPIRe as the above indicates!! >> >> Bill. >> >> On Nov 23, 11:16 pm, [EMAIL PROTECTED] wrote: >> >> > On Sunday 23 November 2008 22:49:21 Jason Martin wrote: >> >> > > > You assume OOO works perfectly. >> >> > > > mov $0,%r11 >> > > > mul %rcx >> > > > add %rax,%r10 >> > > > mov 24(%rsi,%rbx,8),%rax >> > > > adc %rdx,%r11 >> > > > mov %r10,16(%rdi,%rbx,8) >> > > > mul %rcx >> > > > here mov $0,%r8 >> > > > add %rax,%r11 >> > > > mov 32(%rsi,%rbx,8),%rax >> > > > adc %rdx,%r8 >> > > > mov %r11,24(%rdi,%rbx,8) >> >> > > > moving the line at "here" up one before the mul , slows things down >> > > > from >> > > > 2.78 to 3.03 c/l , whereas if OOO was perfect , it should not have any >> > > > effect. This may be due to a cpu scheduler bug , or perhaps the >> > > > shedulers >> > > > not perfect , mul being long latency , two macro ops , two pipes , only >> > > > pipe 0_1 etc >> > > > If its a bug then perhaps K10 is better? >> >> > > I've seen similar wackiness with the core 2 out-of-order engine. It's >> > > strange enough that sometimes sticking in a nop actually saves a >> > > cycle! >> >> > another oddity.. >> >> > loop: >> > mov (%rdi),%rcx >> > adc %rcx,%rcx >> > mov %rcx,(%rdi) >> > ... 8 way unrolled lshift by 1 >> > mov 56(%rdi),%r9 >> > adc %r9,%r9 >> > mov %r9,56(%rdi) >> > lea 64(%rdi),%rdi >> > dec %rsi >> > jnz loop >> >> > runs at 1.11c/l >> >> > whereas the rshift by 1 (ie with rcr instead of adc) does not, you have to >> > bunch them up into 4's to get to 1.11c/l >> >> > mov (%rdi),%rcx >> > mov -8(%rdi),%r8 >> > mov -16(%rdi),%r9 >> > mov -24(%rdi),%r10 >> > rcr $1,%rcx >> > rcr $1,%r8 >> > rcr $1,%r9 >> > rcr $1,%r10 >> > mov %rcx,(%rdi) >> > mov %r8,-8(%rdi) >> > mov %r9,-16(%rdi) >> > mov %r10,-24(%rdi) >> >> > mov -32(%rdi),%rcx >> > mov -40(%rdi),%r8 >> > mov -48(%rdi),%r9 >> > mov -56(%rdi),%r10 >> > rcr $1,%rcx >> > rcr $1,%r8 >> > rcr $1,%r9 >> > rcr $1,%r10 >> > mov %rcx,-32(%rdi) >> > mov %r8,-40(%rdi) >> > mov %r9,-48(%rdi) >> > mov %r10,-56(%rdi) >> >> > lea -64(%rdi),%rdi >> > dec %rsi >> > jnz loop >> >> > Again , it looks like the OOO is broken. >> > But if you look at the gmp-4.2.4 mpn_mul_1 , which runs at 3c/l , the OOO >> > has >> > to get work from three separate iterations to fill out the slots. >> >> > While I'm at it , I got some more complaints :) >> >> > timing mpn_add/sub_n with the gmp speed program the results stay fairly >> > consistent . You may get say 24.5 cycles in one run and 24.6 in another. >> > Ok , >> > occasionally you 200 cycles , but I assume thats an interupt or some such >> > thing. But , for my mpn_com_n , which is mind numbingly simple >> > (mov,not,mov) , sometimes I get 20cycles , 40 cycles, 30 cycles .... . >> > Whats >> > going on there! , I dont know. >> >> > Confused. > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---