Some other things I forgot to mention: 1) It probably wouldn't have been possible for me to get 2.5c/l without jason's code, in both the mul_1 and addmul_1 cases.
2) You can often insert nops with lone or pair instructions which are not 3 macro ops together, further proving that the above analysis is correct. 3) The addmul_1 code I get is very close to the code obtained by someone else through independent means, so I won't post it here. Once the above tricks have been validated on other code, I'll commit the addmul_1 code I have to the repo. Or perhaps someone else will rediscover it from what I have written above. In fact I was only able to find about 16 different versions of addmul_1 that run in 2.5c/l all of which look very much like the solution obtained independently. The order and location of most instructions is fixed by the dual requirements of having triplets of macro-ops and having almost nothing run in ALU0 other than muls. There are very few degrees of freedom. Bill. On Nov 26, 2:05 pm, Bill Hart <[EMAIL PROTECTED]> wrote: > I managed to squeeze a bit more out of Jason's mpn_mul_1 code, so it > is now 2.5c/l. > > Jason's original code which runs at 2.75 c/l is as follows: > > .align 16 > loop: > mov $0,%r9 > mul %rcx > add %rax,%r8 > mov 8(%rsi,%rbx,8),%rax > adc %rdx,%r9 > mov %r8,(%rdi,%rbx,8) > mul %rcx > mov $0,%r10 > add %rax,%r9 > mov 16(%rsi,%rbx,8),%rax > adc %rdx,%r10 > mov %r9,8(%rdi,%rbx,8) > mov $0,%r11 > mul %rcx > add %rax,%r10 > mov 24(%rsi,%rbx,8),%rax > adc %rdx,%r11 > mov %r10,16(%rdi,%rbx,8) > mul %rcx > mov $0,%r8 > add %rax,%r11 > mov 32(%rsi,%rbx,8),%rax > adc %rdx,%r8 > mov %r11,24(%rdi,%rbx,8) > add $4,%rbx > jne loop > end: > > Now I did some reading on the Opteron processor. It turns out that > there are two different kinds of instructions in the above, so-called > direct path instructions and so-called doubles. The mul's are the only > doubles. Everything else is direct path. > > Now direct path are single macro-op instructions, whilst doubles are > broken into two macro-op instructions. > > The Opteron first breaks all the incoming instructions into macro-ops. > Then they are packed together again into groups of three macro-ops. > Either three direct path instructions, one direct path and one double > or three doubles over two cycles. > > At the end of the "pipeline" instructions are retired in the order > they appear (but may be executed out of order in-between). The > processor can retire 3 macro-ops per cycle. > > So first, let's pair up instructions in Jason's code as the processor > would pair them up, three macro-ops at a time: > > .align 16 > loop: > mov $0,%r9 > > mul %rcx > add %rax,%r8 > > mov 8(%rsi,%rbx,8),%rax > adc %rdx,%r9 > mov %r8,(%rdi,%rbx,8) > > mul %rcx > mov $0,%r10 > > add %rax,%r9 > mov 16(%rsi,%rbx,8),%rax > adc %rdx,%r10 > > mov %r9,8(%rdi,%rbx,8) > mov $0,%r11 > > mul %rcx > add %rax,%r10 > > mov 24(%rsi,%rbx,8),%rax > adc %rdx,%r11 > mov %r10,16(%rdi,%rbx,8) > > mul %rcx > mov $0,%r8 > > add %rax,%r11 > mov 32(%rsi,%rbx,8),%rax > adc %rdx,%r8 > > mov %r11,24(%rdi,%rbx,8) > add $4,%rbx > jne loop > end: > > It's arbitrary whether one pairs the muls with a direct path > instruction preceding or succeeding it. > > Now everything is fine except the lone mov on the first line and the > two mov's in the middle of the code block. But we can rectify this by > moving some code around. The mov $, reg lines are fairly independent > and can be moved around a lot. We'll stick one of those with the lone > pair of instructions to make a triplet and the one at the start of the > loop to take its place. > > .align 16 > loop: > mul %rcx > add %rax,%r8 > > mov 8(%rsi,%rbx,8),%rax > adc %rdx,%r9 > mov %r8,(%rdi,%rbx,8) > > mul %rcx > mov $0,%r10 > > add %rax,%r9 > mov 16(%rsi,%rbx,8),%rax > adc %rdx,%r10 > > mov %r9,8(%rdi,%rbx,8) > mov $0,%r11 > mov $0,%r8 > > mul %rcx > add %rax,%r10 > > mov 24(%rsi,%rbx,8),%rax > adc %rdx,%r11 > mov %r10,16(%rdi,%rbx,8) > > mul %rcx > mov $0,%r9 > > add %rax,%r11 > mov 32(%rsi,%rbx,8),%rax > adc %rdx,%r8 > > mov %r11,24(%rdi,%rbx,8) > add $4,%rbx > jne loop > end: > > I'm very lucky here that this immediately gives us 2.5 c/l. > > The case of addmul_1 is much harder. One starts with the above trick. > But in order to get 2.5c/l from Jason's 2.75 c/l code, one needs to do > more work. > > Instead of just moving the results out to the destination in memory, > one needs to add them to memory. The former uses only the AGU unit, > the latter an ALU and and AGU. The problem with the latter is that we > need 4 mul's in the loop which already ties up alu0 for 8 of the 10 > cycles allowed in the loop. Therefore one needs to schedule everything > carefully so that nothing much else runs in ALU0. That requires > knowledge of the pick hardware (or for one to fiddle for about half an > hour). > > There is a program called ptlsim which gives a detailed analysis of > how any particular piece of code is actually broken into macro-ops and > scheduled in the Opteron (works for most x86 machines). However it > doesn't compile correctly on sage.math and I have not been able to get > it to work on another Opteron I have access to either (running it is > supposedly trivial, it just quits immediately when run). It requires a > late 2.6 linux kernel for one thing, and perhaps it has bugs. > > I haven't been able to find the part of the code where the Opteron > model is defined, so I can't just look up how the pick hardware works. > I could just write to the author I guess. It could be a very valuable > tool for eMPIRe as the above indicates!! > > Bill. > > On Nov 23, 11:16 pm, [EMAIL PROTECTED] wrote: > > > On Sunday 23 November 2008 22:49:21 Jason Martin wrote: > > > > > You assume OOO works perfectly. > > > > > mov $0,%r11 > > > > mul %rcx > > > > add %rax,%r10 > > > > mov 24(%rsi,%rbx,8),%rax > > > > adc %rdx,%r11 > > > > mov %r10,16(%rdi,%rbx,8) > > > > mul %rcx > > > > here mov $0,%r8 > > > > add %rax,%r11 > > > > mov 32(%rsi,%rbx,8),%rax > > > > adc %rdx,%r8 > > > > mov %r11,24(%rdi,%rbx,8) > > > > > moving the line at "here" up one before the mul , slows things down from > > > > 2.78 to 3.03 c/l , whereas if OOO was perfect , it should not have any > > > > effect. This may be due to a cpu scheduler bug , or perhaps the > > > > shedulers > > > > not perfect , mul being long latency , two macro ops , two pipes , only > > > > pipe 0_1 etc > > > > If its a bug then perhaps K10 is better? > > > > I've seen similar wackiness with the core 2 out-of-order engine. It's > > > strange enough that sometimes sticking in a nop actually saves a > > > cycle! > > > another oddity.. > > > loop: > > mov (%rdi),%rcx > > adc %rcx,%rcx > > mov %rcx,(%rdi) > > ... 8 way unrolled lshift by 1 > > mov 56(%rdi),%r9 > > adc %r9,%r9 > > mov %r9,56(%rdi) > > lea 64(%rdi),%rdi > > dec %rsi > > jnz loop > > > runs at 1.11c/l > > > whereas the rshift by 1 (ie with rcr instead of adc) does not, you have to > > bunch them up into 4's to get to 1.11c/l > > > mov (%rdi),%rcx > > mov -8(%rdi),%r8 > > mov -16(%rdi),%r9 > > mov -24(%rdi),%r10 > > rcr $1,%rcx > > rcr $1,%r8 > > rcr $1,%r9 > > rcr $1,%r10 > > mov %rcx,(%rdi) > > mov %r8,-8(%rdi) > > mov %r9,-16(%rdi) > > mov %r10,-24(%rdi) > > > mov -32(%rdi),%rcx > > mov -40(%rdi),%r8 > > mov -48(%rdi),%r9 > > mov -56(%rdi),%r10 > > rcr $1,%rcx > > rcr $1,%r8 > > rcr $1,%r9 > > rcr $1,%r10 > > mov %rcx,-32(%rdi) > > mov %r8,-40(%rdi) > > mov %r9,-48(%rdi) > > mov %r10,-56(%rdi) > > > lea -64(%rdi),%rdi > > dec %rsi > > jnz loop > > > Again , it looks like the OOO is broken. > > But if you look at the gmp-4.2.4 mpn_mul_1 , which runs at 3c/l , the OOO > > has > > to get work from three separate iterations to fill out the slots. > > > While I'm at it , I got some more complaints :) > > > timing mpn_add/sub_n with the gmp speed program the results stay fairly > > consistent . You may get say 24.5 cycles in one run and 24.6 in another. Ok > > , > > occasionally you 200 cycles , but I assume thats an interupt or some such > > thing. But , for my mpn_com_n , which is mind numbingly simple > > (mov,not,mov) , sometimes I get 20cycles , 40 cycles, 30 cycles .... . > > Whats > > going on there! , I dont know. > > > Confused. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---