I managed to squeeze a bit more out of Jason's mpn_mul_1 code, so it is now 2.5c/l.
Jason's original code which runs at 2.75 c/l is as follows: .align 16 loop: mov $0,%r9 mul %rcx add %rax,%r8 mov 8(%rsi,%rbx,8),%rax adc %rdx,%r9 mov %r8,(%rdi,%rbx,8) mul %rcx mov $0,%r10 add %rax,%r9 mov 16(%rsi,%rbx,8),%rax adc %rdx,%r10 mov %r9,8(%rdi,%rbx,8) mov $0,%r11 mul %rcx add %rax,%r10 mov 24(%rsi,%rbx,8),%rax adc %rdx,%r11 mov %r10,16(%rdi,%rbx,8) mul %rcx mov $0,%r8 add %rax,%r11 mov 32(%rsi,%rbx,8),%rax adc %rdx,%r8 mov %r11,24(%rdi,%rbx,8) add $4,%rbx jne loop end: Now I did some reading on the Opteron processor. It turns out that there are two different kinds of instructions in the above, so-called direct path instructions and so-called doubles. The mul's are the only doubles. Everything else is direct path. Now direct path are single macro-op instructions, whilst doubles are broken into two macro-op instructions. The Opteron first breaks all the incoming instructions into macro-ops. Then they are packed together again into groups of three macro-ops. Either three direct path instructions, one direct path and one double or three doubles over two cycles. At the end of the "pipeline" instructions are retired in the order they appear (but may be executed out of order in-between). The processor can retire 3 macro-ops per cycle. So first, let's pair up instructions in Jason's code as the processor would pair them up, three macro-ops at a time: .align 16 loop: mov $0,%r9 mul %rcx add %rax,%r8 mov 8(%rsi,%rbx,8),%rax adc %rdx,%r9 mov %r8,(%rdi,%rbx,8) mul %rcx mov $0,%r10 add %rax,%r9 mov 16(%rsi,%rbx,8),%rax adc %rdx,%r10 mov %r9,8(%rdi,%rbx,8) mov $0,%r11 mul %rcx add %rax,%r10 mov 24(%rsi,%rbx,8),%rax adc %rdx,%r11 mov %r10,16(%rdi,%rbx,8) mul %rcx mov $0,%r8 add %rax,%r11 mov 32(%rsi,%rbx,8),%rax adc %rdx,%r8 mov %r11,24(%rdi,%rbx,8) add $4,%rbx jne loop end: It's arbitrary whether one pairs the muls with a direct path instruction preceding or succeeding it. Now everything is fine except the lone mov on the first line and the two mov's in the middle of the code block. But we can rectify this by moving some code around. The mov $, reg lines are fairly independent and can be moved around a lot. We'll stick one of those with the lone pair of instructions to make a triplet and the one at the start of the loop to take its place. .align 16 loop: mul %rcx add %rax,%r8 mov 8(%rsi,%rbx,8),%rax adc %rdx,%r9 mov %r8,(%rdi,%rbx,8) mul %rcx mov $0,%r10 add %rax,%r9 mov 16(%rsi,%rbx,8),%rax adc %rdx,%r10 mov %r9,8(%rdi,%rbx,8) mov $0,%r11 mov $0,%r8 mul %rcx add %rax,%r10 mov 24(%rsi,%rbx,8),%rax adc %rdx,%r11 mov %r10,16(%rdi,%rbx,8) mul %rcx mov $0,%r9 add %rax,%r11 mov 32(%rsi,%rbx,8),%rax adc %rdx,%r8 mov %r11,24(%rdi,%rbx,8) add $4,%rbx jne loop end: I'm very lucky here that this immediately gives us 2.5 c/l. The case of addmul_1 is much harder. One starts with the above trick. But in order to get 2.5c/l from Jason's 2.75 c/l code, one needs to do more work. Instead of just moving the results out to the destination in memory, one needs to add them to memory. The former uses only the AGU unit, the latter an ALU and and AGU. The problem with the latter is that we need 4 mul's in the loop which already ties up alu0 for 8 of the 10 cycles allowed in the loop. Therefore one needs to schedule everything carefully so that nothing much else runs in ALU0. That requires knowledge of the pick hardware (or for one to fiddle for about half an hour). There is a program called ptlsim which gives a detailed analysis of how any particular piece of code is actually broken into macro-ops and scheduled in the Opteron (works for most x86 machines). However it doesn't compile correctly on sage.math and I have not been able to get it to work on another Opteron I have access to either (running it is supposedly trivial, it just quits immediately when run). It requires a late 2.6 linux kernel for one thing, and perhaps it has bugs. I haven't been able to find the part of the code where the Opteron model is defined, so I can't just look up how the pick hardware works. I could just write to the author I guess. It could be a very valuable tool for eMPIRe as the above indicates!! Bill. On Nov 23, 11:16 pm, [EMAIL PROTECTED] wrote: > On Sunday 23 November 2008 22:49:21 Jason Martin wrote: > > > > > > You assume OOO works perfectly. > > > > mov $0,%r11 > > > mul %rcx > > > add %rax,%r10 > > > mov 24(%rsi,%rbx,8),%rax > > > adc %rdx,%r11 > > > mov %r10,16(%rdi,%rbx,8) > > > mul %rcx > > > here mov $0,%r8 > > > add %rax,%r11 > > > mov 32(%rsi,%rbx,8),%rax > > > adc %rdx,%r8 > > > mov %r11,24(%rdi,%rbx,8) > > > > moving the line at "here" up one before the mul , slows things down from > > > 2.78 to 3.03 c/l , whereas if OOO was perfect , it should not have any > > > effect. This may be due to a cpu scheduler bug , or perhaps the shedulers > > > not perfect , mul being long latency , two macro ops , two pipes , only > > > pipe 0_1 etc > > > If its a bug then perhaps K10 is better? > > > I've seen similar wackiness with the core 2 out-of-order engine. It's > > strange enough that sometimes sticking in a nop actually saves a > > cycle! > > another oddity.. > > loop: > mov (%rdi),%rcx > adc %rcx,%rcx > mov %rcx,(%rdi) > ... 8 way unrolled lshift by 1 > mov 56(%rdi),%r9 > adc %r9,%r9 > mov %r9,56(%rdi) > lea 64(%rdi),%rdi > dec %rsi > jnz loop > > runs at 1.11c/l > > whereas the rshift by 1 (ie with rcr instead of adc) does not, you have to > bunch them up into 4's to get to 1.11c/l > > mov (%rdi),%rcx > mov -8(%rdi),%r8 > mov -16(%rdi),%r9 > mov -24(%rdi),%r10 > rcr $1,%rcx > rcr $1,%r8 > rcr $1,%r9 > rcr $1,%r10 > mov %rcx,(%rdi) > mov %r8,-8(%rdi) > mov %r9,-16(%rdi) > mov %r10,-24(%rdi) > > mov -32(%rdi),%rcx > mov -40(%rdi),%r8 > mov -48(%rdi),%r9 > mov -56(%rdi),%r10 > rcr $1,%rcx > rcr $1,%r8 > rcr $1,%r9 > rcr $1,%r10 > mov %rcx,-32(%rdi) > mov %r8,-40(%rdi) > mov %r9,-48(%rdi) > mov %r10,-56(%rdi) > > lea -64(%rdi),%rdi > dec %rsi > jnz loop > > Again , it looks like the OOO is broken. > But if you look at the gmp-4.2.4 mpn_mul_1 , which runs at 3c/l , the OOO has > to get work from three separate iterations to fill out the slots. > > While I'm at it , I got some more complaints :) > > timing mpn_add/sub_n with the gmp speed program the results stay fairly > consistent . You may get say 24.5 cycles in one run and 24.6 in another. Ok , > occasionally you 200 cycles , but I assume thats an interupt or some such > thing. But , for my mpn_com_n , which is mind numbingly simple > (mov,not,mov) , sometimes I get 20cycles , 40 cycles, 30 cycles .... . Whats > going on there! , I dont know. > > Confused. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---