Wow!!!  That's AMAZING work by all of you!!!  Great job squeezing
every bit of efficiency out of it!!!!!!!

--jason (not the same jason that Bill is talking about)


On Wed, Nov 26, 2008 at 9:18 AM, Bill Hart <[EMAIL PROTECTED]> wrote:
>
> Some other things I forgot to mention:
>
> 1) It probably wouldn't have been possible for me to get 2.5c/l
> without jason's code, in both the mul_1 and addmul_1 cases.
>
> 2) You can often insert nops with lone or pair instructions which are
> not 3 macro ops together, further proving that the above analysis is
> correct.
>
> 3) The addmul_1 code I get is very close to the code obtained by
> someone else through independent means, so I won't post it here. Once
> the above tricks have been validated on other code, I'll commit the
> addmul_1 code I have to the repo. Or perhaps someone else will
> rediscover it from what I have written above.
>
> In fact I was only able to find about 16 different versions of
> addmul_1 that run in 2.5c/l all of which look very much like the
> solution obtained independently. The order and location of most
> instructions is fixed by the dual requirements of having triplets of
> macro-ops and having almost nothing run in ALU0 other than muls. There
> are very few degrees of freedom.
>
> Bill.
>
> On Nov 26, 2:05 pm, Bill Hart <[EMAIL PROTECTED]> wrote:
>> I managed to squeeze a bit more out of Jason's mpn_mul_1 code, so it
>> is now 2.5c/l.
>>
>> Jason's original code which runs at 2.75 c/l is as follows:
>>
>> .align 16
>> loop:
>>         mov $0,%r9
>>         mul %rcx
>>         add %rax,%r8
>>         mov 8(%rsi,%rbx,8),%rax
>>         adc %rdx,%r9
>>         mov %r8,(%rdi,%rbx,8)
>>         mul %rcx
>>         mov $0,%r10
>>         add %rax,%r9
>>         mov 16(%rsi,%rbx,8),%rax
>>         adc %rdx,%r10
>>         mov %r9,8(%rdi,%rbx,8)
>>         mov $0,%r11
>>         mul %rcx
>>         add %rax,%r10
>>         mov 24(%rsi,%rbx,8),%rax
>>         adc %rdx,%r11
>>         mov %r10,16(%rdi,%rbx,8)
>>         mul %rcx
>>         mov $0,%r8
>>         add %rax,%r11
>>         mov 32(%rsi,%rbx,8),%rax
>>         adc %rdx,%r8
>>         mov %r11,24(%rdi,%rbx,8)
>>         add $4,%rbx
>>         jne loop
>> end:
>>
>> Now I did some reading on the Opteron processor. It turns out that
>> there are two different kinds of instructions in the above, so-called
>> direct path instructions and so-called doubles. The mul's are the only
>> doubles. Everything else is direct path.
>>
>> Now direct path are single macro-op instructions, whilst doubles are
>> broken into two macro-op instructions.
>>
>> The Opteron first breaks all the incoming instructions into macro-ops.
>> Then they are packed together again into groups of three macro-ops.
>> Either three direct path instructions, one direct path and one double
>> or three doubles over two cycles.
>>
>> At the end of the "pipeline" instructions are retired in the order
>> they appear (but may be executed out of order in-between). The
>> processor can retire 3 macro-ops per cycle.
>>
>> So first, let's pair up instructions in Jason's code as the processor
>> would pair them up, three macro-ops at a time:
>>
>> .align 16
>> loop:
>>         mov $0,%r9
>>
>>         mul %rcx
>>         add %rax,%r8
>>
>>         mov 8(%rsi,%rbx,8),%rax
>>         adc %rdx,%r9
>>         mov %r8,(%rdi,%rbx,8)
>>
>>         mul %rcx
>>         mov $0,%r10
>>
>>         add %rax,%r9
>>         mov 16(%rsi,%rbx,8),%rax
>>         adc %rdx,%r10
>>
>>         mov %r9,8(%rdi,%rbx,8)
>>         mov $0,%r11
>>
>>         mul %rcx
>>         add %rax,%r10
>>
>>         mov 24(%rsi,%rbx,8),%rax
>>         adc %rdx,%r11
>>         mov %r10,16(%rdi,%rbx,8)
>>
>>         mul %rcx
>>         mov $0,%r8
>>
>>         add %rax,%r11
>>         mov 32(%rsi,%rbx,8),%rax
>>         adc %rdx,%r8
>>
>>         mov %r11,24(%rdi,%rbx,8)
>>         add $4,%rbx
>>         jne loop
>> end:
>>
>> It's arbitrary whether one pairs the muls with a direct path
>> instruction preceding or succeeding it.
>>
>> Now everything is fine except the lone mov on the first line and the
>> two mov's in the middle of the code block. But we can rectify this by
>> moving some code around. The mov $, reg lines are fairly independent
>> and can be moved around a lot. We'll stick one of those with the lone
>> pair of instructions to make a triplet and the one at the start of the
>> loop to take its place.
>>
>> .align 16
>> loop:
>>     mul %rcx
>>     add %rax,%r8
>>
>>     mov 8(%rsi,%rbx,8),%rax
>>     adc %rdx,%r9
>>     mov %r8,(%rdi,%rbx,8)
>>
>>     mul %rcx
>>     mov $0,%r10
>>
>>     add %rax,%r9
>>     mov 16(%rsi,%rbx,8),%rax
>>     adc %rdx,%r10
>>
>>     mov %r9,8(%rdi,%rbx,8)
>>     mov $0,%r11
>>     mov $0,%r8
>>
>>     mul %rcx
>>     add %rax,%r10
>>
>>     mov 24(%rsi,%rbx,8),%rax
>>     adc %rdx,%r11
>>     mov %r10,16(%rdi,%rbx,8)
>>
>>     mul %rcx
>>     mov $0,%r9
>>
>>     add %rax,%r11
>>     mov 32(%rsi,%rbx,8),%rax
>>     adc %rdx,%r8
>>
>>     mov %r11,24(%rdi,%rbx,8)
>>     add $4,%rbx
>>     jne loop
>> end:
>>
>> I'm very lucky here that this immediately gives us 2.5 c/l.
>>
>> The case of addmul_1 is much harder. One starts with the above trick.
>> But in order to get 2.5c/l from Jason's 2.75 c/l code, one needs to do
>> more work.
>>
>> Instead of just moving the results out to the destination in memory,
>> one needs to add them to memory. The former uses only the AGU unit,
>> the latter an ALU and and AGU. The problem with the latter is that we
>> need 4 mul's in the loop which already ties up alu0 for 8 of the 10
>> cycles allowed in the loop. Therefore one needs to schedule everything
>> carefully so that nothing much else runs in ALU0. That requires
>> knowledge of the pick hardware (or for one to fiddle for about half an
>> hour).
>>
>> There is a program called ptlsim which gives a detailed analysis of
>> how any particular piece of code is actually broken into macro-ops and
>> scheduled in the Opteron (works for most x86 machines). However it
>> doesn't compile correctly on sage.math and I have not been able to get
>> it to work on another Opteron I have access to either (running it is
>> supposedly trivial, it just quits immediately when run). It requires a
>> late 2.6 linux kernel for one thing, and perhaps it has bugs.
>>
>> I haven't been able to find the part of the code where the Opteron
>> model is defined, so I can't just look up how the pick hardware works.
>> I could just write to the author I guess. It could be a very valuable
>> tool for eMPIRe as the above indicates!!
>>
>> Bill.
>>
>> On Nov 23, 11:16 pm, [EMAIL PROTECTED] wrote:
>>
>> > On Sunday 23 November 2008 22:49:21 Jason Martin wrote:
>>
>> > > > You assume OOO works perfectly.
>>
>> > > >     mov $0,%r11
>> > > >        mul %rcx
>> > > >        add %rax,%r10
>> > > >        mov 24(%rsi,%rbx,8),%rax
>> > > >        adc %rdx,%r11
>> > > >        mov %r10,16(%rdi,%rbx,8)
>> > > >        mul %rcx
>> > > > here        mov $0,%r8
>> > > >        add %rax,%r11
>> > > >        mov 32(%rsi,%rbx,8),%rax
>> > > >        adc %rdx,%r8
>> > > >        mov %r11,24(%rdi,%rbx,8)
>>
>> > > > moving the line at "here" up one before the mul , slows things down 
>> > > > from
>> > > > 2.78 to 3.03 c/l , whereas if OOO was perfect , it should not have any
>> > > > effect. This may be due to a cpu scheduler bug , or perhaps the 
>> > > > shedulers
>> > > > not perfect , mul being long latency , two macro ops , two pipes , only
>> > > > pipe 0_1 etc
>> > > > If its a bug then perhaps K10 is better?
>>
>> > > I've seen similar wackiness with the core 2 out-of-order engine.  It's
>> > > strange enough that sometimes sticking in a nop actually saves a
>> > > cycle!
>>
>> > another oddity..
>>
>> > loop:
>> >         mov     (%rdi),%rcx
>> >         adc     %rcx,%rcx
>> >         mov     %rcx,(%rdi)
>> > ... 8 way unrolled lshift by 1
>> >         mov     56(%rdi),%r9
>> >         adc     %r9,%r9
>> >         mov     %r9,56(%rdi)
>> >         lea     64(%rdi),%rdi
>> >         dec     %rsi
>> >         jnz     loop
>>
>> > runs at 1.11c/l
>>
>> > whereas the rshift by 1 (ie with rcr instead of adc) does not, you have to
>> > bunch them up into 4's to get to 1.11c/l
>>
>> >        mov     (%rdi),%rcx
>> >         mov     -8(%rdi),%r8
>> >         mov     -16(%rdi),%r9
>> >         mov     -24(%rdi),%r10
>> >         rcr     $1,%rcx
>> >         rcr     $1,%r8
>> >         rcr     $1,%r9
>> >         rcr     $1,%r10
>> >         mov     %rcx,(%rdi)
>> >         mov     %r8,-8(%rdi)
>> >         mov     %r9,-16(%rdi)
>> >         mov     %r10,-24(%rdi)
>>
>> >        mov     -32(%rdi),%rcx
>> >         mov     -40(%rdi),%r8
>> >         mov     -48(%rdi),%r9
>> >         mov     -56(%rdi),%r10
>> >         rcr     $1,%rcx
>> >         rcr     $1,%r8
>> >         rcr     $1,%r9
>> >         rcr     $1,%r10
>> >         mov     %rcx,-32(%rdi)
>> >         mov     %r8,-40(%rdi)
>> >         mov     %r9,-48(%rdi)
>> >         mov     %r10,-56(%rdi)
>>
>> >         lea     -64(%rdi),%rdi
>> >         dec     %rsi
>> >         jnz     loop
>>
>> > Again , it looks like the OOO is broken.
>> > But if you look at the gmp-4.2.4 mpn_mul_1 , which runs at 3c/l , the OOO 
>> > has
>> > to get work from three separate iterations to fill out the slots.
>>
>> > While I'm at it , I got some more complaints :)
>>
>> > timing mpn_add/sub_n with the gmp speed program the results stay fairly
>> > consistent . You may get say 24.5 cycles in one run and 24.6 in another. 
>> > Ok ,
>> > occasionally you 200 cycles , but I assume thats an interupt or some such
>> > thing. But , for my mpn_com_n , which is mind numbingly simple
>> > (mov,not,mov) , sometimes I get 20cycles , 40 cycles, 30 cycles  .... . 
>> > Whats
>> > going on there! , I dont know.
>>
>> > Confused.
> >
>

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to