I managed to squeeze a bit more out of Jason's mpn_mul_1 code, so it
is now 2.5c/l.

Jason's original code which runs at 2.75 c/l is as follows:

.align 16
loop:
        mov $0,%r9
        mul %rcx
        add %rax,%r8
        mov 8(%rsi,%rbx,8),%rax
        adc %rdx,%r9
        mov %r8,(%rdi,%rbx,8)
        mul %rcx
        mov $0,%r10
        add %rax,%r9
        mov 16(%rsi,%rbx,8),%rax
        adc %rdx,%r10
        mov %r9,8(%rdi,%rbx,8)
        mov $0,%r11
        mul %rcx
        add %rax,%r10
        mov 24(%rsi,%rbx,8),%rax
        adc %rdx,%r11
        mov %r10,16(%rdi,%rbx,8)
        mul %rcx
        mov $0,%r8
        add %rax,%r11
        mov 32(%rsi,%rbx,8),%rax
        adc %rdx,%r8
        mov %r11,24(%rdi,%rbx,8)
        add $4,%rbx
        jne loop
end:

Now I did some reading on the Opteron processor. It turns out that
there are two different kinds of instructions in the above, so-called
direct path instructions and so-called doubles. The mul's are the only
doubles. Everything else is direct path.

Now direct path are single macro-op instructions, whilst doubles are
broken into two macro-op instructions.

The Opteron first breaks all the incoming instructions into macro-ops.
Then they are packed together again into groups of three macro-ops.
Either three direct path instructions, one direct path and one double
or three doubles over two cycles.

At the end of the "pipeline" instructions are retired in the order
they appear (but may be executed out of order in-between). The
processor can retire 3 macro-ops per cycle.

So first, let's pair up instructions in Jason's code as the processor
would pair them up, three macro-ops at a time:

.align 16
loop:
        mov $0,%r9

        mul %rcx
        add %rax,%r8

        mov 8(%rsi,%rbx,8),%rax
        adc %rdx,%r9
        mov %r8,(%rdi,%rbx,8)

        mul %rcx
        mov $0,%r10

        add %rax,%r9
        mov 16(%rsi,%rbx,8),%rax
        adc %rdx,%r10

        mov %r9,8(%rdi,%rbx,8)
        mov $0,%r11

        mul %rcx
        add %rax,%r10

        mov 24(%rsi,%rbx,8),%rax
        adc %rdx,%r11
        mov %r10,16(%rdi,%rbx,8)

        mul %rcx
        mov $0,%r8

        add %rax,%r11
        mov 32(%rsi,%rbx,8),%rax
        adc %rdx,%r8

        mov %r11,24(%rdi,%rbx,8)
        add $4,%rbx
        jne loop
end:

It's arbitrary whether one pairs the muls with a direct path
instruction preceding or succeeding it.

Now everything is fine except the lone mov on the first line and the
two mov's in the middle of the code block. But we can rectify this by
moving some code around. The mov $, reg lines are fairly independent
and can be moved around a lot. We'll stick one of those with the lone
pair of instructions to make a triplet and the one at the start of the
loop to take its place.

.align 16
loop:
    mul %rcx
    add %rax,%r8

    mov 8(%rsi,%rbx,8),%rax
    adc %rdx,%r9
    mov %r8,(%rdi,%rbx,8)

    mul %rcx
    mov $0,%r10

    add %rax,%r9
    mov 16(%rsi,%rbx,8),%rax
    adc %rdx,%r10

    mov %r9,8(%rdi,%rbx,8)
    mov $0,%r11
    mov $0,%r8

    mul %rcx
    add %rax,%r10

    mov 24(%rsi,%rbx,8),%rax
    adc %rdx,%r11
    mov %r10,16(%rdi,%rbx,8)

    mul %rcx
    mov $0,%r9

    add %rax,%r11
    mov 32(%rsi,%rbx,8),%rax
    adc %rdx,%r8

    mov %r11,24(%rdi,%rbx,8)
    add $4,%rbx
    jne loop
end:

I'm very lucky here that this immediately gives us 2.5 c/l.

The case of addmul_1 is much harder. One starts with the above trick.
But in order to get 2.5c/l from Jason's 2.75 c/l code, one needs to do
more work.

Instead of just moving the results out to the destination in memory,
one needs to add them to memory. The former uses only the AGU unit,
the latter an ALU and and AGU. The problem with the latter is that we
need 4 mul's in the loop which already ties up alu0 for 8 of the 10
cycles allowed in the loop. Therefore one needs to schedule everything
carefully so that nothing much else runs in ALU0. That requires
knowledge of the pick hardware (or for one to fiddle for about half an
hour).

There is a program called ptlsim which gives a detailed analysis of
how any particular piece of code is actually broken into macro-ops and
scheduled in the Opteron (works for most x86 machines). However it
doesn't compile correctly on sage.math and I have not been able to get
it to work on another Opteron I have access to either (running it is
supposedly trivial, it just quits immediately when run). It requires a
late 2.6 linux kernel for one thing, and perhaps it has bugs.

I haven't been able to find the part of the code where the Opteron
model is defined, so I can't just look up how the pick hardware works.
I could just write to the author I guess. It could be a very valuable
tool for eMPIRe as the above indicates!!

Bill.


On Nov 23, 11:16 pm, [EMAIL PROTECTED] wrote:
> On Sunday 23 November 2008 22:49:21 Jason Martin wrote:
>
>
>
> > > You assume OOO works perfectly.
>
> > >     mov $0,%r11
> > >        mul %rcx
> > >        add %rax,%r10
> > >        mov 24(%rsi,%rbx,8),%rax
> > >        adc %rdx,%r11
> > >        mov %r10,16(%rdi,%rbx,8)
> > >        mul %rcx
> > > here        mov $0,%r8
> > >        add %rax,%r11
> > >        mov 32(%rsi,%rbx,8),%rax
> > >        adc %rdx,%r8
> > >        mov %r11,24(%rdi,%rbx,8)
>
> > > moving the line at "here" up one before the mul , slows things down from
> > > 2.78 to 3.03 c/l , whereas if OOO was perfect , it should not have any
> > > effect. This may be due to a cpu scheduler bug , or perhaps the shedulers
> > > not perfect , mul being long latency , two macro ops , two pipes , only
> > > pipe 0_1 etc
> > > If its a bug then perhaps K10 is better?
>
> > I've seen similar wackiness with the core 2 out-of-order engine.  It's
> > strange enough that sometimes sticking in a nop actually saves a
> > cycle!
>
> another oddity..
>
> loop:
>         mov     (%rdi),%rcx
>         adc     %rcx,%rcx
>         mov     %rcx,(%rdi)
> ... 8 way unrolled lshift by 1
>         mov     56(%rdi),%r9
>         adc     %r9,%r9
>         mov     %r9,56(%rdi)
>         lea     64(%rdi),%rdi
>         dec     %rsi
>         jnz     loop
>
> runs at 1.11c/l
>
> whereas the rshift by 1 (ie with rcr instead of adc) does not, you have to
> bunch them up into 4's to get to 1.11c/l
>
>        mov     (%rdi),%rcx
>         mov     -8(%rdi),%r8
>         mov     -16(%rdi),%r9
>         mov     -24(%rdi),%r10
>         rcr     $1,%rcx
>         rcr     $1,%r8
>         rcr     $1,%r9
>         rcr     $1,%r10
>         mov     %rcx,(%rdi)
>         mov     %r8,-8(%rdi)
>         mov     %r9,-16(%rdi)
>         mov     %r10,-24(%rdi)
>
>        mov     -32(%rdi),%rcx
>         mov     -40(%rdi),%r8
>         mov     -48(%rdi),%r9
>         mov     -56(%rdi),%r10
>         rcr     $1,%rcx
>         rcr     $1,%r8
>         rcr     $1,%r9
>         rcr     $1,%r10
>         mov     %rcx,-32(%rdi)
>         mov     %r8,-40(%rdi)
>         mov     %r9,-48(%rdi)
>         mov     %r10,-56(%rdi)
>
>         lea     -64(%rdi),%rdi
>         dec     %rsi
>         jnz     loop
>
> Again , it looks like the OOO is broken.
> But if you look at the gmp-4.2.4 mpn_mul_1 , which runs at 3c/l , the OOO has
> to get work from three separate iterations to fill out the slots.
>
> While I'm at it , I got some more complaints :)
>
> timing mpn_add/sub_n with the gmp speed program the results stay fairly
> consistent . You may get say 24.5 cycles in one run and 24.6 in another. Ok ,
> occasionally you 200 cycles , but I assume thats an interupt or some such
> thing. But , for my mpn_com_n , which is mind numbingly simple
> (mov,not,mov) , sometimes I get 20cycles , 40 cycles, 30 cycles  .... . Whats
> going on there! , I dont know.
>
> Confused.
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to