On Sunday 23 November 2008 22:49:21 Jason Martin wrote:
> > You assume OOO works perfectly.
> >
> >     mov $0,%r11
> >        mul %rcx
> >        add %rax,%r10
> >        mov 24(%rsi,%rbx,8),%rax
> >        adc %rdx,%r11
> >        mov %r10,16(%rdi,%rbx,8)
> >        mul %rcx
> > here        mov $0,%r8
> >        add %rax,%r11
> >        mov 32(%rsi,%rbx,8),%rax
> >        adc %rdx,%r8
> >        mov %r11,24(%rdi,%rbx,8)
> >
> > moving the line at "here" up one before the mul , slows things down from
> > 2.78 to 3.03 c/l , whereas if OOO was perfect , it should not have any
> > effect. This may be due to a cpu scheduler bug , or perhaps the shedulers
> > not perfect , mul being long latency , two macro ops , two pipes , only
> > pipe 0_1 etc
> > If its a bug then perhaps K10 is better?
>
> I've seen similar wackiness with the core 2 out-of-order engine.  It's
> strange enough that sometimes sticking in a nop actually saves a
> cycle!

another oddity..

loop:
        mov     (%rdi),%rcx
        adc     %rcx,%rcx
        mov     %rcx,(%rdi)
... 8 way unrolled lshift by 1
        mov     56(%rdi),%r9
        adc     %r9,%r9
        mov     %r9,56(%rdi)
        lea     64(%rdi),%rdi
        dec     %rsi
        jnz     loop

runs at 1.11c/l

whereas the rshift by 1 (ie with rcr instead of adc) does not, you have to 
bunch them up into 4's to get to 1.11c/l

       mov     (%rdi),%rcx
        mov     -8(%rdi),%r8
        mov     -16(%rdi),%r9
        mov     -24(%rdi),%r10
        rcr     $1,%rcx
        rcr     $1,%r8
        rcr     $1,%r9
        rcr     $1,%r10
        mov     %rcx,(%rdi)
        mov     %r8,-8(%rdi)
        mov     %r9,-16(%rdi)
        mov     %r10,-24(%rdi)

       mov     -32(%rdi),%rcx
        mov     -40(%rdi),%r8
        mov     -48(%rdi),%r9
        mov     -56(%rdi),%r10
        rcr     $1,%rcx
        rcr     $1,%r8
        rcr     $1,%r9
        rcr     $1,%r10
        mov     %rcx,-32(%rdi)
        mov     %r8,-40(%rdi)
        mov     %r9,-48(%rdi)
        mov     %r10,-56(%rdi)

        lea     -64(%rdi),%rdi
        dec     %rsi
        jnz     loop

Again , it looks like the OOO is broken.
But if you look at the gmp-4.2.4 mpn_mul_1 , which runs at 3c/l , the OOO has 
to get work from three separate iterations to fill out the slots.

While I'm at it , I got some more complaints :)

timing mpn_add/sub_n with the gmp speed program the results stay fairly 
consistent . You may get say 24.5 cycles in one run and 24.6 in another. Ok , 
occasionally you 200 cycles , but I assume thats an interupt or some such 
thing. But , for my mpn_com_n , which is mind numbingly simple 
(mov,not,mov) , sometimes I get 20cycles , 40 cycles, 30 cycles  .... . Whats 
going on there! , I dont know.

Confused.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to