On Sunday 23 November 2008 21:38:07 Bill Hart wrote:
> It seems like unrolling our block2 by 2 could be made optimal in
> theory. You need 2 slots for the loop control. There are 14 slots in
> your block2.
>
> 2*14 + 2 = 30.
>
> That would give 10/4 = 2.5c/l.
>
> By the way, you suggest that perhaps moving the loop control up might
> help. If the processor has out-of-order capability, why would this
> help? Is there something else that prevents that from executing
> earlier regardless?

You assume OOO works perfectly.

     mov $0,%r11
        mul %rcx
        add %rax,%r10
        mov 24(%rsi,%rbx,8),%rax
        adc %rdx,%r11
        mov %r10,16(%rdi,%rbx,8)
        mul %rcx
here        mov $0,%r8
        add %rax,%r11
        mov 32(%rsi,%rbx,8),%rax
        adc %rdx,%r8
        mov %r11,24(%rdi,%rbx,8)

moving the line at "here" up one before the mul , slows things down from 2.78 
to 3.03 c/l , whereas if OOO was perfect , it should not have any effect.
This may be due to a cpu scheduler bug , or perhaps the shedulers not 
perfect , mul being long latency , two macro ops , two pipes , only pipe 0_1 
etc
If its a bug then perhaps K10 is better?


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to