On Sunday 23 November 2008 21:38:07 Bill Hart wrote: > It seems like unrolling our block2 by 2 could be made optimal in > theory. You need 2 slots for the loop control. There are 14 slots in > your block2. > > 2*14 + 2 = 30. > > That would give 10/4 = 2.5c/l. > > By the way, you suggest that perhaps moving the loop control up might > help. If the processor has out-of-order capability, why would this > help? Is there something else that prevents that from executing > earlier regardless?
You assume OOO works perfectly. mov $0,%r11 mul %rcx add %rax,%r10 mov 24(%rsi,%rbx,8),%rax adc %rdx,%r11 mov %r10,16(%rdi,%rbx,8) mul %rcx here mov $0,%r8 add %rax,%r11 mov 32(%rsi,%rbx,8),%rax adc %rdx,%r8 mov %r11,24(%rdi,%rbx,8) moving the line at "here" up one before the mul , slows things down from 2.78 to 3.03 c/l , whereas if OOO was perfect , it should not have any effect. This may be due to a cpu scheduler bug , or perhaps the shedulers not perfect , mul being long latency , two macro ops , two pipes , only pipe 0_1 etc If its a bug then perhaps K10 is better? --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---