Nice find! I guess this is essentially the problem David Harvey was talking about.
Bill. On 28/12/2008, jason <ja...@njkfrudils.plus.com> wrote: > > > > On Nov 23, 11:16 pm, ja...@njkfrudils.plus.com wrote: >> On Sunday 23 November 2008 22:49:21 Jason Martin wrote: >> >> >> >> > > You assume OOO works perfectly. >> >> > > mov $0,%r11 >> > > mul %rcx >> > > add %rax,%r10 >> > > mov 24(%rsi,%rbx,8),%rax >> > > adc %rdx,%r11 >> > > mov %r10,16(%rdi,%rbx,8) >> > > mul %rcx >> > > here mov $0,%r8 >> > > add %rax,%r11 >> > > mov 32(%rsi,%rbx,8),%rax >> > > adc %rdx,%r8 >> > > mov %r11,24(%rdi,%rbx,8) >> >> > > moving the line at "here" up one before the mul , slows things down >> > > from >> > > 2.78 to 3.03 c/l , whereas if OOO was perfect , it should not have any >> > > effect. This may be due to a cpu scheduler bug , or perhaps the >> > > shedulers >> > > not perfect , mul being long latency , two macro ops , two pipes , >> > > only >> > > pipe 0_1 etc >> > > If its a bug then perhaps K10 is better? >> >> > I've seen similar wackiness with the core 2 out-of-order engine. It's >> > strange enough that sometimes sticking in a nop actually saves a >> > cycle! >> >> another oddity.. >> >> loop: >> mov (%rdi),%rcx >> adc %rcx,%rcx >> mov %rcx,(%rdi) >> ... 8 way unrolled lshift by 1 >> mov 56(%rdi),%r9 >> adc %r9,%r9 >> mov %r9,56(%rdi) >> lea 64(%rdi),%rdi >> dec %rsi >> jnz loop >> >> runs at 1.11c/l >> >> whereas the rshift by 1 (ie with rcr instead of adc) does not, you have to >> bunch them up into 4's to get to 1.11c/l >> >> mov (%rdi),%rcx >> mov -8(%rdi),%r8 >> mov -16(%rdi),%r9 >> mov -24(%rdi),%r10 >> rcr $1,%rcx >> rcr $1,%r8 >> rcr $1,%r9 >> rcr $1,%r10 >> mov %rcx,(%rdi) >> mov %r8,-8(%rdi) >> mov %r9,-16(%rdi) >> mov %r10,-24(%rdi) >> >> mov -32(%rdi),%rcx >> mov -40(%rdi),%r8 >> mov -48(%rdi),%r9 >> mov -56(%rdi),%r10 >> rcr $1,%rcx >> rcr $1,%r8 >> rcr $1,%r9 >> rcr $1,%r10 >> mov %rcx,-32(%rdi) >> mov %r8,-40(%rdi) >> mov %r9,-48(%rdi) >> mov %r10,-56(%rdi) >> >> lea -64(%rdi),%rdi >> dec %rsi >> jnz loop >> >> Again , it looks like the OOO is broken. >> But if you look at the gmp-4.2.4 mpn_mul_1 , which runs at 3c/l , the OOO >> has >> to get work from three separate iterations to fill out the slots. >> >> While I'm at it , I got some more complaints :) >> >> timing mpn_add/sub_n with the gmp speed program the results stay fairly >> consistent . You may get say 24.5 cycles in one run and 24.6 in another. >> Ok , >> occasionally you 200 cycles , but I assume thats an interupt or some such >> thing. But , for my mpn_com_n , which is mind numbingly simple >> (mov,not,mov) , sometimes I get 20cycles , 40 cycles, 30 cycles .... . >> Whats >> going on there! , I dont know. >> >> Confused. > > I think I understand whats going on (at least a bit more!) . My above > mpn_com_n has a cache bank conflict . > > My old one was > load not store > load not store > load not store > load not store > etc > > and it run at 1.3c/l , however for some alignments of src/dst it would > run at 2.0c/l . This also appears to make the timings with the speed > program also vary a lot. (Note this suggests if your timings vary alot > perhaps your having these kind of problems). > A new one is > > load load not not store store > load load not not store store > etc > > This runs at 1.3c/l for all alignments , and the timings dont vary. > (just a little bit of jitter) > > So far I checked some of my existing asm fns and found no problems. > add/sub/addmul/submul/mul/lshift/rshift/addlsh1/sublsh1/com are done. > mul_basecase is still to do out of the ones mpir has. > > > > > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---