Deleting case1,2,3 so we do the main loop and just fall thru straight into case0 then the time is back to 3393 , so there are no branches now to get in the way.
ie add $4,%r8 mov %rcx,-16(%rdi,%r8,8) jnc lp # this is end of main loop ALIGN(32) skiplp: #cmp $2,%r8 #ja case0 #je case1 #jp case2 case0: add %r10,%rax neg %rax pop %rbp pop %rbx ret # here be dragons now uncommenting the cmp to give add $4,%r8 mov %rcx,-16(%rdi,%r8,8) jnc lp # this is end of main loop ALIGN(32) skiplp: cmp $2,%r8 #ja case0 #je case1 #jp case2 case0: add %r10,%rax neg %rax pop %rbp pop %rbx ret # here be dragons this takes 3394 and now uncommenting ja case 0 to give add $4,%r8 mov %rcx,-16(%rdi,%r8,8) jnc lp # this is end of main loop ALIGN(32) skiplp: cmp $2,%r8 ja case0 #je case1 #jp case2 case0: add %r10,%rax neg %rax pop %rbp pop %rbx ret # here be dragons this gives us 2809 for jumping to case0 , but 3390 from falling thru it !!!!! On Wednesday 06 May 2009 00:17:56 Jason Moxham wrote: > On Monday 04 May 2009 19:23:43 David Harvey wrote: > > Does it make a difference if you permute the case0 block with any of > > the others? > > No difference > > > Does it make a difference if you insert a dummy read/write instruction > > into the case0 block? > > if I put a > mov %r15,%r9 > at the start of case0 , which should do "nothing" > then the times for case0 increase by 150 cycles to 2957 > if I put another > mov %r14,%r8 > at the start of case0 , which again should do "nothing" then the time goes > back down to 2813 which is about 6 cycles longer than originally. > > using nop's instead we get > 1 nop no effect 2809 > 2 nops time to 3093 > 3 nops time to 3228 > 4 nops time to 2953 > > > david > > > > On May 4, 1:39 pm, Jason Moxham <ja...@njkfrudils.plus.com> wrote: > > > Making all cases the same ie using jmp case0 then all the times are > > > fast , and using a jmp case1 then all the times are slow. This looks > > > like just the case0 epilogue is fast , and case1,2,3 epilogues are > > > taking 500 cycles. L1 cache is 32Kb and our 2srcs and 1dst are 24K > > > overall , so all data should be L1 , but the timing look like it's > > > coming from main memory (not even L2) L1 cache line size is 64 bytes > > > which is 8 limbs so if this was affecting it we would have a n mod 8 > > > pattern to the times not a n mod 4 > > > > > > On Monday 04 May 2009 18:02:50 David Harvey wrote: > > > > What happens if you remove the epilogue, i.e. make it run the main > > > > loop exactly floor(n/4) times, so that it performs exactly the same > > > > sequence of instructions for e.g. n = 12, 13, 14, 15? > > > > > > > > david > > > > > > > > On May 4, 11:44 am, Jason Moxham <ja...@njkfrudils.plus.com> wrote: > > > > > Yeah , the numbers are consistent , nice surprise for core2 :) > > > > > > > > > > And running tests on there own gives us the same numbers. > > > > > > > > > > tune$ ./speed -c -s 1000 mpn_test_pppn > > > > > overhead 7.00 cycles, precision 1000000 units of 5.37e-10 secs, CPU > > > > > freq 1861.91 MHz > > > > > mpn_test_pppn > > > > > 1000 2809.93 > > > > > tune$ ./speed -c -s 1001 mpn_test_pppn > > > > > overhead 7.00 cycles, precision 1000000 units of 5.37e-10 secs, CPU > > > > > freq 1861.91 MHz > > > > > mpn_test_pppn > > > > > 1001 3385.72 > > > > > > > > > > As the difference in timings is so large and proportional(mostly) > > > > > to the loop count , I conclude that it is the loop really running > > > > > slower and not some delay after the loop. But I've put large > > > > > alignments at the start *and* end of the main loop so we know it's > > > > > not a mismatch between decode/execute loops. I can't think what > > > > > else it could be. > > > > > > > > > > I've noticed this on some other functions for core2 as well , but > > > > > not all!! > > > > > > > > > > On Monday 04 May 2009 16:22:15 David Harvey wrote: > > > > > > Do you get consistent numbers if you run only for a single value > > > > > > of n? i.e. it's not an artifact of the way the buffers are > > > > > > allocated or something? > > > > > > > > > > > > david > > > > > > > > > > > > On May 4, 10:27 am, Jason Moxham <ja...@njkfrudils.plus.com> wrote: > > > > > > > Hi > > > > > > > > > > > > > > I've been playing with some assembler for the Intel Core2 chips > > > > > > > and have come across this timing oddity which I cant explain . > > > > > > > Any ideas? > > > > > > > > > > > > > > Attached is an attempt at mpn_addlsh1_n > > > > > > > > > > > > > > running timings for a few sizes > > > > > > > limbs time in cycles > > > > > > > 990 3358.04 > > > > > > > 991 3323.79 > > > > > > > 992 2787.45 > > > > > > > 993 3357.63 > > > > > > > 994 3358.74 > > > > > > > 995 3393.34 > > > > > > > 996 2798.41 > > > > > > > 997 3370.40 > > > > > > > 998 3389.18 > > > > > > > 999 3358.13 > > > > > > > 1000 2809.83 > > > > > > > 1001 3385.78 > > > > > > > 1002 3424.43 > > > > > > > 1003 3373.76 > > > > > > > 1004 2820.91 > > > > > > > 1005 3389.62 > > > > > > > 1006 3416.26 > > > > > > > 1007 3339.87 > > > > > > > 1008 2833.34 > > > > > > > 1009 3371.09 > > > > > > > 1010 3429.02 > > > > > > > > > > > > > > As you can see the timings when n%4=0 are much faster , as it's > > > > > > > a 4-way unroll we expect it to be a little faster , but > > > > > > > nothing like this. For example going from 1008 to 1009 limbs > > > > > > > takes an extra 538 cycles !!!!! You will also notice a useless > > > > > > > push %rbp , and the alignment for the loop is 32 not 16 , > > > > > > > without this I could not get the fast speed for the n%4=0 case > > > > > > > This is on a core2 and a penryn > > > > > > > > > > > > > > Jason > > > > > > > > > > > > > > addlsh1_n.asm > > > > > > > 2KViewDownload > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---