Does it make a difference if you permute the case0 block with any of the others?
Does it make a difference if you insert a dummy read/write instruction into the case0 block? david On May 4, 1:39 pm, Jason Moxham <ja...@njkfrudils.plus.com> wrote: > Making all cases the same ie using jmp case0 then all the times are fast , and > using a jmp case1 then all the times are slow. This looks like just the case0 > epilogue is fast , and case1,2,3 epilogues are taking 500 cycles. > L1 cache is 32Kb and our 2srcs and 1dst are 24K overall , so all data should > be L1 , but the timing look like it's coming from main memory (not even L2) > L1 cache line size is 64 bytes which is 8 limbs so if this was affecting it we > would have a n mod 8 pattern to the times not a n mod 4 > > On Monday 04 May 2009 18:02:50 David Harvey wrote: > > > What happens if you remove the epilogue, i.e. make it run the main > > loop exactly floor(n/4) times, so that it performs exactly the same > > sequence of instructions for e.g. n = 12, 13, 14, 15? > > > david > > > On May 4, 11:44 am, Jason Moxham <ja...@njkfrudils.plus.com> wrote: > > > Yeah , the numbers are consistent , nice surprise for core2 :) > > > > And running tests on there own gives us the same numbers. > > > > tune$ ./speed -c -s 1000 mpn_test_pppn > > > overhead 7.00 cycles, precision 1000000 units of 5.37e-10 secs, CPU freq > > > 1861.91 MHz > > > mpn_test_pppn > > > 1000 2809.93 > > > tune$ ./speed -c -s 1001 mpn_test_pppn > > > overhead 7.00 cycles, precision 1000000 units of 5.37e-10 secs, CPU freq > > > 1861.91 MHz > > > mpn_test_pppn > > > 1001 3385.72 > > > > As the difference in timings is so large and proportional(mostly) to the > > > loop count , I conclude that it is the loop really running slower and not > > > some delay after the loop. But I've put large alignments at the start > > > *and* end of the main loop so we know it's not a mismatch between > > > decode/execute loops. I can't think what else it could be. > > > > I've noticed this on some other functions for core2 as well , but not > > > all!! > > > > On Monday 04 May 2009 16:22:15 David Harvey wrote: > > > > Do you get consistent numbers if you run only for a single value of n? > > > > i.e. it's not an artifact of the way the buffers are allocated or > > > > something? > > > > > david > > > > > On May 4, 10:27 am, Jason Moxham <ja...@njkfrudils.plus.com> wrote: > > > > > Hi > > > > > > I've been playing with some assembler for the Intel Core2 chips and > > > > > have come across this timing oddity which I cant explain . Any ideas? > > > > > > Attached is an attempt at mpn_addlsh1_n > > > > > > running timings for a few sizes > > > > > limbs time in cycles > > > > > 990 3358.04 > > > > > 991 3323.79 > > > > > 992 2787.45 > > > > > 993 3357.63 > > > > > 994 3358.74 > > > > > 995 3393.34 > > > > > 996 2798.41 > > > > > 997 3370.40 > > > > > 998 3389.18 > > > > > 999 3358.13 > > > > > 1000 2809.83 > > > > > 1001 3385.78 > > > > > 1002 3424.43 > > > > > 1003 3373.76 > > > > > 1004 2820.91 > > > > > 1005 3389.62 > > > > > 1006 3416.26 > > > > > 1007 3339.87 > > > > > 1008 2833.34 > > > > > 1009 3371.09 > > > > > 1010 3429.02 > > > > > > As you can see the timings when n%4=0 are much faster , as it's a > > > > > 4-way unroll we expect it to be a little faster , but nothing like > > > > > this. For example going from 1008 to 1009 limbs takes an extra 538 > > > > > cycles !!!!! You will also notice a useless push %rbp , and the > > > > > alignment for the loop is 32 not 16 , without this I could not get > > > > > the fast speed for the n%4=0 case This is on a core2 and a penryn > > > > > > Jason > > > > > > addlsh1_n.asm > > > > > 2KViewDownload --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---