Yeah , the numbers are consistent , nice surprise for core2 :) And running tests on there own gives us the same numbers.
tune$ ./speed -c -s 1000 mpn_test_pppn overhead 7.00 cycles, precision 1000000 units of 5.37e-10 secs, CPU freq 1861.91 MHz mpn_test_pppn 1000 2809.93 tune$ ./speed -c -s 1001 mpn_test_pppn overhead 7.00 cycles, precision 1000000 units of 5.37e-10 secs, CPU freq 1861.91 MHz mpn_test_pppn 1001 3385.72 As the difference in timings is so large and proportional(mostly) to the loop count , I conclude that it is the loop really running slower and not some delay after the loop. But I've put large alignments at the start *and* end of the main loop so we know it's not a mismatch between decode/execute loops. I can't think what else it could be. I've noticed this on some other functions for core2 as well , but not all!! On Monday 04 May 2009 16:22:15 David Harvey wrote: > Do you get consistent numbers if you run only for a single value of n? > i.e. it's not an artifact of the way the buffers are allocated or > something? > > david > > On May 4, 10:27 am, Jason Moxham <ja...@njkfrudils.plus.com> wrote: > > Hi > > > > I've been playing with some assembler for the Intel Core2 chips and have > > come across this timing oddity which I cant explain . Any ideas? > > > > Attached is an attempt at mpn_addlsh1_n > > > > running timings for a few sizes > > limbs time in cycles > > 990 3358.04 > > 991 3323.79 > > 992 2787.45 > > 993 3357.63 > > 994 3358.74 > > 995 3393.34 > > 996 2798.41 > > 997 3370.40 > > 998 3389.18 > > 999 3358.13 > > 1000 2809.83 > > 1001 3385.78 > > 1002 3424.43 > > 1003 3373.76 > > 1004 2820.91 > > 1005 3389.62 > > 1006 3416.26 > > 1007 3339.87 > > 1008 2833.34 > > 1009 3371.09 > > 1010 3429.02 > > > > As you can see the timings when n%4=0 are much faster , as it's a 4-way > > unroll we expect it to be a little faster , but nothing like this. For > > example going from 1008 to 1009 limbs takes an extra 538 cycles !!!!! > > You will also notice a useless push %rbp , and the alignment for the loop > > is 32 not 16 , without this I could not get the fast speed for the n%4=0 > > case This is on a core2 and a penryn > > > > Jason > > > > addlsh1_n.asm > > 2KViewDownload > > --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "mpir-devel" group. To post to this group, send email to mpir-devel@googlegroups.com To unsubscribe from this group, send email to mpir-devel+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/mpir-devel?hl=en -~----------~----~----~----~------~----~------~--~---