Yeah , the numbers are consistent , nice surprise for core2 :)

And running tests on there own gives us the same numbers.

tune$ ./speed -c -s 1000 mpn_test_pppn
overhead 7.00 cycles, precision 1000000 units of 5.37e-10 secs, CPU freq 
1861.91 MHz
        mpn_test_pppn
1000          2809.93
tune$ ./speed -c -s 1001 mpn_test_pppn
overhead 7.00 cycles, precision 1000000 units of 5.37e-10 secs, CPU freq 
1861.91 MHz
        mpn_test_pppn
1001          3385.72

As the difference in timings is so large and proportional(mostly) to the loop 
count , I conclude that it is the loop really running slower and not some 
delay after the loop. But I've put large alignments at the start *and* end of 
the main loop so we know it's not a mismatch between decode/execute loops. I 
can't think what else it could be.

I've noticed this on some other functions for core2 as well , but not all!!

On Monday 04 May 2009 16:22:15 David Harvey wrote:
> Do you get consistent numbers if you run only for a single value of n?
> i.e. it's not an artifact of the way the buffers are allocated or
> something?
>
> david
>
> On May 4, 10:27 am, Jason Moxham <ja...@njkfrudils.plus.com> wrote:
> > Hi
> >
> > I've been playing with some assembler for the Intel Core2 chips and have
> > come across this timing oddity which I cant explain . Any ideas?
> >
> > Attached is an attempt at mpn_addlsh1_n
> >
> > running timings for a few sizes
> >  limbs       time in cycles
> > 990           3358.04
> > 991           3323.79
> > 992           2787.45
> > 993           3357.63
> > 994           3358.74
> > 995           3393.34
> > 996           2798.41
> > 997           3370.40
> > 998           3389.18
> > 999           3358.13
> > 1000          2809.83
> > 1001          3385.78
> > 1002          3424.43
> > 1003          3373.76
> > 1004          2820.91
> > 1005          3389.62
> > 1006          3416.26
> > 1007          3339.87
> > 1008          2833.34
> > 1009          3371.09
> > 1010          3429.02
> >
> > As you can see the timings when n%4=0 are much faster , as it's a 4-way
> > unroll we expect it to be a little faster  , but nothing like this. For
> > example going from 1008 to 1009 limbs takes an extra 538 cycles !!!!!
> > You will also notice a useless push %rbp , and the alignment for the loop
> > is 32 not 16 , without this I could not get the fast speed for the n%4=0
> > case This is on a core2 and a penryn
> >
> > Jason
> >
> >  addlsh1_n.asm
> > 2KViewDownload
>
> 


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to