Deleting case1,2,3  so we do the main loop and just fall thru straight into 
case0 then the time is back to 3393 , so there are no branches now to get in 
the way.

ie
        add $4,%r8
        mov %rcx,-16(%rdi,%r8,8)
        jnc lp     # this is end of main loop
ALIGN(32)
skiplp:
#cmp $2,%r8
#ja case0
#je case1
#jp case2
case0:
        add %r10,%rax
        neg %rax
        pop %rbp
        pop %rbx
        ret
# here be dragons

now uncommenting the cmp to give
        add $4,%r8
        mov %rcx,-16(%rdi,%r8,8)
        jnc lp     # this is end of main loop
ALIGN(32)
skiplp:
cmp $2,%r8
#ja case0
#je case1
#jp case2
case0:
        add %r10,%rax
        neg %rax
        pop %rbp
        pop %rbx
        ret
# here be dragons

this takes 3394

and now uncommenting ja case 0 to give
        add $4,%r8
        mov %rcx,-16(%rdi,%r8,8)
        jnc lp     # this is end of main loop
ALIGN(32)
skiplp:
cmp $2,%r8
ja case0
#je case1
#jp case2
case0:
        add %r10,%rax
        neg %rax
        pop %rbp
        pop %rbx
        ret
# here be dragons

this gives us 2809 for jumping to case0 , but 3390 from falling thru it !!!!!



On Wednesday 06 May 2009 00:17:56 Jason Moxham wrote:
> On Monday 04 May 2009 19:23:43 David Harvey wrote:
> > Does it make a difference if you permute the case0 block with any of
> > the others?
>
> No difference
>
> > Does it make a difference if you insert a dummy read/write instruction
> > into the case0 block?
>
> if I put a
> mov %r15,%r9
> at the start of case0  , which should do "nothing"
> then the times for case0 increase by 150 cycles to 2957
> if I put another
> mov %r14,%r8
> at the start of case0 , which again should do "nothing" then the time goes
> back down to  2813 which is about 6 cycles longer than originally.
>
> using nop's instead we get
> 1 nop   no effect 2809
> 2 nops  time to 3093
> 3 nops  time to 3228
> 4 nops time to 2953
>
> > david
> >
> > On May 4, 1:39 pm, Jason Moxham <ja...@njkfrudils.plus.com> wrote:
> > > Making all cases the same ie using jmp case0 then all the times are
> > > fast , and using a jmp case1 then all the times are slow. This looks
> > > like just the case0 epilogue is fast , and case1,2,3 epilogues are
> > > taking 500 cycles. L1 cache is 32Kb and our 2srcs and 1dst are 24K
> > > overall , so all data should be L1 , but the timing look like it's
> > > coming from main memory (not even L2) L1 cache line size is 64 bytes
> > > which is 8 limbs so if this was affecting it we would have a n mod 8
> > > pattern to the times not a n mod 4
> > >
> > > On Monday 04 May 2009 18:02:50 David Harvey wrote:
> > > > What happens if you remove the epilogue, i.e. make it run the main
> > > > loop exactly floor(n/4) times, so that it performs exactly the same
> > > > sequence of instructions for e.g. n = 12, 13, 14, 15?
> > > >
> > > > david
> > > >
> > > > On May 4, 11:44 am, Jason Moxham <ja...@njkfrudils.plus.com> wrote:
> > > > > Yeah , the numbers are consistent , nice surprise for core2 :)
> > > > >
> > > > > And running tests on there own gives us the same numbers.
> > > > >
> > > > > tune$ ./speed -c -s 1000 mpn_test_pppn
> > > > > overhead 7.00 cycles, precision 1000000 units of 5.37e-10 secs, CPU
> > > > > freq 1861.91 MHz
> > > > >         mpn_test_pppn
> > > > > 1000          2809.93
> > > > > tune$ ./speed -c -s 1001 mpn_test_pppn
> > > > > overhead 7.00 cycles, precision 1000000 units of 5.37e-10 secs, CPU
> > > > > freq 1861.91 MHz
> > > > >         mpn_test_pppn
> > > > > 1001          3385.72
> > > > >
> > > > > As the difference in timings is so large and proportional(mostly)
> > > > > to the loop count , I conclude that it is the loop really running
> > > > > slower and not some delay after the loop. But I've put large
> > > > > alignments at the start *and* end of the main loop so we know it's
> > > > > not a mismatch between decode/execute loops. I can't think what
> > > > > else it could be.
> > > > >
> > > > > I've noticed this on some other functions for core2 as well , but
> > > > > not all!!
> > > > >
> > > > > On Monday 04 May 2009 16:22:15 David Harvey wrote:
> > > > > > Do you get consistent numbers if you run only for a single value
> > > > > > of n? i.e. it's not an artifact of the way the buffers are
> > > > > > allocated or something?
> > > > > >
> > > > > > david
> > > > > >
> > > > > > On May 4, 10:27 am, Jason Moxham <ja...@njkfrudils.plus.com> 
wrote:
> > > > > > > Hi
> > > > > > >
> > > > > > > I've been playing with some assembler for the Intel Core2 chips
> > > > > > > and have come across this timing oddity which I cant explain .
> > > > > > > Any ideas?
> > > > > > >
> > > > > > > Attached is an attempt at mpn_addlsh1_n
> > > > > > >
> > > > > > > running timings for a few sizes
> > > > > > >  limbs       time in cycles
> > > > > > > 990           3358.04
> > > > > > > 991           3323.79
> > > > > > > 992           2787.45
> > > > > > > 993           3357.63
> > > > > > > 994           3358.74
> > > > > > > 995           3393.34
> > > > > > > 996           2798.41
> > > > > > > 997           3370.40
> > > > > > > 998           3389.18
> > > > > > > 999           3358.13
> > > > > > > 1000          2809.83
> > > > > > > 1001          3385.78
> > > > > > > 1002          3424.43
> > > > > > > 1003          3373.76
> > > > > > > 1004          2820.91
> > > > > > > 1005          3389.62
> > > > > > > 1006          3416.26
> > > > > > > 1007          3339.87
> > > > > > > 1008          2833.34
> > > > > > > 1009          3371.09
> > > > > > > 1010          3429.02
> > > > > > >
> > > > > > > As you can see the timings when n%4=0 are much faster , as it's
> > > > > > > a 4-way unroll we expect it to be a little faster  , but
> > > > > > > nothing like this. For example going from 1008 to 1009 limbs
> > > > > > > takes an extra 538 cycles !!!!! You will also notice a useless
> > > > > > > push %rbp , and the alignment for the loop is 32 not 16 ,
> > > > > > > without this I could not get the fast speed for the n%4=0 case
> > > > > > > This is on a core2 and a penryn
> > > > > > >
> > > > > > > Jason
> > > > > > >
> > > > > > >  addlsh1_n.asm
> > > > > > > 2KViewDownload
>
> 


--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"mpir-devel" group.
To post to this group, send email to mpir-devel@googlegroups.com
To unsubscribe from this group, send email to 
mpir-devel+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/mpir-devel?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to