On Mon, May 4, 2009 at 11:27 AM, Jason Moxham wrote:
>
> Hi
>
> I've been playing with some assembler for the Intel Core2 chips and have come
> across this timing oddity which I cant explain . Any ideas?
Maybe it's to do with the branch predictor? Remarks:
1. It seems to me that this starts hap
Does it make a difference if you permute the case0 block with any of
the others?
Does it make a difference if you insert a dummy read/write instruction
into the case0 block?
david
On May 4, 1:39 pm, Jason Moxham wrote:
> Making all cases the same ie using jmp case0 then all the times are fast
Making all cases the same ie using jmp case0 then all the times are fast , and
using a jmp case1 then all the times are slow. This looks like just the case0
epilogue is fast , and case1,2,3 epilogues are taking 500 cycles.
L1 cache is 32Kb and our 2srcs and 1dst are 24K overall , so all data sh
What happens if you remove the epilogue, i.e. make it run the main
loop exactly floor(n/4) times, so that it performs exactly the same
sequence of instructions for e.g. n = 12, 13, 14, 15?
david
On May 4, 11:44 am, Jason Moxham wrote:
> Yeah , the numbers are consistent , nice surprise for core
Yeah , the numbers are consistent , nice surprise for core2 :)
And running tests on there own gives us the same numbers.
tune$ ./speed -c -s 1000 mpn_test_pppn
overhead 7.00 cycles, precision 100 units of 5.37e-10 secs, CPU freq
1861.91 MHz
mpn_test_pppn
1000 2809.93
tune$
Do you get consistent numbers if you run only for a single value of n?
i.e. it's not an artifact of the way the buffers are allocated or
something?
david
On May 4, 10:27 am, Jason Moxham wrote:
> Hi
>
> I've been playing with some assembler for the Intel Core2 chips and have come
> across this
Hi
I've been playing with some assembler for the Intel Core2 chips and have come
across this timing oddity which I cant explain . Any ideas?
Attached is an attempt at mpn_addlsh1_n
running timings for a few sizes
limbs time in cycles
990 3358.04
991 3323.79
992
Hi Paul,
Great to hear you have some serious hardware hooked up!
Some of us in Seattle for a Sage Days conference (I am already here
visiting Seattle now) in two weeks are planning a GPU party to get a
first step towards doing some GPU computations for MPIR. We'll just be
writing some CUDA at fi
(Changing the thread title to be a little more relevant than
"Fast
computation of binomial coefficients".
I now have a Tesla C1060 plugged into a Dell T7400 box running
RHEL5 and am learning how to use CUDA for non-trivial
computations. I'