Thank you for your help. In fact, it was one of my initial guess that some
stuff happen asynchronously. I saw the same result by using "del rr". But I
was septic because it make the performance not that impressive. Here is a
comparison with numpy.random.randn() on a Q6600 @ 3.2 Ghz:

Bench GPU 2: 4.41615891457 sec
Bench CPU: 7.45132184029 sec

Did I miss something ?

Martin

On Tue, Jan 18, 2011 at 2:56 PM, Tomasz Rybak <bogom...@post.pl> wrote:

> Dnia 2011-01-18, wto o godzinie 09:46 -0500, Martin Laprise pisze:
> > Hi, I just made some experiments with the CURAND wrappers. It seem to
> work
> > very nicely except for a little detail that I can't figure out. The
> > initialization of the generator and the actual random number generation
> seem
> > very fast. But for what ever reason, PyCUDA take a long time to "recover"
> > after the number generation. This pause is significantly longer than the
> > actual computation and the delay increase with N. Here is an example:
> >
>
> curand kernels are called asynchronously.
> This means that PyCUDA returns immediately after
> initiating the call, and does not wait for result.
> This allows hardware or drive to better manage
> order of execution, and to run many kernels concurrently
> on modern hardware (2.x capabilities).
>
> After changing your code to force PyCUDA to wait I got
> following results:
>
> import numpy as np
> import pycuda.autoinit
> import pycuda.gpuarray
> from pycuda.curandom import PseudoRandomNumberGenerator,
> QuasiRandomNumberGenerator
> import cProfile
> import time as clock
>
>
> cuda_stream = pycuda.driver.Stream()
>
> def curand_prof():
>
>    N = 100000000
>
>    t1 = clock.time()
>    # GPU
>    rr = PseudoRandomNumberGenerator(0,
> np.random.random(128).astype(np.int32))
>     data = pycuda.gpuarray.empty([N], np.float32)
>    rr.fill_normal_float(data.gpudata, N, stream=cuda_stream)
>    cuda_stream.synchronize()
>     t2 = clock.time()
>    print "Bench 1: " + str(t2-t1) + " sec"
>
>
> if __name__ == "__main__":
>     t4 = clock.time()
>    curand_prof()
>    t5 = clock.time()
>    print "Bench 2: " + str(t5-t4) + " sec"
>
> Bench 1: 1.15405488014 sec
> Bench 2: 1.15947508812 sec
>
> It seems consistent with your results - I was running on GTX 460
> with Fermi. Your GTX 260 is Tesla, so 256 threads are used;
> Fermi uses 1024 threads, which uses 4 times less time to compute
> random numbers.
>
> Best regards, thanks for noticing this, and thanks for testing
> CURAND wrapper.
>
> --
> Tomasz Rybak <bogom...@post.pl> GPG/PGP key ID: 2AD5 9860
> Fingerprint A481 824E 7DD3 9C0E C40A  488E C654 FB33 2AD5 9860
> http://member.acm.org/~tomaszrybak
>
_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to