Thank you for your help. In fact, it was one of my initial guess that some stuff happen asynchronously. I saw the same result by using "del rr". But I was septic because it make the performance not that impressive. Here is a comparison with numpy.random.randn() on a Q6600 @ 3.2 Ghz:
Bench GPU 2: 4.41615891457 sec Bench CPU: 7.45132184029 sec Did I miss something ? Martin On Tue, Jan 18, 2011 at 2:56 PM, Tomasz Rybak <bogom...@post.pl> wrote: > Dnia 2011-01-18, wto o godzinie 09:46 -0500, Martin Laprise pisze: > > Hi, I just made some experiments with the CURAND wrappers. It seem to > work > > very nicely except for a little detail that I can't figure out. The > > initialization of the generator and the actual random number generation > seem > > very fast. But for what ever reason, PyCUDA take a long time to "recover" > > after the number generation. This pause is significantly longer than the > > actual computation and the delay increase with N. Here is an example: > > > > curand kernels are called asynchronously. > This means that PyCUDA returns immediately after > initiating the call, and does not wait for result. > This allows hardware or drive to better manage > order of execution, and to run many kernels concurrently > on modern hardware (2.x capabilities). > > After changing your code to force PyCUDA to wait I got > following results: > > import numpy as np > import pycuda.autoinit > import pycuda.gpuarray > from pycuda.curandom import PseudoRandomNumberGenerator, > QuasiRandomNumberGenerator > import cProfile > import time as clock > > > cuda_stream = pycuda.driver.Stream() > > def curand_prof(): > > N = 100000000 > > t1 = clock.time() > # GPU > rr = PseudoRandomNumberGenerator(0, > np.random.random(128).astype(np.int32)) > data = pycuda.gpuarray.empty([N], np.float32) > rr.fill_normal_float(data.gpudata, N, stream=cuda_stream) > cuda_stream.synchronize() > t2 = clock.time() > print "Bench 1: " + str(t2-t1) + " sec" > > > if __name__ == "__main__": > t4 = clock.time() > curand_prof() > t5 = clock.time() > print "Bench 2: " + str(t5-t4) + " sec" > > Bench 1: 1.15405488014 sec > Bench 2: 1.15947508812 sec > > It seems consistent with your results - I was running on GTX 460 > with Fermi. Your GTX 260 is Tesla, so 256 threads are used; > Fermi uses 1024 threads, which uses 4 times less time to compute > random numbers. > > Best regards, thanks for noticing this, and thanks for testing > CURAND wrapper. > > -- > Tomasz Rybak <bogom...@post.pl> GPG/PGP key ID: 2AD5 9860 > Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860 > http://member.acm.org/~tomaszrybak >
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda