Thanks for your explanation. I don't think your wrapper is too simple. I think it's a very good start and It saved me a lot of work. I was just surprised by the performance of CURAND itself which advertised a speedup around 10-20X. But, I think with your below code running on a Fermi, we are quite near the advertised speedup. I ran the code on my GTX 260. I was not able to use N = 100000000 (cuLaunchGrid failed: out of memory) but with N = 50000000 I've got:
Bench 1: 1.75797915459 sec Bench 2: 0.303898096085 sec impressive ! On Thu, Jan 20, 2011 at 8:19 AM, Tomasz Rybak <bogom...@post.pl> wrote: > Dnia 2011-01-18, wto o godzinie 15:45 -0500, Martin Laprise pisze: > > Thank you for your help. In fact, it was one of my initial guess that > some > > stuff happen asynchronously. I saw the same result by using "del rr". But > I > > was septic because it make the performance not that impressive. Here is a > > comparison with numpy.random.randn() on a Q6600 @ 3.2 Ghz: > > > > Bench GPU 2: 4.41615891457 sec > > Bench CPU: 7.45132184029 sec > > > > Did I miss something ? > > No. > First - I think NVIDIA introduced CURAND to avoid generating random > values and transferring them over to GPU - which would give you some > performance benefits. > Second - you already have performance gain. > Third - my wrappers around CURAND are rather simple (maybe too simple); > because they use only one multiprocessor. I just decided to use one MP > because of troubles with too many threads during initialisation > on Tesla. > I have come with below code (warning - this is for now rather dirty > hack, but works for me on both GTX 460 Fermi, and ION 9400 Tesla): > > mp_source = """ > // Uses C++ features (templates); do not surround with extern C > #include <curand_kernel.h> > > extern "C" { > __global__ void prepare_with_seed(curandState *s, const int n, const int > seed, const int offset) { > int id = blockIdx.x*blockDim.x+threadIdx.x; > if (id < n) { > curand_init(seed, threadIdx.x, offset, &s[id]); > } > } > __global__ void normal_float(%(data_type)s *s, float *d, const int n) { > int id = blockIdx.x*blockDim.x+threadIdx.x; > const int tidx = threadIdx.x; > for (int idx = id; idx < n; idx += blockDim.x*gridDim.x) { > d[idx] = curand_normal(&s[id]); > } > } > } > """ > > def testMP(data, input_size, stream = None): > import pycuda.compiler > import pycuda.driver > import random > if pycuda.driver.get_version() < (3, 2, 0): > raise EnvironmentError("Need at least CUDA 3.2") > dev = pycuda.driver.Context.get_device() > block_size = > dev.get_attribute(pycuda.driver.device_attribute.MAX_THREADS_PER_BLOCK) > block_dimension = > dev.get_attribute(pycuda.driver.device_attribute.MAX_BLOCK_DIM_X) > generator_count = min(block_size, block_dimension) > mp_count = > dev.get_attribute(pycuda.driver.device_attribute.MULTIPROCESSOR_COUNT) > state = pycuda.driver.mem_alloc(generator_count * > curand_state_size) > seed = random.randint(0, ((1 << 31) - 1)) > source = str(mp_source) % { > 'data_type': 'curandState', > } > module = pycuda.compiler.SourceModule(source, no_extern_c=True, > keep=True) > p = module.get_function("prepare_with_seed") > if dev.compute_capability() >= (2, 0): > p.prepare("Piii", block=(generator_count, 1, 1)) > p.prepared_call((mp_count, 1), state, generator_count, seed, > 0) > else: > # Ugly hack for non-Fermi > p.prepare("Piii", block=(generator_count/2, 1, 1)) > p.prepared_call((mp_count*2, 1), state, generator_count, > seed, 0) > normal_float = module.get_function("normal_float") > normal_float.prepare("PPi", block=(generator_count, 1, 1)) > normal_float.prepared_async_call((mp_count, 1), stream, state, > data, input_size) > > if __name__ == '__main__': > import pycuda.gpuarray > import pycuda.autoinit > import time as clock > import numpy > N = 100000000 > rr = PseudoRandomNumberGenerator(0, > numpy.random.random(256).astype(numpy.int32)) > cuda_stream = pycuda.driver.Stream() > > t1 = clock.time() > # GPU > data = pycuda.gpuarray.empty([N], numpy.float32) > rr.fill_normal_float(data.gpudata, N, stream=cuda_stream) > cuda_stream.synchronize() > t2 = clock.time() > print "Bench 1: " + str(t2-t1) + " sec" > del data > t1 = clock.time() > # GPU > data = pycuda.gpuarray.empty([N], numpy.float32) > testMP(data.gpudata, N, stream=cuda_stream) > cuda_stream.synchronize() > t2 = clock.time() > print "Bench 2: " + str(t2-t1) + " sec" > > > It uses all MPs it can find and gives following results: > ION (had to descrease N to 90M): > Bench 1: 3.946 > Bench 2: 3.911 > not much improvement, but ION has only 2 MPs. > > GTX 490 (N = 100M): > Bench 1: 1.13427710533 sec > Bench 2: 0.22767996788 sec > So here you have some improvement, although only 5x, not 7x as would > be suggested by 7MPs on GTX 460. Please test on your GTX 280, > which should give results between ION and GTX 460. > > But I would like for Andreas to include first version > of CURAND wrappers, and only then start working on optimisations. > > > > > > > Martin > > > > On Tue, Jan 18, 2011 at 2:56 PM, Tomasz Rybak <bogom...@post.pl> wrote: > > > > > Dnia 2011-01-18, wto o godzinie 09:46 -0500, Martin Laprise pisze: > > > > Hi, I just made some experiments with the CURAND wrappers. It seem to > > > work > > > > very nicely except for a little detail that I can't figure out. The > > > > initialization of the generator and the actual random number > generation > > > seem > > > > very fast. But for what ever reason, PyCUDA take a long time to > "recover" > > > > after the number generation. This pause is significantly longer than > the > > > > actual computation and the delay increase with N. Here is an example: > > > > > > > > > > curand kernels are called asynchronously. > > > This means that PyCUDA returns immediately after > > > initiating the call, and does not wait for result. > > > This allows hardware or drive to better manage > > > order of execution, and to run many kernels concurrently > > > on modern hardware (2.x capabilities). > > > > > > After changing your code to force PyCUDA to wait I got > > > following results: > > > > > > import numpy as np > > > import pycuda.autoinit > > > import pycuda.gpuarray > > > from pycuda.curandom import PseudoRandomNumberGenerator, > > > QuasiRandomNumberGenerator > > > import cProfile > > > import time as clock > > > > > > > > > cuda_stream = pycuda.driver.Stream() > > > > > > def curand_prof(): > > > > > > N = 100000000 > > > > > > t1 = clock.time() > > > # GPU > > > rr = PseudoRandomNumberGenerator(0, > > > np.random.random(128).astype(np.int32)) > > > data = pycuda.gpuarray.empty([N], np.float32) > > > rr.fill_normal_float(data.gpudata, N, stream=cuda_stream) > > > cuda_stream.synchronize() > > > t2 = clock.time() > > > print "Bench 1: " + str(t2-t1) + " sec" > > > > > > > > > if __name__ == "__main__": > > > t4 = clock.time() > > > curand_prof() > > > t5 = clock.time() > > > print "Bench 2: " + str(t5-t4) + " sec" > > > > > > Bench 1: 1.15405488014 sec > > > Bench 2: 1.15947508812 sec > > > > > > It seems consistent with your results - I was running on GTX 460 > > > with Fermi. Your GTX 260 is Tesla, so 256 threads are used; > > > Fermi uses 1024 threads, which uses 4 times less time to compute > > > random numbers. > > > > > > Best regards, thanks for noticing this, and thanks for testing > > > CURAND wrapper. > > > > > > -- > > > Tomasz Rybak <bogom...@post.pl> GPG/PGP key ID: 2AD5 9860 > > > Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860 > > > http://member.acm.org/~tomaszrybak > > > > > > -- > Tomasz Rybak <bogom...@post.pl> GPG/PGP key ID: 2AD5 9860 > Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860 > http://member.acm.org/~tomaszrybak >
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda