Dnia 2010-12-15, śro o godzinie 00:30 +0100, Tomasz Rybak pisze: > Dnia 2010-12-14, wto o godzinie 01:14 +0100, Tomasz Rybak pisze: > > Dnia 2010-12-10, pią o godzinie 01:05 +0100, Tomasz Rybak pisze: > > > Hello. > > > NVIDIA introduced library CURAND in CUDA 3.2. > > > It allows for generating pseudo-random numbers on GPU. > > > CURAND library consists of two parts - host part, and device > > > part. > > > The latter is set of __device__ kernels that can be called > > > from custom kernels. > > > I attach my wrapper around those kernels, and patch > > > for PyCUDA that would include my objects to pycuda.curandom. > > > I have checked those classes on both Fermi and 1.1 device. > > > On ION one cannot call 512 threads at one time - there is > > > not enough resources to initialise 512 random states at one time. > > > > > > This is rather early version - so please give some feedback. > > > I am also using camelCase names while PyCUDA uses > > > names_with_underscores; I assume I will need to change it before > > > patch gets accepted. > > > > New version, IMO ready to be included into PyCUDA. > > I have changed naming conventions to be proper PEP 8 style. > > Please give some feedback. > > Added documentation describing new classes.
Final version - added "const" modifiers to kernel arguments. -- Tomasz Rybak <bogom...@post.pl> GPG/PGP key ID: 2AD5 9860 Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860 http://member.acm.org/~tomaszrybak
diff --git a/doc/source/array.rst b/doc/source/array.rst index 5273df9..d3034ff 100644 --- a/doc/source/array.rst +++ b/doc/source/array.rst @@ -345,6 +345,152 @@ Generating Arrays of Random Numbers Return an array of `shape` filled with random values of `dtype` in the range [0,1). +.. warning:: + + The following classes are using random number generators that run on GPU. + Each thread uses own generator. Creation of those generators requires more + resources than subsequent generation of random numbers. After experiments + it looks like maximum number of active generators on Tesla devices + (with compute capabilities 1.x) is 256. Fermi devices allow for creating + 1024 generators without any problems. If there are troubles with creating + objects of class Randomizer or QuasiRandomizer decrease number of created + generators (and therefore number of active threads). + +.. class:: Randomizer(generator_count, offset, seed=None) + + Class surrounding CURAND kernels from CUDA 3.2. + It allows for generating pseudo-random numbers with uniform and normal + probability function of type int, float, double, float2, double2. + + CUDA 3.2 and above. + + .. method fill_in_uniform_int(data, input_size, stream=None) + + Fills in array of input_size integer values pointed by data + with pseudo-random integers with uniform distribution. + + .. method fill_in_uniform_float(data, input_size, stream=None) + + Fills in array of input_size float values pointed by data + with pseudo-random floats with uniform distribution. + + .. method fill_in_uniform_double(data, input_size, stream=None) + + Fills in array of input_size double values pointed by data + with pseudo-random floats with uniform distribution. + + .. method fill_in_normal_float(data, input_size, stream=None) + + Fills in array of input_size float values pointed by data + with pseudo-random floats with normal distribution. + + .. method fill_in_normal_double(data, input_size, stream=None) + + Fills in array of input_size double values pointed by data + with pseudo-random double with normal distribution. + + .. method fill_in_normal_float2(data, input_size, stream=None) + + Fills in array of input_size float pairs pointed by data + with pseudo-random floats with normal distribution. + + .. method fill_in_normal_double2(data, input_size, stream=None) + + Fills in array of input_size double pairs pointed by data + with pseudo-random double with normal distribution. + + .. method fill_in_uniform(data, stream=None) + + Fills in GPUArray with pseudo-random values with uniform distribution. + Detects type of items in GPUArray and calls appropriate method. + + .. method fill_in_normal(data, stream=None) + + Fills in GPUArray with pseudo-random values with normal distribution. + Detects type of items in GPUArray and calls appropriate method. + + .. method call_skip_ahead(i, stream=None) + + Forces all generators to skip i values. Is equivalent to generating + i values and discarding results, but is much faster. + + .. method call_skip_ahead_array(i, stream=None) + + Accepts array i of integer values, telling each generator how many + values to skip. + + .. method __call__(shape, dtype=numpy.float32, stream=None) + + Creates GPUArray, fills it with pseudo-random values and returns it. + +.. class:: QuasiRandomizer(generator_count, offset, seed=None) + + Class surrounding CURAND kernels from CUDA 3.2. + It allows for generating quasi-random numbers with uniform and normal + probability function of type int, float, double, float2, double2. + + CUDA 3.2 and above. + + .. method fill_in_uniform_int(data, input_size, stream=None) + + Fills in array of input_size integer values pointed by data + with quasi-random integers with uniform distribution. + + .. method fill_in_uniform_float(data, input_size, stream=None) + + Fills in array of input_size float values pointed by data + with quasi-random floats with uniform distribution. + + .. method fill_in_uniform_double(data, input_size, stream=None) + + Fills in array of input_size double values pointed by data + with quasi-random floats with uniform distribution. + + .. method fill_in_normal_float(data, input_size, stream=None) + + Fills in array of input_size float values pointed by data + with quasi-random floats with normal distribution. + + .. method fill_in_normal_double(data, input_size, stream=None) + + Fills in array of input_size double values pointed by data + with quasi-random double with normal distribution. + + .. method fill_in_normal_float2(data, input_size, stream=None) + + Fills in array of input_size float pairs pointed by data + with quasi-random floats with normal distribution. + + .. method fill_in_normal_double2(data, input_size, stream=None) + + Fills in array of input_size double pairs pointed by data + with quasi-random double with normal distribution. + + .. method fill_in_uniform(data, stream=None) + + Fills in GPUArray with quasi-random values with uniform distribution. + Detects type of items in GPUArray and calls appropriate method. + + .. method fill_in_normal(data, stream=None) + + Fills in GPUArray with quasi-random values with normal distribution. + Detects type of items in GPUArray and calls appropriate method. + + .. method call_skip_ahead(i, stream=None) + + Forces all generators to skip i values. Is equivalent to generating + i values and discarding results, but is much faster. + + .. method call_skip_ahead_array(i, stream=None) + + Accepts array i of integer values, telling each generator how many + values to skip. + + .. method __call__(shape, dtype=numpy.float32, stream=None) + + Creates GPUArray, fills it with quasi-random values and returns it. + + Single-pass Custom Expression Evaluation ---------------------------------------- diff --git a/pycuda/curandom.py b/pycuda/curandom.py index 2105f40..e5bbb40 100644 --- a/pycuda/curandom.py +++ b/pycuda/curandom.py @@ -232,6 +232,371 @@ def rand(shape, dtype=numpy.float32, stream=None): return result +# Need to allocate memory for curandState - there in not enough memory +# (neither shared nor private) on the chip for one for each thread ;-) +# sizeof(curandState)) +curand_state_size = 32 +curand_state_sobol_size = 32 + +random_source = """ +// Uses C++ features (templates); do not surround with extern C +#include <curand_kernel.h> + +extern "C" { +__global__ void prepare_with_seed(curandState *s, const int n, const int seed, const int offset) { + if (threadIdx.x < n) { + curand_init(seed, threadIdx.x, offset, &s[threadIdx.x]); + } +} + +__global__ void prepare_with_seeds(curandState *s, const int n, const int *seed, const int offset) { + if (threadIdx.x < n) { + curand_init(seed[threadIdx.x], threadIdx.x, offset, + &s[threadIdx.x]); + } +} + +__global__ void uniform_int(curandState *s, int *d, const int n) { + const int tidx = threadIdx.x; + for (int idx = tidx; idx < n; idx += blockDim.x) { + d[idx] = curand(&s[tidx]); + } +} + +__global__ void uniform_float(curandState *s, float *d, const int n) { + const int tidx = threadIdx.x; + for (int idx = tidx; idx < n; idx += blockDim.x) { + d[idx] = curand_uniform(&s[tidx]); + } +} + +__global__ void uniform_double(curandState *s, double *d, const int n) { + const int tidx = threadIdx.x; + for (int idx = tidx; idx < n; idx += blockDim.x) { + d[idx] = curand_uniform_double(&s[tidx]); + } +} + +__global__ void normal_float(curandState *s, float *d, const int n) { + const int tidx = threadIdx.x; + for (int idx = tidx; idx < n; idx += blockDim.x) { + d[idx] = curand_normal(&s[tidx]); + } +} + +__global__ void normal_double(curandState *s, double *d, const int n) { + const int tidx = threadIdx.x; + for (int idx = tidx; idx < n; idx += blockDim.x) { + d[idx] = curand_normal_double(&s[tidx]); + } +} + +__global__ void normal_float2(curandState *s, float2 *d, const int n) { + const int tidx = threadIdx.x; + for (int idx = tidx; idx < n; idx += blockDim.x) { + d[idx] = curand_normal2(&s[tidx]); + } +} + +__global__ void normal_double2(curandState *s, double2 *d, const int n) { + const int tidx = threadIdx.x; + for (int idx = tidx; idx < n; idx += blockDim.x) { + d[idx] = curand_normal2_double(&s[tidx]); + } +} + +__global__ void skip_ahead(curandState *s, const int n, const int skip) { + const int idx = threadIdx.x; + if (idx < n) { + skipahead(skip, &s[idx]); + } +} + +__global__ void skip_ahead_array(curandState *s, const int n, const int *skip) { + const int idx = threadIdx.x; + if (idx < n) { + skipahead(skip[idx], &s[idx]); + } +} +} +""" + +class Randomizer: + """ + Class surrounding CURAND kernels from CUDA 3.2. + It allows for generating pseudo-random numbers with uniform + and normal probability function of type int, float, double, + float2, double2. + """ + def __init__(self, generator_count, offset, seed = None): + import pycuda.compiler + import pycuda.driver + import pycuda._driver + import pycuda.gpuarray + if pycuda.driver.get_version() < (3, 2, 0): + raise EnvironmentError("Need at least CUDA 3.2") + self.generator_count = generator_count + self.state = pycuda.driver.mem_alloc(generator_count * + curand_state_size) + m = pycuda.compiler.SourceModule(random_source, no_extern_c=True, + keep=True) + if seed is None: + import random + import sys + seed = random.randint(0, ((1 << 31) - 1)) +# TODO: should we use async call here, or leave as is - it is just a constructor? +# TODO: cudaThreadSetLimit(cudaLimitStackSize, 16k) on Fermi + if isinstance(seed, int): + p = m.get_function("prepare_with_seed") + p.prepare("Piii", block=(generator_count, 1, 1)) + try: +# Max 256 threads on ION during preparing +# Need to limit number of threads. If not there is failed execution: +# pycuda._driver.LaunchError: cuLaunchGrid failed: launch out of resources + p.prepared_call((1, 1), self.state, generator_count, seed, + offset) + except pycuda._driver.LaunchError: + raise ValueError("Initialisation failed. Decrease number of threads.") + else: + if isinstance(seed, list) or isinstance(seed, tuple): + seed = numpy.array(seed, dtype=numpy.int32) + if isinstance(seed, numpy.ndarray): + seed = pycuda.gpuarray.to_gpu(seed).astype(numpy.int32) + if (isinstance(seed, pycuda.gpuarray.GPUArray) and + seed.dtype == numpy.int32): + p = m.get_function("prepare_with_seeds") + p.prepare("PiPi", block=(generator_count, 1, 1)) + try: +# Max 256 threads on ION during preparing +# Need to limit number of threads. If not there is failed execution: +# pycuda._driver.LaunchError: cuLaunchGrid failed: launch out of resources + p.prepared_call((1, 1), self.state, generator_count, + seed.gpudata, offset) + except pycuda._driver.LaunchError: + raise ValueError("Initialisation failed. Decrease number of threads.") + else: + raise ValueError("Need GPUArray of integers") + self.uniform_int = m.get_function("uniform_int") + self.uniform_int.prepare("PPi", block=(generator_count, 1, 1)) + self.uniform_float = m.get_function("uniform_float") + self.uniform_float.prepare("PPi", block=(generator_count, 1, 1)) + self.uniform_double = m.get_function("uniform_double") + self.uniform_double.prepare("PPi", block=(generator_count, 1, 1)) + self.normal_float = m.get_function("normal_float") + self.normal_float.prepare("PPi", block=(generator_count, 1, 1)) + self.normal_double = m.get_function("normal_double") + self.normal_double.prepare("PPi", block=(generator_count, 1, 1)) + self.normal_float2 = m.get_function("normal_float2") + self.normal_float2.prepare("PPi", block=(generator_count, 1, 1)) + self.normal_double2 = m.get_function("normal_double2") + self.normal_double2.prepare("PPi", block=(generator_count, 1, 1)) + self.skip_ahead = m.get_function("skip_ahead") + self.skip_ahead.prepare("Pii", block=(generator_count, 1, 1)) + self.skip_ahead_array = m.get_function("skip_ahead_array") + self.skip_ahead_array.prepare("PiP", block=(generator_count, 1, 1)) + def __del__(self): + self.state.free() + def fill_in_uniform_int(self, data, input_size, stream = None): + self.uniform_int.prepared_async_call((1, 1), stream, self.state, + data, input_size) + def fill_in_uniform_float(self, data, input_size, stream = None): + self.uniform_float.prepared_async_call((1, 1), stream, self.state, + data, input_size) + def fill_in_uniform_double(self, data, input_size, stream = None): + self.uniform_double.prepared_async_call((1, 1), stream, self.state, + data, input_size) + def fill_in_normal_float(self, data, input_size, stream = None): + self.normal_float.prepared_async_call((1, 1), stream, self.state, + data, input_size) + def fill_in_normal_double(self, data, input_size, stream = None): + self.normal_double.prepared_async_call((1, 1), stream, self.state, + data, input_size) + def fill_in_normal_float2(self, data, input_size, stream = None): + self.normal_float2.prepared_async_call((1, 1), stream, self.state, + data, input_size) + def fill_in_normal_double2(self, data, input_size, stream = None): + self.normal_double2.prepared_async_call((1, 1), stream, self.state, + data, input_size) + def fill_in_uniform(self, data, stream = None): + if data.dtype == numpy.float32: + self.fill_in_uniform_float(data.gpudata, data.size, stream) + elif data.dtype == numpy.float64: + self.fill_in_uniform_double(data.gpudata, data.size, stream) + elif data.dtype in [numpy.int, numpy.int32, numpy.uint32]: + self.fill_in_uniform_int(data.gpudata, data.size, stream) + else: + raise NotImplementedError + def fill_in_normal(self, data, stream = None): + if isinstance(data.dtype, numpy.float32): + self.fill_in_normal_float(data.gpudata, data.size, stream) + elif isinstance(data.dtype, numpy.float64): + self.fill_in_normal_double(data.gpudata, data.size, stream) + else: + raise NotImplementedError + def call_skip_ahead(self, i, stream = None): + self.skip_ahead.prepared_async_call((1, 1), stream, self.state, + self.generator_count, i) + def call_skip_ahead_array(self, i, stream = None): + self.skip_ahead_array.prepared_async_call((1, 1), stream, self.state, + self.generator_count, i.gpudata) + def __call__(self, shape, dtype = numpy.float32, stream = None): + import pycuda.gpuarray + result = pycuda.gpuarray.GPUArray(shape, dtype) +# Do not use GPUArray._blocks like curand.rand does; +# we are limited by number of states, set in constructor + self.fill_in_uniform(result, stream) + return result + +quasi_random_source = """ +// Uses C++ features (templates); do not surround with extern C +#include <curand_kernel.h> + +extern "C" { +__global__ void prepare(curandStateSobol32 *s, const int n, unsigned int *v, + const unsigned int o) { + if (threadIdx.x < n) { + curand_init(v, o, &s[threadIdx.x]); + } +} + +__global__ void uniform_int(curandStateSobol32 *s, int *d, const int n) { + const int tidx = threadIdx.x; + for (int idx = tidx; idx < n; idx += blockDim.x) { + d[idx] = curand(&s[tidx]); + } +} + +__global__ void uniform_float(curandStateSobol32 *s, float *d, const int n) { + const int tidx = threadIdx.x; + for (int idx = tidx; idx < n; idx += blockDim.x) { + d[idx] = curand_uniform(&s[tidx]); + } +} + +__global__ void uniform_double(curandStateSobol32 *s, double *d, const int n) { + const int tidx = threadIdx.x; + for (int idx = tidx; idx < n; idx += blockDim.x) { + d[idx] = curand_uniform_double(&s[tidx]); + } +} + +__global__ void normal_float(curandStateSobol32 *s, float *d, const int n) { + const int tidx = threadIdx.x; + for (int idx = tidx; idx < n; idx += blockDim.x) { + d[idx] = curand_normal(&s[tidx]); + } +} + +__global__ void normal_double(curandStateSobol32 *s, double *d, const int n) { + const int tidx = threadIdx.x; + for (int idx = tidx; idx < n; idx += blockDim.x) { + d[idx] = curand_normal_double(&s[tidx]); + } +} + +__global__ void skip_ahead(curandStateSobol32 *s, const int n, const int skip) { + const int idx = threadIdx.x; + if (idx < n) { + skipahead(skip, &s[idx]); + } +} + +__global__ void skip_ahead_array(curandStateSobol32 *s, const int n, + const int *skip) { + const int idx = threadIdx.x; + if (idx < n) { + skipahead(skip[idx], &s[idx]); + } +} +} +""" + +class QuasiRandomizer: + """ + Class surrounding CURAND kernels from CUDA 3.2. + It allows for generating quasi-random numbers with uniform + and normal probability function of type int, float, and double. + """ + def __init__(self, generator_count, vector, offset): + import pycuda.compiler + import pycuda.driver + if pycuda.driver.get_version() < (3, 2, 0): + raise EnvironmentError("Need at least CUDA 3.2") + self.generator_count = generator_count + self.state = pycuda.driver.mem_alloc(generator_count * + curand_state_sobol_size) + m = pycuda.compiler.SourceModule(quasi_random_source, no_extern_c=True, + keep=True) +# TODO: should we use async call here, or leave as is - it is just a constructor? + p = m.get_function("prepare") + p.prepare("PiPi", block=(generator_count, 1, 1)) + try: +# Max 256 threads on ION during preparing +# Need to limit number of threads. If not there is failed execution: +# pycuda._driver.LaunchError: cuLaunchGrid failed: launch out of resources + p.prepared_call((1, 1), self.state, generator_count, vector, offset) + except pycuda._driver.LaunchError: + raise ValueError("Initialisation failed. Decrease number of threads.") + self.uniform_int = m.get_function("uniform_int") + self.uniform_int.prepare("PPi", block=(generator_count, 1, 1)) + self.uniform_float = m.get_function("uniform_float") + self.uniform_float.prepare("PPi", block=(generator_count, 1, 1)) + self.uniform_double = m.get_function("uniform_double") + self.uniform_double.prepare("PPi", block=(generator_count, 1, 1)) + self.normal_float = m.get_function("normal_float") + self.normal_float.prepare("PPi", block=(generator_count, 1, 1)) + self.normal_double = m.get_function("normal_double") + self.normal_double.prepare("PPi", block=(generator_count, 1, 1)) + self.skip_ahead = m.get_function("skip_ahead") + self.skip_ahead.prepare("Pii", block=(generator_count, 1, 1)) + self.skip_ahead_array = m.get_function("skip_ahead_array") + self.skip_ahead_array.prepare("PiP", block=(generator_count, 1, 1)) + def __del__(self): + self.state.free() + def fill_in_uniform_int(self, data, input_size, stream = None): + self.uniform_int.prepared_async_call((1, 1), stream, self.state, + data, input_size) + def fill_in_uniform_float(self, data, input_size, stream = None): + self.uniform_float.prepared_async_call((1, 1), stream, self.state, + data, input_size) + def fill_in_uniform_double(self, data, input_size, stream = None): + self.uniform_double.prepared_async_call((1, 1), stream, self.state, + data, input_size) + def fill_in_normal_float(self, data, input_size, stream = None): + self.normal_float.prepared_async_call((1, 1), stream, self.state, + data, input_size) + def fill_in_normal_double(self, data, input_size, stream = None): + self.normal_double.prepared_async_call((1, 1), stream, self.state, + data, input_size) + def fill_in_uniform(self, data, stream = None): + if data.dtype == numpy.float32: + self.fill_in_uniform_float(data.gpudata, data.size, stream) + elif data.dtype == numpy.float64: + self.fill_in_uniform_double(data.gpudata, data.size, stream) + elif data.dtype in [numpy.int, numpy.int32, numpy.uint32]: + self.fill_in_uniform_int(data.gpudata, data.size, stream) + else: + raise NotImplementedError + def fill_in_normal(self, data, stream = None): + if isinstance(data.dtype, numpy.float32): + self.fill_in_normal_float(data.gpudata, data.size, stream) + elif isinstance(data.dtype, numpy.float64): + self.fill_in_normal_double(data.gpudata, data.size, stream) + else: + raise NotImplementedError + def call_skip_ahead(self, i): + self.skip_ahead.prepared_call((1, 1), self.state, + self.generator_count, i) + def call_skip_ahead_array(self, i): + self.skip_ahead_array.prepared_call((1, 1), self.state, + self.generator_count, i.gpudata) + def __call__(self, shape, dtype = numpy.float32, stream = None): + import pycuda.gpuarray + result = pycuda.gpuarray.GPUArray(shape, dtype) + self.fill_in_uniform(result, stream) + return result + + if __name__ == "__main__": import sys, pycuda.autoinit @@ -255,3 +620,17 @@ if __name__ == "__main__": subplot(132); plot( r2.get(),"x-") subplot(133); plot( r3.get(),"x-") show() + + import pycuda.gpuarray + rr = Randomizer(256, 0, numpy.random.random(256).astype(numpy.int32)) + data = pycuda.gpuarray.zeros([10000], numpy.float32) + print data[0:200] + rr.fill_in_uniform_float(data.gpudata, 512) + print data[0:200] + data = rr((100, ), numpy.uint32) + del rr + print data[0:200] + data = pycuda.gpuarray.zeros([256], numpy.int) + rr = QuasiRandomizer(256, data.gpudata, 1) + print data[0:200] +
signature.asc
Description: This is a digitally signed message part
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda