Re: [PyCUDA] CURAND wrapper for PyCUDA

Tomasz Rybak Wed, 15 Dec 2010 13:06:37 -0800

Dnia 2010-12-15, śro o godzinie 00:30 +0100, Tomasz Rybak pisze:
> Dnia 2010-12-14, wto o godzinie 01:14 +0100, Tomasz Rybak pisze:
> > Dnia 2010-12-10, pią o godzinie 01:05 +0100, Tomasz Rybak pisze:
> > > Hello.
> > > NVIDIA introduced library CURAND in CUDA 3.2.
> > > It allows for generating pseudo-random numbers on GPU.
> > > CURAND library consists of two parts - host part, and device
> > > part.
> > > The latter is set of __device__ kernels that can be called
> > > from custom kernels.
> > > I attach my wrapper around those kernels, and patch
> > > for PyCUDA that would include my objects to pycuda.curandom.
> > > I have checked those classes on both Fermi and 1.1 device.
> > > On ION one cannot call 512 threads at one time - there is
> > > not enough resources to initialise 512 random states at one time.
> > > 
> > > This is rather early version - so please give some feedback.
> > > I am also using camelCase names while PyCUDA uses
> > > names_with_underscores; I assume I will need to change it before
> > > patch gets accepted.
> > 
> > New version, IMO ready to be included into PyCUDA.
> > I have changed naming conventions to be proper PEP 8 style.
> > Please give some feedback.
> 
> Added documentation describing new classes.


Final version - added "const" modifiers to kernel arguments.

-- 
Tomasz Rybak <bogom...@post.pl> GPG/PGP key ID: 2AD5 9860
Fingerprint A481 824E 7DD3 9C0E C40A  488E C654 FB33 2AD5 9860
http://member.acm.org/~tomaszrybak

diff --git a/doc/source/array.rst b/doc/source/array.rst
index 5273df9..d3034ff 100644
--- a/doc/source/array.rst
+++ b/doc/source/array.rst
@@ -345,6 +345,152 @@ Generating Arrays of Random Numbers
     Return an array of `shape` filled with random values of `dtype`
     in the range [0,1).
 
+.. warning::
+
+    The following classes are using random number generators that run on GPU.
+    Each thread uses own generator. Creation of those generators requires more
+    resources than subsequent generation of random numbers. After experiments
+    it looks like maximum number of active generators on Tesla devices
+    (with compute capabilities 1.x) is 256. Fermi devices allow for creating
+    1024 generators without any problems. If there are troubles with creating
+    objects of class Randomizer or QuasiRandomizer decrease number of created
+    generators (and therefore number of active threads).
+
+.. class:: Randomizer(generator_count, offset, seed=None)
+
+    Class surrounding CURAND kernels from CUDA 3.2.
+    It allows for generating pseudo-random numbers with uniform and normal
+    probability function of type int, float, double, float2, double2.
+
+    CUDA 3.2 and above.
+
+    .. method fill_in_uniform_int(data, input_size, stream=None)
+
+        Fills in array of input_size integer values pointed by data
+	with pseudo-random integers with uniform distribution.
+
+    .. method fill_in_uniform_float(data, input_size, stream=None)
+
+        Fills in array of input_size float values pointed by data
+	with pseudo-random floats with uniform distribution.
+
+    .. method fill_in_uniform_double(data, input_size, stream=None)
+
+        Fills in array of input_size double values pointed by data
+	with pseudo-random floats with uniform distribution.
+
+    .. method fill_in_normal_float(data, input_size, stream=None)
+
+        Fills in array of input_size float values pointed by data
+	with pseudo-random floats with normal distribution.
+
+    .. method fill_in_normal_double(data, input_size, stream=None)
+
+        Fills in array of input_size double values pointed by data
+	with pseudo-random double with normal distribution.
+
+    .. method fill_in_normal_float2(data, input_size, stream=None)
+
+        Fills in array of input_size float pairs pointed by data
+	with pseudo-random floats with normal distribution.
+
+    .. method fill_in_normal_double2(data, input_size, stream=None)
+
+        Fills in array of input_size double pairs pointed by data
+	with pseudo-random double with normal distribution.
+
+    .. method fill_in_uniform(data, stream=None)
+
+        Fills in GPUArray with pseudo-random values with uniform distribution.
+	Detects type of items in GPUArray and calls appropriate method.
+
+    .. method fill_in_normal(data, stream=None)
+
+        Fills in GPUArray with pseudo-random values with normal distribution.
+	Detects type of items in GPUArray and calls appropriate method.
+
+    .. method call_skip_ahead(i, stream=None)
+
+        Forces all generators to skip i values. Is equivalent to generating
+	i values and discarding results, but is much faster.
+	
+    .. method call_skip_ahead_array(i, stream=None)
+
+        Accepts array i of integer values, telling each generator how many
+	values to skip.
+
+    .. method __call__(shape, dtype=numpy.float32, stream=None)
+
+        Creates GPUArray, fills it with pseudo-random values and returns it.
+
+.. class:: QuasiRandomizer(generator_count, offset, seed=None)
+
+    Class surrounding CURAND kernels from CUDA 3.2.
+    It allows for generating quasi-random numbers with uniform and normal
+    probability function of type int, float, double, float2, double2.
+
+    CUDA 3.2 and above.
+
+    .. method fill_in_uniform_int(data, input_size, stream=None)
+
+        Fills in array of input_size integer values pointed by data
+	with quasi-random integers with uniform distribution.
+
+    .. method fill_in_uniform_float(data, input_size, stream=None)
+
+        Fills in array of input_size float values pointed by data
+	with quasi-random floats with uniform distribution.
+
+    .. method fill_in_uniform_double(data, input_size, stream=None)
+
+        Fills in array of input_size double values pointed by data
+	with quasi-random floats with uniform distribution.
+
+    .. method fill_in_normal_float(data, input_size, stream=None)
+
+        Fills in array of input_size float values pointed by data
+	with quasi-random floats with normal distribution.
+
+    .. method fill_in_normal_double(data, input_size, stream=None)
+
+        Fills in array of input_size double values pointed by data
+	with quasi-random double with normal distribution.
+
+    .. method fill_in_normal_float2(data, input_size, stream=None)
+
+        Fills in array of input_size float pairs pointed by data
+	with quasi-random floats with normal distribution.
+
+    .. method fill_in_normal_double2(data, input_size, stream=None)
+
+        Fills in array of input_size double pairs pointed by data
+	with quasi-random double with normal distribution.
+
+    .. method fill_in_uniform(data, stream=None)
+
+        Fills in GPUArray with quasi-random values with uniform distribution.
+	Detects type of items in GPUArray and calls appropriate method.
+
+    .. method fill_in_normal(data, stream=None)
+
+        Fills in GPUArray with quasi-random values with normal distribution.
+	Detects type of items in GPUArray and calls appropriate method.
+
+    .. method call_skip_ahead(i, stream=None)
+
+        Forces all generators to skip i values. Is equivalent to generating
+	i values and discarding results, but is much faster.
+	
+    .. method call_skip_ahead_array(i, stream=None)
+
+        Accepts array i of integer values, telling each generator how many
+	values to skip.
+
+    .. method __call__(shape, dtype=numpy.float32, stream=None)
+
+        Creates GPUArray, fills it with quasi-random values and returns it.
+
+
 Single-pass Custom Expression Evaluation
 ----------------------------------------
 
diff --git a/pycuda/curandom.py b/pycuda/curandom.py
index 2105f40..e5bbb40 100644
--- a/pycuda/curandom.py
+++ b/pycuda/curandom.py
@@ -232,6 +232,371 @@ def rand(shape, dtype=numpy.float32, stream=None):
     
     return result
 
+# Need to allocate memory for curandState - there in not enough memory
+# (neither shared nor private) on the chip for one for each thread ;-)
+# sizeof(curandState))
+curand_state_size = 32
+curand_state_sobol_size = 32
+
+random_source = """
+// Uses C++ features (templates); do not surround with extern C
+#include <curand_kernel.h>
+
+extern "C" {
+__global__ void prepare_with_seed(curandState *s, const int n, const int seed, const int offset) {
+    if (threadIdx.x < n) {
+        curand_init(seed, threadIdx.x, offset, &s[threadIdx.x]);
+    }
+}
+
+__global__ void prepare_with_seeds(curandState *s, const int n, const int *seed, const int offset) {
+    if (threadIdx.x < n) {
+        curand_init(seed[threadIdx.x], threadIdx.x, offset,
+            &s[threadIdx.x]);
+    }
+}
+
+__global__ void uniform_int(curandState *s, int *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand(&s[tidx]);
+    }
+}
+
+__global__ void uniform_float(curandState *s, float *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_uniform(&s[tidx]);
+    }
+}
+
+__global__ void uniform_double(curandState *s, double *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_uniform_double(&s[tidx]);
+    }
+}
+
+__global__ void normal_float(curandState *s, float *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_normal(&s[tidx]);
+    }
+}
+
+__global__ void normal_double(curandState *s, double *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_normal_double(&s[tidx]);
+    }
+}
+
+__global__ void normal_float2(curandState *s, float2 *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_normal2(&s[tidx]);
+    }
+}
+
+__global__ void normal_double2(curandState *s, double2 *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_normal2_double(&s[tidx]);
+    }
+}
+
+__global__ void skip_ahead(curandState *s, const int n, const int skip) {
+    const int idx = threadIdx.x;
+    if (idx < n) {
+        skipahead(skip, &s[idx]);
+    }
+}
+
+__global__ void skip_ahead_array(curandState *s, const int n, const int *skip) {
+    const int idx = threadIdx.x;
+    if (idx < n) {
+        skipahead(skip[idx], &s[idx]);
+    }
+}
+}
+"""
+
+class Randomizer:
+    """
+    Class surrounding CURAND kernels from CUDA 3.2.
+    It allows for generating pseudo-random numbers with uniform
+    and normal probability function of type int, float, double,
+    float2, double2.
+    """
+    def __init__(self, generator_count, offset, seed = None):
+        import pycuda.compiler
+        import pycuda.driver
+        import pycuda._driver
+        import pycuda.gpuarray
+        if pycuda.driver.get_version() < (3, 2, 0):
+            raise EnvironmentError("Need at least CUDA 3.2")
+        self.generator_count = generator_count
+        self.state = pycuda.driver.mem_alloc(generator_count *
+            curand_state_size)
+        m = pycuda.compiler.SourceModule(random_source, no_extern_c=True,
+            keep=True)
+        if seed is None:
+            import random
+            import sys
+            seed = random.randint(0, ((1 << 31) - 1))
+# TODO: should we use async call here, or leave as is - it is just a constructor?
+# TODO: cudaThreadSetLimit(cudaLimitStackSize, 16k) on Fermi
+        if isinstance(seed, int):
+            p = m.get_function("prepare_with_seed")
+            p.prepare("Piii", block=(generator_count, 1, 1))
+            try:
+# Max 256 threads on ION during preparing
+# Need to limit number of threads. If not there is failed execution:
+# pycuda._driver.LaunchError: cuLaunchGrid failed: launch out of resources
+                p.prepared_call((1, 1), self.state, generator_count, seed,
+                    offset)
+            except pycuda._driver.LaunchError:
+                raise ValueError("Initialisation failed. Decrease number of threads.")
+        else:
+            if isinstance(seed, list) or isinstance(seed, tuple):
+                seed = numpy.array(seed, dtype=numpy.int32)
+            if isinstance(seed, numpy.ndarray):
+                seed = pycuda.gpuarray.to_gpu(seed).astype(numpy.int32)
+            if (isinstance(seed, pycuda.gpuarray.GPUArray) and 
+                seed.dtype == numpy.int32):
+                p = m.get_function("prepare_with_seeds")
+                p.prepare("PiPi", block=(generator_count, 1, 1))
+                try:
+# Max 256 threads on ION during preparing
+# Need to limit number of threads. If not there is failed execution:
+# pycuda._driver.LaunchError: cuLaunchGrid failed: launch out of resources
+                    p.prepared_call((1, 1), self.state, generator_count,
+                        seed.gpudata, offset)
+                except pycuda._driver.LaunchError:
+                    raise ValueError("Initialisation failed. Decrease number of threads.")
+            else:
+                raise ValueError("Need GPUArray of integers")
+        self.uniform_int = m.get_function("uniform_int")
+        self.uniform_int.prepare("PPi", block=(generator_count, 1, 1))
+        self.uniform_float = m.get_function("uniform_float")
+        self.uniform_float.prepare("PPi", block=(generator_count, 1, 1))
+        self.uniform_double = m.get_function("uniform_double")
+        self.uniform_double.prepare("PPi", block=(generator_count, 1, 1))
+        self.normal_float = m.get_function("normal_float")
+        self.normal_float.prepare("PPi", block=(generator_count, 1, 1))
+        self.normal_double = m.get_function("normal_double")
+        self.normal_double.prepare("PPi", block=(generator_count, 1, 1))
+        self.normal_float2 = m.get_function("normal_float2")
+        self.normal_float2.prepare("PPi", block=(generator_count, 1, 1))
+        self.normal_double2 = m.get_function("normal_double2")
+        self.normal_double2.prepare("PPi", block=(generator_count, 1, 1))
+        self.skip_ahead = m.get_function("skip_ahead")
+        self.skip_ahead.prepare("Pii", block=(generator_count, 1, 1))
+        self.skip_ahead_array = m.get_function("skip_ahead_array")
+        self.skip_ahead_array.prepare("PiP", block=(generator_count, 1, 1))
+    def __del__(self):
+        self.state.free()
+    def fill_in_uniform_int(self, data, input_size, stream = None):
+        self.uniform_int.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_in_uniform_float(self, data, input_size, stream = None):
+        self.uniform_float.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_in_uniform_double(self, data, input_size, stream = None):
+        self.uniform_double.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_in_normal_float(self, data, input_size, stream = None):
+        self.normal_float.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_in_normal_double(self, data, input_size, stream = None):
+        self.normal_double.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_in_normal_float2(self, data, input_size, stream = None):
+        self.normal_float2.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_in_normal_double2(self, data, input_size, stream = None):
+        self.normal_double2.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_in_uniform(self, data, stream = None):
+        if data.dtype == numpy.float32:
+            self.fill_in_uniform_float(data.gpudata, data.size, stream)
+        elif data.dtype == numpy.float64:
+            self.fill_in_uniform_double(data.gpudata, data.size, stream)
+        elif data.dtype in [numpy.int, numpy.int32, numpy.uint32]:
+            self.fill_in_uniform_int(data.gpudata, data.size, stream)
+        else:
+            raise NotImplementedError
+    def fill_in_normal(self, data, stream = None):
+        if isinstance(data.dtype, numpy.float32):
+            self.fill_in_normal_float(data.gpudata, data.size, stream)
+        elif isinstance(data.dtype, numpy.float64):
+            self.fill_in_normal_double(data.gpudata, data.size, stream)
+        else:
+            raise NotImplementedError
+    def call_skip_ahead(self, i, stream = None):
+        self.skip_ahead.prepared_async_call((1, 1), stream, self.state,
+            self.generator_count, i)
+    def call_skip_ahead_array(self, i, stream = None):
+        self.skip_ahead_array.prepared_async_call((1, 1), stream, self.state,
+            self.generator_count, i.gpudata)
+    def __call__(self, shape, dtype = numpy.float32, stream = None):
+        import pycuda.gpuarray
+        result = pycuda.gpuarray.GPUArray(shape, dtype)
+# Do not use GPUArray._blocks like curand.rand does;
+# we are limited by number of states, set in constructor
+        self.fill_in_uniform(result, stream)
+        return result
+
+quasi_random_source = """
+// Uses C++ features (templates); do not surround with extern C
+#include <curand_kernel.h>
+
+extern "C" {
+__global__ void prepare(curandStateSobol32 *s, const int n, unsigned int *v,
+    const unsigned int o) {
+    if (threadIdx.x < n) {
+        curand_init(v, o, &s[threadIdx.x]);
+    }
+}
+
+__global__ void uniform_int(curandStateSobol32 *s, int *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand(&s[tidx]);
+    }
+}
+
+__global__ void uniform_float(curandStateSobol32 *s, float *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_uniform(&s[tidx]);
+    }
+}
+
+__global__ void uniform_double(curandStateSobol32 *s, double *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_uniform_double(&s[tidx]);
+    }
+}
+
+__global__ void normal_float(curandStateSobol32 *s, float *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_normal(&s[tidx]);
+    }
+}
+
+__global__ void normal_double(curandStateSobol32 *s, double *d, const int n) {
+    const int tidx = threadIdx.x;
+    for (int idx = tidx; idx < n; idx += blockDim.x) {
+        d[idx] = curand_normal_double(&s[tidx]);
+    }
+}
+
+__global__ void skip_ahead(curandStateSobol32 *s, const int n, const int skip) {
+    const int idx = threadIdx.x;
+    if (idx < n) {
+        skipahead(skip, &s[idx]);
+    }
+}
+
+__global__ void skip_ahead_array(curandStateSobol32 *s, const int n,
+    const int *skip) {
+    const int idx = threadIdx.x;
+    if (idx < n) {
+        skipahead(skip[idx], &s[idx]);
+    }
+}
+}
+"""
+
+class QuasiRandomizer:
+    """
+    Class surrounding CURAND kernels from CUDA 3.2.
+    It allows for generating quasi-random numbers with uniform
+    and normal probability function of type int, float, and double.
+    """
+    def __init__(self, generator_count, vector, offset):
+        import pycuda.compiler
+        import pycuda.driver
+        if pycuda.driver.get_version() < (3, 2, 0):
+            raise EnvironmentError("Need at least CUDA 3.2")
+        self.generator_count = generator_count
+        self.state = pycuda.driver.mem_alloc(generator_count *
+            curand_state_sobol_size)
+        m = pycuda.compiler.SourceModule(quasi_random_source, no_extern_c=True,
+            keep=True)
+# TODO: should we use async call here, or leave as is - it is just a constructor?
+        p = m.get_function("prepare")
+        p.prepare("PiPi", block=(generator_count, 1, 1))
+        try:
+# Max 256 threads on ION during preparing
+# Need to limit number of threads. If not there is failed execution:
+# pycuda._driver.LaunchError: cuLaunchGrid failed: launch out of resources
+            p.prepared_call((1, 1), self.state, generator_count, vector, offset)
+        except pycuda._driver.LaunchError:
+            raise ValueError("Initialisation failed. Decrease number of threads.")
+        self.uniform_int = m.get_function("uniform_int")
+        self.uniform_int.prepare("PPi", block=(generator_count, 1, 1))
+        self.uniform_float = m.get_function("uniform_float")
+        self.uniform_float.prepare("PPi", block=(generator_count, 1, 1))
+        self.uniform_double = m.get_function("uniform_double")
+        self.uniform_double.prepare("PPi", block=(generator_count, 1, 1))
+        self.normal_float = m.get_function("normal_float")
+        self.normal_float.prepare("PPi", block=(generator_count, 1, 1))
+        self.normal_double = m.get_function("normal_double")
+        self.normal_double.prepare("PPi", block=(generator_count, 1, 1))
+        self.skip_ahead = m.get_function("skip_ahead")
+        self.skip_ahead.prepare("Pii", block=(generator_count, 1, 1))
+        self.skip_ahead_array = m.get_function("skip_ahead_array")
+        self.skip_ahead_array.prepare("PiP", block=(generator_count, 1, 1))
+    def __del__(self):
+        self.state.free()
+    def fill_in_uniform_int(self, data, input_size, stream = None):
+        self.uniform_int.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_in_uniform_float(self, data, input_size, stream = None):
+        self.uniform_float.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_in_uniform_double(self, data, input_size, stream = None):
+        self.uniform_double.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_in_normal_float(self, data, input_size, stream = None):
+        self.normal_float.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_in_normal_double(self, data, input_size, stream = None):
+        self.normal_double.prepared_async_call((1, 1), stream, self.state,
+            data, input_size)
+    def fill_in_uniform(self, data, stream = None):
+        if data.dtype == numpy.float32:
+            self.fill_in_uniform_float(data.gpudata, data.size, stream)
+        elif data.dtype == numpy.float64:
+            self.fill_in_uniform_double(data.gpudata, data.size, stream)
+        elif data.dtype in [numpy.int, numpy.int32, numpy.uint32]:
+            self.fill_in_uniform_int(data.gpudata, data.size, stream)
+        else:
+            raise NotImplementedError
+    def fill_in_normal(self, data, stream = None):
+        if isinstance(data.dtype, numpy.float32):
+            self.fill_in_normal_float(data.gpudata, data.size, stream)
+        elif isinstance(data.dtype, numpy.float64):
+            self.fill_in_normal_double(data.gpudata, data.size, stream)
+        else:
+            raise NotImplementedError
+    def call_skip_ahead(self, i):
+        self.skip_ahead.prepared_call((1, 1), self.state,
+            self.generator_count, i)
+    def call_skip_ahead_array(self, i):
+        self.skip_ahead_array.prepared_call((1, 1), self.state,
+            self.generator_count, i.gpudata)
+    def __call__(self, shape, dtype = numpy.float32, stream = None):
+        import pycuda.gpuarray
+        result = pycuda.gpuarray.GPUArray(shape, dtype)
+        self.fill_in_uniform(result, stream)
+        return result
+
+
 
 if __name__ == "__main__":
     import sys, pycuda.autoinit
@@ -255,3 +620,17 @@ if __name__ == "__main__":
         subplot(132); plot( r2.get(),"x-")
         subplot(133); plot( r3.get(),"x-")
         show()
+
+        import pycuda.gpuarray
+        rr = Randomizer(256, 0, numpy.random.random(256).astype(numpy.int32))
+        data = pycuda.gpuarray.zeros([10000], numpy.float32)
+        print data[0:200]
+        rr.fill_in_uniform_float(data.gpudata, 512)
+        print data[0:200]
+        data = rr((100, ), numpy.uint32)
+        del rr
+        print data[0:200]
+        data = pycuda.gpuarray.zeros([256], numpy.int)
+        rr = QuasiRandomizer(256, data.gpudata, 1)
+        print data[0:200]
+

signature.asc
Description: This is a digitally signed message part

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] CURAND wrapper for PyCUDA

Reply via email to