Dnia 2010-12-26, nie o godzinie 17:42 +0100, Andreas Kloeckner pisze: > Hi Tomasz, > > On Tue, 21 Dec 2010 23:21:59 +0100, Tomasz Rybak <bogom...@post.pl> wrote: > > I have wrote attached code that calculates parallel prefix sum. > > There are two variants - the first one (exclusive) is based on article > > by Mark Harris from NVIDIA, second one (inclusive) is based on diagram > > from Wikipedia. > > Inclusive scan could be optimised with regard to shared memory access > > conflicts, similarly to the exclusive version. > > Also it seems that inclusive scan is less stable numerically - results > > differ from CPU version when calculating scan of large arrays. > > Again, thanks for your contribution! Here are a few comments on the > scan: > > - It doesn't seem like the inclusive and the exclusive version are so > dissimilar. As such, I don't think we should duplicate code for the > two. If necessary, I'd even prefer to make this code depend on Mako or > Jinja (template engines) to avoid code duplication.
Fixed code repetition, added class inheritance. I am not sure about joining exclusive and inclusive scans - they use different algorithms. For now I have added functions and classes to pycuda.reduction - feel free to move them to another module. > > - Use warnings.warn, not plain print, for warnings. Done. > > - Tests should go into tests/test_gpu_array or some such. Done, also added documentation (see patch). > > - Formal nitpicks: Please indent comments with the rest of the code. > PEP 8 says a=value (no spaces) for keyword arguments. Camel case in C > is yucky, too. :) Done, sorry. > > - 'Sum' is poor wording for the general associative operation that the > scan uses--use scan_op perhaps. Done. -- Tomasz Rybak <bogom...@post.pl> GPG/PGP key ID: 2AD5 9860 Fingerprint A481 824E 7DD3 9C0E C40A 488E C654 FB33 2AD5 9860 http://member.acm.org/~tomaszrybak
diff --git a/doc/source/array.rst b/doc/source/array.rst index 5273df9..4bd9c92 100644 --- a/doc/source/array.rst +++ b/doc/source/array.rst @@ -448,6 +448,83 @@ Here's a usage example:: my_dot_prod = krnl(a, b).get() +Parallel Prefix Scan +-------------------- + +In module pycuda.reduction. + +.. function get_exclusive_prefix_module(data_type, scan_operation, neutral, name = 'prefix_kernel', optimise = False, keep = False, options = [], preamble = '') + + Generate module containing function named *name*. This function accepts + two pointers of type *data_type* and size of array n. The first pointer + points memory where result will be stored, the second points input array. + Function assumes that both input and output array have the same size, and + that size is power of 2. Function performs exclusive parallel prefix scan, + performing *scan_operation* on all elements. If scan operation is addition + (scan_operation = 'a+b') then input array {a0, a1, a2, ..., aN} is + transformed into {0, a0, a0+a1, a0+a1+a2, ..., a0+a1+a2+...}. *neutral* + is value that is neutral for particular *scan_operation*, e.g. 0 for + addition, 1 for multiplication. *optimise* determines whether to use more + shared memory to avoid bank conflicts. + +.. function get_exclusive_prefix_module_tail(data_type, scan_operation, neutral, name = 'prefix_kernel', optimise = False, keep = False, options = [], preamble = '') + + Function similar to *get_exclusive_prefix_module* but accepting arrays + which sizes are not power of 2. + +.. function get_inclusive_prefix_module(data_type, scan_operation, neutral, name = 'prefix_kernel', optimise = False, keep = False, options = [], preamble = '') + + Generate module containing function named *name*. This function accepts + two pointers of type *data_type* and size of array n. The first pointer + points memory where result will be stored, the second points input array. + Function assumes that both input and output array have the same size, and + that size is power of 2. Function performs inclusive parallel prefix scan, + performing *scan_operation* on all elements. If scan operation is addition + (scan_operation = 'a+b') then input array {a0, a1, a2, ..., aN} is + transformed into {a0, a0+a1, a0+a1+a2, ..., a0+a1+a2+...+aN}. *neutral* + is value that is neutral for particular *scan_operation*, e.g. 0 for + addition, 1 for multiplication. Returned function uses different algorithm + than *get_exclusive_prefix_module*. + +.. function get_inclusive_prefix_module_tail(data_type, scan_operation, neutral, name = 'prefix_kernel', optimise = False, keep = False, options = [], preamble = '') + + Function similar to *get_inclusive_prefix_module* but accepting arrays + which sizes are not power of 2. + +.. class:: ExclusivePrefixKernel(data_type, scan_operation, neutral, name = 'prefix_kernel', keep = False, options = [], preamble = '') + + Generates kernels that perform exclusive prefix scan on GPUArray. + Generated kernels perform *scan_operation* on arrays consisting of + *data_type* with *neutral* element. + + .. method:: __call__(result, material[, size]) + + Invoke the generated kernels. The arguments must either be instances + of :class:`GPUArray`. *size* is optional; if it is missing operation + is performed on entire array, if it is present, only on *size* elements. + +.. class:: InclusivePrefixKernel(data_type, scan_operation, neutral, name = 'prefix_kernel', keep = False, options = [], preamble = '') + + Generates kernels that perform inclusive prefix scan on GPUArray. + Generated kernels perform *scan_operation* on arrays consisting of + *data_type* with *neutral* element. + + .. method:: __call__(result, material[, size]) + + Invoke the generated kernels. The arguments must either be instances + of :class:`GPUArray`. *size* is optional; if it is missing operation + is performed on entire array, if it is present, only on *size* elements. + + +Here's a usage example:: + + a = gpuarray.arange(400, dtype=numpy.float32) + b = gpuarray.arange(400, dtype=numpy.float32) + + krnl = InclusivePrefixKernel('float', 'a+b', '0') + + krnl(a, b) + Fast Fourier Transforms ----------------------- diff --git a/pycuda/reduction.py b/pycuda/reduction.py index 474997f..d349e48 100644 --- a/pycuda/reduction.py +++ b/pycuda/reduction.py @@ -385,3 +385,444 @@ def get_subset_minmax_kernel(what, dtype): "const %(tp)s *in" % { "tp": dtype_to_ctype(dtype), }, preamble="#define MY_INFINITY (1./0)") + + +# Parallel prefix scan + +exclusive_scan_source = """ + // Travel upwards, from leaves to the root of the tree + // During each loop time distance between nodes increases two times, + // and number of nodes decreases two times + for (int number_of_nodes = full_array_size>>1; number_of_nodes > 0; + number_of_nodes >>= 1) { + __syncthreads(); + + if (threadIdx.x < number_of_nodes) { + int source_node_index = tree_node_distance*(2*threadIdx.x+1)-1; + int target_node_index = tree_node_distance*(2*threadIdx.x+2)-1; + source_node_index += SHARED_BANK_PADDING(source_node_index); + target_node_index += SHARED_BANK_PADDING(target_node_index); + + shared_array[target_node_index] = + SUM(shared_array[target_node_index], + shared_array[source_node_index]); + } + tree_node_distance <<= 1; + } + + if (threadIdx.x == 0) { + shared_array[full_array_size-1 + SHARED_BANK_PADDING(full_array_size-1)] = %(neutral)s; + } + + // Travel downwards, from root to the leaves + // During each loop number of nodes increases two times and + // distance between nodes decreases two times + for (int number_of_nodes = 1; number_of_nodes < full_array_size; + number_of_nodes <<= 1) { + tree_node_distance >>= 1; + __syncthreads(); + + if (threadIdx.x < number_of_nodes) { + int source_node_index = tree_node_distance*(2*threadIdx.x+1)-1; + int target_node_index = tree_node_distance*(2*threadIdx.x+2)-1; + source_node_index += SHARED_BANK_PADDING(source_node_index); + target_node_index += SHARED_BANK_PADDING(target_node_index); + + %(data_type)s temp = shared_array[source_node_index]; + shared_array[source_node_index] = shared_array[target_node_index]; + shared_array[target_node_index] = + SUM(shared_array[target_node_index], temp); + } + } + __syncthreads(); +""" + +scan_main_head = """ + #define SUM(a, b) (%(scan_operation)s) + + %(preamble)s + + __global__ void %(name)s(%(data_type)s *result, + const %(data_type)s *material, const int n) { + extern __shared__ %(data_type)s shared_array[]; + + int source_node_index = threadIdx.x; +""" + +scan_tail_head = """ + #define SUM(a, b) (%(scan_operation)s) + + %(preamble)s + + __global__ void %(name)s(%(data_type)s *result, + const %(data_type)s *material, const int n, const int block_offset) { + extern __shared__ %(data_type)s shared_array[]; + + int source_node_index = threadIdx.x; +""" + +# Exclusive version, after Mark Harris atricle +def get_exclusive_prefix_module(data_type, scan_operation, neutral, + name = 'prefix_kernel', optimise = False, keep = False, options = [], + preamble = ''): + """ + Implementation of prefix-sum implemented based on whitepaper + ``Parallel Prefix Sum (Scan) with CUDA'' + by Mark Harris from NVIDIA + """ + from pycuda.compiler import SourceModule + if optimise: + preamble = preamble+"\n#define SHARED_BANK_PADDING(n) ((n) >> 4)\n" + else: + preamble = preamble+"\n#define SHARED_BANK_PADDING(n) 0\n" + src = (scan_main_head + + """ + // Size of shared array that can contain input data + const int full_array_size = n; + const int block_offset = 2*blockIdx.x*blockDim.x; + int tree_node_distance = 1; + + int target_node_index = threadIdx.x + (n/2); + shared_array[source_node_index + SHARED_BANK_PADDING(source_node_index)] = + material[source_node_index + block_offset]; + shared_array[target_node_index + SHARED_BANK_PADDING(target_node_index)] = + material[target_node_index + block_offset]; + """ + + exclusive_scan_source + + """ + result[source_node_index + block_offset] = + shared_array[source_node_index + + SHARED_BANK_PADDING(source_node_index)]; + result[target_node_index + block_offset] = + shared_array[target_node_index + + SHARED_BANK_PADDING(target_node_index)]; + } + """) % { + 'data_type': data_type, + 'name': name, + 'neutral': neutral, + 'scan_operation': scan_operation, + 'preamble': preamble, + } + return SourceModule(src, options=options, keep=keep) + +def get_exclusive_prefix_module_tail(data_type, scan_operation, neutral, + name = 'prefix_kernel', optimise = False, keep = False, options = [], + preamble = ''): + """ + Implementation of prefix-sum after whitepaper + Parallel Prefix Sum (Scan) with CUDA + by Mark Harris from NVIDIA + + Can be used for tables which size is not power-of-2. + Excepts size of shared memory larger than size of input data. + """ + from pycuda.compiler import SourceModule + if optimise: + preamble = preamble+"\n#define SHARED_BANK_PADDING(n) ((n) >> 4)\n" + else: + preamble = preamble+"\n#define SHARED_BANK_PADDING(n) 0\n" + src = (scan_tail_head + + """ + // Size of shared array that can contain input data + // The smallest array of size power-of-2 that can contain original array + const int full_array_size = 1<<((int)(floor(log2((float)n))+1)); + int tree_node_distance = 1; + + int target_node_index = threadIdx.x + (n/2); + if (source_node_index < n) { + shared_array[source_node_index + SHARED_BANK_PADDING(source_node_index)] = material[source_node_index + block_offset]; + } else { + shared_array[source_node_index + SHARED_BANK_PADDING(source_node_index)] = %(neutral)s; + } + if (target_node_index < n) { + shared_array[target_node_index + SHARED_BANK_PADDING(target_node_index)] = material[target_node_index + block_offset]; + } else { + shared_array[target_node_index + SHARED_BANK_PADDING(target_node_index)] = %(neutral)s; + } + """ + + exclusive_scan_source + + """ + if (source_node_index < n) { result[source_node_index + block_offset] = shared_array[source_node_index + SHARED_BANK_PADDING(source_node_index)]; } + if (target_node_index < n) { result[target_node_index + block_offset] = shared_array[target_node_index + SHARED_BANK_PADDING(target_node_index)]; } + } + """) % { + 'data_type': data_type, + 'name': name, + 'neutral': neutral, + 'scan_operation': scan_operation, + 'preamble': preamble, + } + return SourceModule(src, options=options, keep=keep) + +# Inclusive version, after http://en.wikipedia.org/wiki/Prefix_sum + +inclusive_scan_source = """ + // During each loop time distance between nodes increases two times, + // and number of nodes decreases two times + for (int tree_node_distance = 1; tree_node_distance < n; tree_node_distance <<= 1) { + // Ensure that all values are in an array + __syncthreads(); + + int target_node_index = source_node_index + tree_node_distance; + %(data_type)s partial; + if (target_node_index < n) { + partial = SUM(shared_array[target_node_index], + shared_array[source_node_index]); + } + + // Ensure that all sums are in registers + __syncthreads(); + + // Write to an array + if (target_node_index < n) { + shared_array[target_node_index] = partial; + } + } + + __syncthreads(); +""" + +# This function does not use "neutral" but I have left it to maintain the same +# list of arguments as other functions +def get_inclusive_prefix_module(data_type, scan_operation, neutral, + name = 'prefix_kernel', keep = False, options = [], preamble = ''): + """ + Implementation of prefix-sum implemented based on Wikipedia ``Prefix sum'' article + """ + from pycuda.compiler import SourceModule + src = (scan_main_head + + """ + const int block_offset = blockIdx.x*blockDim.x; + shared_array[source_node_index] = material[source_node_index + block_offset]; + """ + + inclusive_scan_source + + """ + result[source_node_index + block_offset] = shared_array[source_node_index]; + } + """) % { + 'data_type': data_type, + 'name': name, + 'scan_operation': scan_operation, + 'preamble': preamble, + } + return SourceModule(src, options=options, keep=keep) + +def get_inclusive_prefix_module_tail(data_type, scan_operation, neutral, + name = 'prefix_kernel', keep = False, options = [], preamble = ''): + """ + Implementation of prefix-sum implemented based on Wikipedia ``Prefix sum'' article + """ + from pycuda.compiler import SourceModule + src = (scan_tail_head + + """ + if (source_node_index < n) { + shared_array[source_node_index] = material[source_node_index + block_offset]; + } else { + shared_array[source_node_index] = %(neutral)s; + } + """ + + inclusive_scan_source + + """ + if (source_node_index < n) { + result[source_node_index + block_offset] = shared_array[source_node_index]; + } + } + """) % { + 'data_type': data_type, + 'name': name, + 'scan_operation': scan_operation, + 'neutral': neutral, + 'preamble': preamble, + } + return SourceModule(src, options=options, keep=keep) + +class PrefixKernel(object): + def __init__(self, data_type, scan_operation, neutral, + name = 'prefix_kernel', keep = False, options = [], preamble = ''): + # Determine the size of used data type + self.numpy_type = pycuda.tools.parse_c_arg("const %s * in" % + data_type).dtype + item_size = self.numpy_type.itemsize + + # Determine the number of threads + dev = pycuda.driver.Context.get_device() + block_size = dev.get_attribute(pycuda.driver.device_attribute.MAX_THREADS_PER_BLOCK) + block_dimension = dev.get_attribute(pycuda.driver.device_attribute.MAX_BLOCK_DIM_X) + self.block_size = min(block_size, block_dimension) + + # Shared memory size: two items per thread, number of threads, + # size of one data item + self.shared_size = self.get_block_size()*item_size + # Padding to avoid bank conflicts + log_num_banks = 4 + self.shared_size += ((self.get_block_size() >> log_num_banks) * + item_size) + + # Check whether we do not exceed available shared memory size + max_shared_size = dev.get_attribute(pycuda.driver.device_attribute.MAX_SHARED_MEMORY_PER_BLOCK) + if self.shared_size > max_shared_size: + import warnings + warnings.warn ("Changed shared size") + self.shared_size = max_shared_size + self.block_size = self.shared_size//(2*item_size) + + self.main_function = self.get_main_function(data_type, scan_operation, + neutral, name, keep, options, preamble) + self.main_function.prepare("PPi", block=(self.block_size, 1, 1), + shared=self.shared_size) + self.tail_function = self.get_tail_function(data_type, scan_operation, + neutral, name, keep, options, preamble) + self.tail_function.prepare("PPii", block=(self.block_size, 1, 1), + shared=self.shared_size) + # Use maximum available shared memory in 2.x devices + # TODO: is it needed as we are more limited by number of threads? + # Might be needed for large data types (?) + if dev.compute_capability() >= (2, 0): + cache_size = pycuda.driver.func_cache.PREFER_SHARED + self.main_function.set_cache_config(cache_size) + self.tail_function.set_cache_config(cache_size) + def get_block_size(self): + return None + def get_main_function(self, data_type, scan_operation, neutral, + name = 'prefix_kernel', keep = False, options = [], preamble = ''): + return None + def get_tail_function(self, data_type, scan_operation, neutral, + name = 'prefix_kernel', keep = False, options = [], preamble = ''): + return None + def __call__(self, *args, **kwargs): + # Look for keyword arguments, use positional when keyword cannot be found + result = kwargs.get('result') + if result is None: + result = args[0] + material = kwargs.get('material') + if material is None: + material = args[1] + # Accept size of data to operate on as parameter, use size of input + # array if not found + input_size = kwargs.get('size') + if input_size is None: + input_size = material.size + block_count = ((input_size + self.get_block_size() - 1) // + (self.get_block_size())) + if input_size != block_count * self.get_block_size(): + if block_count > 1: + self.main_function.prepared_call((block_count-1, 1), + result.gpudata, material.gpudata, self.get_block_size()) + self.tail_function.prepared_call((1, 1), result.gpudata, + material.gpudata, + input_size - (block_count - 1) * self.get_block_size(), + (block_count - 1) * self.get_block_size()) + else: + self.main_function.prepared_call((block_count, 1), + result.gpudata, material.gpudata, + self.get_block_size()) + # Get final elements from each of blocks into table + # Sum of the last block is not needed + block_sum_array_gpu = pycuda.gpuarray.zeros([block_count], + self.numpy_type) + block_sum_array = numpy.zeros(block_count).astype(self.numpy_type) + for i in range(block_count-1): + index = (i + 1) * self.get_block_size() - 1 + block_sum_array[i] = result[index:index+1].get()[0] + block_sum_array_gpu.set(block_sum_array) + # TODO: is it possible use GPUArray per-element kernel here? + # Rather not (we are adding different number of values for each slice + # of the array, but maybe I am not seeing something) + self.finish_function.prepared_call((block_count, 1), + result.gpudata, material.gpudata, + block_sum_array_gpu.gpudata, input_size) + +class ExclusivePrefixKernel(PrefixKernel): + def __init__(self, data_type, scan_operation, neutral, + name = 'prefix_kernel', keep = False, options = [], preamble = ''): + super(ExclusivePrefixKernel, self).__init__(data_type, scan_operation, + neutral, name, keep, options, preamble) + + from pycuda.compiler import SourceModule + finish_source = """ + #define SUM(a, b) (%(scan_operation)s) + + %(preamble)s + + __global__ void finish(%(data_type)s *result, + const %(data_type)s *material, const %(data_type)s *partial, + const int n) { + const int x = 2*blockIdx.x*blockDim.x+threadIdx.x; + const int y = 2*blockIdx.x*blockDim.x+threadIdx.x+blockDim.x; + if (x < n) { + for (int i = 1; i <= blockIdx.x; i++) { + // Program produces exclusive scan + // We need to add last element from previous block + // to get correct sum + // It calculates ``inclusive'' sum for internal purposes + result[x] = SUM(result[x], material[2*i*blockDim.x-1]); + result[x] = SUM(result[x], partial[i-1]); + if (y < n) { + result[y] = SUM(result[y], material[2*i*blockDim.x-1]); + result[y] = SUM(result[y], partial[i-1]); + } + } + } + } + """ % { + 'data_type': data_type, + 'scan_operation': scan_operation, + 'preamble': preamble, + } + self.finish_function = SourceModule(finish_source, options=options, keep=keep).get_function('finish') + self.finish_function.prepare("PPPi", block=(self.block_size, 1, 1)) + def get_block_size(self): + return 2 * self.block_size + def get_main_function(self, data_type, scan_operation, neutral, + name = 'prefix_kernel', keep = False, options = [], preamble = ''): + result = get_exclusive_prefix_module(data_type, scan_operation, + neutral, name, True, keep, options, preamble) + return result.get_function(name) + def get_tail_function(self, data_type, scan_operation, neutral, + name = 'prefix_kernel', keep = False, options = [], preamble = ''): + result = get_exclusive_prefix_module_tail(data_type, scan_operation, + neutral, name, True, keep, options, preamble) + return result.get_function(name) + +class InclusivePrefixKernel(PrefixKernel): + def __init__(self, data_type, scan_operation, neutral, + name = 'prefix_kernel', keep = False, options = [], preamble = ''): + super(InclusivePrefixKernel, self).__init__(data_type, scan_operation, + neutral, name, keep, options, preamble) + + from pycuda.compiler import SourceModule + finish_source = """ + #define SUM(a, b) (%(scan_operation)s) + + %(preamble)s + + __global__ void finish(%(data_type)s *result, + const %(data_type)s *material, const %(data_type)s *partial, + const int n) { + const int x = blockIdx.x*blockDim.x+threadIdx.x; + if (x < n) { + for (int i = 1; i <= blockIdx.x; i++) { + result[x] = SUM(result[x], partial[i-1]); + } + } + } + """ % { + 'data_type': data_type, + 'scan_operation': scan_operation, + 'preamble': preamble, + } + self.finish_function = SourceModule(finish_source, options=options, keep=keep).get_function('finish') + self.finish_function.prepare("PPPi", block=(self.block_size, 1, 1)) + def get_block_size(self): + return self.block_size + def get_main_function(self, data_type, scan_operation, neutral, + name = 'prefix_kernel', keep = False, options = [], preamble = ''): + result = get_inclusive_prefix_module(data_type, scan_operation, + neutral, name, keep, options, preamble) + return result.get_function(name) + def get_tail_function(self, data_type, scan_operation, neutral, + name = 'prefix_kernel', keep = False, options = [], preamble = ''): + result = get_inclusive_prefix_module_tail(data_type, scan_operation, + neutral, name, keep, options, preamble) + return result.get_function(name) diff --git a/test/test_gpuarray.py b/test/test_gpuarray.py index a99136f..e13c3f2 100644 --- a/test/test_gpuarray.py +++ b/test/test_gpuarray.py @@ -598,6 +598,114 @@ class TestGPUArray: assert la.norm(z.get().imag - z.imag.get()) == 0 assert la.norm(z.get().conj() - z.conj().get()) == 0 + # Test prefix scan + def compute_exclusive(self, what): + result = numpy.zeros((what.size, ), dtype=what.dtype) + prefix = 0 + result[0] = prefix + for i in range(what.size-1): + prefix += what[i] + result[i+1] = prefix + return result + + def compute_inclusive(self, what): + result = numpy.zeros((what.size, ), dtype=what.dtype) + prefix = what[0] + result[0] = prefix + for i in range(1, what.size): + prefix += what[i] + result[i] = prefix + return result + + @pycuda.tools.mark_cuda_test + def test_exclusive_scan_small(self): + from pycuda.reduction import ExclusivePrefixKernel + p = ExclusivePrefixKernel('float', 'a+b', '0') + size = p.block_size + a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32) + b_gpu = pycuda.gpuarray.zeros([size], numpy.float32) + a = a_gpu.get() + p(result=b_gpu, material=a_gpu, size=size) + b = self.compute_exclusive(a) + assert (b - b_gpu.get() < 1e-4).all() + @pycuda.tools.mark_cuda_test + def test_exclusive_scan_large(self): + from pycuda.reduction import ExclusivePrefixKernel + p = ExclusivePrefixKernel('float', 'a+b', '0') + size = 16*p.block_size + a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32) + b_gpu = pycuda.gpuarray.zeros([size], numpy.float32) + a = a_gpu.get() + p(result=b_gpu, material=a_gpu) + b = self.compute_exclusive(a) + assert (b - b_gpu.get() < 1e-3).all() + @pycuda.tools.mark_cuda_test + def test_exclusive_scan_small_non2(self): + from pycuda.reduction import ExclusivePrefixKernel + p = ExclusivePrefixKernel('float', 'a+b', '0') + size = p.block_size-3 + a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32) + b_gpu = pycuda.gpuarray.zeros([size], numpy.float32) + a = a_gpu.get() + p(b_gpu, a_gpu, size=size) + b = self.compute_exclusive(a) + assert (b - b_gpu.get() < 1e-4).all() + @pycuda.tools.mark_cuda_test + def test_exclusive_scan_large_non2(self): + from pycuda.reduction import ExclusivePrefixKernel + p = ExclusivePrefixKernel('float', 'a+b', '0') + size = 16*p.block_size-3 + a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32) + b_gpu = pycuda.gpuarray.zeros([size], numpy.float32) + a = a_gpu.get() + p(b_gpu, a_gpu) + b = self.compute_exclusive(a) + assert (b - b_gpu.get() < 1e-3).all() + @pycuda.tools.mark_cuda_test + def test_inclusive_scan_small(self): + from pycuda.reduction import InclusivePrefixKernel + p = InclusivePrefixKernel('float', 'a+b', '0') + size = p.block_size + a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32) + b_gpu = pycuda.gpuarray.zeros([size], numpy.float32) + a = a_gpu.get() + p(b_gpu, a_gpu, size) + b = self.compute_inclusive(a) + assert (b - b_gpu.get() < 1e-3).all() + @pycuda.tools.mark_cuda_test + def test_inclusive_scan_large(self): + from pycuda.reduction import InclusivePrefixKernel + p = InclusivePrefixKernel('float', 'a+b', '0') + size = 16*p.block_size + a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32) + b_gpu = pycuda.gpuarray.zeros([size], numpy.float32) + a = a_gpu.get() + p(b_gpu, a_gpu, size) + b = self.compute_inclusive(a) + assert (b - b_gpu.get() < 1e-2).all() + @pycuda.tools.mark_cuda_test + def test_inclusive_scan_small_non2(self): + from pycuda.reduction import InclusivePrefixKernel + p = InclusivePrefixKernel('float', 'a+b', '0') + size = p.block_size-3 + a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32) + b_gpu = pycuda.gpuarray.zeros([size], numpy.float32) + a = a_gpu.get() + p(b_gpu, a_gpu, size) + b = self.compute_inclusive(a) + assert (b - b_gpu.get() < 1e-3).all() + @pycuda.tools.mark_cuda_test + def test_inclusive_scan_large_non2(self): + from pycuda.reduction import InclusivePrefixKernel + p = InclusivePrefixKernel('float', 'a+b', '0') + size = 16*p.block_size-3 + a_gpu = pycuda.curandom.rand((size,), dtype=numpy.float32) + b_gpu = pycuda.gpuarray.zeros([size], numpy.float32) + a = a_gpu.get() + p(b_gpu, a_gpu, size) + b = self.compute_inclusive(a) + assert (b - b_gpu.get() < 1e-2).all() + if __name__ == "__main__": # make sure that import failures get reported, instead of skipping the tests.
signature.asc
Description: This is a digitally signed message part
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda