Jerome Kieffer <jerome.kief...@esrf.fr> writes:

> On Thu, 21 May 2015 07:59:35 -0400
> Andreas Kloeckner <li...@weasel.tiker.net> wrote:
>
>> Luke Pfister <lpfis...@illinois.edu> writes:
>> > Is there a suggested way to do the equivalent of np.sum along a particular
>> > axis for a high-dimensional GPUarray?
>> >
>> > I saw that this was discussed in 2009, before GPUarrays carried stride
>> > information.
>> 
>> Hand-writing a kernel is probably still your best option. Just map the
>> non-reduction axes to the grid/thread block axes, and write a for loop
>> to do the summation.
>
> Won't you win by having 1 workgroup (sorry it is the OpenCL name, can't 
> remember the CUDA one)
> doing a partial parallel reduction ?
>
> i.e. 1 workgroup = 32 threads
> First stage:
> 32x( read + add) to shared memory as much as needed for the dimension of the 
> gpuarray
>
> Second stage:
> Parallel reducton within the shared memory (even without barrier as we are in 
> a warp)

I'd say that depends on the shape of the array, or, specifically, on
whether your other axes are big enough to fill the GPU. If they are,
then parallel reduction is not a winner, since it's not
work-efficient. (Work = n, span = log(n)) On the other hand, if you're
hurting to fill the machine, then it might be worth considering.

Andreas

Attachment: signature.asc
Description: PGP signature

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to