It seems striding in kernels can indeed incur a large penalty, though the
result is dependent on the array size. I've just updated the code to only
use the noncontiguous kernel if necessary.

Using a GTX 970, I get the following results using the pow kernel (I'm
using ipython's %timeit here, because for some reason python appears to
hang on my machine if I use the standard timeit module):

>>> import pycuda.autoinit; import pycuda.gpuarray as gpuarray
>>> b1 = gpuarray.arange(100000, dtype='float64').reshape(1000, 100)
>>> b2 = b1[::2, :-1].copy()
>>> # force CUDA to compile the kernels before we time
>>> b1[::2,:-1]**2
>>> b2**2

>>> %timeit b1[::2,:-1]**2
1000 loops, best of 3: 654 µs per loop
>>> %timeit b2**2
10000 loops, best of 3: 147 µs per loop

>>> b1 = gpuarray.arange(1000000, dtype='float64').reshape(1000, 1000)
>>> b2 = b1[::2, :-1].copy()

>>> %timeit b1[::2,:-1]**2
100 loops, best of 3: 2.05 ms per loop
>>> %timeit b2**2
1000 loops, best of 3: 1.66 ms per loop

I'll try clean up the code and get it into my repo tpday.

Keegan


On Thu, Dec 1, 2016 at 10:03 AM, Frédéric Bastien <
frederic.bast...@gmail.com> wrote:

> I can share our experience in Theano related to that.
>
> I added code to "fuse" dimensions that are contiguous. So if you have a 3d
> tensor, and only 1 dimensions is non contiguous, you won't pay the indexing
> of a 3d tensor, but only of a 2d tensor.
>
> I saw 2x speed up in such cases. It was old gpu and old cuda version,
> maybe this changed.
>
> In the new Theano back-end, we have rewriting that in a more readable
> version:
>
> https://github.com/Theano/libgpuarray/blob/master/src/gpuarray_util.c#L153
>
> This also take into account the broadcasting.
>
> This could be done before call the kernel.
>
> On Wed, Nov 30, 2016 at 8:24 PM, Andreas Kloeckner <
> li...@informa.tiker.net> wrote:
>
>> Keegan,
>>
>> Keegan Owsley <keeg...@gmail.com> writes:
>> > I've just slapped together a patch to pycuda that makes most elementwise
>> > operations work with noncontiguous arrays. There are a bunch of hacks in
>> > there, and the code needs some reorg before it's ready to be considered
>> for
>> > upstream (I made these changes while learning the pycuda codebase, so
>> > there's a bunch of crud that can be cleaned out), but I figure I might
>> as
>> > well put it out there in its current state and see what you guys think.
>> > It's also not extremely well-tested (I have no idea if it interferes
>> with
>> > skcuda, for example), but all of the main functions appear to work.
>> >
>> > You can check out the code at https://bitbucket.org/owsleyk_
>> omega/pycuda.
>> >
>> > Briefly, this works by adding new parameters into elementwise kernels
>> that
>> > describe the stride and shape of your arrays, then using a function that
>> > computes the location in memory from the stride, shape, and index.
>> > Elementwise kernel ops are modified so that they use the proper
>> indexing.
>> > See an example of a kernel that's generated below:
>>
>> Thanks for putting this together and sharing it! I have one main
>> question about this, regarding performance:
>>
>> Modulo (especially variable-denominator modulo) has a habit of being
>> fantastically slow on GPUs. Could you time contiguous
>> vs. noncontiguous for various levels of "gappiness" and number of
>> axes? I'm asking this because I'd be OK with a 50% slowdown, but not
>> necessarily a factor of 5 slowdown on actual GPU hardware.
>>
>> Thanks!
>> Andreas
>>
>> _______________________________________________
>> PyCUDA mailing list
>> PyCUDA@tiker.net
>> https://lists.tiker.net/listinfo/pycuda
>>
>
>
_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
https://lists.tiker.net/listinfo/pycuda

Reply via email to