It seems striding in kernels can indeed incur a large penalty, though the result is dependent on the array size. I've just updated the code to only use the noncontiguous kernel if necessary.
Using a GTX 970, I get the following results using the pow kernel (I'm using ipython's %timeit here, because for some reason python appears to hang on my machine if I use the standard timeit module): >>> import pycuda.autoinit; import pycuda.gpuarray as gpuarray >>> b1 = gpuarray.arange(100000, dtype='float64').reshape(1000, 100) >>> b2 = b1[::2, :-1].copy() >>> # force CUDA to compile the kernels before we time >>> b1[::2,:-1]**2 >>> b2**2 >>> %timeit b1[::2,:-1]**2 1000 loops, best of 3: 654 µs per loop >>> %timeit b2**2 10000 loops, best of 3: 147 µs per loop >>> b1 = gpuarray.arange(1000000, dtype='float64').reshape(1000, 1000) >>> b2 = b1[::2, :-1].copy() >>> %timeit b1[::2,:-1]**2 100 loops, best of 3: 2.05 ms per loop >>> %timeit b2**2 1000 loops, best of 3: 1.66 ms per loop I'll try clean up the code and get it into my repo tpday. Keegan On Thu, Dec 1, 2016 at 10:03 AM, Frédéric Bastien < frederic.bast...@gmail.com> wrote: > I can share our experience in Theano related to that. > > I added code to "fuse" dimensions that are contiguous. So if you have a 3d > tensor, and only 1 dimensions is non contiguous, you won't pay the indexing > of a 3d tensor, but only of a 2d tensor. > > I saw 2x speed up in such cases. It was old gpu and old cuda version, > maybe this changed. > > In the new Theano back-end, we have rewriting that in a more readable > version: > > https://github.com/Theano/libgpuarray/blob/master/src/gpuarray_util.c#L153 > > This also take into account the broadcasting. > > This could be done before call the kernel. > > On Wed, Nov 30, 2016 at 8:24 PM, Andreas Kloeckner < > li...@informa.tiker.net> wrote: > >> Keegan, >> >> Keegan Owsley <keeg...@gmail.com> writes: >> > I've just slapped together a patch to pycuda that makes most elementwise >> > operations work with noncontiguous arrays. There are a bunch of hacks in >> > there, and the code needs some reorg before it's ready to be considered >> for >> > upstream (I made these changes while learning the pycuda codebase, so >> > there's a bunch of crud that can be cleaned out), but I figure I might >> as >> > well put it out there in its current state and see what you guys think. >> > It's also not extremely well-tested (I have no idea if it interferes >> with >> > skcuda, for example), but all of the main functions appear to work. >> > >> > You can check out the code at https://bitbucket.org/owsleyk_ >> omega/pycuda. >> > >> > Briefly, this works by adding new parameters into elementwise kernels >> that >> > describe the stride and shape of your arrays, then using a function that >> > computes the location in memory from the stride, shape, and index. >> > Elementwise kernel ops are modified so that they use the proper >> indexing. >> > See an example of a kernel that's generated below: >> >> Thanks for putting this together and sharing it! I have one main >> question about this, regarding performance: >> >> Modulo (especially variable-denominator modulo) has a habit of being >> fantastically slow on GPUs. Could you time contiguous >> vs. noncontiguous for various levels of "gappiness" and number of >> axes? I'm asking this because I'd be OK with a 50% slowdown, but not >> necessarily a factor of 5 slowdown on actual GPU hardware. >> >> Thanks! >> Andreas >> >> _______________________________________________ >> PyCUDA mailing list >> PyCUDA@tiker.net >> https://lists.tiker.net/listinfo/pycuda >> > >
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net https://lists.tiker.net/listinfo/pycuda