Hello Matthew, Your hypothesis is correct, the execution of the kernel takes too long and OS just terminates it (the exact threshold depends on the OS, Windows has something like 5s, and on my Linux box your program does not crash at all and takes about a minute to execute).
Now how to fix it. You have too many cycles inside the kernel which is always a bad sign — you have to think parallel. I do not know what parts you have cut out, but what is left looks like it can be transformed to successive application of two operations to the whole array: 1. Map: if value > 0, add 0.01 to corresponding accumulator (can be straightforwardly done in parallel, processing one or several indices per thread). Pycuda facilitates this even more, see this part of docs: http://documen.tician.de/pycuda/array.html#module-pycuda.elementwise 2. Reduce: sum neighbouring nodes in the accumulator, producing 'nearest' array of size NV. This is not as straightforward as Map, but it is a well-known problem, and Pycuda has corresponding functions too: http://documen.tician.de/pycuda/array.html#module-pycuda.reduction Best regards, Bogdan On Tue, Jun 21, 2011 at 4:05 PM, Matthew Graham <m...@cacr.caltech.edu> wrote: > Hi, > > I'm having problems with big arrays; this crashes horribly: > > from pycuda import driver, compiler, gpuarray, tools > from pycuda.curandom import rand as curand > import numpy as np > import pycuda.autoinit > > NV = 5632 # No of vectors > DIM = 2031 # No of dimensions per vector > BLOCK_SIZE = 16 # 512 max threads per mp > GRID_SIZE = 352 > HEIGHT = 8 > WIDTH = 8 > > gj = curand((NV, DIM)) > nodes = curand((HEIGHT, WIDTH, DIM)) # Each row is a vector > dissim = curand((DIM,DIM)) > nearest = gpuarray.zeros((NV), np.float32) > > kernel_code = ''' > __global__ void NearestPrototypeKernel(float *gj, float *nodes, float > *dissim, float *nearest) > { > // Element > int idx = blockIdx.x * blockDim.x + threadIdx.x; > float value = 0.; > > for (int i = 0; i < 8; ++i) { > for (int j = 0; j < 8; ++j) { > for (int l = 0; l < 2031; ++l) { > float wp_l = gj[idx * 2031 + l]; > if (wp_l > 0) { > for (int k = 0; k < 2031; ++k) { > value += 0.01; > } > } > } > } > } > > nearest[idx] = value; > } > ''' > mod = compiler.SourceModule(kernel_code) > npker = mod.get_function("NearestPrototypeKernel") > npker( > # inputs > gj, nodes, dissim, > # output > nearest, > # block of multiple threads > block = (BLOCK_SIZE, 1, 1), > # grid of blocks > grid = (GRID_SIZE, 1) > ) > a = nearest.get() > > --- > > with this: > > /Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/compiler.py:122: > UserWarning: The CUDA compiler suceeded, but said the following: > ptxas /tmp/tmpxft_000003e5_00000000-2_kernel.ptx, line 93; warning : Double > is not supported. Demoting to float > > +stdout+stderr) > Traceback (most recent call last): > File "kernel_test.py", line 54, in <module> > a = nearest.get() > File > "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/gpuarray.py", > line 177, in get > drv.memcpy_dtoh(ary, self.gpudata) > pycuda._driver.LaunchError: cuMemcpyDtoH failed: launch timeout > Error in atexit._run_exitfuncs: > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/atexit.py", > line 24, in _run_exitfuncs > func(*targs, **kargs) > File > "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/autoinit.py", > line 13, in _finish_up > context.pop() > LaunchError: cuCtxPopCurrent failed: launch timeout > Error in sys.exitfunc: > Traceback (most recent call last): > File > "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/atexit.py", > line 24, in _run_exitfuncs > func(*targs, **kargs) > File > "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/autoinit.py", > line 13, in _finish_up > context.pop() > pycuda._driver.LaunchError: cuCtxPopCurrent failed: launch timeout > PyCUDA WARNING: a clean-up operation failed (dead context maybe?) > cuMemFree failed: invalid context > PyCUDA WARNING: a clean-up operation failed (dead context maybe?) > cuMemFree failed: invalid context > PyCUDA WARNING: a clean-up operation failed (dead context maybe?) > cuMemFree failed: invalid context > PyCUDA WARNING: a clean-up operation failed (dead context maybe?) > cuModuleUnload failed: invalid context > PyCUDA WARNING: a clean-up operation failed (dead context maybe?) > cuMemFree failed: invalid context > ------------------------------------------------------------------- > PyCUDA ERROR: The context stack was not empty upon module cleanup. > ------------------------------------------------------------------- > A context was still active when the context stack was being > cleaned up. At this point in our execution, CUDA may already > have been deinitialized, so there is no way we can finish > cleanly. The program will be aborted now. > Use Context.pop() to avoid this problem. > ------------------------------------------------------------------- > Abort > > ---- > > Does anybody have any idea to get around this? > > Cheers, > > Matthew > _______________________________________________ > PyCUDA mailing list > PyCUDA@tiker.net > http://lists.tiker.net/listinfo/pycuda > _______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda