Hello Matthew,

Your hypothesis is correct, the execution of the kernel takes too long
and OS just terminates it (the exact threshold depends on the OS,
Windows has something like 5s, and on my Linux box your program does
not crash at all and takes about a minute to execute).

Now how to fix it. You have too many cycles inside the kernel which is
always a bad sign — you have to think parallel. I do not know what
parts you have cut out, but what is left looks like it can be
transformed to successive application of two operations to the whole
array:

1. Map: if value > 0, add 0.01 to corresponding accumulator (can be
straightforwardly done in parallel, processing one or several indices
per thread). Pycuda facilitates this even more, see this part of docs:
http://documen.tician.de/pycuda/array.html#module-pycuda.elementwise

2. Reduce: sum neighbouring nodes in the accumulator, producing
'nearest' array of size NV. This is not as straightforward as Map, but
it is a well-known problem, and Pycuda has corresponding functions
too:
http://documen.tician.de/pycuda/array.html#module-pycuda.reduction

Best regards,
Bogdan

On Tue, Jun 21, 2011 at 4:05 PM, Matthew Graham <m...@cacr.caltech.edu> wrote:
> Hi,
>
> I'm having problems with big arrays; this crashes horribly:
>
> from pycuda import driver, compiler, gpuarray, tools
> from pycuda.curandom import rand as curand
> import numpy as np
> import pycuda.autoinit
>
> NV = 5632 # No of vectors
> DIM = 2031 # No of dimensions per vector
> BLOCK_SIZE = 16 # 512 max threads per mp
> GRID_SIZE = 352
> HEIGHT = 8
> WIDTH = 8
>
> gj = curand((NV, DIM))
> nodes = curand((HEIGHT, WIDTH, DIM)) # Each row is a vector
> dissim = curand((DIM,DIM))
> nearest = gpuarray.zeros((NV), np.float32)
>
> kernel_code = '''
>  __global__ void NearestPrototypeKernel(float *gj, float *nodes, float 
> *dissim, float *nearest)
>  {
>    // Element
>    int idx = blockIdx.x * blockDim.x + threadIdx.x;
>    float value = 0.;
>
>    for (int i = 0; i < 8; ++i) {
>      for (int j = 0; j < 8; ++j) {
>        for (int l = 0; l < 2031; ++l) {
>          float wp_l = gj[idx * 2031 + l];
>          if (wp_l > 0) {
>            for (int k = 0; k < 2031; ++k) {
>              value += 0.01;
>            }
>          }
>        }
>      }
>    }
>
>    nearest[idx] = value;
>  }
> '''
> mod = compiler.SourceModule(kernel_code)
> npker = mod.get_function("NearestPrototypeKernel")
> npker(
>    # inputs
>    gj, nodes, dissim,
>    # output
>    nearest,
>    # block of multiple threads
>    block = (BLOCK_SIZE, 1, 1),
>    # grid of blocks
>    grid = (GRID_SIZE, 1)
> )
> a = nearest.get()
>
> ---
>
> with this:
>
> /Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/compiler.py:122:
>  UserWarning: The CUDA compiler suceeded, but said the following:
> ptxas /tmp/tmpxft_000003e5_00000000-2_kernel.ptx, line 93; warning : Double 
> is not supported. Demoting to float
>
>  +stdout+stderr)
> Traceback (most recent call last):
>  File "kernel_test.py", line 54, in <module>
>    a = nearest.get()
>  File 
> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/gpuarray.py",
>  line 177, in get
>    drv.memcpy_dtoh(ary, self.gpudata)
> pycuda._driver.LaunchError: cuMemcpyDtoH failed: launch timeout
> Error in atexit._run_exitfuncs:
> Traceback (most recent call last):
>  File 
> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/atexit.py",
>  line 24, in _run_exitfuncs
>    func(*targs, **kargs)
>  File 
> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/autoinit.py",
>  line 13, in _finish_up
>    context.pop()
> LaunchError: cuCtxPopCurrent failed: launch timeout
> Error in sys.exitfunc:
> Traceback (most recent call last):
>  File 
> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/atexit.py",
>  line 24, in _run_exitfuncs
>    func(*targs, **kargs)
>  File 
> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/autoinit.py",
>  line 13, in _finish_up
>    context.pop()
> pycuda._driver.LaunchError: cuCtxPopCurrent failed: launch timeout
> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
> cuMemFree failed: invalid context
> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
> cuMemFree failed: invalid context
> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
> cuMemFree failed: invalid context
> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
> cuModuleUnload failed: invalid context
> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
> cuMemFree failed: invalid context
> -------------------------------------------------------------------
> PyCUDA ERROR: The context stack was not empty upon module cleanup.
> -------------------------------------------------------------------
> A context was still active when the context stack was being
> cleaned up. At this point in our execution, CUDA may already
> have been deinitialized, so there is no way we can finish
> cleanly. The program will be aborted now.
> Use Context.pop() to avoid this problem.
> -------------------------------------------------------------------
> Abort
>
> ----
>
> Does anybody have any idea to get around this?
>
>        Cheers,
>
>        Matthew
> _______________________________________________
> PyCUDA mailing list
> PyCUDA@tiker.net
> http://lists.tiker.net/listinfo/pycuda
>

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to