Hi Nathan,

First, the failed context exception was thrown because you did not
release the context when the first exception occurred, namely
"pycuda._driver.LaunchError: cuMemcpyDtoH failed: launch failed". The
latter exception is the real problem.

The reason seems to be the following:

1. You allocate int16 CPU array:
>         out = zeros((ywin,xwin),dtype=numpy.int16)

2. You create GPU array with the same element size as "out":
>         out_gpu = cuda.mem_alloc(out.size * out.dtype.itemsize)

3. You define the kernel which takes int array as a first argument:
>                __global__ void classify(int *out, float *slope, float *tpi)

4. And then you pass your 16-bit int array as 32-bit int array to this function:
>         classify(out_gpu,slope_gpu,tpi_gpu,block=(32,32,1),grid=(209,209))

5. As a result, you write beyond the border of this array in kernel
and GPU-CPU synchronization (which, in your case, occurs on
transferring data from device to host) fails.

In addition, check that the item size of slopeArray and tpiArray is
sizeof(float).

Best regards,
Bogdan

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to