Hi Nathan, First, the failed context exception was thrown because you did not release the context when the first exception occurred, namely "pycuda._driver.LaunchError: cuMemcpyDtoH failed: launch failed". The latter exception is the real problem.
The reason seems to be the following: 1. You allocate int16 CPU array: > out = zeros((ywin,xwin),dtype=numpy.int16) 2. You create GPU array with the same element size as "out": > out_gpu = cuda.mem_alloc(out.size * out.dtype.itemsize) 3. You define the kernel which takes int array as a first argument: > __global__ void classify(int *out, float *slope, float *tpi) 4. And then you pass your 16-bit int array as 32-bit int array to this function: > classify(out_gpu,slope_gpu,tpi_gpu,block=(32,32,1),grid=(209,209)) 5. As a result, you write beyond the border of this array in kernel and GPU-CPU synchronization (which, in your case, occurs on transferring data from device to host) fails. In addition, check that the item size of slopeArray and tpiArray is sizeof(float). Best regards, Bogdan _______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda