On 06/21/2011 08:59 AM, Matthew Graham wrote: > Hi Peter, > > Unfortunately it's a GeForce 9400M and running with smaller dimensions also > works fine so this does seem to explain the problem. Is there any GPU pattern > for deconstructing nested loops to get around this issue? > > Cheers, > > Matthew > >
Hey, it depends a little on your problem I'd say. I'm not a GPU expert and others might know this better, but in what I saw so far is that you often have to rethink completely your programming strategy. Like now you're doing basically a 1d array of threads with 4 nested loops in each kernel, that's threading like you'd do it on the CPU. Try to see if you can reshuffle your thoughts to use a 2d grid or even a 3d block without getting too messed up with the indexing scheme (not trivial in the beginning...I advice you thorough testing). For instance your first loop does 8 iterations, you could use a block of size 16x8 or something like that and get rid of the first loop -> so only 3 left. Next, the next loop also has 8 iterations, now you cannot directly use a 3d block of 16x8x8 because that would mean 1024 threads. You could go to a 8x8x8 block for instance and double your grid size. You can also play on the grid dimensions...Anyway, I think there's quite some ways to rewrite the same code in a more parallel way of thinking. The idea would be to replace a loop by a set of threads and as you can massively parallelize (65536**2 on a 2d grid on your gpu) you can think of things that would be horribly slow on a cpu threading scheme, but very quick on the gpu. ++ Peter > > On Jun 20, 2011, at 11:46 PM, Peter Schmidtke wrote: > > >> Dear Matthew, >> >> first of all what GPU you are using? A Geforce or a Tesla or something >> like that? >> >> I see a quadruple loop in your kernel. If it's quick to execute there's >> no problem with that, however, be careful, if you have a Geforce for >> instance, there's a time limit a thread can run (5s I think), so the >> kernel cannot run forever. But I think that's not the case with Teslas >> for instance. Try to execute something shorter and see if it >> works....then add another sub-loop and you'll see that it's too slow. >> >> ++ >> >> Peter >> >> >> On 06/21/2011 08:05 AM, Matthew Graham wrote: >> >>> Hi, >>> >>> I'm having problems with big arrays; this crashes horribly: >>> >>> from pycuda import driver, compiler, gpuarray, tools >>> from pycuda.curandom import rand as curand >>> import numpy as np >>> import pycuda.autoinit >>> >>> NV = 5632 # No of vectors >>> DIM = 2031 # No of dimensions per vector >>> BLOCK_SIZE = 16 # 512 max threads per mp >>> GRID_SIZE = 352 >>> HEIGHT = 8 >>> WIDTH = 8 >>> >>> gj = curand((NV, DIM)) >>> nodes = curand((HEIGHT, WIDTH, DIM)) # Each row is a vector >>> dissim = curand((DIM,DIM)) >>> nearest = gpuarray.zeros((NV), np.float32) >>> >>> kernel_code = ''' >>> __global__ void NearestPrototypeKernel(float *gj, float *nodes, float >>> *dissim, float *nearest) >>> { >>> // Element >>> int idx = blockIdx.x * blockDim.x + threadIdx.x; >>> float value = 0.; >>> >>> for (int i = 0; i < 8; ++i) { >>> for (int j = 0; j < 8; ++j) { >>> for (int l = 0; l < 2031; ++l) { >>> float wp_l = gj[idx * 2031 + l]; >>> if (wp_l > 0) { >>> for (int k = 0; k < 2031; ++k) { >>> value += 0.01; >>> } >>> } >>> } >>> } >>> } >>> >>> nearest[idx] = value; >>> } >>> ''' >>> mod = compiler.SourceModule(kernel_code) >>> npker = mod.get_function("NearestPrototypeKernel") >>> npker( >>> # inputs >>> gj, nodes, dissim, >>> # output >>> nearest, >>> # block of multiple threads >>> block = (BLOCK_SIZE, 1, 1), >>> # grid of blocks >>> grid = (GRID_SIZE, 1) >>> ) >>> a = nearest.get() >>> >>> --- >>> >>> with this: >>> >>> /Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/compiler.py:122: >>> UserWarning: The CUDA compiler suceeded, but said the following: >>> ptxas /tmp/tmpxft_000003e5_00000000-2_kernel.ptx, line 93; warning : Double >>> is not supported. Demoting to float >>> >>> +stdout+stderr) >>> Traceback (most recent call last): >>> File "kernel_test.py", line 54, in <module> >>> a = nearest.get() >>> File >>> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/gpuarray.py", >>> line 177, in get >>> drv.memcpy_dtoh(ary, self.gpudata) >>> pycuda._driver.LaunchError: cuMemcpyDtoH failed: launch timeout >>> Error in atexit._run_exitfuncs: >>> Traceback (most recent call last): >>> File >>> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/atexit.py", >>> line 24, in _run_exitfuncs >>> func(*targs, **kargs) >>> File >>> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/autoinit.py", >>> line 13, in _finish_up >>> context.pop() >>> LaunchError: cuCtxPopCurrent failed: launch timeout >>> Error in sys.exitfunc: >>> Traceback (most recent call last): >>> File >>> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/atexit.py", >>> line 24, in _run_exitfuncs >>> func(*targs, **kargs) >>> File >>> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/autoinit.py", >>> line 13, in _finish_up >>> context.pop() >>> pycuda._driver.LaunchError: cuCtxPopCurrent failed: launch timeout >>> PyCUDA WARNING: a clean-up operation failed (dead context maybe?) >>> cuMemFree failed: invalid context >>> PyCUDA WARNING: a clean-up operation failed (dead context maybe?) >>> cuMemFree failed: invalid context >>> PyCUDA WARNING: a clean-up operation failed (dead context maybe?) >>> cuMemFree failed: invalid context >>> PyCUDA WARNING: a clean-up operation failed (dead context maybe?) >>> cuModuleUnload failed: invalid context >>> PyCUDA WARNING: a clean-up operation failed (dead context maybe?) >>> cuMemFree failed: invalid context >>> ------------------------------------------------------------------- >>> PyCUDA ERROR: The context stack was not empty upon module cleanup. >>> ------------------------------------------------------------------- >>> A context was still active when the context stack was being >>> cleaned up. At this point in our execution, CUDA may already >>> have been deinitialized, so there is no way we can finish >>> cleanly. The program will be aborted now. >>> Use Context.pop() to avoid this problem. >>> ------------------------------------------------------------------- >>> Abort >>> >>> ---- >>> >>> Does anybody have any idea to get around this? >>> >>> Cheers, >>> >>> Matthew >>> _______________________________________________ >>> PyCUDA mailing list >>> PyCUDA@tiker.net >>> http://lists.tiker.net/listinfo/pycuda >>> >>> >> >> -- >> >> Peter Schmidtke >> PhD Student >> Dept. Physical Chemistry >> Faculty of Pharmacy >> University of Barcelona >> >> > -- Peter Schmidtke PhD Student Dept. Physical Chemistry Faculty of Pharmacy University of Barcelona _______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda