On 06/21/2011 08:59 AM, Matthew Graham wrote:
> Hi Peter,
>
> Unfortunately it's a GeForce 9400M and running with smaller dimensions also 
> works fine so this does seem to explain the problem. Is there any GPU pattern 
> for deconstructing nested loops to get around this issue?
>
>       Cheers,
>
>       Matthew
>
>   

Hey,

it depends a little on your problem I'd say. I'm not a GPU expert and
others might know this better, but in what I saw so far is that you
often have to rethink completely your programming strategy. Like now
you're doing basically a 1d array of threads with 4 nested loops in each
kernel, that's threading like you'd do it on the CPU.
Try to see if you can reshuffle your thoughts to use a 2d grid or even a
3d block without getting too messed up with the indexing scheme (not
trivial in the beginning...I advice you thorough testing). For instance
your first loop does 8 iterations, you could use a block of size 16x8 or
something like that and get rid of the first loop -> so only 3 left.
Next, the next loop also has 8 iterations, now you cannot directly use a
3d block of 16x8x8 because that would mean 1024 threads. You could go to
a 8x8x8 block for instance and double your grid size. You can also play
on the grid dimensions...Anyway, I think there's quite some ways to
rewrite the same code in a more parallel way of thinking. The idea would
be to replace a loop by a set of threads and as you can massively
parallelize (65536**2 on a 2d grid on your gpu) you can think of things
that would be horribly slow on a cpu threading scheme, but very quick on
the gpu.


++

Peter




>
> On Jun 20, 2011, at 11:46 PM, Peter Schmidtke wrote:
>
>   
>> Dear Matthew,
>>
>> first of all what GPU you are using? A Geforce or a Tesla or something
>> like that?
>>
>> I see a quadruple loop in your kernel. If it's quick to execute there's
>> no problem with that, however, be careful, if you have a Geforce for
>> instance, there's a time limit a thread can run (5s I think), so the
>> kernel cannot run forever. But I think that's not the case with Teslas
>> for instance. Try to execute something shorter and see if it
>> works....then add another sub-loop and you'll see that it's too slow.
>>
>> ++
>>
>> Peter
>>
>>
>> On 06/21/2011 08:05 AM, Matthew Graham wrote:
>>     
>>> Hi,
>>>
>>> I'm having problems with big arrays; this crashes horribly:
>>>
>>> from pycuda import driver, compiler, gpuarray, tools
>>> from pycuda.curandom import rand as curand
>>> import numpy as np
>>> import pycuda.autoinit
>>>
>>> NV = 5632 # No of vectors
>>> DIM = 2031 # No of dimensions per vector
>>> BLOCK_SIZE = 16 # 512 max threads per mp
>>> GRID_SIZE = 352
>>> HEIGHT = 8
>>> WIDTH = 8
>>>
>>> gj = curand((NV, DIM))
>>> nodes = curand((HEIGHT, WIDTH, DIM)) # Each row is a vector
>>> dissim = curand((DIM,DIM))
>>> nearest = gpuarray.zeros((NV), np.float32)
>>>
>>> kernel_code = '''
>>>  __global__ void NearestPrototypeKernel(float *gj, float *nodes, float 
>>> *dissim, float *nearest)
>>>  {
>>>    // Element
>>>    int idx = blockIdx.x * blockDim.x + threadIdx.x;
>>>    float value = 0.;
>>>
>>>    for (int i = 0; i < 8; ++i) {
>>>      for (int j = 0; j < 8; ++j) {
>>>        for (int l = 0; l < 2031; ++l) {
>>>          float wp_l = gj[idx * 2031 + l];
>>>          if (wp_l > 0) {
>>>            for (int k = 0; k < 2031; ++k) {
>>>              value += 0.01;
>>>            }
>>>          }
>>>        }
>>>      }
>>>    }
>>>
>>>    nearest[idx] = value;
>>>  }
>>> '''
>>> mod = compiler.SourceModule(kernel_code)
>>> npker = mod.get_function("NearestPrototypeKernel")
>>> npker(
>>>    # inputs
>>>    gj, nodes, dissim,
>>>    # output
>>>    nearest,
>>>    # block of multiple threads
>>>    block = (BLOCK_SIZE, 1, 1),
>>>    # grid of blocks
>>>    grid = (GRID_SIZE, 1)
>>> )
>>> a = nearest.get()
>>>
>>> ---
>>>
>>> with this:
>>>
>>> /Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/compiler.py:122:
>>>  UserWarning: The CUDA compiler suceeded, but said the following:
>>> ptxas /tmp/tmpxft_000003e5_00000000-2_kernel.ptx, line 93; warning : Double 
>>> is not supported. Demoting to float
>>>
>>>  +stdout+stderr)
>>> Traceback (most recent call last):
>>>  File "kernel_test.py", line 54, in <module>
>>>    a = nearest.get()
>>>  File 
>>> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/gpuarray.py",
>>>  line 177, in get
>>>    drv.memcpy_dtoh(ary, self.gpudata)
>>> pycuda._driver.LaunchError: cuMemcpyDtoH failed: launch timeout
>>> Error in atexit._run_exitfuncs:
>>> Traceback (most recent call last):
>>>  File 
>>> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/atexit.py",
>>>  line 24, in _run_exitfuncs
>>>    func(*targs, **kargs)
>>>  File 
>>> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/autoinit.py",
>>>  line 13, in _finish_up
>>>    context.pop()
>>> LaunchError: cuCtxPopCurrent failed: launch timeout
>>> Error in sys.exitfunc:
>>> Traceback (most recent call last):
>>>  File 
>>> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/atexit.py",
>>>  line 24, in _run_exitfuncs
>>>    func(*targs, **kargs)
>>>  File 
>>> "/Library/Frameworks/Python.framework/Versions/6.0.0/lib/python2.6/site-packages/pycuda-2011.1-py2.6-macosx-10.5-i386.egg/pycuda/autoinit.py",
>>>  line 13, in _finish_up
>>>    context.pop()
>>> pycuda._driver.LaunchError: cuCtxPopCurrent failed: launch timeout
>>> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
>>> cuMemFree failed: invalid context
>>> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
>>> cuMemFree failed: invalid context
>>> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
>>> cuMemFree failed: invalid context
>>> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
>>> cuModuleUnload failed: invalid context
>>> PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
>>> cuMemFree failed: invalid context
>>> -------------------------------------------------------------------
>>> PyCUDA ERROR: The context stack was not empty upon module cleanup.
>>> -------------------------------------------------------------------
>>> A context was still active when the context stack was being
>>> cleaned up. At this point in our execution, CUDA may already
>>> have been deinitialized, so there is no way we can finish
>>> cleanly. The program will be aborted now.
>>> Use Context.pop() to avoid this problem.
>>> -------------------------------------------------------------------
>>> Abort
>>>
>>> ----
>>>
>>> Does anybody have any idea to get around this?
>>>
>>>     Cheers,
>>>
>>>     Matthew
>>> _______________________________________________
>>> PyCUDA mailing list
>>> PyCUDA@tiker.net
>>> http://lists.tiker.net/listinfo/pycuda
>>>
>>>       
>>
>> -- 
>>
>> Peter Schmidtke
>> PhD Student
>> Dept. Physical Chemistry
>> Faculty of Pharmacy
>> University of Barcelona
>>
>>     
>   


-- 

Peter Schmidtke
PhD Student
Dept. Physical Chemistry
Faculty of Pharmacy
University of Barcelona


_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to