Hi pycuda community I’m rather new in programming pycuda and currently try to 
implement a shallow water wave equation solver in pycuda, it worked pretty well 
so far, but the major struggle in terms of speed is the iteration process. I’m 
already searching for several hours to find an appropriate solution and still 
end up having the loop in python calling 4 independent kernels for each 
iteration step. The version that worked so far had direct driver.In()- and 
driver.Out()-calls for each kernel, but that’s pretty slow, keeping the stuff 
on the device is significantly faster!, but after the first iteration, all 
elements are zero when working with appropriate cuda memory allocation. What’s 
my error, I’m wondering? I attached the code, but here’s the current structure: 
#Allocating and Copying all arrays on the 
device:u_gpu=drv.mem_alloc(u.nbytes)v_gpu=drv.mem_alloc(v.nbytes)eta_gpu=drv.mem_alloc(eta.nbytes)…drv.memcpy_htod(u_gpu,u)
 … etc.…For i in range(n_iterations):                u_old_gpu=u_gpu            
    Kernel1(u_gpu,u_old_gpu,v_gpu, … grid, block)v_old_gpu=v_gpu                
Kernel2(v_gpu,v_old_gpu,u_gpu, … grid, block)                                
Kernel3 -needs kernel2 and kernel1 to finish beforehand                 Kernel4 
-needs kernel3 to finish beforehand                etc. and then copying back 
all the stuff. If n_iterations 1, all arrays are filled with zeros, except if I 
copy the stuff back to the host and again back to the device?Same error shows 
up when I use gpuarrays, so I guess I have a logic error in my cuda 
application, maybe a device synchronization or so? I’m working with a Geforce 
GTX 970, and an i7 3770k, Windows 10. Looking forward to your answer and thanks 
in advance! Cheers,Andreas     

Attachment: pycudaTest_self1.py
Description: Binary data

_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to