Hi,
Im currently working on CFD code on pyCUDA and have one problem with parallel 
prefix sum. I need this function to copy elements which close to free surface 
to small linear array, but call of pyCUDA implementation of scan takes too much 
time.

Here is the test code, which executes scan for 5 arrays:

import pycuda.autoinit
import pycuda.gpuarray as gpu
import pycuda.scan as scan
import time
from pycuda.compiler import SourceModule
import numpy as np</span>

N = pow(2,15)
NArrays = 5

arrayH = np.zeros(N,dtype=np.int32)

arrayDList = []
for i in range(NArrays):
    arrayDList.append(gpu.to_gpu(arrayH))

krn = scan.InclusiveScanKernel(np.int32,"a+b")

for i in range(NArrays):
    time1 = time.time()
    krn(arrayDList[i])
    time2 = time.time()
    print "time = " + str(time2-time1)</span>

Output:

time = 0.000386953353882
time = 0.000221967697144
time = 0.000216960906982
time = 0.00021505355835
time = 0.000216007232666

CUDA Profiler output:
...
</span>method=[ scan_scan_intervals ] gputime=[ 16.640 ] cputime=[ 16.000 ] 
occupancy=[ 0.500 ] 
method=[ scan_scan_intervals ] gputime=[ 9.920 ] cputime=[ 5.000 ] occupancy=[ 
0.125 ] 
method=[ scan_final_update ] gputime=[ 5.408 ] cputime=[ 4.000 ] occupancy=[ 
1.000 ] 
...

On GPU scan takes about 30 microseconds, but call in python code takes 200. I 
need to call scan procedure on every timestep in my code and 200 μs is too slow 
(energy equation solver takes about 150 μs). Is there any way to improve 
parallel scan call time?
_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to