Dear PyCUDA users,

I have been testing the performance of two implementations of the same
kernel function. One of them launches the kernel from Python using PyCUDA,
while the other launches it from a c script. It appears that the PyCUDA
implementation is systematically slower by 20% under a range of different
conditions.

My kernel takes in an array of length M, performs a calculation N times on
each element, summing the N results for each element. It then stores the M
sums in an output array. The 20% difference in speed persists across many
different values of N, holding M fixed. If the difference merely
corresponded to a longer initialization time, then I would expect the
difference to shrink as N increases.

This is how I am launching the kernel using PyCUDA:

cube_file = open(cu_file_path)
module = pycuda.compiler.SourceModule(cube_file.read(), no_extern_c=True)
cube_file.close()
kernel_func = module.get_function("my_kernel")
kernel_func(drv.In(inp_array), numpy.int32(arg2), numpy.float32(arg3), ...,
drv.Out(outp_array))


This is how I compile the c script implementation:

nvcc -ccbin /usr/bin -I. -I/usr/local/cuda/include -Xptxas -v -arch sm_20 -c
test_kernel.cu -o test_kernel.cu.o
g++ -fPIC -o test_kernel test_kernel.cu.o  -L/usr/local/cuda/lib64 -lcudart

In both cases I am launching the kernel with the same number of threads per
block and blocks per grid.

Is this the best way of compiling/launching the kernel from PyCUDA?


-Kevin
_______________________________________________
PyCUDA mailing list
PyCUDA@tiker.net
http://lists.tiker.net/listinfo/pycuda

Reply via email to