Dear PyCUDA users, I have been testing the performance of two implementations of the same kernel function. One of them launches the kernel from Python using PyCUDA, while the other launches it from a c script. It appears that the PyCUDA implementation is systematically slower by 20% under a range of different conditions.
My kernel takes in an array of length M, performs a calculation N times on each element, summing the N results for each element. It then stores the M sums in an output array. The 20% difference in speed persists across many different values of N, holding M fixed. If the difference merely corresponded to a longer initialization time, then I would expect the difference to shrink as N increases. This is how I am launching the kernel using PyCUDA: cube_file = open(cu_file_path) module = pycuda.compiler.SourceModule(cube_file.read(), no_extern_c=True) cube_file.close() kernel_func = module.get_function("my_kernel") kernel_func(drv.In(inp_array), numpy.int32(arg2), numpy.float32(arg3), ..., drv.Out(outp_array)) This is how I compile the c script implementation: nvcc -ccbin /usr/bin -I. -I/usr/local/cuda/include -Xptxas -v -arch sm_20 -c test_kernel.cu -o test_kernel.cu.o g++ -fPIC -o test_kernel test_kernel.cu.o -L/usr/local/cuda/lib64 -lcudart In both cases I am launching the kernel with the same number of threads per block and blocks per grid. Is this the best way of compiling/launching the kernel from PyCUDA? -Kevin
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda