Hello everybody, I'm quite new to cuda and pycuda.
I need a kernel that creates a matrix (of dimension nxd) out of an array (1xd), by simply "repeating" the same array n times: for example, suppose we have n = 4 and d = 3, then if the array is [1 2 3] the result of my kernel should be: [1 2 3 1 2 3 1 2 3 1 2 3] (a matrix 4x3) Basically, it's the same as doing numpy.tile(array, (n, 1)) I've written the code below: kernel_code_template = """ __global__ void TileKernel(float *in, float *out) { // Each thread computes one element of out int y = blockIdx.y * blockDim.y + threadIdx.y; int x = blockIdx.x * blockDim.x + threadIdx.x; if (y > %(n)s || x > %(d)s) return; out[y * %(d)s + x] = in[x]; } """ d = 64 n = 512 blockSizex = 16 blockSizey = 16 gridSizex = (d + blockSizex - 1) / blockSizex gridSizey = (n + blockSizey - 1) / blockSizey # get the kernel code from the template kernel_code = kernel_code_template % { 'd': d, 'n': n } mod = SourceModule(kernel_code) TileKernel = mod.get_function("TileKernel") vec_cpu = np.arange(d).astype(np.float32) # just as an example vec_gpu = gpuarray.to_gpu(vec_cpu) out_gpu = gpuarray.empty((n, d), np.float32) TileKernel.prepare("PP") TileKernel.prepared_call((gridSizex, gridSizey), (blockSizex, blockSizey, 1), vec_gpu.gpudata, out_gpu.gpudata) out_cpu = out_gpu.get() Now, if I run this code with d equals a power of 2 >= 16 I get the right result (just like numpy.tile(vec_cpu, (n, 1)) ); but if I set d equals to anything else (let's say for example 88) I get that every element of the output matrix has the correct value, except the first column: some entries are right but others have another value (equals to d), and the entries of the first column that have the wrong value are different every run. I really can't figure out where's the problem, but maybe it's just something simple that I'm missing... Any help will be appreciated, thanks in advance! Best regards, Manuele
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net https://lists.tiker.net/listinfo/pycuda