The problem is that the results match with cpu only for dim_x and dim_y minor of 32.For higher dimensions the cpu and gpu results are different.
From: [email protected] To: [email protected]; [email protected] Date: Wed, 18 Jul 2012 18:39:11 +0200 Subject: Re: [PyCUDA] Thread Problem Hi, this is my code that, keep a 3d array, and for each pixel of the matrix find the minimum and put it to the corresponding pixel of a matrix b. Then compare the result with the cpu. Obviously, with grid=(1,1) the max dimensions of the matrix can be 32x32 (1024 threads per block; GTX 580). So, because i want a matrix of (500,500) , i I thought to declare a block=(32,32,1) and grid=(16,16). But it doesn't work. I noted that it work only with square matrix and with dimensions minor of 32. I have this problem only with the 3d array, in fact whit bi-dimensional array i solved it doing: int idx = threadIdx.x + blockIdx.x * blockDim.x; with blocks=(1024,1,1) and grid=(245,1) (for (500,500) matrix) . i tried to do the same thing but it doesn't work... the code : import pycuda.driver as cuda import pycuda.autoinit from pycuda.compiler import SourceModule import numpy import time from pycuda.gpuarray import to_gpu dim_x=32 dim_y=32 dim_z=10 a = numpy.random.randn(dim_z,dim_y,dim_x) a = a.astype(numpy.int32) b=numpy.zeros((dim_x,dim_y),dtype=numpy.int32) dimz=numpy.array([dim_z],dtype=numpy.int32) a_gpu=to_gpu(a) b_gpu=to_gpu(b) dimz_gpu=to_gpu(dimz) mod = SourceModule(""" __global__ void findmin(int *a,int *b,int *dimz_gpu) { int idx = threadIdx.x + blockIdx.x * blockDim.x; //OK int idy = threadIdx.y + blockIdx.y * blockDim.y; //OK int x_width = blockDim.x * gridDim.x; //OK int y_width = blockDim.y * gridDim.y; int flat_id1 = idx + x_width * idy ; int min=4294967296; for(int idz = 0; idz <dimz_gpu[0]; idz++) { int flat_id = idx + x_width * idy + (x_width * y_width) * idz; //OK if(a[flat_id]<min) { min=a[flat_id]; b[flat_id1]=min; } } } """) func = mod.get_function("findmin") func(a_gpu, b_gpu,dimz_gpu,block=(32,32,1),grid=(1,1)) print a_gpu.get() print "b :\n" b=b_gpu.get() print b minimo=100 b1=numpy.zeros((dim_x,dim_y),dtype=numpy.int32) for i in range(0,dim_x): for j in range(0,dim_y): minimo=min(a[:,i,j]) b1[i,j]=minimo print "Difference between CPU:\n" print b1-b Thanks! Andrea > Date: Thu, 12 Jul 2012 10:18:20 +1000 > Subject: Re: [PyCUDA] Thread Problem > From: [email protected] > To: [email protected] > > Hi Andrea, > > Unfortunately, I am not quite familiar with the topic. Probably the > issue here is incorrect padding, or incorrect mode of numpy function > you are using for comparison — logically I'd expect mode to be 'wrap', > not 'reflect'. Moreover, why did you even prefer to take correlate1d() > as a reference instead of numpy.convolve() or > scipy.fftpack.convolve()? > > Perhaps, it will be helpful to look at the sources of the two > functions above and see how they do the padding. Scipy one is at > https://github.com/scipy/scipy/blob/master/scipy/fftpack/src/convolve.c > > > > On Thu, Jul 12, 2012 at 2:38 AM, Andrea Cesari <[email protected]> > wrote: > > Hi, > > excuse me if i write you in private, but it isn't properly a pycuda problem. > > I would ask you an opinion. > > To do de equivalent of scipy.ndimage.convolve1d (that is a linear > > convolution), im trying to do something like this: > > a and b are the vector to convolve. > > > > A=FFT(a) > > B=FFT(b) whit zero-padding at the end > > CONV=A.*B > > conv=INV_FFT(CONV) > > > > it's right? because i tried to do this in matlab( i have not yet installed > > pyfft) but the results are different. > > It's a mathematical problem? > > > > Thanks for your patience, > > Andrea > > > >> Date: Wed, 11 Jul 2012 22:48:25 +1000 > > > >> Subject: Re: [PyCUDA] Thread Problem > >> From: [email protected] > >> To: [email protected] > >> CC: [email protected] > >> > >> Hi Andrea, > >> > >> On Wed, Jul 11, 2012 at 10:25 PM, Andrea Cesari > >> <[email protected]> wrote: > >> > __global__ void gpu_kernel(int *corrGpu,int *aMod,int *b,int > >> > *kernelSize_h) > >> > { > >> > int j,step1=kernelSize_h[0]/2; // <--- > >> ... > >> > """) > >> > >> When I remove /2 where the arrow points, I get results identical with > >> the CPU version. Are you sure it is necessary there? > >> > >> > About your advise: when i do: int idx = threadIdx.x+step, idx doesn't > >> > start > >> > from step1? so when j=0 idx-step1+j =0 ? it's wrong? > >> > >> Yes, sorry, that was my mistake. Everything is correct in this part. _______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
