Hi Bogdan, Thanks for the quick response. My operating specs are exactly the same as yours, and when I run your test I get an error of ~3e-7. But I think that number may have to do with dividing by the norm of the convolution in the expression in the last line of your test
print numpy.linalg.norm(corr_cpu - corr_gpu) / numpy.linalg.norm(corr_gpu) If the GPU convolution norm is high and the difference between cpu and gpu values of convolution is relatively low, you would get a low value for the division. I think you want to show that norm(corr_cpu-corr_gpu) is approximately zero for two matrices corr_cpu and corr_gpu with nearly identical entries. If I replace your statements with print numpy.linalg.norm(corr_cpu - corr_gpu) print numpy.linalg.norm(corr_gpu) I get 69734.4856715 1.94368e+11 The ratio of those two numbers is ~3e-7. But the norm of the difference between the two convolutions (cpu vs gpu) is high (69734). Is there something I am missing? Thanks, Saigopal On Mon, Jan 17, 2011 at 10:36 PM, Bogdan Opanchuk <manti...@gmail.com>wrote: > Hi Saigopal, > > What pyfft version do you use? Can you please post the full testing > code which can be executed to reproduce the bug? Because the code I > composed (basically, added comparison with CPU to your code) works > normally on my desktop with Tesla C2050 (Ubuntu 10.04 x64, Cuda 3.2, > PyCuda 0.94.2, pyfft 0.3.4), giving the relative error of ~ 1e-7: > > # ---------- > import numpy > from pyfft.cuda import Plan > import pycuda.autoinit > import pycuda.gpuarray as gpuarray > > # w,h,k are the array dimensions in a power of 2 > # im1, im2 are the input 3d arrays of dtype complex64 > > w = h = k = 512 > im1 = numpy.random.rand(w,h,k).astype(numpy.complex64) > im2 = numpy.random.rand(w,h,k).astype(numpy.complex64) > > plan = Plan((w,h,k), normalize=True) > > # forward transform on device > im1_gpu = gpuarray.to_gpu(im1) > plan.execute(im1_gpu) > im1_ft = im1_gpu.get() > del im1_gpu > > im2_gpu = gpuarray.to_gpu(im2) > plan.execute(im2_gpu) > im2_ft = im2_gpu.get() > del im2_gpu > > # do multiplication on host - can be done on device. > conv = im1_ft * im2_ft > > #inverse transform on device > conv_gpu = gpuarray.to_gpu(conv) > del conv > plan.execute(conv_gpu, inverse=True) > corr_gpu = conv_gpu.get() > > # Reference calculation on CPU: > im1_ft = numpy.fft.fftn(im1) > im2_ft = numpy.fft.fftn(im2) > conv = im1_ft * im2_ft > del im1 > del im2 > del im1_ft > del im2_ft > corr_cpu = numpy.fft.ifftn(conv) > > print numpy.linalg.norm(corr_cpu - corr_gpu) / numpy.linalg.norm(corr_gpu) > # ---------- > > (I inserted all these deletions because otherwise numpy threw > MemoryError, despite the fact that the desktop has 12GB RAM) > > Best regards, > Bogdan > > P.S. There is a harmless typo in your code: string instead of boolean > value in "inverse='True'" > > On Tue, Jan 18, 2011 at 1:21 PM, Saigopal Nelaturi <saigo...@gmail.com> > wrote: > > Hello all, > > I am implementing a simple 3d convolution on the gpu using pyfft. > The > > basic idea is straightforward - obtain the 3d Fourier transform for each > > array, multiply and take the inverse transform of the product. I am > > using pyfft for the implementation. The code below works correctly when > > my input array is 256^3 but fails (executes but gives garbage results) > > for a 512^3 voxel grid. > > > > > > # w,h,k are the array dimensions in a power of 2 > > # im1, im2 are the input 3d arrays of dtype complex64 > > > > plan = Plan((w,h,k), normalize=True) > > > > # forward transform on device > > im1_gpu = gpuarray.to_gpu(im1) > > plan.execute(im1_gpu) > > im1_ft = im1_gpu.get() > > del im1_gpu > > > > im2_gpu = gpuarray.to_gpu(im2) > > plan.execute(im2_gpu) > > im2_ft = im2_gpu.get() > > del im2_gpu > > > > > > # do multiplication on host - can be done on device. > > conv = im1_ft * im2_ft > > > > #inverse transform on device > > conv_gpu = gpuarray.to_gpu(conv) > > plan.execute(conv_gpu, inverse='True') > > corr = conv_gpu.get() > > > > > > I don't think there's anything wrong with the code (it works for smaller > > array sizes) as such but I am perplexed as to why the failure occurs. I > > am running the code on a Tesla C2050 (2.8GB available memory) and so > > there's enough space to hold the 512^3 array with complex64 dtype. Does > > anyone have an explanation? > > > > -Saigopal > > > > > > _______________________________________________ > > PyCUDA mailing list > > PyCUDA@tiker.net > > http://lists.tiker.net/listinfo/pycuda > > >
_______________________________________________ PyCUDA mailing list PyCUDA@tiker.net http://lists.tiker.net/listinfo/pycuda