Bogdan, thanks for your earlier deconstruction of my problem with batched 1D FFT---it's such a simple explanation, and I can't remember if I considered that or not---DOH!
Jayanth, quick note: the multidimensional DFT (and thus FFT) is defined in terms of the 1D version. The following example demonstrates how FFT2 can be achieved with two FFT1s: ``` import numpy as np import numpy.fft as fft x = np.random.randn(100,100) + 1j * np.random.randn(100,100) x2 = fft.fft2(x) x11 = fft.fft(fft.fft(x, axis=0), axis=1) np.allclose(x2, x11) ``` The allclose() will print return True. On Mon, Dec 9, 2013 at 2:55 AM, Jayanth Channagiri <[email protected]>wrote: > Hello Bogdan > > Thank you very much for some interesting ideas. > The fact that you can run 8192 x 8192 on your C2050 clearly suggests that > it was the limitation by my Quadro 2000. > I had a look on Reikna and it is indeed helpful. > > And Ahmed, > I realised that creating a 2d array and making it into two separate > sequentual 1D FFTs, one horizontal and the other vertical, does not yield > the same result. Clearly 1D FFT and 2D FFT are different. > They have done the same in > http://wiki.tiker.net/PyCuda/Examples/*2DFFT*<http://wiki.tiker.net/PyCuda/Examples/2DFFT>. > It is not 2D FFT but 1D FFT for each row and then reshaping it back to > 2D. The result is not 2DFFT. > > > > For my problem, I need to find FFT in 3D for an array of the range 1024 * > 4096 *4096 using parallel computing by PyCUDA. > Is it necessary to write a kernel in C while writing the program or I can > proceed the way I had sent in the previous mail? With my program, I can > readily see 10x speedup compared to numpy fft. But my GPU is unable to > handle huge data. > It will be really helpful if anyone can suggest any > documentation/blogs/videos etc regarding it. > Thank you all. > Have a good day > > Jayanth > > > > > > > Date: Fri, 6 Dec 2013 16:47:17 +1100 > > > Subject: Re: [PyCUDA] cuMemAlloc failed: out of memory > > From: [email protected] > > To: [email protected] > > CC: [email protected]; [email protected] > > > > > Hi Jayanth, > > > > I can run a 8192x8192 transform on a Tesla C2050 without problems. I > > think you are limited by the available video memory, see my previous > > message in this thread --- a 8192x4096 buffer takes 250Mb, and you > > have to factor in the temporary buffers PyFFT creates. > > > > By the way, I would recommend you to switch from PyFFT to Reikna > > (http://reikna.publicfields.net). PyFFT is not supported anymore, and > > Reikna includes its code along with some additional features and > > optimizations (more robust block/grid size finder, temporary array > > management, launch optimizations and so on). Your code would look > > like: > > > > import numpy > > import reikna.cluda as cluda > > from reikna.fft import FFT > > > > api = cluda.cuda_api() > > thr = api.Thread.create() > > > > # Or, if you want to use an external stream, > > # > > # cuda.init() > > # context = make_default_context() > > # stream = cuda.Stream() > > # thr = api.Thread(stream) > > > > data = numpy.ones((4096, 4096), dtype = numpy.complex64) > > gpu_data = thr.to_device(data) #converting to gpu array > > > > fft = FFT(data).compile(thr) > > fft(gpu_data, gpu_data) > > result = gpu_data.get() > > > > print result > > > > > > On Fri, Dec 6, 2013 at 3:43 PM, Jayanth Channagiri > > <[email protected]> wrote: > > > Dear Ahmed > > > > > > Thank you for the resourceful reply. > > > > > > But the CUFFT limit is ~2^27 and also in the benchmarks on the CUFFT > reach > > > upto 2^25. In my case, I am able to reach only upto 2^24. In some way, > I am > > > missing another factor. Is this limited by my GPU's memory? > > > And also, in the same table, you can see for "Maximum width and height > for a > > > 2D texture reference bound to a CUDA array " is 65000*65000 which is > way too > > > high compared to mine. My GPU has a computing capacity of 2.x. > > > Thank you for the idea of performing two separate sequentual 1D FFTs. > I will > > > shed more light on it. The thing is my problem doesn't stop at 2D. My > goal > > > is to perform 3D FFT and I am not sure if I can calculate that way. > > > > > > > > > For others in the list, here I am sending the complete traceback of the > > > error message. > > > Traceback (most recent call last): > > > File "<stdin>", line 1, in <module> > > > File "/usr/lib/python2.7/dist- > > > packages/spyderlib/widgets/externalshell/sitecustomize.py", line 493, > in > > > runfile > > > execfile(filename, namespace) > > > File "/home/jayanth/Dropbox/fft/fft1d_AB.py", line 99, in <module> > > > plan.execute(gpu_data) > > > File > > > > "/usr/local/lib/python2.7/dist-packages/pyfft-0.3.8-py2.7.egg/pyfft/plan.py", > > > line 271, in _executeInterleaved > > > batch, data_in, data_out) > > > File > > > > "/usr/local/lib/python2.7/dist-packages/pyfft-0.3.8-py2.7.egg/pyfft/plan.py", > > > line 192, in _execute > > > self._tempmemobj = self._context.allocate(buffer_size * 2) > > > > > > pycuda._driver.MemoryError: cuMemAlloc failed: out of memory > > > > > > Also, here is the simple program to which I was addressing to > calculate FFT > > > using pyfft : > > > from pyfft.cuda import Plan > > > import numpy > > > import pycuda.driver as cuda > > > from pycuda.tools import make_default_context > > > import pycuda.gpuarray as gpuarray > > > > > > cuda.init() > > > context = make_default_context() > > > stream = cuda.Stream() > > > > > > plan = Plan((4096, 4096), stream=stream) #creating the plan > > > data = numpy.ones((4096, 4096), dtype = numpy.complex64) #My data with > just > > > ones to calculate the fft for single precision > > > gpu_data = gpuarray.to_gpu(data) #converting to gpu array > > > plan.execute(gpu_data)#calculating pyfft > > > result = gpu_data.get() #the result > > > > > > This is just a simple program to calculate the FFT for an array of > 4096 * > > > 4096 in 2d. It works well for this array or a smaller array. As soon > after I > > > increase it to the higher values like 8192*8192 or 8192*4096 or > anything, it > > > gives an error message saying out of memory. > > > So I wanted to know the reason behind it and how to overcome. > > > You can execute the same code and kindly let me know if you have the > same > > > limits in your respective GPUs. > > > > > > Thank you > > > > > > > > > > > > ________________________________ > > > Date: Thu, 5 Dec 2013 20:27:45 -0500 > > > Subject: Re: [PyCUDA] cuMemAlloc failed: out of memory > > > From: [email protected] > > > To: [email protected] > > > CC: [email protected] > > > > > > > > > I ran into a similar issue: > > > > http://stackoverflow.com/questions/13187443/nvidia-cufft-limit-on-sizes-and-batches-for-fft-with-scikits-cuda > > > > > > The long and short of it is that CUFFT seems to have a limit of > > > approximately 2^27 elements that it can operate on, in any combination > of > > > dimensions. In the StackOverflow post above, I was trying to make a > plan for > > > large batches of the same 1D FFTs and hit this limitation. You'll also > > > notice that the benchmarks on the CUFFT site > > > https://developer.nvidia.com/cuFFT go up to sizes of 2^25. > > > > > > I hypothesize that this is related to the 2^27 "Maximum width for a 1D > > > texture reference bound to linear memory" limit that we see in Table > 12 of > > > the CUDA C Programming Guide > > > > http://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities > . > > > > > > So since 4096**2 is 2^24, increasing to 8096 by 8096 gets very close > to the > > > limit, even though you'd think 2D FFTs would not be governed by the > same > > > limits as 1D FFT batches. > > > > > > You should be able to achieve 8096 by 8096 and larger 2D FFTs by > performing > > > two separate sequentual 1D FFTs, one horizontal and the other > vertical. The > > > runtimes should nominally be the same (they are for CPU FFTs), and the > > > answer will be the same, up to machine precision. > > > > > > > > > On Thu, Dec 5, 2013 at 9:53 AM, Jayanth Channagiri < > [email protected]> > > > wrote: > > > > > > Hello > > > > > > I have a NVIDIA 2000 GPU. It has 192 CUDA cores and 1 Gb memory. > > > GB GDDR5 > > > > > > I am trying to calculate fft by GPU using pyfft. > > > I am able to calculate the fft only upto the array with maximum of > 4096 x > > > 4096. > > > > > > But as soon after I increase the array size, it gives an error message > > > saying: > > > pycuda._driver.MemoryError: cuMemAlloc failed: out of memory > > > > > > Can anyone please tell me if this error means that my GPU is not > sufficient > > > to calculate this array? Or is it my computer's memory? Or a > programming > > > error? What is the maximum array size you can achieve with GPU? > > > Is there any information of how else can I calculate the huge arrays? > > > > > > Thank you very much in advance for the help and sorry if it is too > > > preliminary question. > > > > > > Jayanth > > > > > > > > > > > > > > > > > > _______________________________________________ > > > PyCUDA mailing list > > > [email protected] > > > http://lists.tiker.net/listinfo/pycuda > > > > > > > > > > > > _______________________________________________ > > > PyCUDA mailing list > > > [email protected] > > > http://lists.tiker.net/listinfo/pycuda > > > >
_______________________________________________ PyCUDA mailing list [email protected] http://lists.tiker.net/listinfo/pycuda
