Re: [PyCUDA] cuMemAlloc failed: out of memory

Ahmed Fasih Mon, 09 Dec 2013 03:29:49 -0800

Bogdan, thanks for your earlier deconstruction of my problem with batched
1D FFT---it's such a simple explanation, and I can't remember if I
considered that or not---DOH!


Jayanth, quick note: the multidimensional DFT (and thus FFT) is defined in
terms of the 1D version. The following example demonstrates how FFT2 can be
achieved with two FFT1s:

```
import numpy as np
import numpy.fft as fft

x = np.random.randn(100,100) + 1j * np.random.randn(100,100)
x2 = fft.fft2(x)
x11 = fft.fft(fft.fft(x, axis=0), axis=1)

np.allclose(x2, x11)
```
The allclose() will print return True.



On Mon, Dec 9, 2013 at 2:55 AM, Jayanth Channagiri
<[email protected]>wrote:

> Hello Bogdan
>
> Thank you very much for some interesting ideas.
> The fact that you can run 8192 x 8192 on your C2050 clearly suggests that
> it was the limitation by my Quadro 2000.
> I had a look on Reikna and it is indeed helpful.
>
> And Ahmed,
> I realised that creating a 2d array and making it into two separate
> sequentual 1D FFTs, one horizontal and the other vertical, does not yield
> the same result. Clearly 1D FFT and 2D FFT are different.
> They have done the same in 
> http://wiki.tiker.net/PyCuda/Examples/*2DFFT*<http://wiki.tiker.net/PyCuda/Examples/2DFFT>.
>  It is not 2D FFT but 1D FFT for each row and then reshaping it back to
> 2D. The result is not 2DFFT.
>
>
>
> For my problem, I need to find FFT in 3D for an array of the range 1024 *
> 4096 *4096 using parallel computing by PyCUDA.
> Is it necessary to write a kernel in C while writing the program or I can
> proceed the way I had sent in the previous mail? With my program, I can
> readily see 10x speedup compared to numpy fft. But my GPU is unable to
> handle huge data.
> It will be really helpful if anyone can suggest any
> documentation/blogs/videos etc regarding it.
> Thank you all.
> Have a good day
>
> Jayanth
>
>
>
>
>
> > Date: Fri, 6 Dec 2013 16:47:17 +1100
>
> > Subject: Re: [PyCUDA] cuMemAlloc failed: out of memory
> > From: [email protected]
> > To: [email protected]
> > CC: [email protected]; [email protected]
>
> >
> > Hi Jayanth,
> >
> > I can run a 8192x8192 transform on a Tesla C2050 without problems. I
> > think you are limited by the available video memory, see my previous
> > message in this thread --- a 8192x4096 buffer takes 250Mb, and you
> > have to factor in the temporary buffers PyFFT creates.
> >
> > By the way, I would recommend you to switch from PyFFT to Reikna
> > (http://reikna.publicfields.net). PyFFT is not supported anymore, and
> > Reikna includes its code along with some additional features and
> > optimizations (more robust block/grid size finder, temporary array
> > management, launch optimizations and so on). Your code would look
> > like:
> >
> > import numpy
> > import reikna.cluda as cluda
> > from reikna.fft import FFT
> >
> > api = cluda.cuda_api()
> > thr = api.Thread.create()
> >
> > # Or, if you want to use an external stream,
> > #
> > # cuda.init()
> > # context = make_default_context()
> > # stream = cuda.Stream()
> > # thr = api.Thread(stream)
> >
> > data = numpy.ones((4096, 4096), dtype = numpy.complex64)
> > gpu_data = thr.to_device(data) #converting to gpu array
> >
> > fft = FFT(data).compile(thr)
> > fft(gpu_data, gpu_data)
> > result = gpu_data.get()
> >
> > print result
> >
> >
> > On Fri, Dec 6, 2013 at 3:43 PM, Jayanth Channagiri
> > <[email protected]> wrote:
> > > Dear Ahmed
> > >
> > > Thank you for the resourceful reply.
> > >
> > > But the CUFFT limit is ~2^27 and also in the benchmarks on the CUFFT
> reach
> > > upto 2^25. In my case, I am able to reach only upto 2^24. In some way,
> I am
> > > missing another factor. Is this limited by my GPU's memory?
> > > And also, in the same table, you can see for "Maximum width and height
> for a
> > > 2D texture reference bound to a CUDA array " is 65000*65000 which is
> way too
> > > high compared to mine. My GPU has a computing capacity of 2.x.
> > > Thank you for the idea of performing two separate sequentual 1D FFTs.
> I will
> > > shed more light on it. The thing is my problem doesn't stop at 2D. My
> goal
> > > is to perform 3D FFT and I am not sure if I can calculate that way.
> > >
> > >
> > > For others in the list, here I am sending the complete traceback of the
> > > error message.
> > > Traceback (most recent call last):
> > > File "<stdin>", line 1, in <module>
> > > File "/usr/lib/python2.7/dist-
> > > packages/spyderlib/widgets/externalshell/sitecustomize.py", line 493,
> in
> > > runfile
> > > execfile(filename, namespace)
> > > File "/home/jayanth/Dropbox/fft/fft1d_AB.py", line 99, in <module>
> > > plan.execute(gpu_data)
> > > File
> > >
> "/usr/local/lib/python2.7/dist-packages/pyfft-0.3.8-py2.7.egg/pyfft/plan.py",
> > > line 271, in _executeInterleaved
> > > batch, data_in, data_out)
> > > File
> > >
> "/usr/local/lib/python2.7/dist-packages/pyfft-0.3.8-py2.7.egg/pyfft/plan.py",
> > > line 192, in _execute
> > > self._tempmemobj = self._context.allocate(buffer_size * 2)
> > >
> > > pycuda._driver.MemoryError: cuMemAlloc failed: out of memory
> > >
> > > Also, here is the simple program to which I was addressing to
> calculate FFT
> > > using pyfft :
> > > from pyfft.cuda import Plan
> > > import numpy
> > > import pycuda.driver as cuda
> > > from pycuda.tools import make_default_context
> > > import pycuda.gpuarray as gpuarray
> > >
> > > cuda.init()
> > > context = make_default_context()
> > > stream = cuda.Stream()
> > >
> > > plan = Plan((4096, 4096), stream=stream) #creating the plan
> > > data = numpy.ones((4096, 4096), dtype = numpy.complex64) #My data with
> just
> > > ones to calculate the fft for single precision
> > > gpu_data = gpuarray.to_gpu(data) #converting to gpu array
> > > plan.execute(gpu_data)#calculating pyfft
> > > result = gpu_data.get() #the result
> > >
> > > This is just a simple program to calculate the FFT for an array of
> 4096 *
> > > 4096 in 2d. It works well for this array or a smaller array. As soon
> after I
> > > increase it to the higher values like 8192*8192 or 8192*4096 or
> anything, it
> > > gives an error message saying out of memory.
> > > So I wanted to know the reason behind it and how to overcome.
> > > You can execute the same code and kindly let me know if you have the
> same
> > > limits in your respective GPUs.
> > >
> > > Thank you
> > >
> > >
> > >
> > > ________________________________
> > > Date: Thu, 5 Dec 2013 20:27:45 -0500
> > > Subject: Re: [PyCUDA] cuMemAlloc failed: out of memory
> > > From: [email protected]
> > > To: [email protected]
> > > CC: [email protected]
> > >
> > >
> > > I ran into a similar issue:
> > >
> http://stackoverflow.com/questions/13187443/nvidia-cufft-limit-on-sizes-and-batches-for-fft-with-scikits-cuda
> > >
> > > The long and short of it is that CUFFT seems to have a limit of
> > > approximately 2^27 elements that it can operate on, in any combination
> of
> > > dimensions. In the StackOverflow post above, I was trying to make a
> plan for
> > > large batches of the same 1D FFTs and hit this limitation. You'll also
> > > notice that the benchmarks on the CUFFT site
> > > https://developer.nvidia.com/cuFFT go up to sizes of 2^25.
> > >
> > > I hypothesize that this is related to the 2^27 "Maximum width for a 1D
> > > texture reference bound to linear memory" limit that we see in Table
> 12 of
> > > the CUDA C Programming Guide
> > >
> http://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities
> .
> > >
> > > So since 4096**2 is 2^24, increasing to 8096 by 8096 gets very close
> to the
> > > limit, even though you'd think 2D FFTs would not be governed by the
> same
> > > limits as 1D FFT batches.
> > >
> > > You should be able to achieve 8096 by 8096 and larger 2D FFTs by
> performing
> > > two separate sequentual 1D FFTs, one horizontal and the other
> vertical. The
> > > runtimes should nominally be the same (they are for CPU FFTs), and the
> > > answer will be the same, up to machine precision.
> > >
> > >
> > > On Thu, Dec 5, 2013 at 9:53 AM, Jayanth Channagiri <
> [email protected]>
> > > wrote:
> > >
> > > Hello
> > >
> > > I have a NVIDIA 2000 GPU. It has 192 CUDA cores and 1 Gb memory.
> > > GB GDDR5
> > >
> > > I am trying to calculate fft by GPU using pyfft.
> > > I am able to calculate the fft only upto the array with maximum of
> 4096 x
> > > 4096.
> > >
> > > But as soon after I increase the array size, it gives an error message
> > > saying:
> > > pycuda._driver.MemoryError: cuMemAlloc failed: out of memory
> > >
> > > Can anyone please tell me if this error means that my GPU is not
> sufficient
> > > to calculate this array? Or is it my computer's memory? Or a
> programming
> > > error? What is the maximum array size you can achieve with GPU?
> > > Is there any information of how else can I calculate the huge arrays?
> > >
> > > Thank you very much in advance for the help and sorry if it is too
> > > preliminary question.
> > >
> > > Jayanth
> > >
> > >
> > >
> > >
> > >
> > > _______________________________________________
> > > PyCUDA mailing list
> > > [email protected]
> > > http://lists.tiker.net/listinfo/pycuda
> > >
> > >
> > >
> > > _______________________________________________
> > > PyCUDA mailing list
> > > [email protected]
> > > http://lists.tiker.net/listinfo/pycuda
> > >
>

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Re: [PyCUDA] cuMemAlloc failed: out of memory

Reply via email to