Hi Ahmed,

On Fri, Dec 6, 2013 at 12:27 PM, Ahmed Fasih <[email protected]> wrote:
> I ran into a similar issue:
> http://stackoverflow.com/questions/13187443/nvidia-cufft-limit-on-sizes-and-batches-for-fft-with-scikits-cuda

Batch 10000 of 64x1024 complex64 arrays amounts to 5Gb of data, which
wouldn't fit on 2.5Gb memory of C2050 anyway :) Even with C2070 it
probably wouldn't work since it would require at least one temporary
intermediate array of the same size.

> I hypothesize that this is related to the 2^27 "Maximum width for a 1D
> texture reference bound to linear memory" limit that we see in Table 12 of
> the CUDA C Programming Guide
> http://docs.nvidia.com/cuda/cuda-c-programming-guide/#compute-capabilities.

I doubt that CUFFT uses textures internally, I do not see any
advantage in it as compared to the normal global memory. I would guess
it has something to do with grid size limitations or data sizes of
variables used internally for indexing.

Also I don't think that's what happens in Jayanth's case; for him it's
probably just the lack of [free] global memory. 8192x8192 of complex64
is 500Mb, add an output array and one or two temporary ones and you
can easily exceed the capabilities of your video card.

> You should be able to achieve 8096 by 8096 and larger 2D FFTs by performing
> two separate sequentual 1D FFTs, one horizontal and the other vertical. The
> runtimes should nominally be the same (they are for CPU FFTs), and the
> answer will be the same, up to machine precision.

Isn't it how multidimensional FFTs are usually implemented?

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

Reply via email to