On Tue, 26 Nov 2013
"Dinesh Vadhia" wrote:

> Probably a loaded question but is there a significant performance difference 
> between using MKL (or OpenBLAS) on multi-core cpu's and cuBLAS on gpu's.  
> Does anyone have recent experience or link to an independent benchmark?

Using Numpy (Xeon 5520 2.2GHz):

In [1]: import numpy
In [2]: shape = (450,450,450)
In [3]: start=numpy.random.random(shape).astype("complex128")
In [4]: %timeit result = numpy.fft.fftn(start)
1 loops, best of 3: 10.2 s per loop

Using FFTw (8 threads (2x quad cores):

In [5]: import fftw3
In [7]: result = numpy.empty_like(start)
In [8]: fft = fftw3.Plan(start, result, direction='forward', flags=['measure'], 
In [9]: %timeit fft()
1 loops, best of 3: 887 ms per loop

Using CuFFT (GeForce Titan):
1) with 2 transfers:
In [10]: import pycuda,pycuda.gpuarray as gpuarray,scikits.cuda.fft as 
In [11]: cuplan = cu_fft.Plan(start.shape, numpy.complex128, numpy.complex128)
In [12]: d_result = gpuarray.empty(start.shape, start.dtype)
In [13]: d_start = gpuarray.empty(start.shape, start.dtype)
In [14]: def cuda_fft(start):
   ....:     d_start.set(start)
   ....:     cu_fft.fft(d_start, d_result, cuplan)
   ....:     return d_result.get()
In [15]: %timeit cuda_fft(start)
1 loops, best of 3: 1.7 s per loop

2) with 1 transfert:
In [18]: def cuda_fft_2():
    cu_fft.fft(d_start, d_result, cuplan)
    return d_result.get()
In [20]: %timeit cuda_fft_2()
1 loops, best of 3: 1.05 s per loop

3) Without transfer:
In [22]: def cuda_fft_3():
    cu_fft.fft(d_start, d_result, cuplan)

In [23]: %timeit cuda_fft_3()
1 loops, best of 3: 202 ms per loop

A Geforce Titan (1000€) can be 4x faster than a couple of Xeon 5520 (2x 250€) 
if your data are already on the GPU.
Nota: Plan calculation are much faster on GPU then on CPU.
Jérôme Kieffer
tel +33 476 882 445
