Re: [julia-users] FFTW.REDFT00 and FFT.RODFT00 are ~10x slower than other routines?

2015-09-25 Thread Steven G. Johnson


On Wednesday, September 23, 2015 at 9:07:51 PM UTC-4, Sheehan Olver wrote:
>
> OK that makes sense.  But then why is rfft on a vector length 2*(n-1) more 
> than 2x faster than FFT.REDFT00?
>
> *julia> **r=rand(10);@time for k=1:100 FFTW.r2r(r,FFTW.REDFT00) end;*
>   2.496130 seconds (8.30 k allocations: 76.703 MB, 0.76% gc time)
> *julia> **r=rand(2*(10-1));@time for k=1:100 rfft(r) end;*
>   0.943706 seconds (8.30 k allocations: 152.985 MB, 1.58% gc time)
>

The short answer is that rfft is much more optimized in FFTW than than the 
r2r transforms.

The long answer is REDFT00 transforms of *even* lengths n are especially 
bad, because the algorithms for this problem are not very attractive.   For 
an even length, the options are:

   1) use an algorithm from FFTPACK or Numerical Recipes that turns it into 
a rfft of the same length plus O(n) pre/post-processing.   We implemented 
this, but don't use it because it seems to suffer from severe accuracy 
problems: https://github.com/FFTW/fftw3/blob/master/reodft/redft00e-r2hc.c
2) pad symmetrically to an rfft of twice the length.  This is accurate, 
but sacrices a factor of 2 in speed as you noticed: 
 https://github.com/FFTW/fftw3/blob/master/reodft/redft00e-r2hc-pad.c
3) if n-1 has a small prime factor r, implement a pruned version of the 
analogous radix-r Cooley-Tukey algorithm.  This would work and be accurate, 
but is a lot of work to implement, and we didn't both except in the case 
where n is *odd* (so that we can use radix r=2, or actually split 
radix: https://github.com/FFTW/fftw3/blob/master/reodft/reodft00e-splitradix.c)
  

> PS  Why doesn't fft(::Vector{Float64}) automatically call rfft and 
> re-interpret the output? 
>

We could, but my feeling was that if you really care about factors of two 
in performance, you should be using rfft directly.  That also lets you save 
a factor of two in memory, and a factor of two in any post-processing 
computations as well, and additionally lets you save a factor of two in any 
subsequent inverse transform (if you need it, e.g. for convolutions or 
filtering).

Matlab, in contrast, doesn't expose an rfft interface, so you can't get all 
of the savings that are possible in this case.  So they might as well get 
what they can from fft(x).
 


Re: [julia-users] FFTW.REDFT00 and FFT.RODFT00 are ~10x slower than other routines?

2015-09-23 Thread Sheehan Olver
Hi Steven,

OK that makes sense.  But why is Julia 2x slower than Matlab for some 
values of n  (see below, the timing difference is consistent when looped over)? 
 Shouldn’t they both be using FFTW?


julia> r=rand(200);plan=plan_fft(r); @time plan*r;
  0.080122 seconds (12 allocations: 61.035 MB, 12.91% gc time)

julia> r=rand(201);plan=plan_fft(r); @time plan*r;
  0.287002 seconds (12 allocations: 61.036 MB, 3.76% gc time)

julia> r=rand(199);plan=plan_fft(r); @time plan*r;
  0.218389 seconds (12 allocations: 61.035 MB, 4.76% gc time)


>> r=rand(200,1);tic();fft(r);toc()
Elapsed time is 0.023956 seconds.
>> r=rand(201,1);tic();fft(r);toc()
Elapsed time is 0.303320 seconds.
>> r=rand(199,1);tic();fft(r);toc()
Elapsed time is 0.131212 seconds.





> On 23 Sep 2015, at 3:52 am, Steven G. Johnson  wrote:
> 
> 
> 
> On Tuesday, September 22, 2015 at 6:51:50 AM UTC-4, Sheehan Olver wrote:
> I get the following timings with the various FFTW routines, where I ran each 
> line multiple times to make sure it was accurate.  Why are REDFT00 and 
> RODFT00 almost 10x slower?
> 
> This is because a REDFT00 (DCT-I) of n points is exactly equivalent to a DFT 
> of N=2(n-1) points with real-even data.  Although FFTW uses a different 
> algorithm, not a size-N complex FFT, the fast algorithms are essentially 
> equivalent to pruned versions of the FFT algorithms for N, and depend on the 
> factorization of N, not of n.  Similarly for RODFT00 (DST-I), except N=2(n+1) 
> in that case.  In contrast, the other DCT/DST types correspond to DFTs of 
> length N=2n or N=4n or N=8n, so their speed depends on the factorization of n.
> 
> If n=10 as in your example, then n-1 has large prime factors (41, 271), 
> so the FFT algorithms are much slower, although they are still O(n log n).   
> If you try n=11, you will notice that REDFT00 becomes much faster (but 
> other DCT types become slower).
> 
> This is mentioned in the FFTW manual:
> 
> http://www.fftw.org/doc/Real-even_002fodd-DFTs-_0028cosine_002fsine-transforms_0029.html
> 
> "FFTW is most efficient when N is a product of small factors; note that this 
> differs from the factorization of the physical size n for REDFT00 and 
> RODFT00!"
> 



Re: [julia-users] FFTW.REDFT00 and FFT.RODFT00 are ~10x slower than other routines?

2015-09-23 Thread Steven G. Johnson


On Wednesday, September 23, 2015 at 3:47:43 AM UTC-4, Sheehan Olver wrote:
>
> OK that makes sense.  But why is Julia 2x slower than Matlab for some 
> values of n  (see below, the timing difference is consistent when looped 
> over)?  Shouldn’t they both be using FFTW?
>

I think that maybe Matlab defaults to using FFTW's real-data routines when 
the data is real, whereas Julia only uses the real-data routines if you 
request them via rfft etc.   (The rfft functions have the advantage of 
requiring half of the storage for the output compared to a complex FFT, 
whereas I think Matlab pads the output back to the full length using the 
mirror symmetries.)


Re: [julia-users] FFTW.REDFT00 and FFT.RODFT00 are ~10x slower than other routines?

2015-09-23 Thread Sheehan Olver
OK that makes sense.  But then why is rfft on a vector length 2*(n-1) more than 
2x faster than FFT.REDFT00?

julia> r=rand(10);@time for k=1:100 FFTW.r2r(r,FFTW.REDFT00) end;
  2.496130 seconds (8.30 k allocations: 76.703 MB, 0.76% gc time)
julia> r=rand(2*(10-1));@time for k=1:100 rfft(r) end;
  0.943706 seconds (8.30 k allocations: 152.985 MB, 1.58% gc time)



PS  Why doesn't fft(::Vector{Float64}) automatically call rfft and re-interpret 
the output?




> On 24 Sep 2015, at 10:59 am, Steven G. Johnson  wrote:
> 
> 
> 
> On Wednesday, September 23, 2015 at 3:47:43 AM UTC-4, Sheehan Olver wrote:
>   OK that makes sense.  But why is Julia 2x slower than Matlab for some 
> values of n  (see below, the timing difference is consistent when looped 
> over)?  Shouldn’t they both be using FFTW?
> 
> I think that maybe Matlab defaults to using FFTW's real-data routines when 
> the data is real, whereas Julia only uses the real-data routines if you 
> request them via rfft etc.   (The rfft functions have the advantage of 
> requiring half of the storage for the output compared to a complex FFT, 
> whereas I think Matlab pads the output back to the full length using the 
> mirror symmetries.)



[julia-users] FFTW.REDFT00 and FFT.RODFT00 are ~10x slower than other routines?

2015-09-22 Thread Sheehan Olver
I get the following timings with the various FFTW routines, where I ran 
each line multiple times to make sure it was accurate.  Why are REDFT00 and 
RODFT00 almost 10x slower?


r=rand(10)
@time FFTW.r2r(r,FFTW.REDFT00) #0.26s
@time FFTW.r2r(r,FFTW.REDFT01) #0.0033s
@time FFTW.r2r(r,FFTW.REDFT10) #0.0035s
@time FFTW.r2r(r,FFTW.REDFT11) #0.0033s
@time FFTW.r2r(r,FFTW.RODFT00) #0.017s
@time FFTW.r2r(r,FFTW.RODFT01) #0.0035s
@time FFTW.r2r(r,FFTW.RODFT10) #0.0035s
@time FFTW.r2r(r,FFTW.RODFT11) #0.0035s
@time fft(r)   #0.0033s