Re: FFT in D (using SIMD) and benchmarks

a Wed, 25 Jan 2012 16:15:18 -0800

On Wednesday, 25 January 2012 at 18:36:35 UTC, Manu wrote:

Can you paste disassembly's of the GDC code and the G++ code?
I imagine there's something trivial in the scheduler that GDChas missed.

Like the other day I noticed GDC was unnecessarily generating astack frame
for leaf functions, which Iain already fixed.


I've uploaded it here:

https://github.com/downloads/jerro/pfft/disassembly.zip

For g++ there is the disassembly of the entire object file. Forgdc that included parts of the standard library and was about170k lines so I coppied just the functions that took the mosttime. I have also included percentages of time taken by functionsfor "./bench 22". I noticed that 2% of time is taken by memset.This could be caused by static array initialization infft_passes_stride, so I'll void initialize static arrays there.

For very small sizes the code compiled with g++ is probablyfaster because it's more aggressively inlined.

I'd also be interested to try out my experimental std.simd(portable)library in the context of your FFT... might give that a shot, Ithink it'll
work well.

For that you probably only need to replace the code in sse.d(create stdsimd.d or something). The type that you pass as afirst parameter for the FFT template should define at least thefollowing:


- an alias "vec" for the vector type
- an alias "T" for the scalar type

- an enum vec_size which is the number of scalars in a vector - afunction "static vec scalar_to_vector(T)", which takes a scalarand returns

a vector with all elements set to that scalar

- a function "static void bit_reverse_swap_16(T * p0, T * p1, T *p2, T * p3, size_t i1, size_t i2)"- a function "static void bit_reverse_16(T * p0, T * p1, T * p2,T * p3, size_t i)"

You can start with using Scalar!T.bit_reverse_16 andScalar!T.bit_reverse_swap_16 from fft_impl.d but that won't bevery fast. Those to funcions do the following:

bit_reverse_16: This one reads 16 elements - four from p0 + i,four from p1 + i, four from p2 + i and four from p3 + i. Then it"bit reverses" them - this means that for each pair of indices jand k such that k has the reversebits of j, it swaps element at j and element at k. I define theindex here so that the element that was read from address pn + i+ m has index n * 4 + m.After bit reversing them, it writes the elements back. You cannassume that(p0 + i), (p1 + i), (p2 + i) and (p3 + i) are all aligned to4*T.sizeof.

bit_reverse_swap_16: This one reads 16 elements from (p0 + i),(p1 + i), (p2 + i) and (p3 + i)and bitreverses them and also the elements from (p0 + j), (p1 +j), (p2 + j) and (p3 + j)and bitreverses those. Then it writes the elements that were readfrom offsets i to offsets j and vice versa. You can assue that(p0 + i), (p1 + i), (p2 + i) and (p3 + i),(p0 + j), (p1 + j), (p2 + j) and (p3 + j) are all aligned to4*T.sizeof.

If you want to speed up the fft for larg sizes (greater or equalto 1<<Options.largeLimit, which is roughly those that do not fitin L1 cache) a bit, you can also write those functions:

- static void interleave(int n)(vec a, vec b, ref vec c, ref vecd)Here n will be a power of two larger than 1 and less or eaqual tovec_size.The function breaks vector a into n equaly sized parts and doesthe samefor b, then interleaves those parts and writes them to c,d. Forexample:for vec_size = 4 and n = 4 it should write (a0, b0, a1, b1) to cand (a2, b2, a3, b3) to d. For vec_size = 4 and n = 2 it shouldwrite (a0, a1, b0, b1) to c and (a2,a3,b2,b3) to d.

- static void deinterleave(int n)(vec a, vec b, ref vec c, refvec d)

This is an inverse of interleave!n(a, b, c, d)

- static void complex_array_to_real_imag_vec(int n)(float * arr,ref vec rr, ref vec ri)This one reads n pairs from arr. It repeats the first element ofeach pairvec_size / n times and writes them to rr. It also repeats thesecond elementof each pair vec_size / n times and writes them to ri. Forexample: if vec_sizeis 4 and n is 2 then the it should write (arr[0], arr[0], arr[2],arr[2])to rr and (arr[1], arr[1], arr[3], arr[3]) to ri. You can assumethat arr

is aligned to 2*n*T.sizeof.

The functions interleave, deinterleave andcomplex_array_to_real_imag_vecare called in the inner loop, so if you can't make them veryfast, you shouldjust ommit them. The algorithm will then fall back (it checks forthe presenceof interleave!2) to a scalar one for the passes where thesefunctions would be used otherwise.

Re: FFT in D (using SIMD) and benchmarks

Reply via email to