On Wednesday, 25 January 2012 at 18:36:35 UTC, Manu wrote:
Can you paste disassembly's of the GDC code and the G++ code?
I imagine there's something trivial in the scheduler that GDC has missed.

Like the other day I noticed GDC was unnecessarily generating a stack frame
for leaf functions, which Iain already fixed.

I've uploaded it here:

https://github.com/downloads/jerro/pfft/disassembly.zip

For g++ there is the disassembly of the entire object file. For gdc that included parts of the standard library and was about 170k lines so I coppied just the functions that took the most time. I have also included percentages of time taken by functions for "./bench 22". I noticed that 2% of time is taken by memset. This could be caused by static array initialization in fft_passes_stride, so I'll void initialize static arrays there.

For very small sizes the code compiled with g++ is probably faster because it's more aggressively inlined.

I'd also be interested to try out my experimental std.simd (portable) library in the context of your FFT... might give that a shot, I think it'll
work well.

For that you probably only need to replace the code in sse.d (create stdsimd.d or something). The type that you pass as a first parameter for the FFT template should define at least the following:

- an alias "vec" for the vector type
- an alias "T" for the scalar type
- an enum vec_size which is the number of scalars in a vector - a function "static vec scalar_to_vector(T)", which takes a scalar and returns
a vector with all elements set to that scalar
- a function "static void bit_reverse_swap_16(T * p0, T * p1, T * p2, T * p3, size_t i1, size_t i2)" - a function "static void bit_reverse_16(T * p0, T * p1, T * p2, T * p3, size_t i)"

You can start with using Scalar!T.bit_reverse_16 and Scalar!T.bit_reverse_swap_16 from fft_impl.d but that won't be very fast. Those to funcions do the following:

bit_reverse_16: This one reads 16 elements - four from p0 + i, four from p1 + i, four from p2 + i and four from p3 + i. Then it "bit reverses" them - this means that for each pair of indices j and k such that k has the reverse bits of j, it swaps element at j and element at k. I define the index here so that the element that was read from address pn + i + m has index n * 4 + m. After bit reversing them, it writes the elements back. You cann assume that (p0 + i), (p1 + i), (p2 + i) and (p3 + i) are all aligned to 4*T.sizeof.

bit_reverse_swap_16: This one reads 16 elements from (p0 + i), (p1 + i), (p2 + i) and (p3 + i) and bitreverses them and also the elements from (p0 + j), (p1 + j), (p2 + j) and (p3 + j) and bitreverses those. Then it writes the elements that were read from offsets i to offsets j and vice versa. You can assue that (p0 + i), (p1 + i), (p2 + i) and (p3 + i), (p0 + j), (p1 + j), (p2 + j) and (p3 + j) are all aligned to 4*T.sizeof.

If you want to speed up the fft for larg sizes (greater or equal to 1<<Options.largeLimit, which is roughly those that do not fit in L1 cache) a bit, you can also write those functions:

- static void interleave(int n)(vec a, vec b, ref vec c, ref vec d) Here n will be a power of two larger than 1 and less or eaqual to vec_size. The function breaks vector a into n equaly sized parts and does the same for b, then interleaves those parts and writes them to c,d. For example: for vec_size = 4 and n = 4 it should write (a0, b0, a1, b1) to c and (a2, b2, a3, b3) to d. For vec_size = 4 and n = 2 it should write (a0, a1, b0, b1) to c and (a2,a3,b2,b3) to d.

- static void deinterleave(int n)(vec a, vec b, ref vec c, ref vec d)
This is an inverse of interleave!n(a, b, c, d)

- static void complex_array_to_real_imag_vec(int n)(float * arr, ref vec rr, ref vec ri) This one reads n pairs from arr. It repeats the first element of each pair vec_size / n times and writes them to rr. It also repeats the second element of each pair vec_size / n times and writes them to ri. For example: if vec_size is 4 and n is 2 then the it should write (arr[0], arr[0], arr[2], arr[2]) to rr and (arr[1], arr[1], arr[3], arr[3]) to ri. You can assume that arr
is aligned to 2*n*T.sizeof.

The functions interleave, deinterleave and complex_array_to_real_imag_vec are called in the inner loop, so if you can't make them very fast, you should just ommit them. The algorithm will then fall back (it checks for the presence of interleave!2) to a scalar one for the passes where these functions would be used otherwise.

Reply via email to