On Wednesday, 25 January 2012 at 18:36:35 UTC, Manu wrote:
Can you paste disassembly's of the GDC code and the G++ code?
I imagine there's something trivial in the scheduler that GDC
has missed.
Like the other day I noticed GDC was unnecessarily generating a
stack frame
for leaf functions, which Iain already fixed.
I've uploaded it here:
https://github.com/downloads/jerro/pfft/disassembly.zip
For g++ there is the disassembly of the entire object file. For
gdc that included parts of the standard library and was about
170k lines so I coppied just the functions that took the most
time. I have also included percentages of time taken by functions
for "./bench 22". I noticed that 2% of time is taken by memset.
This could be caused by static array initialization in
fft_passes_stride, so I'll void initialize static arrays there.
For very small sizes the code compiled with g++ is probably
faster because it's more aggressively inlined.
I'd also be interested to try out my experimental std.simd
(portable)
library in the context of your FFT... might give that a shot, I
think it'll
work well.
For that you probably only need to replace the code in sse.d
(create stdsimd.d or something). The type that you pass as a
first parameter for the FFT template should define at least the
following:
- an alias "vec" for the vector type
- an alias "T" for the scalar type
- an enum vec_size which is the number of scalars in a vector - a
function "static vec scalar_to_vector(T)", which takes a scalar
and returns
a vector with all elements set to that scalar
- a function "static void bit_reverse_swap_16(T * p0, T * p1, T *
p2, T * p3, size_t i1, size_t i2)"
- a function "static void bit_reverse_16(T * p0, T * p1, T * p2,
T * p3, size_t i)"
You can start with using Scalar!T.bit_reverse_16 and
Scalar!T.bit_reverse_swap_16 from fft_impl.d but that won't be
very fast. Those to funcions do the following:
bit_reverse_16: This one reads 16 elements - four from p0 + i,
four from p1 + i, four from p2 + i and four from p3 + i. Then it
"bit reverses" them - this means that for each pair of indices j
and k such that k has the reverse
bits of j, it swaps element at j and element at k. I define the
index here so that the element that was read from address pn + i
+ m has index n * 4 + m.
After bit reversing them, it writes the elements back. You cann
assume that
(p0 + i), (p1 + i), (p2 + i) and (p3 + i) are all aligned to
4*T.sizeof.
bit_reverse_swap_16: This one reads 16 elements from (p0 + i),
(p1 + i), (p2 + i) and (p3 + i)
and bitreverses them and also the elements from (p0 + j), (p1 +
j), (p2 + j) and (p3 + j)
and bitreverses those. Then it writes the elements that were read
from offsets i to offsets j and vice versa. You can assue that
(p0 + i), (p1 + i), (p2 + i) and (p3 + i),
(p0 + j), (p1 + j), (p2 + j) and (p3 + j) are all aligned to
4*T.sizeof.
If you want to speed up the fft for larg sizes (greater or equal
to 1<<Options.largeLimit, which is roughly those that do not fit
in L1 cache) a bit, you can also write those functions:
- static void interleave(int n)(vec a, vec b, ref vec c, ref vec
d)
Here n will be a power of two larger than 1 and less or eaqual to
vec_size.
The function breaks vector a into n equaly sized parts and does
the same
for b, then interleaves those parts and writes them to c,d. For
example:
for vec_size = 4 and n = 4 it should write (a0, b0, a1, b1) to c
and (a2, b2, a3, b3) to d. For vec_size = 4 and n = 2 it should
write (a0, a1, b0, b1) to c and (a2,a3,b2,b3) to d.
- static void deinterleave(int n)(vec a, vec b, ref vec c, ref
vec d)
This is an inverse of interleave!n(a, b, c, d)
- static void complex_array_to_real_imag_vec(int n)(float * arr,
ref vec rr, ref vec ri)
This one reads n pairs from arr. It repeats the first element of
each pair
vec_size / n times and writes them to rr. It also repeats the
second element
of each pair vec_size / n times and writes them to ri. For
example: if vec_size
is 4 and n is 2 then the it should write (arr[0], arr[0], arr[2],
arr[2])
to rr and (arr[1], arr[1], arr[3], arr[3]) to ri. You can assume
that arr
is aligned to 2*n*T.sizeof.
The functions interleave, deinterleave and
complex_array_to_real_imag_vec
are called in the inner loop, so if you can't make them very
fast, you should
just ommit them. The algorithm will then fall back (it checks for
the presence
of interleave!2) to a scalar one for the passes where these
functions would be used otherwise.