The FFTW threading feature doesn't work all that well. Here's a long
dialog about it here.
https://github.com/gnuradio/gnuradio/issues/2802
Ron
On 2/17/21 11:52, Brian Padalino wrote:
On Wed, Feb 17, 2021 at 12:02 PM Marcus Müller <muel...@kit.edu
<mailto:muel...@kit.edu>> wrote:
Hi Brian,
On 17.02.21 16:55, Brian Padalino wrote:
> On Wed, Feb 17, 2021 at 10:05 AM Marcus Müller <muel...@kit.edu
<mailto:muel...@kit.edu> <mailto:muel...@kit.edu
<mailto:muel...@kit.edu>>>
> wrote:
>
> Oh, sorry, didn't mean to imply that! FFT interpolation
might work well (if you can live
> with the sinc sidelobes).
>
>
> If the bandwidth is already constrained to 20MHz/23MHz, then
there would be no sidelobes -
> correct?
What the FFT resampler does is, in the end, interpolation with 23
sincs, each at the
"original" spectral position.
These sincs' sidelobes of course extent left and right beyond the
23 original "carriers"
(thinking of OFDM here)!
Just wanted to clarify - 20 carriers are occupied, but sample rate
goes out to 23 carriers. If we assume the extent of the carriers is
already at a stopband and sufficiently bandlimited, the sinc sidelobes
would never go above the already established stopband - no extra
spectral growth really - just extension of the stopband. Right?
The question here is whether that matters: if the bandwidth of
interest is -10 to +10 MHz,
what do I care about what happens beyond that? The answer to that
quesiton lies in Mark's
application :)
The level of the newly created spectrum should be at the already
established stopband or below it. Right?
> I do have a question, though: Why do you go for a 4600-FFT?
why not simply a 23 FFT?
>
> Larger n yields greater efficiencies, right? Doing lots of
small calls isn't necessarily
> as efficient as doing less large calls, so long as you can
handle the latency and the
> processor can stay fed with data. I figure ~4000 samples is a
good compromise, but maybe
> larger ones work out better, too.
Well, quick calc, and acting as if O-notation was actually
meaningful on hardware (see
David's answer, he's got a point about caches: on my CPU, a 32×32
bit multiplication
instruction (non-SIMD) takes about 1/400 of getting memory from
random positions in RAM
(cache misses), on average, IIRC.
If we've got M samples to process, and we're doing N_1 or
N_2=100·N_1 length transforms,
the complexity of the first is
a = O( (M/N_1) · N_·log(N_1) ) = O( M · log(N_1) )
of the second, it'd be
O( (M/100/N_1) · 100·N_1 log(100·N_1) ) = O( M · log(100·N_1) )
= O( M · log(N_1) + M log(100) )
which is worse, but:
>
> For reference, here's some performance benchmarks for FFTW
(randomly chosen processor):
>
> http://www.fftw.org/speed/Pentium4-3.06GHz/
<http://www.fftw.org/speed/Pentium4-3.06GHz/>
>
> Single precision, non powers of 2, 1d transforms are some good
benchmarks to look at for
> something like this. Obviously some numbers are better than
others, but the general trend
> seems to skew towards the mid-thousands as being a decent number
to target for maximizing
> throughput.
Real-world data, and I fully agree, always trumps O-notation
(which simply is a bad model
for any real-world problem, usually).
Luckily, I've had this same problem before, so... See attachment!
You should be able to call it like
fft_rate -l 46
to set the FFT length to 46, and add a -n 10 to make the FFT do 10
threads (which usually
hurts more than it helps). (Need to multiply the reported rate by
the FFT length, it
counts transforms, not samples.)
The single-threaded 2²⁴-FFT I do runs at ca. 34.5 MS/s throughput.
If I run
fft_rate -l $(( 23*2*100 ))
I get rougly 87.4 MS/s.
If I run at
fft_rate -l $(( 23*2 ))
I get ca 92 MS/s.
I'd call this a clear "meh" result!
Fair enough regarding the smaller FFT size. I didn't think it would
work out this way, but it also got me scratching my head. I wouldn't
have imagined parallelizing a bunch of FFT's would make things worse.
And then I went to the code, and I see this:
https://github.com/gnuradio/gnuradio/blob/fa3267982aa88e3c8f886d803b57f9b1501f43f6/gr-fft/lib/fft.cc#L231
So this call is initializing fftw as an n-thread operation. The call
to fftw is still synchronous, so we're asking fftw to parallelize the
FFT and ensure it's all coalesced by the time the call returns. I can
understand why this would hurt FFT performance.
I then ran the program with 4 or 8 threads as an argument, and looked
at top - but only a single processor was being used. Maybe my
gnuradio wasn't compiled with multithreading fftw enabled?
For your example program with defaults, I get a rate_avg on a single
core of 91 transforms/sec but only a single core is really being
used. When I spin up 4 individual instances, each one gets around 55,
so an aggregate of 220 transforms/sec.
When I switch to 4600 for the block size and 4 separate instances, I
get around 20k transforms/sec for each, or 80k transforms/sec in
aggregate. This translates to 368Msamples/sec - right? Note my
processor is an Intel i7-7700k. Also note that changing the n input
argument doesn't help me much since it doesn't seem that fftw is
really parallelizing anything.
I would imagine setting n would yield the same results as running n
instances, but since it doesn't, should the implementation be modified
to not rely on fftw to perform the multithreading, but instead ask
gnuradio to handle the multithreading via a threadpool?
Marcus, can you see if running n instances versus n threads yields the
same performance results on your system?
Brian