On Wed, Feb 17, 2021 at 12:02 PM Marcus Müller <muel...@kit.edu> wrote:

> Hi Brian,
>
> On 17.02.21 16:55, Brian Padalino wrote:
> > On Wed, Feb 17, 2021 at 10:05 AM Marcus Müller <muel...@kit.edu <mailto:
> muel...@kit.edu>>
> > wrote:
> >
> >     Oh, sorry, didn't mean to imply that! FFT interpolation might work
> well (if you can live
> >     with the sinc sidelobes).
> >
> >
> > If the bandwidth is already constrained to 20MHz/23MHz, then there would
> be no sidelobes -
> > correct?
>
> What the FFT resampler does is, in the end, interpolation with 23 sincs,
> each at the
> "original" spectral position.
> These sincs' sidelobes of course extent left and right beyond the 23
> original "carriers"
> (thinking of OFDM here)!
>

Just wanted to clarify - 20 carriers are occupied, but sample rate goes out
to 23 carriers. If we assume the extent of the carriers is already at a
stopband and sufficiently bandlimited, the sinc sidelobes would never go
above the already established stopband - no extra spectral growth really -
just extension of the stopband.  Right?


>
> The question here is whether that matters: if the bandwidth of interest is
> -10 to +10 MHz,
> what do I care about what happens beyond that? The answer to that quesiton
> lies in Mark's
> application :)
>

The level of the newly created spectrum should be at the already
established stopband or below it.  Right?


>
> >     I do have a question, though: Why do you go for a 4600-FFT? why not
> simply a 23 FFT?
> >
> > Larger n yields greater efficiencies, right?  Doing lots of small calls
> isn't necessarily
> > as efficient as doing less large calls, so long as you can handle the
> latency and the
> > processor can stay fed with data.  I figure ~4000 samples is a good
> compromise, but maybe
> > larger ones work out better, too.
>
> Well, quick calc, and acting as if O-notation was actually meaningful on
> hardware (see
> David's answer, he's got a point about caches: on my CPU, a 32×32 bit
> multiplication
> instruction (non-SIMD) takes about 1/400 of getting memory from random
> positions in RAM
> (cache misses), on average, IIRC.
>
> If we've got M samples to process, and we're doing N_1 or N_2=100·N_1
> length transforms,
> the complexity of the first is
>
> a = O( (M/N_1) · N_·log(N_1) ) = O( M · log(N_1) )
>
> of the second, it'd be
>
> O( (M/100/N_1) · 100·N_1 log(100·N_1) ) = O( M · log(100·N_1) )
> = O( M · log(N_1) + M log(100) )
>
> which is worse, but:
>
> >
> > For reference, here's some performance benchmarks for FFTW (randomly
> chosen processor):
> >
> >   http://www.fftw.org/speed/Pentium4-3.06GHz/ <
> http://www.fftw.org/speed/Pentium4-3.06GHz/>
> >
> > Single precision, non powers of 2, 1d transforms are some good
> benchmarks to look at for
> > something like this.  Obviously some numbers are better than others, but
> the general trend
> > seems to skew towards the mid-thousands as being a decent number to
> target for maximizing
> > throughput.
>
> Real-world data, and I fully agree, always trumps O-notation (which simply
> is a bad model
> for any real-world problem, usually).
>
> Luckily, I've had this same problem before, so... See attachment!
>
> You should be able to call it like
>
> fft_rate -l 46
>
> to set the FFT length to 46, and add a -n 10 to make the FFT do 10 threads
> (which usually
> hurts more than it helps). (Need to multiply the reported rate by the FFT
> length, it
> counts transforms, not samples.)
>
> The single-threaded 2²⁴-FFT I do runs at ca. 34.5 MS/s throughput. If I run
>
> fft_rate -l $(( 23*2*100 ))
>
> I get rougly 87.4 MS/s.
> If I run at
>
> fft_rate -l $(( 23*2 ))
>
> I get ca 92 MS/s.
>
> I'd call this a clear "meh" result!
>

Fair enough regarding the smaller FFT size.  I didn't think it would work
out this way, but it also got me scratching my head.  I wouldn't have
imagined parallelizing a bunch of FFT's would make things worse.  And then
I went to the code, and I see this:


https://github.com/gnuradio/gnuradio/blob/fa3267982aa88e3c8f886d803b57f9b1501f43f6/gr-fft/lib/fft.cc#L231

So this call is initializing fftw as an n-thread operation.  The call to
fftw is still synchronous, so we're asking fftw to parallelize the FFT and
ensure it's all coalesced by the time the call returns.  I can understand
why this would hurt FFT performance.

I then ran the program with 4 or 8 threads as an argument, and looked at
top - but only a single processor was being used.  Maybe my gnuradio wasn't
compiled with multithreading fftw enabled?

For your example program with defaults, I get a rate_avg on a single core
of 91 transforms/sec but only a single core is really being used.  When I
spin up 4 individual instances, each one gets around 55, so an aggregate of
220 transforms/sec.

When I switch to 4600 for the block size and 4 separate instances, I get
around 20k transforms/sec for each, or 80k transforms/sec in aggregate.
This translates to 368Msamples/sec - right?  Note my processor is an Intel
i7-7700k.  Also note that changing the n input argument doesn't help me
much since it doesn't seem that fftw is really parallelizing anything.

I would imagine setting n would yield the same results as running n
instances, but since it doesn't, should the implementation be modified to
not rely on fftw to perform the multithreading, but instead ask gnuradio to
handle the multithreading via a threadpool?

Marcus, can you see if running n instances versus n threads yields the same
performance results on your system?

Brian

Reply via email to