On Wed, Feb 17, 2021 at 12:02 PM Marcus Müller <muel...@kit.edu> wrote:
> Hi Brian, > > On 17.02.21 16:55, Brian Padalino wrote: > > On Wed, Feb 17, 2021 at 10:05 AM Marcus Müller <muel...@kit.edu <mailto: > muel...@kit.edu>> > > wrote: > > > > Oh, sorry, didn't mean to imply that! FFT interpolation might work > well (if you can live > > with the sinc sidelobes). > > > > > > If the bandwidth is already constrained to 20MHz/23MHz, then there would > be no sidelobes - > > correct? > > What the FFT resampler does is, in the end, interpolation with 23 sincs, > each at the > "original" spectral position. > These sincs' sidelobes of course extent left and right beyond the 23 > original "carriers" > (thinking of OFDM here)! > Just wanted to clarify - 20 carriers are occupied, but sample rate goes out to 23 carriers. If we assume the extent of the carriers is already at a stopband and sufficiently bandlimited, the sinc sidelobes would never go above the already established stopband - no extra spectral growth really - just extension of the stopband. Right? > > The question here is whether that matters: if the bandwidth of interest is > -10 to +10 MHz, > what do I care about what happens beyond that? The answer to that quesiton > lies in Mark's > application :) > The level of the newly created spectrum should be at the already established stopband or below it. Right? > > > I do have a question, though: Why do you go for a 4600-FFT? why not > simply a 23 FFT? > > > > Larger n yields greater efficiencies, right? Doing lots of small calls > isn't necessarily > > as efficient as doing less large calls, so long as you can handle the > latency and the > > processor can stay fed with data. I figure ~4000 samples is a good > compromise, but maybe > > larger ones work out better, too. > > Well, quick calc, and acting as if O-notation was actually meaningful on > hardware (see > David's answer, he's got a point about caches: on my CPU, a 32×32 bit > multiplication > instruction (non-SIMD) takes about 1/400 of getting memory from random > positions in RAM > (cache misses), on average, IIRC. > > If we've got M samples to process, and we're doing N_1 or N_2=100·N_1 > length transforms, > the complexity of the first is > > a = O( (M/N_1) · N_·log(N_1) ) = O( M · log(N_1) ) > > of the second, it'd be > > O( (M/100/N_1) · 100·N_1 log(100·N_1) ) = O( M · log(100·N_1) ) > = O( M · log(N_1) + M log(100) ) > > which is worse, but: > > > > > For reference, here's some performance benchmarks for FFTW (randomly > chosen processor): > > > > http://www.fftw.org/speed/Pentium4-3.06GHz/ < > http://www.fftw.org/speed/Pentium4-3.06GHz/> > > > > Single precision, non powers of 2, 1d transforms are some good > benchmarks to look at for > > something like this. Obviously some numbers are better than others, but > the general trend > > seems to skew towards the mid-thousands as being a decent number to > target for maximizing > > throughput. > > Real-world data, and I fully agree, always trumps O-notation (which simply > is a bad model > for any real-world problem, usually). > > Luckily, I've had this same problem before, so... See attachment! > > You should be able to call it like > > fft_rate -l 46 > > to set the FFT length to 46, and add a -n 10 to make the FFT do 10 threads > (which usually > hurts more than it helps). (Need to multiply the reported rate by the FFT > length, it > counts transforms, not samples.) > > The single-threaded 2²⁴-FFT I do runs at ca. 34.5 MS/s throughput. If I run > > fft_rate -l $(( 23*2*100 )) > > I get rougly 87.4 MS/s. > If I run at > > fft_rate -l $(( 23*2 )) > > I get ca 92 MS/s. > > I'd call this a clear "meh" result! > Fair enough regarding the smaller FFT size. I didn't think it would work out this way, but it also got me scratching my head. I wouldn't have imagined parallelizing a bunch of FFT's would make things worse. And then I went to the code, and I see this: https://github.com/gnuradio/gnuradio/blob/fa3267982aa88e3c8f886d803b57f9b1501f43f6/gr-fft/lib/fft.cc#L231 So this call is initializing fftw as an n-thread operation. The call to fftw is still synchronous, so we're asking fftw to parallelize the FFT and ensure it's all coalesced by the time the call returns. I can understand why this would hurt FFT performance. I then ran the program with 4 or 8 threads as an argument, and looked at top - but only a single processor was being used. Maybe my gnuradio wasn't compiled with multithreading fftw enabled? For your example program with defaults, I get a rate_avg on a single core of 91 transforms/sec but only a single core is really being used. When I spin up 4 individual instances, each one gets around 55, so an aggregate of 220 transforms/sec. When I switch to 4600 for the block size and 4 separate instances, I get around 20k transforms/sec for each, or 80k transforms/sec in aggregate. This translates to 368Msamples/sec - right? Note my processor is an Intel i7-7700k. Also note that changing the n input argument doesn't help me much since it doesn't seem that fftw is really parallelizing anything. I would imagine setting n would yield the same results as running n instances, but since it doesn't, should the implementation be modified to not rely on fftw to perform the multithreading, but instead ask gnuradio to handle the multithreading via a threadpool? Marcus, can you see if running n instances versus n threads yields the same performance results on your system? Brian