Tom Rondeau wrote: > Martin Dvh wrote: >> Eric Blossom wrote: >> >>> On Tue, Dec 11, 2007 at 03:41:46PM -0800, Eugene Grayver wrote: >>> >>>> Please see answers in-line. >>>> >>>> Thanks! >>>> General curiosity questions: >>>> >>>> Are you using oprofile to measure performance? >>>> >>>> I am a bit of a maverick, and for various reasons am using a pure >>>> C++ environment. I hacked my own 'connect_block' function (can;t >>>> wait for v3.2, where these will be part of native gr). >>>> >>> The trunk contains C++ code for connect, hier_block2, etc. Some of >>> the pieces that are still missing include C++ support for the USRP >>> daughterboards, but Johnathan Corgan is working on that now. >>> >>> >>>> I am measuring the performance using a custom block (gr_throughput) >>>> that simply reports the average number of samples processed per >>>> second. >>>> What h/w platform are you running on / tuning for? >>>> >>>> The platform is currently Intel Xeon or Core2 Duo. >>>> >>>> You're not trying to run your app on a cache-crippled machine like a >>>> Celeron, are you? ;) >>>> >>>> No, very high end. >>>> >>>> Which blocks are causing you the biggest problem? >>>> >>>> I got a 2x improvement on all the filtering blocks. >>>> >>> If these are FIR filters, were you using gr_fft_filter_{fff,ccc} >>> or the gr_fir_filter* blocks? The FFT one's are _much_ faster with a >>> break-even point around 16 taps IIRC. >>> >>> >>>> About a 40% improvement for sine/cosine generation blocks. This >>>> includes gr_expj, gr_rotate. >>>> >>> No surprise there, and that's a great example of SIMD code that should >>> be in GNU Radio. >>> >>> >>>> Are your problems caused primarily by lack of CPU cycles, cache >>>> misses or mis-predicted branches? >>>> >>>> I am not sure, since I am not at all a software expect (mostly >>>> dsp/comm). My guess is that the SSE instructions are not being used >>>> (or not used to a full extent). Even the 'multiply' block is VERY >>>> slow compared to a vector x vector multiplication in the Intel library. >>>> >>> OK. >>> >>> >>>> Some of the gr_blocks process each sample using a separate function >>>> call (e.g. for (n=0; n<noutput_samples; n++) >>>> scale(in[n]) >>>> >>>> Replacing this with a single vectorized function call is much faster. >>>> >>> OK. >>> >>> >>>>> We would not accept the changes. >>>>> >>>> That's what I expected. We'll try to contribute the more >>>> dsp-centric blocks such as demodulators. >>> That would be great! Or if you want to code up an SSE Taylor series >>> expansion for sine/cosine good to 23-bits or so, we'd love that too ;) >>> >> I am working on this in the little spare time I have. >> I already got a SSE taylor series for atan2, working in gnuradio. >> The atan2 needs some code cleanup and wrapper code to switch >> implementations (if (processor=X86, processor >> supports_SSE2)=>optimized else generic) >> The sin/cos is far from ready. >> >> Greetings, >> Martin >> > > Martin, > > Bob put in a fast atan function (general/gr_fast_atan2f.cc) about a year > ago. Have you looked in this, and is the Taylor performance better? The taylor performance is much better when you get (a multiple of) 4 atan2s at a time. (because the SSE taylor series works with SIMD in blocks of 4) When you only get one at a time, the performance is still better but not by much. The taylor series also is more precise then gr_fast_atan2f.cc I don't have the numbers at hand, but I also wrote qa and benchmark code so exact numbers on precision and speed can be determined.
As a side note: I have also been working on a new version off the FFT FIR filter. This one is more efficient when decimating. inverse_FFT_size=forward_FFT_size/decimation This works very well when decimation is 2^n, it also works well for most other decimation factors EXCEPT when decimation is a big prime. This means the theoretical maximum speed improvement is a factor two (when decimation is infinite) But when you want multiple parts of the spectrum then the speed improvement is much better then using a FIR filter per spectrum part. Then you can use a single forward FFT with multiple inverse FFTs. Greetings, Martin > We really need a faster sin/cos. Glad to hear you're working on it. > > Tom > > _______________________________________________ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org http://lists.gnu.org/mailman/listinfo/discuss-gnuradio