Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
Tom Rondeau wrote: > Martin Dvh wrote: >> Eric Blossom wrote: >> >>> On Tue, Dec 11, 2007 at 03:41:46PM -0800, Eugene Grayver wrote: >>> Please see answers in-line. Thanks! General curiosity questions: Are you using oprofile to measure performance? I am a bit of a maverick, and for various reasons am using a pure C++ environment. I hacked my own 'connect_block' function (can;t wait for v3.2, where these will be part of native gr). >>> The trunk contains C++ code for connect, hier_block2, etc. Some of >>> the pieces that are still missing include C++ support for the USRP >>> daughterboards, but Johnathan Corgan is working on that now. >>> >>> I am measuring the performance using a custom block (gr_throughput) that simply reports the average number of samples processed per second. What h/w platform are you running on / tuning for? The platform is currently Intel Xeon or Core2 Duo. You're not trying to run your app on a cache-crippled machine like a Celeron, are you? ;) No, very high end. Which blocks are causing you the biggest problem? I got a 2x improvement on all the filtering blocks. >>> If these are FIR filters, were you using gr_fft_filter_{fff,ccc} >>> or the gr_fir_filter* blocks? The FFT one's are _much_ faster with a >>> break-even point around 16 taps IIRC. >>> >>> About a 40% improvement for sine/cosine generation blocks. This includes gr_expj, gr_rotate. >>> No surprise there, and that's a great example of SIMD code that should >>> be in GNU Radio. >>> >>> Are your problems caused primarily by lack of CPU cycles, cache misses or mis-predicted branches? I am not sure, since I am not at all a software expect (mostly dsp/comm). My guess is that the SSE instructions are not being used (or not used to a full extent). Even the 'multiply' block is VERY slow compared to a vector x vector multiplication in the Intel library. >>> OK. >>> >>> Some of the gr_blocks process each sample using a separate function call (e.g. for (n=0; n>>> scale(in[n]) Replacing this with a single vectorized function call is much faster. >>> OK. >>> >>> > We would not accept the changes. > That's what I expected. We'll try to contribute the more dsp-centric blocks such as demodulators. >>> That would be great! Or if you want to code up an SSE Taylor series >>> expansion for sine/cosine good to 23-bits or so, we'd love that too ;) >>> >> I am working on this in the little spare time I have. >> I already got a SSE taylor series for atan2, working in gnuradio. >> The atan2 needs some code cleanup and wrapper code to switch >> implementations (if (processor=X86, processor >> supports_SSE2)=>optimized else generic) >> The sin/cos is far from ready. >> >> Greetings, >> Martin >> > > Martin, > > Bob put in a fast atan function (general/gr_fast_atan2f.cc) about a year > ago. Have you looked in this, and is the Taylor performance better? The taylor performance is much better when you get (a multiple of) 4 atan2s at a time. (because the SSE taylor series works with SIMD in blocks of 4) When you only get one at a time, the performance is still better but not by much. The taylor series also is more precise then gr_fast_atan2f.cc I don't have the numbers at hand, but I also wrote qa and benchmark code so exact numbers on precision and speed can be determined. As a side note: I have also been working on a new version off the FFT FIR filter. This one is more efficient when decimating. inverse_FFT_size=forward_FFT_size/decimation This works very well when decimation is 2^n, it also works well for most other decimation factors EXCEPT when decimation is a big prime. This means the theoretical maximum speed improvement is a factor two (when decimation is infinite) But when you want multiple parts of the spectrum then the speed improvement is much better then using a FIR filter per spectrum part. Then you can use a single forward FFT with multiple inverse FFTs. Greetings, Martin > We really need a faster sin/cos. Glad to hear you're working on it. > > Tom > > ___ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
Martin Dvh wrote: Eric Blossom wrote: On Tue, Dec 11, 2007 at 03:41:46PM -0800, Eugene Grayver wrote: Please see answers in-line. Thanks! General curiosity questions: Are you using oprofile to measure performance? I am a bit of a maverick, and for various reasons am using a pure C++ environment. I hacked my own 'connect_block' function (can;t wait for v3.2, where these will be part of native gr). The trunk contains C++ code for connect, hier_block2, etc. Some of the pieces that are still missing include C++ support for the USRP daughterboards, but Johnathan Corgan is working on that now. I am measuring the performance using a custom block (gr_throughput) that simply reports the average number of samples processed per second. What h/w platform are you running on / tuning for? The platform is currently Intel Xeon or Core2 Duo. You're not trying to run your app on a cache-crippled machine like a Celeron, are you? ;) No, very high end. Which blocks are causing you the biggest problem? I got a 2x improvement on all the filtering blocks. If these are FIR filters, were you using gr_fft_filter_{fff,ccc} or the gr_fir_filter* blocks? The FFT one's are _much_ faster with a break-even point around 16 taps IIRC. About a 40% improvement for sine/cosine generation blocks. This includes gr_expj, gr_rotate. No surprise there, and that's a great example of SIMD code that should be in GNU Radio. Are your problems caused primarily by lack of CPU cycles, cache misses or mis-predicted branches? I am not sure, since I am not at all a software expect (mostly dsp/comm). My guess is that the SSE instructions are not being used (or not used to a full extent). Even the 'multiply' block is VERY slow compared to a vector x vector multiplication in the Intel library. OK. Some of the gr_blocks process each sample using a separate function call (e.g. for (n=0; n scale(in[n]) Replacing this with a single vectorized function call is much faster. OK. We would not accept the changes. That's what I expected. We'll try to contribute the more dsp-centric blocks such as demodulators. That would be great! Or if you want to code up an SSE Taylor series expansion for sine/cosine good to 23-bits or so, we'd love that too ;) I am working on this in the little spare time I have. I already got a SSE taylor series for atan2, working in gnuradio. The atan2 needs some code cleanup and wrapper code to switch implementations (if (processor=X86, processor supports_SSE2)=>optimized else generic) The sin/cos is far from ready. Greetings, Martin Martin, Bob put in a fast atan function (general/gr_fast_atan2f.cc) about a year ago. Have you looked in this, and is the Taylor performance better? We really need a faster sin/cos. Glad to hear you're working on it. Tom ___ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
General curiosity questions: Are you using oprofile to measure performance? I am a bit of a maverick, and for various reasons am using a pure C++ environment. I hacked my own 'connect_block' function (can;t wait for v3.2, where these will be part of native gr). I am measuring the performance using a custom block (gr_throughput) that simply reports the average number of samples processed per second. While pure C++ may be desirable for some reasons, performance is not really one of them. When you use Python, it isn't running anything that is really performance critical. Which blocks are causing you the biggest problem? I got a 2x improvement on all the filtering blocks. That isn't surprising. I believe our SSE filtering code was optimized for prior generations of processors, so a new Core2 optimized version would be useful, and likely competitive with IPP. Also, are you sure that when you compile our code with Intel's compiler that you are even getting the SSE versions? Or are the pure C++ versions called? Another thing, which I believe was mentioned earlier -- if you really care about FIR filter performance, you should be using the FFT versions of the filters. The difference in performance can be huge, making the 2x you get from IPP insignificant. About a 40% improvement for sine/cosine generation blocks. This includes gr_expj, gr_rotate. There is definitely room for improvement here. Are your problems caused primarily by lack of CPU cycles, cache misses or mis-predicted branches? I am not sure, since I am not at all a software expect (mostly dsp/comm). My guess is that the SSE instructions are not being used (or not used to a full extent). Even the 'multiply' block is VERY slow compared to a vector x vector multiplication in the Intel library. Some of the gr_blocks process each sample using a separate function call (e.g. for (n=0; n Those function calls should be inlined if nothing else. In any case, GCC is not vectorizing this, but it would be trivial to write it in SSE or intrinsics, which would allow this to be done in open source code, without having to resort to IPP. That would be a very useful contribution. Matt ___ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
Eric Blossom wrote: > On Tue, Dec 11, 2007 at 03:41:46PM -0800, Eugene Grayver wrote: >> Please see answers in-line. >> >> Thanks! > >> General curiosity questions: >> >> Are you using oprofile to measure performance? >> >> I am a bit of a maverick, and for various reasons am using a pure C++ >> environment. I hacked my own 'connect_block' function (can;t wait for >> v3.2, where these will be part of native gr). > > The trunk contains C++ code for connect, hier_block2, etc. Some of > the pieces that are still missing include C++ support for the USRP > daughterboards, but Johnathan Corgan is working on that now. > >> I am measuring the performance using a custom block (gr_throughput) >> that simply reports the average number of samples processed per >> second. > >> What h/w platform are you running on / tuning for? >> >> The platform is currently Intel Xeon or Core2 Duo. >> >> You're not trying to run your app on a cache-crippled machine like a >> Celeron, are you? ;) >> >> No, very high end. >> >> Which blocks are causing you the biggest problem? >> >> I got a 2x improvement on all the filtering blocks. > > If these are FIR filters, were you using gr_fft_filter_{fff,ccc} > or the gr_fir_filter* blocks? The FFT one's are _much_ faster with a > break-even point around 16 taps IIRC. > >> About a 40% improvement for sine/cosine generation blocks. This >> includes gr_expj, gr_rotate. > > No surprise there, and that's a great example of SIMD code that should > be in GNU Radio. > >> Are your problems caused primarily by lack of CPU cycles, cache >> misses or mis-predicted branches? >> >> I am not sure, since I am not at all a software expect (mostly dsp/comm). >> My guess is that the SSE instructions are not being used (or not used to a >> full extent). Even the 'multiply' block is VERY slow compared to a vector >> x vector multiplication in the Intel library. > > OK. > >> Some of the gr_blocks >> process each sample using a separate function call (e.g. >> for (n=0; n> scale(in[n]) >> >> Replacing this with a single vectorized function call is much faster. > > OK. > >>> We would not accept the changes. > >> That's what I expected. We'll try to contribute the more dsp-centric >> blocks such as demodulators. > > That would be great! Or if you want to code up an SSE Taylor series > expansion for sine/cosine good to 23-bits or so, we'd love that too ;) I am working on this in the little spare time I have. I already got a SSE taylor series for atan2, working in gnuradio. The atan2 needs some code cleanup and wrapper code to switch implementations (if (processor=X86, processor supports_SSE2)=>optimized else generic) The sin/cos is far from ready. Greetings, Martin > Thanks for telling us about your experience. > > Eric > > > ___ > Discuss-gnuradio mailing list > Discuss-gnuradio@gnu.org > http://lists.gnu.org/mailman/listinfo/discuss-gnuradio > ___ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
On Tue, Dec 11, 2007 at 04:03:28PM -0800, Dan Halperin wrote: > Eugene Grayver wrote: > > Please see answers in-line. > > Which blocks are causing you the biggest problem? > > > > I got a 2x improvement on all the filtering blocks. About a 40% > > improvement for sine/cosine generation blocks. This includes gr_expj, > > gr_rotate. > > I should mention that gr_rotate's performance can be _greatly improved_ > by a simple change that, rather than rescaling the multiplier every > iteration, rescales every k, e.g. k=1000. I think I have an earlier > mailing list post about this. IIRC, the patch didn't go in because there > seemed to be no consensus about what k to use... Ooops, sorry about that. Let me dig through the archived discussion and we'll get it in. Eric ___ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
On Tue, Dec 11, 2007 at 03:41:46PM -0800, Eugene Grayver wrote: > Please see answers in-line. > > Thanks! > General curiosity questions: > > Are you using oprofile to measure performance? > > I am a bit of a maverick, and for various reasons am using a pure C++ > environment. I hacked my own 'connect_block' function (can;t wait for > v3.2, where these will be part of native gr). The trunk contains C++ code for connect, hier_block2, etc. Some of the pieces that are still missing include C++ support for the USRP daughterboards, but Johnathan Corgan is working on that now. > I am measuring the performance using a custom block (gr_throughput) > that simply reports the average number of samples processed per > second. > What h/w platform are you running on / tuning for? > > The platform is currently Intel Xeon or Core2 Duo. > > You're not trying to run your app on a cache-crippled machine like a > Celeron, are you? ;) > > No, very high end. > > Which blocks are causing you the biggest problem? > > I got a 2x improvement on all the filtering blocks. If these are FIR filters, were you using gr_fft_filter_{fff,ccc} or the gr_fir_filter* blocks? The FFT one's are _much_ faster with a break-even point around 16 taps IIRC. > About a 40% improvement for sine/cosine generation blocks. This > includes gr_expj, gr_rotate. No surprise there, and that's a great example of SIMD code that should be in GNU Radio. > Are your problems caused primarily by lack of CPU cycles, cache > misses or mis-predicted branches? > > I am not sure, since I am not at all a software expect (mostly dsp/comm). > My guess is that the SSE instructions are not being used (or not used to a > full extent). Even the 'multiply' block is VERY slow compared to a vector > x vector multiplication in the Intel library. OK. > Some of the gr_blocks > process each sample using a separate function call (e.g. > for (n=0; n scale(in[n]) > > Replacing this with a single vectorized function call is much faster. OK. > > We would not accept the changes. > That's what I expected. We'll try to contribute the more dsp-centric > blocks such as demodulators. That would be great! Or if you want to code up an SSE Taylor series expansion for sine/cosine good to 23-bits or so, we'd love that too ;) Thanks for telling us about your experience. Eric ___ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Eugene Grayver wrote: > Please see answers in-line. > Which blocks are causing you the biggest problem? > > I got a 2x improvement on all the filtering blocks. About a 40% > improvement for sine/cosine generation blocks. This includes gr_expj, > gr_rotate. I should mention that gr_rotate's performance can be _greatly improved_ by a simple change that, rather than rescaling the multiplier every iteration, rescales every k, e.g. k=1000. I think I have an earlier mailing list post about this. IIRC, the patch didn't go in because there seemed to be no consensus about what k to use... > We would not accept the changes. Part of what we're up to is building > an ever expanding universe of free code. Instead of using the > non-free IPP code, please consider using a free library such as ATLAS, > or help us find and fix performance challenges in a way that doesn't > require non-free code. Also, are you sure that your performance > issues can't be better addressed with an algorithmic change? If > you're using a lot of very low-level blocks (e.g., add, multiply, > etc.) you're probably better off writing a block that aggregates some > of the operations into a single block. > > That's what I expected. We'll try to contribute the more dsp-centric > blocks such as demodulators. That said, if you put the code (and/or the modified makefiles) up somewhere, I'm sure there are some users that would benefit even if it doesn't make it into the main release. - -Dan -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHXyVQy9GYuuMoUJ4RAjSfAKDWVOeMbGteN+BQhl71tG5mo2D3CgCfdzCO 34TYmHjgnijbENsfxECZNwo= =v01V -END PGP SIGNATURE- ___ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
Please see answers in-line. Thanks! Eric Blossom <[EMAIL PROTECTED]> 12/11/2007 02:31 PM To Eugene Grayver <[EMAIL PROTECTED]> cc discuss-gnuradio@gnu.org Subject Re: [Discuss-gnuradio] Re-writing blocks using intel libraries On Tue, Dec 11, 2007 at 10:13:32AM -0800, Eugene Grayver wrote: > Hello, > > We are working on some systems that require high sampling rates. I am > already using the Intel C++ compiler at the highest optimization ratio, > but a lot of the blocks are very slow still. It appears that intel C++ > does not properly vectorize data type. General curiosity questions: Are you using oprofile to measure performance? I am a bit of a maverick, and for various reasons am using a pure C++ environment. I hacked my own 'connect_block' function (can;t wait for v3.2, where these will be part of native gr). I am measuring the performance using a custom block (gr_throughput) that simply reports the average number of samples processed per second. What h/w platform are you running on / tuning for? The platform is currently Intel Xeon or Core2 Duo. You're not trying to run your app on a cache-crippled machine like a Celeron, are you? ;) No, very high end. Which blocks are causing you the biggest problem? I got a 2x improvement on all the filtering blocks. About a 40% improvement for sine/cosine generation blocks. This includes gr_expj, gr_rotate. Are your problems caused primarily by lack of CPU cycles, cache misses or mis-predicted branches? I am not sure, since I am not at all a software expect (mostly dsp/comm). My guess is that the SSE instructions are not being used (or not used to a full extent). Even the 'multiply' block is VERY slow compared to a vector x vector multiplication in the Intel library. Some of the gr_blocks process each sample using a separate function call (e.g. for (n=0; n I have been replacing almost every low level block with a functionally > equivalent using the intel performance libraries (IPP). These libraries > are not GPL, but are free for noncommercial use under Linux ($200 > otherwise). At some point, I would like to contribute our work back to > gnuradio. Would this fit with the gr philosophy? How should we structure > the code? (i.e. have a separate set of files, use #defines, or ...)? > > Eugene We would not accept the changes. Part of what we're up to is building an ever expanding universe of free code. Instead of using the non-free IPP code, please consider using a free library such as ATLAS, or help us find and fix performance challenges in a way that doesn't require non-free code. Also, are you sure that your performance issues can't be better addressed with an algorithmic change? If you're using a lot of very low-level blocks (e.g., add, multiply, etc.) you're probably better off writing a block that aggregates some of the operations into a single block. That's what I expected. We'll try to contribute the more dsp-centric blocks such as demodulators. Eric ___ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
Re: [Discuss-gnuradio] Re-writing blocks using intel libraries
On Tue, Dec 11, 2007 at 10:13:32AM -0800, Eugene Grayver wrote: > Hello, > > We are working on some systems that require high sampling rates. I am > already using the Intel C++ compiler at the highest optimization ratio, > but a lot of the blocks are very slow still. It appears that intel C++ > does not properly vectorize data type. General curiosity questions: Are you using oprofile to measure performance? What h/w platform are you running on / tuning for? You're not trying to run your app on a cache-crippled machine like a Celeron, are you? ;) Which blocks are causing you the biggest problem? Are your problems caused primarily by lack of CPU cycles, cache misses or mis-predicted branches? > I have been replacing almost every low level block with a functionally > equivalent using the intel performance libraries (IPP). These libraries > are not GPL, but are free for noncommercial use under Linux ($200 > otherwise). At some point, I would like to contribute our work back to > gnuradio. Would this fit with the gr philosophy? How should we structure > the code? (i.e. have a separate set of files, use #defines, or ...)? > > Eugene We would not accept the changes. Part of what we're up to is building an ever expanding universe of free code. Instead of using the non-free IPP code, please consider using a free library such as ATLAS, or help us find and fix performance challenges in a way that doesn't require non-free code. Also, are you sure that your performance issues can't be better addressed with an algorithmic change? If you're using a lot of very low-level blocks (e.g., add, multiply, etc.) you're probably better off writing a block that aggregates some of the operations into a single block. Eric ___ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
[Discuss-gnuradio] Re-writing blocks using intel libraries
Hello, We are working on some systems that require high sampling rates. I am already using the Intel C++ compiler at the highest optimization ratio, but a lot of the blocks are very slow still. It appears that intel C++ does not properly vectorize data type. I have been replacing almost every low level block with a functionally equivalent using the intel performance libraries (IPP). These libraries are not GPL, but are free for noncommercial use under Linux ($200 otherwise). At some point, I would like to contribute our work back to gnuradio. Would this fit with the gr philosophy? How should we structure the code? (i.e. have a separate set of files, use #defines, or ...)? Eugene Eugene Grayver, Ph.D. Aerospace Corp., Sr. Eng. Spec. Tel: 310.336.1274 ___ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org http://lists.gnu.org/mailman/listinfo/discuss-gnuradio