Re: [Discuss-gnuradio] Re-writing blocks using intel libraries

2007-12-12 Thread Martin Dvh
Tom Rondeau wrote:
> Martin Dvh wrote:
>> Eric Blossom wrote:
>>  
>>> On Tue, Dec 11, 2007 at 03:41:46PM -0800, Eugene Grayver wrote:
>>>
 Please see answers in-line.

 Thanks!
   General curiosity questions:

   Are you using oprofile to measure performance?

 I am a bit of a maverick, and for various reasons am using a pure
 C++ environment.  I hacked my own 'connect_block' function (can;t
 wait for v3.2, where these will be part of native gr).
   
>>> The trunk contains C++ code for connect, hier_block2, etc.  Some of
>>> the pieces that are still missing include C++ support for the USRP
>>> daughterboards, but Johnathan Corgan is working on that now.
>>>
>>>
 I am measuring the performance using a custom block (gr_throughput)
 that simply reports the average number of samples processed per
 second.
 What h/w platform are you running on / tuning for?

 The platform is currently Intel Xeon or Core2 Duo.

   You're not trying to run your app on a cache-crippled machine like a
   Celeron, are you?  ;)

 No, very high end.

   Which blocks are causing you the biggest problem?

 I got a 2x improvement on all the filtering blocks.
   
>>> If these are FIR filters, were you using gr_fft_filter_{fff,ccc}
>>> or the gr_fir_filter* blocks?  The FFT one's are _much_ faster with a
>>> break-even point around 16 taps IIRC.
>>>
>>>
 About a 40% improvement for sine/cosine generation blocks.  This
 includes gr_expj, gr_rotate.
   
>>> No surprise there, and that's a great example of SIMD code that should
>>> be in GNU Radio.
>>>
>>>
   Are your problems caused primarily by lack of CPU cycles, cache
   misses or mis-predicted branches?

 I am not sure, since I am not at all a software expect (mostly
 dsp/comm). My guess is that the SSE instructions are not being used
 (or not used to a full extent).  Even the 'multiply' block is VERY
 slow compared to a vector x vector multiplication in the Intel library.
   
>>> OK.
>>>
>>>
 Some of the gr_blocks process each sample using a separate function
 call (e.g. for (n=0; n>>> scale(in[n])

 Replacing this with a single vectorized function call is much faster.
   
>>> OK.
>>>
>>>
> We would not accept the changes.
> 
 That's what I expected.  We'll try to contribute the more
 dsp-centric blocks such as demodulators.   
>>> That would be great!  Or if you want to code up an SSE Taylor series
>>> expansion for sine/cosine good to 23-bits or so, we'd love that too ;)
>>> 
>> I am working on this in the little spare time I have.
>> I already got a SSE taylor series for atan2, working in gnuradio.
>> The atan2 needs some code cleanup and wrapper code to switch
>> implementations (if (processor=X86, processor
>> supports_SSE2)=>optimized else generic)
>> The sin/cos is far from ready.
>>
>> Greetings,
>> Martin
>>   
> 
> Martin,
> 
> Bob put in a fast atan function (general/gr_fast_atan2f.cc) about a year
> ago. Have you looked in this, and is the Taylor performance better?
The taylor performance is much better when you get (a multiple of) 4 atan2s at 
a time.
(because the SSE taylor series works with SIMD in blocks of 4)
When you only get one at a time, the performance is still better but not by 
much.
The taylor series also is more precise then gr_fast_atan2f.cc
I don't have the numbers at hand, but I also wrote qa and benchmark code so 
exact numbers on precision and speed can be determined.

As a side note:
I have also been working on a new version off the FFT FIR filter.
This one is more efficient when decimating.
inverse_FFT_size=forward_FFT_size/decimation
This works very well when decimation is 2^n, it also works well for most other 
decimation factors EXCEPT when decimation is a big prime.

This means the theoretical maximum speed improvement is a factor two (when 
decimation is infinite)
But when you want multiple parts of the spectrum then the speed improvement is 
much better then using a FIR filter per spectrum part.
Then you can use a single forward FFT with multiple inverse FFTs.

Greetings,
Martin

> We really need a faster sin/cos. Glad to hear you're working on it.
> 
> Tom

> 
> 



___
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
http://lists.gnu.org/mailman/listinfo/discuss-gnuradio


Re: [Discuss-gnuradio] Re-writing blocks using intel libraries

2007-12-12 Thread Tom Rondeau

Martin Dvh wrote:

Eric Blossom wrote:
  

On Tue, Dec 11, 2007 at 03:41:46PM -0800, Eugene Grayver wrote:


Please see answers in-line.

Thanks!
  
General curiosity questions:


  Are you using oprofile to measure performance?

I am a bit of a maverick, and for various reasons am using a pure C++ 
environment.  I hacked my own 'connect_block' function (can;t wait for 
v3.2, where these will be part of native gr).
  

The trunk contains C++ code for connect, hier_block2, etc.  Some of
the pieces that are still missing include C++ support for the USRP
daughterboards, but Johnathan Corgan is working on that now.



I am measuring the performance using a custom block (gr_throughput)
that simply reports the average number of samples processed per
second.
  
  What h/w platform are you running on / tuning for?


The platform is currently Intel Xeon or Core2 Duo.

  You're not trying to run your app on a cache-crippled machine like a
  Celeron, are you?  ;)

No, very high end.

  Which blocks are causing you the biggest problem?

I got a 2x improvement on all the filtering blocks.
  

If these are FIR filters, were you using gr_fft_filter_{fff,ccc}
or the gr_fir_filter* blocks?  The FFT one's are _much_ faster with a
break-even point around 16 taps IIRC.



About a 40% improvement for sine/cosine generation blocks.  This
includes gr_expj, gr_rotate.
  

No surprise there, and that's a great example of SIMD code that should
be in GNU Radio.



  Are your problems caused primarily by lack of CPU cycles, cache
  misses or mis-predicted branches?

I am not sure, since I am not at all a software expect (mostly dsp/comm). 
My guess is that the SSE instructions are not being used (or not used to a 
full extent).  Even the 'multiply' block is VERY slow compared to a vector 
x vector multiplication in the Intel library.
  

OK.


Some of the gr_blocks 
process each sample using a separate function call (e.g. 
for (n=0; n
scale(in[n])

Replacing this with a single vectorized function call is much faster.
  

OK.



We would not accept the changes.

That's what I expected.  We'll try to contribute the more dsp-centric 
blocks such as demodulators. 
  

That would be great!  Or if you want to code up an SSE Taylor series
expansion for sine/cosine good to 23-bits or so, we'd love that too ;)


I am working on this in the little spare time I have.
I already got a SSE taylor series for atan2, working in gnuradio.
The atan2 needs some code cleanup and wrapper code to switch implementations (if 
(processor=X86, processor supports_SSE2)=>optimized else generic)
The sin/cos is far from ready.

Greetings,
Martin
  


Martin,

Bob put in a fast atan function (general/gr_fast_atan2f.cc) about a year 
ago. Have you looked in this, and is the Taylor performance better?


We really need a faster sin/cos. Glad to hear you're working on it.

Tom



___
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
http://lists.gnu.org/mailman/listinfo/discuss-gnuradio


Re: [Discuss-gnuradio] Re-writing blocks using intel libraries

2007-12-11 Thread Matt Ettus



General curiosity questions:

 Are you using oprofile to measure performance?

I am a bit of a maverick, and for various reasons am using a pure C++ 
environment.  I hacked my own 'connect_block' function (can;t wait for 
v3.2, where these will be part of native gr).  I am measuring the 
performance using a custom block (gr_throughput) that simply reports 
the average number of samples processed per second.


While pure C++ may be desirable for some reasons, performance is not 
really one of them.  When you use Python, it isn't running anything that 
is really performance critical.


 Which blocks are causing you the biggest problem?

I got a 2x improvement on all the filtering blocks.


That isn't surprising.  I believe our SSE filtering code was optimized 
for prior generations of processors, so a new Core2 optimized version 
would be useful, and likely competitive with IPP.  Also, are you sure 
that when you compile our code with Intel's compiler that you are even 
getting the SSE versions?  Or are the pure C++ versions called?


Another thing, which I believe was mentioned earlier -- if you really 
care about FIR filter performance, you should be using the FFT versions 
of the filters.  The difference in performance can be huge, making the 
2x you get from IPP insignificant.


 About a 40% improvement for sine/cosine generation blocks.  This 
includes gr_expj, gr_rotate.

There is definitely room for improvement here.


 Are your problems caused primarily by lack of CPU cycles, cache
 misses or mis-predicted branches?

I am not sure, since I am not at all a software expect (mostly 
dsp/comm).  My guess is that the SSE instructions are not being used 
(or not used to a full extent).  Even the 'multiply' block is VERY 
slow compared to a vector x vector multiplication in the Intel 
library.  Some of the gr_blocks process each sample using a separate 
function call (e.g.

for (n=0; n

Those function calls should be inlined if nothing else.

In any case,  GCC is not vectorizing this, but it would be trivial to 
write it in SSE or intrinsics, which would allow this to be done in open 
source code, without having to resort to IPP.  That would be a very 
useful contribution.


Matt



___
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
http://lists.gnu.org/mailman/listinfo/discuss-gnuradio


Re: [Discuss-gnuradio] Re-writing blocks using intel libraries

2007-12-11 Thread Martin Dvh
Eric Blossom wrote:
> On Tue, Dec 11, 2007 at 03:41:46PM -0800, Eugene Grayver wrote:
>> Please see answers in-line.
>>
>> Thanks!
> 
>> General curiosity questions:
>>
>>   Are you using oprofile to measure performance?
>>
>> I am a bit of a maverick, and for various reasons am using a pure C++ 
>> environment.  I hacked my own 'connect_block' function (can;t wait for 
>> v3.2, where these will be part of native gr).
> 
> The trunk contains C++ code for connect, hier_block2, etc.  Some of
> the pieces that are still missing include C++ support for the USRP
> daughterboards, but Johnathan Corgan is working on that now.
> 
>> I am measuring the performance using a custom block (gr_throughput)
>> that simply reports the average number of samples processed per
>> second.
> 
>>   What h/w platform are you running on / tuning for?
>>
>> The platform is currently Intel Xeon or Core2 Duo.
>>
>>   You're not trying to run your app on a cache-crippled machine like a
>>   Celeron, are you?  ;)
>>
>> No, very high end.
>>
>>   Which blocks are causing you the biggest problem?
>>
>> I got a 2x improvement on all the filtering blocks.
> 
> If these are FIR filters, were you using gr_fft_filter_{fff,ccc}
> or the gr_fir_filter* blocks?  The FFT one's are _much_ faster with a
> break-even point around 16 taps IIRC.
> 
>> About a 40% improvement for sine/cosine generation blocks.  This
>> includes gr_expj, gr_rotate.
> 
> No surprise there, and that's a great example of SIMD code that should
> be in GNU Radio.
> 
>>   Are your problems caused primarily by lack of CPU cycles, cache
>>   misses or mis-predicted branches?
>>
>> I am not sure, since I am not at all a software expect (mostly dsp/comm). 
>> My guess is that the SSE instructions are not being used (or not used to a 
>> full extent).  Even the 'multiply' block is VERY slow compared to a vector 
>> x vector multiplication in the Intel library.
> 
> OK.
> 
>> Some of the gr_blocks 
>> process each sample using a separate function call (e.g. 
>> for (n=0; n> scale(in[n])
>>
>> Replacing this with a single vectorized function call is much faster.
> 
> OK.
> 
>>> We would not accept the changes.
> 
>> That's what I expected.  We'll try to contribute the more dsp-centric 
>> blocks such as demodulators. 
> 
> That would be great!  Or if you want to code up an SSE Taylor series
> expansion for sine/cosine good to 23-bits or so, we'd love that too ;)
I am working on this in the little spare time I have.
I already got a SSE taylor series for atan2, working in gnuradio.
The atan2 needs some code cleanup and wrapper code to switch implementations 
(if (processor=X86, processor supports_SSE2)=>optimized else generic)
The sin/cos is far from ready.

Greetings,
Martin




> Thanks for telling us about your experience.
> 
> Eric
> 
> 
> ___
> Discuss-gnuradio mailing list
> Discuss-gnuradio@gnu.org
> http://lists.gnu.org/mailman/listinfo/discuss-gnuradio
> 



___
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
http://lists.gnu.org/mailman/listinfo/discuss-gnuradio


Re: [Discuss-gnuradio] Re-writing blocks using intel libraries

2007-12-11 Thread Eric Blossom
On Tue, Dec 11, 2007 at 04:03:28PM -0800, Dan Halperin wrote:
> Eugene Grayver wrote:
> > Please see answers in-line.
> >   Which blocks are causing you the biggest problem?
> > 
> > I got a 2x improvement on all the filtering blocks.  About a 40% 
> > improvement for sine/cosine generation blocks.  This includes gr_expj, 
> > gr_rotate.
> 
> I should mention that gr_rotate's performance can be _greatly improved_
> by a simple change that, rather than rescaling the multiplier every
> iteration, rescales every k, e.g. k=1000. I think I have an earlier
> mailing list post about this. IIRC, the patch didn't go in because there
> seemed to be no consensus about what k to use...

Ooops, sorry about that.  Let me dig through the archived discussion
and we'll get it in.

Eric


___
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
http://lists.gnu.org/mailman/listinfo/discuss-gnuradio


Re: [Discuss-gnuradio] Re-writing blocks using intel libraries

2007-12-11 Thread Eric Blossom
On Tue, Dec 11, 2007 at 03:41:46PM -0800, Eugene Grayver wrote:
> Please see answers in-line.
> 
> Thanks!

> General curiosity questions:
> 
>   Are you using oprofile to measure performance?
> 
> I am a bit of a maverick, and for various reasons am using a pure C++ 
> environment.  I hacked my own 'connect_block' function (can;t wait for 
> v3.2, where these will be part of native gr).

The trunk contains C++ code for connect, hier_block2, etc.  Some of
the pieces that are still missing include C++ support for the USRP
daughterboards, but Johnathan Corgan is working on that now.

> I am measuring the performance using a custom block (gr_throughput)
> that simply reports the average number of samples processed per
> second.

>   What h/w platform are you running on / tuning for?
> 
> The platform is currently Intel Xeon or Core2 Duo.
> 
>   You're not trying to run your app on a cache-crippled machine like a
>   Celeron, are you?  ;)
> 
> No, very high end.
> 
>   Which blocks are causing you the biggest problem?
> 
> I got a 2x improvement on all the filtering blocks.

If these are FIR filters, were you using gr_fft_filter_{fff,ccc}
or the gr_fir_filter* blocks?  The FFT one's are _much_ faster with a
break-even point around 16 taps IIRC.

> About a 40% improvement for sine/cosine generation blocks.  This
> includes gr_expj, gr_rotate.

No surprise there, and that's a great example of SIMD code that should
be in GNU Radio.

>   Are your problems caused primarily by lack of CPU cycles, cache
>   misses or mis-predicted branches?
> 
> I am not sure, since I am not at all a software expect (mostly dsp/comm). 
> My guess is that the SSE instructions are not being used (or not used to a 
> full extent).  Even the 'multiply' block is VERY slow compared to a vector 
> x vector multiplication in the Intel library.

OK.

> Some of the gr_blocks 
> process each sample using a separate function call (e.g. 
> for (n=0; n scale(in[n])
> 
> Replacing this with a single vectorized function call is much faster.

OK.

> > We would not accept the changes.

> That's what I expected.  We'll try to contribute the more dsp-centric 
> blocks such as demodulators. 

That would be great!  Or if you want to code up an SSE Taylor series
expansion for sine/cosine good to 23-bits or so, we'd love that too ;)

Thanks for telling us about your experience.

Eric


___
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
http://lists.gnu.org/mailman/listinfo/discuss-gnuradio


Re: [Discuss-gnuradio] Re-writing blocks using intel libraries

2007-12-11 Thread Dan Halperin
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Eugene Grayver wrote:
> Please see answers in-line.
>   Which blocks are causing you the biggest problem?
> 
> I got a 2x improvement on all the filtering blocks.  About a 40% 
> improvement for sine/cosine generation blocks.  This includes gr_expj, 
> gr_rotate.

I should mention that gr_rotate's performance can be _greatly improved_
by a simple change that, rather than rescaling the multiplier every
iteration, rescales every k, e.g. k=1000. I think I have an earlier
mailing list post about this. IIRC, the patch didn't go in because there
seemed to be no consensus about what k to use...
> We would not accept the changes.  Part of what we're up to is building
> an ever expanding universe of free code.  Instead of using the
> non-free IPP code, please consider using a free library such as ATLAS,
> or help us find and fix performance challenges in a way that doesn't
> require non-free code.  Also, are you sure that your performance
> issues can't be better addressed with an algorithmic change?  If
> you're using a lot of very low-level blocks (e.g., add, multiply,
> etc.) you're probably better off writing a block that aggregates some
> of the operations into a single block.
> 
> That's what I expected.  We'll try to contribute the more dsp-centric 
> blocks such as demodulators. 

That said, if you put the code (and/or the modified makefiles) up
somewhere, I'm sure there are some users that would benefit even if it
doesn't make it into the main release.

- -Dan
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHXyVQy9GYuuMoUJ4RAjSfAKDWVOeMbGteN+BQhl71tG5mo2D3CgCfdzCO
34TYmHjgnijbENsfxECZNwo=
=v01V
-END PGP SIGNATURE-


___
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
http://lists.gnu.org/mailman/listinfo/discuss-gnuradio


Re: [Discuss-gnuradio] Re-writing blocks using intel libraries

2007-12-11 Thread Eugene Grayver
Please see answers in-line.

Thanks!




Eric Blossom <[EMAIL PROTECTED]> 
12/11/2007 02:31 PM

To
Eugene Grayver <[EMAIL PROTECTED]>
cc
discuss-gnuradio@gnu.org
Subject
Re: [Discuss-gnuradio] Re-writing blocks using intel libraries






On Tue, Dec 11, 2007 at 10:13:32AM -0800, Eugene Grayver wrote:
> Hello,
> 
> We are working on some systems that require high sampling rates.  I am 
> already using the Intel C++ compiler at the highest optimization ratio, 
> but a lot of the blocks are very slow still.  It appears that intel C++ 
> does not properly vectorize  data type. 

General curiosity questions:

  Are you using oprofile to measure performance?

I am a bit of a maverick, and for various reasons am using a pure C++ 
environment.  I hacked my own 'connect_block' function (can;t wait for 
v3.2, where these will be part of native gr).  I am measuring the 
performance using a custom block (gr_throughput) that simply reports the 
average number of samples processed per second.

  What h/w platform are you running on / tuning for?

The platform is currently Intel Xeon or Core2 Duo.

  You're not trying to run your app on a cache-crippled machine like a
  Celeron, are you?  ;)

No, very high end.

  Which blocks are causing you the biggest problem?

I got a 2x improvement on all the filtering blocks.  About a 40% 
improvement for sine/cosine generation blocks.  This includes gr_expj, 
gr_rotate.

  Are your problems caused primarily by lack of CPU cycles, cache
  misses or mis-predicted branches?

I am not sure, since I am not at all a software expect (mostly dsp/comm). 
My guess is that the SSE instructions are not being used (or not used to a 
full extent).  Even the 'multiply' block is VERY slow compared to a vector 
x vector multiplication in the Intel library.  Some of the gr_blocks 
process each sample using a separate function call (e.g. 
for (n=0; n I have been replacing almost every low level block with a functionally 
> equivalent using the intel performance libraries (IPP).  These libraries 

> are not GPL, but are free for noncommercial use under Linux ($200 
> otherwise).  At some point, I would like to contribute our work back to 
> gnuradio.  Would this fit with the gr philosophy?  How should we 
structure 
> the code?  (i.e. have a separate set of files, use #defines, or ...)?
> 
> Eugene

We would not accept the changes.  Part of what we're up to is building
an ever expanding universe of free code.  Instead of using the
non-free IPP code, please consider using a free library such as ATLAS,
or help us find and fix performance challenges in a way that doesn't
require non-free code.  Also, are you sure that your performance
issues can't be better addressed with an algorithmic change?  If
you're using a lot of very low-level blocks (e.g., add, multiply,
etc.) you're probably better off writing a block that aggregates some
of the operations into a single block.

That's what I expected.  We'll try to contribute the more dsp-centric 
blocks such as demodulators. 



Eric

___
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
http://lists.gnu.org/mailman/listinfo/discuss-gnuradio


Re: [Discuss-gnuradio] Re-writing blocks using intel libraries

2007-12-11 Thread Eric Blossom
On Tue, Dec 11, 2007 at 10:13:32AM -0800, Eugene Grayver wrote:
> Hello,
> 
> We are working on some systems that require high sampling rates.  I am 
> already using the Intel C++ compiler at the highest optimization ratio, 
> but a lot of the blocks are very slow still.  It appears that intel C++ 
> does not properly vectorize  data type. 

General curiosity questions:

  Are you using oprofile to measure performance?

  What h/w platform are you running on / tuning for?

  You're not trying to run your app on a cache-crippled machine like a
  Celeron, are you?  ;)

  Which blocks are causing you the biggest problem?

  Are your problems caused primarily by lack of CPU cycles, cache
  misses or mis-predicted branches?

> I have been replacing almost every low level block with a functionally 
> equivalent using the intel performance libraries (IPP).  These libraries 
> are not GPL, but are free for noncommercial use under Linux ($200 
> otherwise).  At some point, I would like to contribute our work back to 
> gnuradio.  Would this fit with the gr philosophy?  How should we structure 
> the code?  (i.e. have a separate set of files, use #defines, or ...)?
> 
> Eugene

We would not accept the changes.  Part of what we're up to is building
an ever expanding universe of free code.  Instead of using the
non-free IPP code, please consider using a free library such as ATLAS,
or help us find and fix performance challenges in a way that doesn't
require non-free code.  Also, are you sure that your performance
issues can't be better addressed with an algorithmic change?  If
you're using a lot of very low-level blocks (e.g., add, multiply,
etc.) you're probably better off writing a block that aggregates some
of the operations into a single block.

Eric


___
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
http://lists.gnu.org/mailman/listinfo/discuss-gnuradio


[Discuss-gnuradio] Re-writing blocks using intel libraries

2007-12-11 Thread Eugene Grayver
Hello,

We are working on some systems that require high sampling rates.  I am 
already using the Intel C++ compiler at the highest optimization ratio, 
but a lot of the blocks are very slow still.  It appears that intel C++ 
does not properly vectorize  data type. 

I have been replacing almost every low level block with a functionally 
equivalent using the intel performance libraries (IPP).  These libraries 
are not GPL, but are free for noncommercial use under Linux ($200 
otherwise).  At some point, I would like to contribute our work back to 
gnuradio.  Would this fit with the gr philosophy?  How should we structure 
the code?  (i.e. have a separate set of files, use #defines, or ...)?

Eugene

Eugene Grayver, Ph.D.
Aerospace Corp., Sr. Eng. Spec.
Tel: 310.336.1274

___
Discuss-gnuradio mailing list
Discuss-gnuradio@gnu.org
http://lists.gnu.org/mailman/listinfo/discuss-gnuradio