On Sat, 6 Jan 2007 05:59:55 +0100
"J.D. Bakker" <[EMAIL PROTECTED]> wrote:

> typedef struct {
>    short header_len;  // This field is always present, and always first
>    short data_len;    // This field is always present, and always second
>    [ header contents, including header version, type, etc. goes here ]
>    char data[NET_MULTICAST_PAYLOAD];
> } NET_RX_STRUCT;
> 
> NET_RX_STRUCT msg;
> rxin_char=(void*)(&timf1_char[timf1p_pa]);
> timf1p_pa=(timf1p_pa+ad_read_bytes)&timf1_bytemask;
> for(j=0; j<ad_read_bytes; j+=NET_MULTICAST_PAYLOAD)
>    {
>    recvfrom(netfd.rec_rx,&msg,sizeof(NET_RX_STRUCT),0,
>                                   (struct sockaddr *) &rx_addr,&addrlen);
>    memcpy(&rxin_char[j], ((void *)&msg) + msg.header_len, 
> NET_MULTICAST_PAYLOAD);
>    }
OK, now you just added an extra copy operation and the ReadData
does not return a pointer. This was exactly what I referred to
from the start.

> For the time being I kept the data size constant, to not mix the 
> issues (variable data size adds 5-6 lines). And yes, there is a 
> memcpy. But read below...
> 
> By the way, there appears to be an inconsistency in your program. Here:
> 
>    timf1p_pa=(timf1p_pa+ad_read_bytes)&timf1_bytemask;
> 
> you make sure that the address pointer for the circular buffer wraps 
> around, but I see no such protection in the for() loop. Or am I 
> missing something ?
Yes. ad_read_bytes is a power of two so it always goes even in the buffer.

> An extra memcpy makes little difference in this loop. Note that the 
> recvfrom() needs to do the equivalent of a memcpy() anyway. I wrote a 
> little test program (see the bottom of this mail) to test the speed 
> difference between 1 and 2 copy instructions if the destination 
> buffer is larger than the cache.
> 
> On a Pentium MMX 166MHz, a Thinkpad laptop with X running, I get:
> 
> Single copy: 10000000 loops in 129.44 seconds, or 79.11 MiBps.
> Double copy: 10000000 loops in 147.21 seconds, or 69.56 MiBps.
Hmmm, actually changing the Linrad code to do the double copy as
you suggest increases the CPU load from 47.5% to 48.5% when
16 bit raw data is received on a 200 MHz Pentium MMX. This makes
me believe that receiving fft1 transforms which will be necessary
for this computer to do meaningful work will lead to an extra
load of 4 percent units and that is not insignificant at all.
(32 bit floats, interleaved transforms)

The fft1 mode does not yet work so I can not test right now, but
I hav reasons to believe that the MMX routines will allow this
computer to work well as a slave with two channels at 96 kHz
bandwidth.

> Is there any way at all that you can avoid that, and process the data 
> as it comes in ? My first big multi-threaded program was a real-time 
> streaming video encoder for a quad Pentium Pro machine, and switching 
> processing from a frame at a time to a macroblock (16x16pixels) at a 
> time sped the encoder up tremendously, even though the required 
> number of operations almost doubled.
I do not know, but it is going to be difficult if possible at all.

> >The most demanding task is the full bandwidth, full dynamic
> >range FFT. It would be identical in all computers and it does
> >not make any sense to do it in more than one computer.
> 
> Why ? Because this one computer would be much faster than the others ?
Yes. Amateurs typically have one modern computer
(running Windows) plus a couple of scrap computers that 
would be perfectly adequate as slaves.

> So would it be correct to say that:
> 
> (a) if all computers were equally fast, there is little advantage in 
> cooked mode over raw mode (other than energy conservation), and
Right now yes, but with future add-ons this might change because
the master might do only front-end processing because it is not
fast enough to do everything. Actually I think it will be possible
to use two Pentium MMX with one doing fft1 and nothing more
while the other is doing fft2 and down conversion. It will be
near the limits for what these computers can do - one of them
is just a little to slow.

At the moment CPU time is no problem at all because we do not
yet have the wideband hardware that modern computers could
serve.
 
> (b) cooked mode allows slow slaves that would normally not be able to 
> keep up with the FFTs to still display the data.
There are many FFTs in Linrad. fft1(forward full bw) -> fft1(backward 2 * full 
bw) ->
fft2(forward full bw) -> fft2(backward narrow bw) -> fft3(forward narrow bw) ->
fft3(backward narrow bw)

There is dsp processing alternating between the frequency domain and the
time domain. Only the very first fft needs full dynamic range and uses
float. The remaining wideband processing uses MMX. This was necessary
when a modern computer was Pentium 3. Today floats can be used, but
not if one wants to monitor several microwave bands on the same computer.

Because of the much higher speed of 16 bit MMX routines the first
fft takes as much time as all the other tasks together. Since the
processing of all the other transforms depends on what frequency
the user has selected for his loudspeaker signal it will be fine to
do fft1 once and for all in the fast master computer.

  73

 Leif / SM5BSZ



#############################################################
This message is sent to you because you are subscribed to
  the mailing list <linrad@antennspecialisten.se>.
To unsubscribe, E-mail to: <[EMAIL PROTECTED]>
To switch to the DIGEST mode, E-mail to <[EMAIL PROTECTED]>
To switch to the INDEX mode, E-mail to <[EMAIL PROTECTED]>
Send administrative queries to  <[EMAIL PROTECTED]>

Reply via email to