Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

2009-10-13 Thread Jim Kuzdrall
Greetings Bruce,

Still thinking about your problem...

On Sunday 11 October 2009 17:01, Bruce Labitt wrote:
 I did do an experiment that had curious results.  Instead of sending
 double precision binary data, I sent single precision or 'float'.  I
 was expecting to halve my transmission time, since it is half the
 size. Instead, there was only a 10-15% speed increase, not a 100%. 
 This result is telling me something, although at this time, I'm too
 brain dead to really ascertain what it really means.

The float option should work to get you a factor of two.

First, make certain that float is 4 bytes for your compiler by 
printing out sizeof(float) from a compiled program.  The C Standard 
defines float as having a range of 10^+38 to 10^-38 and at least 6 
decimal digits precision, but that leaves the door open to use doubles 
for floats in the compiler.

More significantly, C promotes floats to doubles when it passes them 
to a function.  I am guessing that is what happened.

Assuming you have the data in a float array, cast the array to an 
array of 4 byte character arrays.  Send it as if it were characters 
rather than the numeric values.  The receiving end should not care what 
the bytes represent.  When the array is retrieved as characters, cast 
it back to the floats.  Since it is the same compiler, byte order 
should not cause a problem.

As a second enhancement, try sending the number pairs as they are 
generated, rather than waiting for them all to complete.  The 
relatively slow communication hardware has its own formatter and shift 
register - and most likely a FIFO.  It will take care of issuing the 
bits while the processor does other things - like computing FFts.

To make this work, you might divide the FFT computation into 16 
parts.  Start sending the first part as soon as it is completed.

If both of these things worked, you would be 4X faster.  Better than 
standing pat.

Jim Kuzdrall
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

2009-10-13 Thread Ralph A. Mack
Bruce Labitt bruce.lab...@myfairpoint.net 
mailto:bruce.lab...@myfairpoint.net wrote:
 
  What I'm trying to do:  Optimizer for a radar power spectral density 
problem
 
  Problem:  FFTs required in optimization loop take too long on current
  workstation for the optimizer to even be viable.
 
  Attempted solution:  FFT engine on remote server to reduce overall
  execution time
 
  Builds client - server app implementing above solution.  Server uses
  OpenMP and FFTW to exploit all cores.
[...]
  Implements better binary packing unpacking in code.  Stuff works
 
  Nit in solution:  TCP transport time  FFT execution time, rendering
  attempted solution non-viable
 
[...]
  Hey, that is my bigger picture...  Any and all suggestions are
  appreciated.  Undoubtedly, a few dumb questions will follow.  I appear
  to be good at it.  :P  Maybe this context will help list subscribers
  frame their answers if they have any, or ask insightful question.

I don't understand anything about your domain of application,
so take this for what its worth...

I've gleaned the following from the previous posts. Is it a fair summary?

- The local FFT is taking ~200 ms, which isn't fast enough.
- The remote FFT is substantially faster than this once the data gets 
there.
- However, it takes substantially longer (~1.2 seconds) to move the data
  than to process it locally.

What does fast enough mean here? What is your time budget per data set?
Is it only constrained by catching and cooking one data set before it is
overwritten by a new one (or before you choke on the stream buffers :) )?
Are there latency/timeliness requirements from downstream?
If so, what are they?
Provided your processing rate keeps up with the arrival rate,
how far behind can you afford to deliver results?
(i.e. how much pipelining is permitted in a solution?)

How fast is the remote FFT? I didn't catch a number for this one.
Or was the 200 ms the remote processing time?
(In which case, what't the local processing time?)
Do you have the actual server you're targeting to benchmark this on?

This helps to frame the external requirements more clearly.

You've stated the problem in the implementation domain.
It sounds like your range of solutions could leave very little headroom.
My instinctive response is to ask
Is there a more frugal approach in the application domain?

Do you need to grind down the whole field of potential interest?
Are there ways to narrow and intensify your focus partway through?
Perhaps to do a much faster but weaker FFT,
analyze it quickly to identify a narrower problem of interest,
and then do the slower, much stronger FFT on a lot less data?
Reducing the data load for the hard part may help with on-chip or
off-chip solutions. It may also help to identify hybrid solutions.

Alternatively, a mid-stream focusing analysis might be so expensive
as to negate the benefit, or any performant mid-stream analysis might
be merely a too-risky heuristic, or the problem may simply not lend
itself to that kind of decomposition. You did say that you had already
encountered a number of dead-ends - this may be familiar ground :)

I don't know your domain. I don't have answers, just questions.
I just figured those kind of questions were worth asking
before we try squeezing the last Mbps out of the network...

Lupestro

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

2009-10-13 Thread bruce . labitt
gnhlug-discuss-boun...@mail.gnhlug.org wrote on 10/13/2009 08:13:12 AM:

 Greetings Bruce,
 
 Still thinking about your problem...

Well, thank you for that!

 
 On Sunday 11 October 2009 17:01, Bruce Labitt wrote:
  I did do an experiment that had curious results.  Instead of sending
  double precision binary data, I sent single precision or 'float'.  I
  was expecting to halve my transmission time, since it is half the
  size. Instead, there was only a 10-15% speed increase, not a 100%. 
  This result is telling me something, although at this time, I'm too
  brain dead to really ascertain what it really means.
 
 The float option should work to get you a factor of two.

I would agree...  It really makes me think that, either there is an 
unknown bottleneck in the client, or a misconfiguration of the socket, or 
the OS,... or more likely some combination of all of the above...

 
 First, make certain that float is 4 bytes for your compiler by 
 printing out sizeof(float) from a compiled program.  The C Standard 
 defines float as having a range of 10^+38 to 10^-38 and at least 6 
 decimal digits precision, but that leaves the door open to use doubles 
 for floats in the compiler.
 

printf tells me sizeof(float) = 4 for both client and server

 More significantly, C promotes floats to doubles when it passes them 

 to a function.  I am guessing that is what happened.
 
 Assuming you have the data in a float array, cast the array to an 
 array of 4 byte character arrays.  Send it as if it were characters 
 rather than the numeric values.  The receiving end should not care what 
 the bytes represent.  When the array is retrieved as characters, cast 
 it back to the floats.  Since it is the same compiler, byte order 
 should not cause a problem.
 

Data is converted to string prior to transmit.  This is an area for 
improvement.  However, the timing numbers I've indicated 'should' be just 
the transmission of the string buffer.

As for 'the same compiler', well, It is gcc on both, but to make things 
more fun, the client is little endian, and the server is big endian.  So 
byte order does matter. 

 As a second enhancement, try sending the number pairs as they are 
 generated, rather than waiting for them all to complete.  The 
 relatively slow communication hardware has its own formatter and shift 
 register - and most likely a FIFO.  It will take care of issuing the 
 bits while the processor does other things - like computing FFts.
 
 To make this work, you might divide the FFT computation into 16 
 parts.  Start sending the first part as soon as it is completed.
 

I will see if I can do something here.  As it stands, the FFTs are broken 
up...  For one calculation I need to do 80-100 FFTs.  Hence my desire for 
reducing any of the large time consuming processes.

 If both of these things worked, you would be 4X faster.  Better than 

 standing pat.

Much better than standing pat!

 
 Jim Kuzdrall

Thanks for thinking about this,
Bruce


**
Neither the footer nor anything else in this E-mail is intended to or 
constitutes an brelectronic signature and/or legally binding agreement in the 
absence of an brexpress statement or Autoliv policy and/or procedure to the 
contrary.brThis E-mail and any attachments hereto are Autoliv property and 
may contain legally brprivileged, confidential and/or proprietary 
information.brThe recipient of this E-mail is prohibited from distributing, 
copying, forwarding or in any way brdisseminating any material contained 
within this E-mail without prior written brpermission from the author. If you 
receive this E-mail in error, please brimmediately notify the author and 
delete this E-mail.  Autoliv disclaims all brresponsibility and liability for 
the consequences of any person who fails to brabide by the terms herein. br
**

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

2009-10-13 Thread bruce . labitt
gnhlug-discuss-boun...@mail.gnhlug.org wrote on 10/13/2009 08:44:49 AM:

 Bruce Labitt bruce.lab...@myfairpoint.net 
 mailto:bruce.lab...@myfairpoint.net wrote:
  
   What I'm trying to do:  Optimizer for a radar power spectral density 
 problem
  
   Problem:  FFTs required in optimization loop take too long on current
   workstation for the optimizer to even be viable.
  
   Attempted solution:  FFT engine on remote server to reduce overall
   execution time
  
   Builds client - server app implementing above solution.  Server uses
   OpenMP and FFTW to exploit all cores.
 [...]
   Implements better binary packing unpacking in code.  Stuff works
  
   Nit in solution:  TCP transport time  FFT execution time, rendering
   attempted solution non-viable
  
 [...]
   Hey, that is my bigger picture...  Any and all suggestions are
   appreciated.  Undoubtedly, a few dumb questions will follow.  I 
appear
   to be good at it.  :P  Maybe this context will help list subscribers
   frame their answers if they have any, or ask insightful question.
 
 I don't understand anything about your domain of application,
 so take this for what its worth...
 
 I've gleaned the following from the previous posts. Is it a fair 
summary?

Not exactly, minor corrections below

 - The local FFT is taking ~200 ms, which isn't fast enough.

remote FFT takes 200 ms.  local machine takes ~20 sec.

 - The remote FFT is substantially faster than this once the data gets 
 there.

Remote is 100x faster than local.

 - However, it takes substantially longer (~1.2 seconds) to move the data
   than to process it locally.
 

Tranfer time to server is indeed ~1.2 sec if running at full link 
bandwidth.  I have yet to achieve this in my application.  I'd be 
'deliriously happy' to get this bandwidth.  Could I use more bandwidth, 
yes.

 What does fast enough mean here? What is your time budget per data 
set?

This is a very good question.  ITBP (in the big picture) I need to compute 
100 FFTs for 1 data set.  This data will be fed to an optimizer, which 
will modify coefficients and ask for another 100 FFTs.  If things are set 
up properly, stuff eventually converges.  If not, it can be ages before 
you know you have a poor run.  This makes debug dificult.  So I like to 
build in some sort of parameter that I can monitor for simulation 
progress. 

Hmm, I digressed.  At 20 sec /cycle it takes about 40 minutes to do the 
100 FFTs that make a dataset. 
If I could get ~2 sec per FFT cycle, I'd be much happier.  It would be a 
lot easier to debug, and watch the convergence indicators. 


 Is it only constrained by catching and cooking one data set before it 
is
 overwritten by a new one (or before you choke on the stream buffers :) 
)?
 Are there latency/timeliness requirements from downstream?
 If so, what are they?
 Provided your processing rate keeps up with the arrival rate,
 how far behind can you afford to deliver results?
 (i.e. how much pipelining is permitted in a solution?)
 
 How fast is the remote FFT? I didn't catch a number for this one.
 Or was the 200 ms the remote processing time?
 (In which case, what't the local processing time?)
 Do you have the actual server you're targeting to benchmark this on?
 
 This helps to frame the external requirements more clearly.
 
 You've stated the problem in the implementation domain.
 It sounds like your range of solutions could leave very little headroom.
 My instinctive response is to ask
 Is there a more frugal approach in the application domain?

This is an insightful observation.  Good question!

 
 Do you need to grind down the whole field of potential interest?
 Are there ways to narrow and intensify your focus partway through?
 Perhaps to do a much faster but weaker FFT,
 analyze it quickly to identify a narrower problem of interest,
 and then do the slower, much stronger FFT on a lot less data?
 Reducing the data load for the hard part may help with on-chip or
 off-chip solutions. It may also help to identify hybrid solutions.
 
 Alternatively, a mid-stream focusing analysis might be so expensive
 as to negate the benefit, or any performant mid-stream analysis might
 be merely a too-risky heuristic, or the problem may simply not lend
 itself to that kind of decomposition. You did say that you had already
 encountered a number of dead-ends - this may be familiar ground :)
 

Believe me, a lot of it is familiar :)  Nonetheless, things are worth 
mulling over.  Despite the time invested, nothing really is cast in 
concrete...

 I don't know your domain. I don't have answers, just questions.
 I just figured those kind of questions were worth asking
 before we try squeezing the last Mbps out of the network...
 
 Lupestro
 

In my current solution space - the network transport IS the dominant 
bottleneck.  I truly was not expecting such slow performance there.  Of 
course, once 'fixed' there will be a new bottleneck.  Some bottlenecks 
cannot be corrected, either one accepts the performance, 

Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

2009-10-13 Thread bruce . labitt
Jim Kuzdrall gnh...@intrel.com wrote on 10/13/2009 02:04:36 PM:

 On Tuesday 13 October 2009 10:54, bruce.lab...@autoliv.com wrote:
  gnhlug-discuss-boun...@mail.gnhlug.org wrote on 10/13/2009 08:13:12 
 AM:
   Greetings Bruce,
  
   Still thinking about your problem...
 
  Well, thank you for that!
 
 
  Data is converted to string prior to transmit.  This is an area for
  improvement.  However, the timing numbers I've indicated 'should' be
  just the transmission of the string buffer.
 
 If I can remember my C programming, you do something non-portable 
 like:
 ...
 /* number of pairs (initialize to the correct number) */
 int npair=128;
 /* the double does a boundary alignment */
 union {
 double align;
 float  ansf[npair][2];
 char   ansc[npair*2*4]
 } answer;
 
 /* check that they are the same size (this does not guarantee
that they are aligned right; check the received data for that) */
 printf(float array size is %d and text array size is %d\n,
 sizeof(answer.ansf), sizeof(answer.ansc));
 
 cnt= fwrite( (char *)answer.ansc, 4*2, npair, output_dev);
 printf(The count was %d and it should be %d,cnt,npair);
 exit(0);
 ...
 

The transfer is complete, the numbers do get across, and they are even 
correct.  I separated the problem into two parts - packing into strings, 
and sending the string over the socket.  The packing works well enough for 
now.  It is just the network transfer rate that isn't fast enough. 

 Wow, C seems so unfamiliar after a few years away from it.  I hope 
 that gives you the gist of what I am suggesting to try.  There may be a 
 better choice of functions, but I thought a for-loop on fputc() would 
 cost a lot more cpu cycles.
 

fputc is REALLY slow.  fwrite using big blocks is MUCH faster, orders of 
magnitude faster.

 To get this to work, you may have to get the FFT program and the I/O 

 program to be separate processes - pipe or socket or something does 
 that.
 
 Maybe somebody more experience will make a suggestion.
 
 I got the speed increase more optimistic than justified, but you can 

 get about 4x by adding another data link.  Send the odd chunks to one 
 and the even chunks to the other.
 

I have been toying with sending the real and imaginary parts on separate 
sockets in their own threads.  That is what psockets was supposed to do. 
(Striped network transfer).  It still bugs me that the utilization of the 
network is so bad.  A striped transfer, even if it is just two stripes, 
will be better than just one socket.  That is if I can get it to work... 
If I were to go to multiple sockets, I might as well use someone else's 
library, after all, why reinvent the wheel.

I'm still think it is a misconfigured/mistuned system.  I hope it is just 
a mismatch of parameters between the two computers.  Still, the most 
likely answer is that the problem lies in my code.  Time to dive in 
again...

Thanks!
Bruce


**
Neither the footer nor anything else in this E-mail is intended to or 
constitutes an brelectronic signature and/or legally binding agreement in the 
absence of an brexpress statement or Autoliv policy and/or procedure to the 
contrary.brThis E-mail and any attachments hereto are Autoliv property and 
may contain legally brprivileged, confidential and/or proprietary 
information.brThe recipient of this E-mail is prohibited from distributing, 
copying, forwarding or in any way brdisseminating any material contained 
within this E-mail without prior written brpermission from the author. If you 
receive this E-mail in error, please brimmediately notify the author and 
delete this E-mail.  Autoliv disclaims all brresponsibility and liability for 
the consequences of any person who fails to brabide by the terms herein. br
**

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

2009-10-12 Thread Tom Buskey
IBM was working on a Cell on a card to plug into a motherboard for use
in supercomputer clusters.  Much like nvidea's tesla gpu cards.



On 10/11/09, Jim Kuzdrall gnh...@intrel.com wrote:
 Greetings Bruce,

 Interesting and challenging project!

 On Saturday 10 October 2009 15:20, Bruce Labitt wrote:
 For anyone that is remotely interested, here is the big picture for
 the problem I'm trying to solve.  If you are not interested, hey
 delete the post.  Won't irritate me in the least!

 If you just transferred the data (no framing or error checking), how
 many bits per second must you transfer to keep up with the FFT data
 production?

 Did you explore adding a dedicated FFT card to your control
 computer?   The algorithms they build into the hardware are much, much
 faster than compiled software.  The local board would keep the data in
 your control computer - with DMA, I assume - eliminating the transfer
 problem.

 I know a fellow who now works for Apple whose job is to optimize FFT
 algorithms to the processor they use.  Assembly language, of course.
 Why is Apple interested?  Faster FFT, faster MP3 translation, longer
 battery life.  A very high payoff.

 Jim Kuzdrall
 ___
 gnhlug-discuss mailing list
 gnhlug-discuss@mail.gnhlug.org
 http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


-- 
Sent from my mobile device
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

2009-10-12 Thread Bruce Labitt
Tom Buskey wrote:
 IBM was working on a Cell on a card to plug into a motherboard for use
 in supercomputer clusters.  Much like nvidea's tesla gpu cards.


   
Fixstars, formerly known as terrasoftsolutions has an equivalent item 
http://www.fixstars.com/en/products/gigaaccel180/
If this was around when I started, it would have been good.  However, 
for the moment I am stuck with 
http://www.fixstars.com/en/products/bladecenter/qs22/

If I had the high speed link, I'd be golden.  Right now, I don't have 
the budget for it.  So back to the slow 1Gb network stuff... :(
 On 10/11/09, Jim Kuzdrall gnh...@intrel.com wrote:
   
 Greetings Bruce,

 Interesting and challenging project!

 On Saturday 10 October 2009 15:20, Bruce Labitt wrote:
 
 For anyone that is remotely interested, here is the big picture for
 the problem I'm trying to solve.  If you are not interested, hey
 delete the post.  Won't irritate me in the least!

   
 If you just transferred the data (no framing or error checking), how
 many bits per second must you transfer to keep up with the FFT data
 production?

 Did you explore adding a dedicated FFT card to your control
 computer?   The algorithms they build into the hardware are much, much
 faster than compiled software.  The local board would keep the data in
 your control computer - with DMA, I assume - eliminating the transfer
 problem.

 I know a fellow who now works for Apple whose job is to optimize FFT
 algorithms to the processor they use.  Assembly language, of course.
 Why is Apple interested?  Faster FFT, faster MP3 translation, longer
 battery life.  A very high payoff.

 Jim Kuzdrall
 ___
 gnhlug-discuss mailing list
 gnhlug-discuss@mail.gnhlug.org
 http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/

 

   

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

2009-10-11 Thread Jim Kuzdrall
Greetings Bruce,

Interesting and challenging project!

On Saturday 10 October 2009 15:20, Bruce Labitt wrote:
 For anyone that is remotely interested, here is the big picture for
 the problem I'm trying to solve.  If you are not interested, hey
 delete the post.  Won't irritate me in the least!

If you just transferred the data (no framing or error checking), how 
many bits per second must you transfer to keep up with the FFT data 
production?

Did you explore adding a dedicated FFT card to your control 
computer?   The algorithms they build into the hardware are much, much 
faster than compiled software.  The local board would keep the data in 
your control computer - with DMA, I assume - eliminating the transfer 
problem.

I know a fellow who now works for Apple whose job is to optimize FFT 
algorithms to the processor they use.  Assembly language, of course.  
Why is Apple interested?  Faster FFT, faster MP3 translation, longer 
battery life.  A very high payoff.

Jim Kuzdrall 
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

2009-10-11 Thread Joshua Judson Rosen
Bruce Labitt bruce.lab...@myfairpoint.net writes:

 What I'm trying to do:  Optimizer for a radar power spectral density problem
 
 Problem:  FFTs required in optimization loop take too long on current 
 workstation for the optimizer to even be viable. 
 
 Attempted solution:  FFT engine on remote server to reduce overall 
 execution time
 
 Builds client - server app implementing above solution.  Server uses 
 OpenMP and FFTW to exploit all cores.
[...]
 Implements better binary packing unpacking in code.  Stuff works
 
 Nit in solution:  TCP transport time  FFT execution time, rendering 
 attempted solution non-viable
 
 Researches TCP optimization: Reads countless papers on tcp optimization 
 techniques... Fails to find a robust solutions or methodology for 
 problem.  Tries most techniques written in papers, only realizing a 10% 
 gain.  Not good enough.  Still needs to be faster
 
 Driven to more exotic techniques to reduce transport time.  Explores 
 parallel sockets, other techniques
[...]
 Hey, that is my bigger picture...  Any and all suggestions are 
 appreciated.  Undoubtedly, a few dumb questions will follow.  I appear 
 to be good at it.  :P  Maybe this context will help list subscribers 
 frame their answers if they have any, or ask insightful questions.

Where exactly does NFS fit into this?

-- 
Don't be afraid to ask (Lf.((Lx.xx) (Lr.f(rr.
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

2009-10-11 Thread Drew Van Zandt
Have you considered using a fast compression/decompression algorithm before
you transmit, one that isn't too computationally intensive for either
compression or decompression?  You won't get high compression (factor of 10)
as you might with slower ones, but if you get even a factor of 2, you've
just doubled the effective network speed.

It appears to me that this would only gain you 50% or so, as the obvious
fast compression algorithms are only about twice as fast (
http://www.quicklz.com/bench.html) as Gigabit Ethernet's theoretical speed,
but it's worth consideration.  The decompression is faster than the
compression, which would hopefully result in a net win on processing time on
remote server.

--DTVZ

On Sun, Oct 11, 2009 at 10:29 AM, Joshua Judson Rosen
roz...@geekspace.comwrote:

 Bruce Labitt bruce.lab...@myfairpoint.net writes:
 
  What I'm trying to do:  Optimizer for a radar power spectral density
 problem
 
  Problem:  FFTs required in optimization loop take too long on current
  workstation for the optimizer to even be viable.
 
  Attempted solution:  FFT engine on remote server to reduce overall
  execution time
 
  Builds client - server app implementing above solution.  Server uses
  OpenMP and FFTW to exploit all cores.
 [...]
  Implements better binary packing unpacking in code.  Stuff works
 
  Nit in solution:  TCP transport time  FFT execution time, rendering
  attempted solution non-viable
 
  Researches TCP optimization: Reads countless papers on tcp optimization
  techniques... Fails to find a robust solutions or methodology for
  problem.  Tries most techniques written in papers, only realizing a 10%
  gain.  Not good enough.  Still needs to be faster
 
  Driven to more exotic techniques to reduce transport time.  Explores
  parallel sockets, other techniques
 [...]
  Hey, that is my bigger picture...  Any and all suggestions are
  appreciated.  Undoubtedly, a few dumb questions will follow.  I appear
  to be good at it.  :P  Maybe this context will help list subscribers
  frame their answers if they have any, or ask insightful questions.

 Where exactly does NFS fit into this?

 --
 Don't be afraid to ask (Lf.((Lx.xx) (Lr.f(rr.
 ___
 gnhlug-discuss mailing list
 gnhlug-discuss@mail.gnhlug.org
 http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

2009-10-11 Thread Bruce Labitt
Lloyd Kvam wrote:
 On Sat, 2009-10-10 at 15:20 -0400, Bruce Labitt wrote:
   
 Nit in solution:  TCP transport time  FFT execution time, rendering 
 attempted solution non-viable

 Researches TCP optimization: Reads countless papers on tcp
 optimization techniques... Fails to find a robust solutions or
 methodology for problem.  Tries most techniques written in papers,
 only realizing a 10% gain.  Not good enough.  Still needs to be faster

 Driven to more exotic techniques to reduce transport time.  Explores 
 parallel sockets, other techniques
 

 Does a simple netcat transfer go fast enough?  In other words, can
 normal TCP in a simple case do the job?

   
I'll try it when I go back.  Netperf seems to indicate I can do it - I 
got ~770Mbit/sec.  It is kind of baffling that my stuff only gets 
141Mbps... :(  I was avoiding looking at netperf source, but now appears 
to be a good time...
 If not, would you be better off talking ethernet?  Presumably you do
 not need the routing capabilities of TCP.  As I understand it, your
 client and server are on the same LAN.

 http://aschauf.landshut.org/fh/linux/udp_vs_raw/index.html

 I don't know if this is helpful.  Without knowing the timings from basic
 test cases, it's hard to know where to find the best point of attack.

   
Thanks for the suggestion and link.  I'll check it out.
-Bruce


___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

2009-10-11 Thread Bruce Labitt
Jim Kuzdrall wrote:
 Greetings Bruce,

 Interesting and challenging project!

 On Saturday 10 October 2009 15:20, Bruce Labitt wrote:
   
 For anyone that is remotely interested, here is the big picture for
 the problem I'm trying to solve.  If you are not interested, hey
 delete the post.  Won't irritate me in the least!

 
 If you just transferred the data (no framing or error checking), how 
 many bits per second must you transfer to keep up with the FFT data 
 production?
   
For the problems I'm doing now, the net cannot keep up.  At 800Mbps it 
would take ~1.6 sec to push the data and the engine computes a ~10M 
point complex double precision FFT in ~200ms.  10Gb ethernet would be 
nice, but I don't have the budget for this.  Even then, the transport 
would be 0.16sec vs the 0.2s compute.
 Did you explore adding a dedicated FFT card to your control 
 computer?   The algorithms they build into the hardware are much, much 
 faster than compiled software.  The local board would keep the data in 
 your control computer - with DMA, I assume - eliminating the transfer 
 problem.

   
I will look into it again.  Maybe the landscape has changed.  At one 
point I had to do 128M point FFTs - there wasn't any hardware to do that!

 I know a fellow who now works for Apple whose job is to optimize FFT 
 algorithms to the processor they use.  Assembly language, of course.  
 Why is Apple interested?  Faster FFT, faster MP3 translation, longer 
 battery life.  A very high payoff.

 Jim Kuzdrall 


   
I am using open source FFTW.  It is quite fast and it uses the 
platform's assets quite effectively.  Fortunately, it has been optimized 
for the Cell processor.  It runs 50-100X faster on my Cell than on my 
3.4GHZ P4, or whatever boat anchor I have.  I also tested the software 
on a couple of our servers.  The ratio is still way up near 50x.  The 
problem is that the cache gets exhausted and then the memory bus 
bandwidth gets saturated, this forms the upper limit of performance for 
the P4 / AMD64 class machines. 

The problem is indeed quite challenging.  I've gone down quite a few 
dead ends.  The list has seen some of my dead end attempts, but not all 
of them :)  I spared you some...

Bruce


___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

2009-10-11 Thread Bruce Labitt
Joshua Judson Rosen wrote:
 Bruce Labitt bruce.lab...@myfairpoint.net writes:
   
 What I'm trying to do:  Optimizer for a radar power spectral density problem

 Problem:  FFTs required in optimization loop take too long on current 
 workstation for the optimizer to even be viable. 

 Attempted solution:  FFT engine on remote server to reduce overall 
 execution time

 Builds client - server app implementing above solution.  Server uses 
 OpenMP and FFTW to exploit all cores.
 
 [...]
   
 Implements better binary packing unpacking in code.  Stuff works

 Nit in solution:  TCP transport time  FFT execution time, rendering 
 attempted solution non-viable

 Researches TCP optimization: Reads countless papers on tcp optimization 
 techniques... Fails to find a robust solutions or methodology for 
 problem.  Tries most techniques written in papers, only realizing a 10% 
 gain.  Not good enough.  Still needs to be faster

 Driven to more exotic techniques to reduce transport time.  Explores 
 parallel sockets, other techniques
 
 [...]
   
 Hey, that is my bigger picture...  Any and all suggestions are 
 appreciated.  Undoubtedly, a few dumb questions will follow.  I appear 
 to be good at it.  :P  Maybe this context will help list subscribers 
 frame their answers if they have any, or ask insightful questions.
 

 Where exactly does NFS fit into this?

   
NFS is used for the file system for the server.  The 1Gb link is used 
for both NFS and the data transport.  The data being transfered to the 
server never exists as a file so there is no hit for writing/reading 
to/from file.


___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

2009-10-11 Thread Bruce Labitt
Drew Van Zandt wrote:
 Have you considered using a fast compression/decompression algorithm 
 before you transmit, one that isn't too computationally intensive for 
 either compression or decompression?  You won't get high compression 
 (factor of 10) as you might with slower ones, but if you get even a 
 factor of 2, you've just doubled the effective network speed.

I hadn't thought of this.  Typically a lot of the data is random, so 
compression doesn't help too much.  Or at least, that is what I 
thought.  If I'm wrong, I hope someone will cheerfully correct me :)

I did do an experiment that had curious results.  Instead of sending 
double precision binary data, I sent single precision or 'float'.  I was 
expecting to halve my transmission time, since it is half the size.  
Instead, there was only a 10-15% speed increase, not a 100%.  This 
result is telling me something, although at this time, I'm too brain 
dead to really ascertain what it really means.
 It appears to me that this would only gain you 50% or so, as the 
 obvious fast compression algorithms are only about twice as fast 
 (http://www.quicklz.com/bench.html) as Gigabit Ethernet's theoretical 
 speed, but it's worth consideration.  The decompression is faster than 
 the compression, which would hopefully result in a net win on 
 processing time on remote server.

 --DTVZ

Thanks for your suggestions.  Compression would certainly help out. 
Bruce
 On Sun, Oct 11, 2009 at 10:29 AM, Joshua Judson Rosen 
 roz...@geekspace.com mailto:roz...@geekspace.com wrote:

 Bruce Labitt bruce.lab...@myfairpoint.net
 mailto:bruce.lab...@myfairpoint.net writes:
 
  What I'm trying to do:  Optimizer for a radar power spectral
 density problem
 
  Problem:  FFTs required in optimization loop take too long on
 current
  workstation for the optimizer to even be viable.
 
  Attempted solution:  FFT engine on remote server to reduce overall
  execution time
 
  Builds client - server app implementing above solution.  Server uses
  OpenMP and FFTW to exploit all cores.
 [...]
  Implements better binary packing unpacking in code.  Stuff works
 
  Nit in solution:  TCP transport time  FFT execution time,
 rendering
  attempted solution non-viable
 
  Researches TCP optimization: Reads countless papers on tcp
 optimization
  techniques... Fails to find a robust solutions or methodology for
  problem.  Tries most techniques written in papers, only
 realizing a 10%
  gain.  Not good enough.  Still needs to be faster
 
  Driven to more exotic techniques to reduce transport time.  Explores
  parallel sockets, other techniques
 [...]
  Hey, that is my bigger picture...  Any and all suggestions are
  appreciated.  Undoubtedly, a few dumb questions will follow.  I
 appear
  to be good at it.  :P  Maybe this context will help list subscribers
  frame their answers if they have any, or ask insightful questions.

 Where exactly does NFS fit into this?

 --
 Don't be afraid to ask (Lf.((Lx.xx) (Lr.f(rr.
 ___
 gnhlug-discuss mailing list
 gnhlug-discuss@mail.gnhlug.org mailto:gnhlug-discuss@mail.gnhlug.org
 http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


 

 ___
 gnhlug-discuss mailing list
 gnhlug-discuss@mail.gnhlug.org
 http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
   

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...

2009-10-10 Thread Lloyd Kvam
On Sat, 2009-10-10 at 15:20 -0400, Bruce Labitt wrote:
 Nit in solution:  TCP transport time  FFT execution time, rendering 
 attempted solution non-viable
 
 Researches TCP optimization: Reads countless papers on tcp
 optimization techniques... Fails to find a robust solutions or
 methodology for problem.  Tries most techniques written in papers,
 only realizing a 10% gain.  Not good enough.  Still needs to be faster
 
 Driven to more exotic techniques to reduce transport time.  Explores 
 parallel sockets, other techniques

Does a simple netcat transfer go fast enough?  In other words, can
normal TCP in a simple case do the job?

If not, would you be better off talking ethernet?  Presumably you do
not need the routing capabilities of TCP.  As I understand it, your
client and server are on the same LAN.

http://aschauf.landshut.org/fh/linux/udp_vs_raw/index.html

I don't know if this is helpful.  Without knowing the timings from basic
test cases, it's hard to know where to find the best point of attack.

-- 
Lloyd Kvam
Venix Corp
DLSLUG/GNHLUG library
http://dlslug.org/library.html
http://www.librarything.com/catalog/dlslug
http://www.librarything.com/rsshtml/recent/dlslug
http://www.librarything.com/rss/recent/dlslug

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/