Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...
Greetings Bruce, Still thinking about your problem... On Sunday 11 October 2009 17:01, Bruce Labitt wrote: I did do an experiment that had curious results. Instead of sending double precision binary data, I sent single precision or 'float'. I was expecting to halve my transmission time, since it is half the size. Instead, there was only a 10-15% speed increase, not a 100%. This result is telling me something, although at this time, I'm too brain dead to really ascertain what it really means. The float option should work to get you a factor of two. First, make certain that float is 4 bytes for your compiler by printing out sizeof(float) from a compiled program. The C Standard defines float as having a range of 10^+38 to 10^-38 and at least 6 decimal digits precision, but that leaves the door open to use doubles for floats in the compiler. More significantly, C promotes floats to doubles when it passes them to a function. I am guessing that is what happened. Assuming you have the data in a float array, cast the array to an array of 4 byte character arrays. Send it as if it were characters rather than the numeric values. The receiving end should not care what the bytes represent. When the array is retrieved as characters, cast it back to the floats. Since it is the same compiler, byte order should not cause a problem. As a second enhancement, try sending the number pairs as they are generated, rather than waiting for them all to complete. The relatively slow communication hardware has its own formatter and shift register - and most likely a FIFO. It will take care of issuing the bits while the processor does other things - like computing FFts. To make this work, you might divide the FFT computation into 16 parts. Start sending the first part as soon as it is completed. If both of these things worked, you would be 4X faster. Better than standing pat. Jim Kuzdrall ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...
Bruce Labitt bruce.lab...@myfairpoint.net mailto:bruce.lab...@myfairpoint.net wrote: What I'm trying to do: Optimizer for a radar power spectral density problem Problem: FFTs required in optimization loop take too long on current workstation for the optimizer to even be viable. Attempted solution: FFT engine on remote server to reduce overall execution time Builds client - server app implementing above solution. Server uses OpenMP and FFTW to exploit all cores. [...] Implements better binary packing unpacking in code. Stuff works Nit in solution: TCP transport time FFT execution time, rendering attempted solution non-viable [...] Hey, that is my bigger picture... Any and all suggestions are appreciated. Undoubtedly, a few dumb questions will follow. I appear to be good at it. :P Maybe this context will help list subscribers frame their answers if they have any, or ask insightful question. I don't understand anything about your domain of application, so take this for what its worth... I've gleaned the following from the previous posts. Is it a fair summary? - The local FFT is taking ~200 ms, which isn't fast enough. - The remote FFT is substantially faster than this once the data gets there. - However, it takes substantially longer (~1.2 seconds) to move the data than to process it locally. What does fast enough mean here? What is your time budget per data set? Is it only constrained by catching and cooking one data set before it is overwritten by a new one (or before you choke on the stream buffers :) )? Are there latency/timeliness requirements from downstream? If so, what are they? Provided your processing rate keeps up with the arrival rate, how far behind can you afford to deliver results? (i.e. how much pipelining is permitted in a solution?) How fast is the remote FFT? I didn't catch a number for this one. Or was the 200 ms the remote processing time? (In which case, what't the local processing time?) Do you have the actual server you're targeting to benchmark this on? This helps to frame the external requirements more clearly. You've stated the problem in the implementation domain. It sounds like your range of solutions could leave very little headroom. My instinctive response is to ask Is there a more frugal approach in the application domain? Do you need to grind down the whole field of potential interest? Are there ways to narrow and intensify your focus partway through? Perhaps to do a much faster but weaker FFT, analyze it quickly to identify a narrower problem of interest, and then do the slower, much stronger FFT on a lot less data? Reducing the data load for the hard part may help with on-chip or off-chip solutions. It may also help to identify hybrid solutions. Alternatively, a mid-stream focusing analysis might be so expensive as to negate the benefit, or any performant mid-stream analysis might be merely a too-risky heuristic, or the problem may simply not lend itself to that kind of decomposition. You did say that you had already encountered a number of dead-ends - this may be familiar ground :) I don't know your domain. I don't have answers, just questions. I just figured those kind of questions were worth asking before we try squeezing the last Mbps out of the network... Lupestro ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...
gnhlug-discuss-boun...@mail.gnhlug.org wrote on 10/13/2009 08:13:12 AM: Greetings Bruce, Still thinking about your problem... Well, thank you for that! On Sunday 11 October 2009 17:01, Bruce Labitt wrote: I did do an experiment that had curious results. Instead of sending double precision binary data, I sent single precision or 'float'. I was expecting to halve my transmission time, since it is half the size. Instead, there was only a 10-15% speed increase, not a 100%. This result is telling me something, although at this time, I'm too brain dead to really ascertain what it really means. The float option should work to get you a factor of two. I would agree... It really makes me think that, either there is an unknown bottleneck in the client, or a misconfiguration of the socket, or the OS,... or more likely some combination of all of the above... First, make certain that float is 4 bytes for your compiler by printing out sizeof(float) from a compiled program. The C Standard defines float as having a range of 10^+38 to 10^-38 and at least 6 decimal digits precision, but that leaves the door open to use doubles for floats in the compiler. printf tells me sizeof(float) = 4 for both client and server More significantly, C promotes floats to doubles when it passes them to a function. I am guessing that is what happened. Assuming you have the data in a float array, cast the array to an array of 4 byte character arrays. Send it as if it were characters rather than the numeric values. The receiving end should not care what the bytes represent. When the array is retrieved as characters, cast it back to the floats. Since it is the same compiler, byte order should not cause a problem. Data is converted to string prior to transmit. This is an area for improvement. However, the timing numbers I've indicated 'should' be just the transmission of the string buffer. As for 'the same compiler', well, It is gcc on both, but to make things more fun, the client is little endian, and the server is big endian. So byte order does matter. As a second enhancement, try sending the number pairs as they are generated, rather than waiting for them all to complete. The relatively slow communication hardware has its own formatter and shift register - and most likely a FIFO. It will take care of issuing the bits while the processor does other things - like computing FFts. To make this work, you might divide the FFT computation into 16 parts. Start sending the first part as soon as it is completed. I will see if I can do something here. As it stands, the FFTs are broken up... For one calculation I need to do 80-100 FFTs. Hence my desire for reducing any of the large time consuming processes. If both of these things worked, you would be 4X faster. Better than standing pat. Much better than standing pat! Jim Kuzdrall Thanks for thinking about this, Bruce ** Neither the footer nor anything else in this E-mail is intended to or constitutes an brelectronic signature and/or legally binding agreement in the absence of an brexpress statement or Autoliv policy and/or procedure to the contrary.brThis E-mail and any attachments hereto are Autoliv property and may contain legally brprivileged, confidential and/or proprietary information.brThe recipient of this E-mail is prohibited from distributing, copying, forwarding or in any way brdisseminating any material contained within this E-mail without prior written brpermission from the author. If you receive this E-mail in error, please brimmediately notify the author and delete this E-mail. Autoliv disclaims all brresponsibility and liability for the consequences of any person who fails to brabide by the terms herein. br ** ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...
gnhlug-discuss-boun...@mail.gnhlug.org wrote on 10/13/2009 08:44:49 AM: Bruce Labitt bruce.lab...@myfairpoint.net mailto:bruce.lab...@myfairpoint.net wrote: What I'm trying to do: Optimizer for a radar power spectral density problem Problem: FFTs required in optimization loop take too long on current workstation for the optimizer to even be viable. Attempted solution: FFT engine on remote server to reduce overall execution time Builds client - server app implementing above solution. Server uses OpenMP and FFTW to exploit all cores. [...] Implements better binary packing unpacking in code. Stuff works Nit in solution: TCP transport time FFT execution time, rendering attempted solution non-viable [...] Hey, that is my bigger picture... Any and all suggestions are appreciated. Undoubtedly, a few dumb questions will follow. I appear to be good at it. :P Maybe this context will help list subscribers frame their answers if they have any, or ask insightful question. I don't understand anything about your domain of application, so take this for what its worth... I've gleaned the following from the previous posts. Is it a fair summary? Not exactly, minor corrections below - The local FFT is taking ~200 ms, which isn't fast enough. remote FFT takes 200 ms. local machine takes ~20 sec. - The remote FFT is substantially faster than this once the data gets there. Remote is 100x faster than local. - However, it takes substantially longer (~1.2 seconds) to move the data than to process it locally. Tranfer time to server is indeed ~1.2 sec if running at full link bandwidth. I have yet to achieve this in my application. I'd be 'deliriously happy' to get this bandwidth. Could I use more bandwidth, yes. What does fast enough mean here? What is your time budget per data set? This is a very good question. ITBP (in the big picture) I need to compute 100 FFTs for 1 data set. This data will be fed to an optimizer, which will modify coefficients and ask for another 100 FFTs. If things are set up properly, stuff eventually converges. If not, it can be ages before you know you have a poor run. This makes debug dificult. So I like to build in some sort of parameter that I can monitor for simulation progress. Hmm, I digressed. At 20 sec /cycle it takes about 40 minutes to do the 100 FFTs that make a dataset. If I could get ~2 sec per FFT cycle, I'd be much happier. It would be a lot easier to debug, and watch the convergence indicators. Is it only constrained by catching and cooking one data set before it is overwritten by a new one (or before you choke on the stream buffers :) )? Are there latency/timeliness requirements from downstream? If so, what are they? Provided your processing rate keeps up with the arrival rate, how far behind can you afford to deliver results? (i.e. how much pipelining is permitted in a solution?) How fast is the remote FFT? I didn't catch a number for this one. Or was the 200 ms the remote processing time? (In which case, what't the local processing time?) Do you have the actual server you're targeting to benchmark this on? This helps to frame the external requirements more clearly. You've stated the problem in the implementation domain. It sounds like your range of solutions could leave very little headroom. My instinctive response is to ask Is there a more frugal approach in the application domain? This is an insightful observation. Good question! Do you need to grind down the whole field of potential interest? Are there ways to narrow and intensify your focus partway through? Perhaps to do a much faster but weaker FFT, analyze it quickly to identify a narrower problem of interest, and then do the slower, much stronger FFT on a lot less data? Reducing the data load for the hard part may help with on-chip or off-chip solutions. It may also help to identify hybrid solutions. Alternatively, a mid-stream focusing analysis might be so expensive as to negate the benefit, or any performant mid-stream analysis might be merely a too-risky heuristic, or the problem may simply not lend itself to that kind of decomposition. You did say that you had already encountered a number of dead-ends - this may be familiar ground :) Believe me, a lot of it is familiar :) Nonetheless, things are worth mulling over. Despite the time invested, nothing really is cast in concrete... I don't know your domain. I don't have answers, just questions. I just figured those kind of questions were worth asking before we try squeezing the last Mbps out of the network... Lupestro In my current solution space - the network transport IS the dominant bottleneck. I truly was not expecting such slow performance there. Of course, once 'fixed' there will be a new bottleneck. Some bottlenecks cannot be corrected, either one accepts the performance,
Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...
Jim Kuzdrall gnh...@intrel.com wrote on 10/13/2009 02:04:36 PM: On Tuesday 13 October 2009 10:54, bruce.lab...@autoliv.com wrote: gnhlug-discuss-boun...@mail.gnhlug.org wrote on 10/13/2009 08:13:12 AM: Greetings Bruce, Still thinking about your problem... Well, thank you for that! Data is converted to string prior to transmit. This is an area for improvement. However, the timing numbers I've indicated 'should' be just the transmission of the string buffer. If I can remember my C programming, you do something non-portable like: ... /* number of pairs (initialize to the correct number) */ int npair=128; /* the double does a boundary alignment */ union { double align; float ansf[npair][2]; char ansc[npair*2*4] } answer; /* check that they are the same size (this does not guarantee that they are aligned right; check the received data for that) */ printf(float array size is %d and text array size is %d\n, sizeof(answer.ansf), sizeof(answer.ansc)); cnt= fwrite( (char *)answer.ansc, 4*2, npair, output_dev); printf(The count was %d and it should be %d,cnt,npair); exit(0); ... The transfer is complete, the numbers do get across, and they are even correct. I separated the problem into two parts - packing into strings, and sending the string over the socket. The packing works well enough for now. It is just the network transfer rate that isn't fast enough. Wow, C seems so unfamiliar after a few years away from it. I hope that gives you the gist of what I am suggesting to try. There may be a better choice of functions, but I thought a for-loop on fputc() would cost a lot more cpu cycles. fputc is REALLY slow. fwrite using big blocks is MUCH faster, orders of magnitude faster. To get this to work, you may have to get the FFT program and the I/O program to be separate processes - pipe or socket or something does that. Maybe somebody more experience will make a suggestion. I got the speed increase more optimistic than justified, but you can get about 4x by adding another data link. Send the odd chunks to one and the even chunks to the other. I have been toying with sending the real and imaginary parts on separate sockets in their own threads. That is what psockets was supposed to do. (Striped network transfer). It still bugs me that the utilization of the network is so bad. A striped transfer, even if it is just two stripes, will be better than just one socket. That is if I can get it to work... If I were to go to multiple sockets, I might as well use someone else's library, after all, why reinvent the wheel. I'm still think it is a misconfigured/mistuned system. I hope it is just a mismatch of parameters between the two computers. Still, the most likely answer is that the problem lies in my code. Time to dive in again... Thanks! Bruce ** Neither the footer nor anything else in this E-mail is intended to or constitutes an brelectronic signature and/or legally binding agreement in the absence of an brexpress statement or Autoliv policy and/or procedure to the contrary.brThis E-mail and any attachments hereto are Autoliv property and may contain legally brprivileged, confidential and/or proprietary information.brThe recipient of this E-mail is prohibited from distributing, copying, forwarding or in any way brdisseminating any material contained within this E-mail without prior written brpermission from the author. If you receive this E-mail in error, please brimmediately notify the author and delete this E-mail. Autoliv disclaims all brresponsibility and liability for the consequences of any person who fails to brabide by the terms herein. br ** ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...
IBM was working on a Cell on a card to plug into a motherboard for use in supercomputer clusters. Much like nvidea's tesla gpu cards. On 10/11/09, Jim Kuzdrall gnh...@intrel.com wrote: Greetings Bruce, Interesting and challenging project! On Saturday 10 October 2009 15:20, Bruce Labitt wrote: For anyone that is remotely interested, here is the big picture for the problem I'm trying to solve. If you are not interested, hey delete the post. Won't irritate me in the least! If you just transferred the data (no framing or error checking), how many bits per second must you transfer to keep up with the FFT data production? Did you explore adding a dedicated FFT card to your control computer? The algorithms they build into the hardware are much, much faster than compiled software. The local board would keep the data in your control computer - with DMA, I assume - eliminating the transfer problem. I know a fellow who now works for Apple whose job is to optimize FFT algorithms to the processor they use. Assembly language, of course. Why is Apple interested? Faster FFT, faster MP3 translation, longer battery life. A very high payoff. Jim Kuzdrall ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/ -- Sent from my mobile device ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...
Tom Buskey wrote: IBM was working on a Cell on a card to plug into a motherboard for use in supercomputer clusters. Much like nvidea's tesla gpu cards. Fixstars, formerly known as terrasoftsolutions has an equivalent item http://www.fixstars.com/en/products/gigaaccel180/ If this was around when I started, it would have been good. However, for the moment I am stuck with http://www.fixstars.com/en/products/bladecenter/qs22/ If I had the high speed link, I'd be golden. Right now, I don't have the budget for it. So back to the slow 1Gb network stuff... :( On 10/11/09, Jim Kuzdrall gnh...@intrel.com wrote: Greetings Bruce, Interesting and challenging project! On Saturday 10 October 2009 15:20, Bruce Labitt wrote: For anyone that is remotely interested, here is the big picture for the problem I'm trying to solve. If you are not interested, hey delete the post. Won't irritate me in the least! If you just transferred the data (no framing or error checking), how many bits per second must you transfer to keep up with the FFT data production? Did you explore adding a dedicated FFT card to your control computer? The algorithms they build into the hardware are much, much faster than compiled software. The local board would keep the data in your control computer - with DMA, I assume - eliminating the transfer problem. I know a fellow who now works for Apple whose job is to optimize FFT algorithms to the processor they use. Assembly language, of course. Why is Apple interested? Faster FFT, faster MP3 translation, longer battery life. A very high payoff. Jim Kuzdrall ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/ ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...
Greetings Bruce, Interesting and challenging project! On Saturday 10 October 2009 15:20, Bruce Labitt wrote: For anyone that is remotely interested, here is the big picture for the problem I'm trying to solve. If you are not interested, hey delete the post. Won't irritate me in the least! If you just transferred the data (no framing or error checking), how many bits per second must you transfer to keep up with the FFT data production? Did you explore adding a dedicated FFT card to your control computer? The algorithms they build into the hardware are much, much faster than compiled software. The local board would keep the data in your control computer - with DMA, I assume - eliminating the transfer problem. I know a fellow who now works for Apple whose job is to optimize FFT algorithms to the processor they use. Assembly language, of course. Why is Apple interested? Faster FFT, faster MP3 translation, longer battery life. A very high payoff. Jim Kuzdrall ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...
Bruce Labitt bruce.lab...@myfairpoint.net writes: What I'm trying to do: Optimizer for a radar power spectral density problem Problem: FFTs required in optimization loop take too long on current workstation for the optimizer to even be viable. Attempted solution: FFT engine on remote server to reduce overall execution time Builds client - server app implementing above solution. Server uses OpenMP and FFTW to exploit all cores. [...] Implements better binary packing unpacking in code. Stuff works Nit in solution: TCP transport time FFT execution time, rendering attempted solution non-viable Researches TCP optimization: Reads countless papers on tcp optimization techniques... Fails to find a robust solutions or methodology for problem. Tries most techniques written in papers, only realizing a 10% gain. Not good enough. Still needs to be faster Driven to more exotic techniques to reduce transport time. Explores parallel sockets, other techniques [...] Hey, that is my bigger picture... Any and all suggestions are appreciated. Undoubtedly, a few dumb questions will follow. I appear to be good at it. :P Maybe this context will help list subscribers frame their answers if they have any, or ask insightful questions. Where exactly does NFS fit into this? -- Don't be afraid to ask (Lf.((Lx.xx) (Lr.f(rr. ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...
Have you considered using a fast compression/decompression algorithm before you transmit, one that isn't too computationally intensive for either compression or decompression? You won't get high compression (factor of 10) as you might with slower ones, but if you get even a factor of 2, you've just doubled the effective network speed. It appears to me that this would only gain you 50% or so, as the obvious fast compression algorithms are only about twice as fast ( http://www.quicklz.com/bench.html) as Gigabit Ethernet's theoretical speed, but it's worth consideration. The decompression is faster than the compression, which would hopefully result in a net win on processing time on remote server. --DTVZ On Sun, Oct 11, 2009 at 10:29 AM, Joshua Judson Rosen roz...@geekspace.comwrote: Bruce Labitt bruce.lab...@myfairpoint.net writes: What I'm trying to do: Optimizer for a radar power spectral density problem Problem: FFTs required in optimization loop take too long on current workstation for the optimizer to even be viable. Attempted solution: FFT engine on remote server to reduce overall execution time Builds client - server app implementing above solution. Server uses OpenMP and FFTW to exploit all cores. [...] Implements better binary packing unpacking in code. Stuff works Nit in solution: TCP transport time FFT execution time, rendering attempted solution non-viable Researches TCP optimization: Reads countless papers on tcp optimization techniques... Fails to find a robust solutions or methodology for problem. Tries most techniques written in papers, only realizing a 10% gain. Not good enough. Still needs to be faster Driven to more exotic techniques to reduce transport time. Explores parallel sockets, other techniques [...] Hey, that is my bigger picture... Any and all suggestions are appreciated. Undoubtedly, a few dumb questions will follow. I appear to be good at it. :P Maybe this context will help list subscribers frame their answers if they have any, or ask insightful questions. Where exactly does NFS fit into this? -- Don't be afraid to ask (Lf.((Lx.xx) (Lr.f(rr. ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/ ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...
Lloyd Kvam wrote: On Sat, 2009-10-10 at 15:20 -0400, Bruce Labitt wrote: Nit in solution: TCP transport time FFT execution time, rendering attempted solution non-viable Researches TCP optimization: Reads countless papers on tcp optimization techniques... Fails to find a robust solutions or methodology for problem. Tries most techniques written in papers, only realizing a 10% gain. Not good enough. Still needs to be faster Driven to more exotic techniques to reduce transport time. Explores parallel sockets, other techniques Does a simple netcat transfer go fast enough? In other words, can normal TCP in a simple case do the job? I'll try it when I go back. Netperf seems to indicate I can do it - I got ~770Mbit/sec. It is kind of baffling that my stuff only gets 141Mbps... :( I was avoiding looking at netperf source, but now appears to be a good time... If not, would you be better off talking ethernet? Presumably you do not need the routing capabilities of TCP. As I understand it, your client and server are on the same LAN. http://aschauf.landshut.org/fh/linux/udp_vs_raw/index.html I don't know if this is helpful. Without knowing the timings from basic test cases, it's hard to know where to find the best point of attack. Thanks for the suggestion and link. I'll check it out. -Bruce ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...
Jim Kuzdrall wrote: Greetings Bruce, Interesting and challenging project! On Saturday 10 October 2009 15:20, Bruce Labitt wrote: For anyone that is remotely interested, here is the big picture for the problem I'm trying to solve. If you are not interested, hey delete the post. Won't irritate me in the least! If you just transferred the data (no framing or error checking), how many bits per second must you transfer to keep up with the FFT data production? For the problems I'm doing now, the net cannot keep up. At 800Mbps it would take ~1.6 sec to push the data and the engine computes a ~10M point complex double precision FFT in ~200ms. 10Gb ethernet would be nice, but I don't have the budget for this. Even then, the transport would be 0.16sec vs the 0.2s compute. Did you explore adding a dedicated FFT card to your control computer? The algorithms they build into the hardware are much, much faster than compiled software. The local board would keep the data in your control computer - with DMA, I assume - eliminating the transfer problem. I will look into it again. Maybe the landscape has changed. At one point I had to do 128M point FFTs - there wasn't any hardware to do that! I know a fellow who now works for Apple whose job is to optimize FFT algorithms to the processor they use. Assembly language, of course. Why is Apple interested? Faster FFT, faster MP3 translation, longer battery life. A very high payoff. Jim Kuzdrall I am using open source FFTW. It is quite fast and it uses the platform's assets quite effectively. Fortunately, it has been optimized for the Cell processor. It runs 50-100X faster on my Cell than on my 3.4GHZ P4, or whatever boat anchor I have. I also tested the software on a couple of our servers. The ratio is still way up near 50x. The problem is that the cache gets exhausted and then the memory bus bandwidth gets saturated, this forms the upper limit of performance for the P4 / AMD64 class machines. The problem is indeed quite challenging. I've gone down quite a few dead ends. The list has seen some of my dead end attempts, but not all of them :) I spared you some... Bruce ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...
Joshua Judson Rosen wrote: Bruce Labitt bruce.lab...@myfairpoint.net writes: What I'm trying to do: Optimizer for a radar power spectral density problem Problem: FFTs required in optimization loop take too long on current workstation for the optimizer to even be viable. Attempted solution: FFT engine on remote server to reduce overall execution time Builds client - server app implementing above solution. Server uses OpenMP and FFTW to exploit all cores. [...] Implements better binary packing unpacking in code. Stuff works Nit in solution: TCP transport time FFT execution time, rendering attempted solution non-viable Researches TCP optimization: Reads countless papers on tcp optimization techniques... Fails to find a robust solutions or methodology for problem. Tries most techniques written in papers, only realizing a 10% gain. Not good enough. Still needs to be faster Driven to more exotic techniques to reduce transport time. Explores parallel sockets, other techniques [...] Hey, that is my bigger picture... Any and all suggestions are appreciated. Undoubtedly, a few dumb questions will follow. I appear to be good at it. :P Maybe this context will help list subscribers frame their answers if they have any, or ask insightful questions. Where exactly does NFS fit into this? NFS is used for the file system for the server. The 1Gb link is used for both NFS and the data transport. The data being transfered to the server never exists as a file so there is no hit for writing/reading to/from file. ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...
Drew Van Zandt wrote: Have you considered using a fast compression/decompression algorithm before you transmit, one that isn't too computationally intensive for either compression or decompression? You won't get high compression (factor of 10) as you might with slower ones, but if you get even a factor of 2, you've just doubled the effective network speed. I hadn't thought of this. Typically a lot of the data is random, so compression doesn't help too much. Or at least, that is what I thought. If I'm wrong, I hope someone will cheerfully correct me :) I did do an experiment that had curious results. Instead of sending double precision binary data, I sent single precision or 'float'. I was expecting to halve my transmission time, since it is half the size. Instead, there was only a 10-15% speed increase, not a 100%. This result is telling me something, although at this time, I'm too brain dead to really ascertain what it really means. It appears to me that this would only gain you 50% or so, as the obvious fast compression algorithms are only about twice as fast (http://www.quicklz.com/bench.html) as Gigabit Ethernet's theoretical speed, but it's worth consideration. The decompression is faster than the compression, which would hopefully result in a net win on processing time on remote server. --DTVZ Thanks for your suggestions. Compression would certainly help out. Bruce On Sun, Oct 11, 2009 at 10:29 AM, Joshua Judson Rosen roz...@geekspace.com mailto:roz...@geekspace.com wrote: Bruce Labitt bruce.lab...@myfairpoint.net mailto:bruce.lab...@myfairpoint.net writes: What I'm trying to do: Optimizer for a radar power spectral density problem Problem: FFTs required in optimization loop take too long on current workstation for the optimizer to even be viable. Attempted solution: FFT engine on remote server to reduce overall execution time Builds client - server app implementing above solution. Server uses OpenMP and FFTW to exploit all cores. [...] Implements better binary packing unpacking in code. Stuff works Nit in solution: TCP transport time FFT execution time, rendering attempted solution non-viable Researches TCP optimization: Reads countless papers on tcp optimization techniques... Fails to find a robust solutions or methodology for problem. Tries most techniques written in papers, only realizing a 10% gain. Not good enough. Still needs to be faster Driven to more exotic techniques to reduce transport time. Explores parallel sockets, other techniques [...] Hey, that is my bigger picture... Any and all suggestions are appreciated. Undoubtedly, a few dumb questions will follow. I appear to be good at it. :P Maybe this context will help list subscribers frame their answers if they have any, or ask insightful questions. Where exactly does NFS fit into this? -- Don't be afraid to ask (Lf.((Lx.xx) (Lr.f(rr. ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org mailto:gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/ ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/ ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: FWIW: The bigger picture... Or why I have been asking a lot of questions lately...
On Sat, 2009-10-10 at 15:20 -0400, Bruce Labitt wrote: Nit in solution: TCP transport time FFT execution time, rendering attempted solution non-viable Researches TCP optimization: Reads countless papers on tcp optimization techniques... Fails to find a robust solutions or methodology for problem. Tries most techniques written in papers, only realizing a 10% gain. Not good enough. Still needs to be faster Driven to more exotic techniques to reduce transport time. Explores parallel sockets, other techniques Does a simple netcat transfer go fast enough? In other words, can normal TCP in a simple case do the job? If not, would you be better off talking ethernet? Presumably you do not need the routing capabilities of TCP. As I understand it, your client and server are on the same LAN. http://aschauf.landshut.org/fh/linux/udp_vs_raw/index.html I don't know if this is helpful. Without knowing the timings from basic test cases, it's hard to know where to find the best point of attack. -- Lloyd Kvam Venix Corp DLSLUG/GNHLUG library http://dlslug.org/library.html http://www.librarything.com/catalog/dlslug http://www.librarything.com/rsshtml/recent/dlslug http://www.librarything.com/rss/recent/dlslug ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/