On 2010-04-14, at 11:08, Ronald K Long wrote: > We've narrowed down the problem quite a bit. > > The problematic code snippet is not actually doing any reads or > writes; > it's just doing a massive number of fseek() operations within a couple > of nested loops. (Note: The production code is doing some I/O, but > this > snippet was narrowed down to the bare minimum example that exhibited > the > problem, which was how we discovered that fseek was the culprit.) > > The issue appears to be the behavior of the glibc implementation of > fseek(). Apparently, a call to fseek() on a buffered file stream > causes > glibc to flush the stream (regardless of whether a flush is actually > needed). If we modify the snippet to call setvbuf() and disable > buffering on the file stream before any of the fseek() calls, then it > finishes more or less instantly, as you would expect.
I'd encourage you to file a bug (preferably with a patch) against glibc to fix this. I've had reasonable success in getting problems like this fixed upstream. > The problem is that this offending code is actually buried deep > within a > COTS library that we're using to do image processing (the Hierarchical > Data Format (HDF) library). While we do have access to the source > code > for this library and could conceivably modify it, this is a large and > complex library, and a change of this nature would require us to do a > large amount of regression testing to ensure that nothing was broken. > > So at the end of the day this is really not a "Lustre problem" per se, > though we would still be interested in any suggestions as to how we > can > minimize the effects of this glibc "flush penalty". This penalty is > not > particularly onerous when reading and writing to local disk, but is > obviously more of an issue with a distributed filesystem. Similarly, HDF + Lustre usage is very common, and I expect that the HDF developers would be interested to fix this if possible. > On Wed, 2010-04-14 at 07:08 -0500, Ronald K Long wrote: > > > > Andreas - Here is a snipet of the strace output. > > > > read(3, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0 > > \0\0"..., 2097152) = 2097152 > > As Andreas suspected, your application is doing 2MB reads every time. > Does it really need 2MB of data on each read? If not, can you fix > your > application to only read as much data as it actually wants? Cheers, Andreas -- Andreas Dilger Principal Engineer, Lustre Group Oracle Corporation Canada Inc. _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss