Hi, On Franklin, a CrayXT at NERSC with a lustre /scratch filesystem, we have noticed excessively long return times on ftruncate calls that are issued through HDF5 or the MPI-IO layer (through MPI_File_set_size() for instance). Here is a IO trace plot that shows 235GB written to a shared HDF5 file in 65s followed by an ftruncate that lasts about 50s:
http://vis.lbl.gov/~mhowison/vorpal/n2048.cb.align.183/tag.png (Full details: With collective buffering enabled in the MPI-IO layer, the I/O pattern is essentially a series of 4MB writes issued from 48 nodes that have been designated as aggregator/writer nodes. The number of writer nodes matches the 48 OSTs that store the file, and the write size matches the 4MB stripe width. This sets up a pattern that *looks* to the OSTs as essentially the same pattern as if we had 48 single-stripe files and 48 nodes each writing to its own file. This has been the most effective way we have found to stage shared-file writes on lustre.) However, we've also seen this long ftruncate problem with several other IO patterns in addition to collective buffering in MPI-IO: for instance, when bypassing MPI-IO in HDF5 and instead using the MPI-POSIX driver and with unstructured 1D grids. Any ideas on what might cause these long ftruncates? We plan on analyzing LMT data from the metadata server to determine if it is simply contention with other users, but we are suspicious of the consistency and magnitude of these hangs. Thanks, Mark Howison NERSC Analytics [email protected] _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
