I ran into something similar which turned out to be an issue with the parallel file system we have - a version of lustre < 2.7. The lustre clients in these versions lock the output file, so the multiple ranks on the same compute node get serialized. Here is the start of that thread: https://www.mail-archive.com/[email protected]/msg11807.html. This sounds different, but thought I'd mention it in case it helps with diagnosing the issue. It might be interesting to spread your job across multiple compute nodes and see how it scales.
best, David Schneider ________________________________________ From: Hdf-forum [[email protected]] on behalf of Peter Colberg [[email protected]] Sent: Sunday, June 28, 2015 4:38 PM To: [email protected] Subject: [Hdf-forum] Slow H5Ocopy using parallel HDF5 Dear HDF developers, I have stumbled upon a grave performance bug in H5Ocopy when using parallel HDF5. Please see the attached test programs for reproducing the issue. In my MPI program I achieve collective write speeds from 16 nodes of 2000 MB/s on a GPFS filesystem, so parallel HDF5 is working fine in general. However, when copying datasets between two parallel files, the copy time increases roughly linearly with the number of nodes. Following, each test was repeated 10 times, and the smallest time was chosen. The environment was Parallel HDF5 1.8.14, Intel MPI 4.1.2.040, GPFS 3.5.0 and CentOS 6.4 on Linux x86_64. Consider first a small compact dataset (32K): # mpirun -np 1 -ppn 1 ./h5copy_mpio_compact 0.0292 s # mpirun -np 2 -ppn 1 ./h5copy_mpio_compact 0.0343 s # mpirun -np 4 -ppn 1 ./h5copy_mpio_compact 0.0411 s # mpirun -np 8 -ppn 1 ./h5copy_mpio_compact 0.0409 s # mpirun -np 16 -ppn 1 ./h5copy_mpio_compact 0.0407 s The copy time is constant with the number of MPI nodes. The dataset has a compact layout, thus it consists purely of metadata. This test indicates that metadata copying is working fine. Now consider a larger contiguous dataset (32M): # mpirun -np 1 -ppn 1 ./h5copy_mpio 0.0723 s # mpirun -np 2 -ppn 1 ./h5copy_mpio 0.371 s # mpirun -np 4 -ppn 1 ./h5copy_mpio 1.91 s # mpirun -np 8 -ppn 1 ./h5copy_mpio 4.02 s # mpirun -np 16 -ppn 1 ./h5copy_mpio 9.49 s The copy time increases roughly linearly with the number of MPI nodes, even though the size of the raw data being copied is the same for all cases. Could it be that all processes are trying to write the same raw data to the destination object, causing serious write contention? I would expect that while all processes copy the metadata to their respective metadata cache, only one process copies the raw data to the output file. However, while trying to understand the source code of H5Ocopy, I could not find any special handling of the MPIO case. Can you reproduce the issue on your parallel filesystem? Which part of H5Ocopy might be causing the issue? Regards, Peter _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
