Hi Peter,

H5Ocopy was not really intended to be used in parallel. It was intended as a 
support routine for the tool h5copy to work in serial mode.
So yes the performance hit that you see is just that every process is doing a 
copy of the same data, i.e. we don't support the parallel use case.
But you can open the file with process 0, and copy, and close the file and 
reopen it with all processes if that helps. You can also use the tool h5copy to 
copy it outside your program.

I entered an issue for parallel support and improvements for H5Ocopy() in our 
Jira database (HDFFV-9435), but to be honest, I am not sure if we will have 
time to fix it for parallel unless someone funds it, since this isn't a high 
priority feature at the moment.

Thanks,
Mohamad

-----Original Message-----
From: Hdf-forum [mailto:[email protected]] On Behalf Of 
Peter Colberg
Sent: Sunday, June 28, 2015 6:38 PM
To: [email protected]
Subject: [Hdf-forum] Slow H5Ocopy using parallel HDF5

Dear HDF developers,

I have stumbled upon a grave performance bug in H5Ocopy when using parallel 
HDF5. Please see the attached test programs for reproducing the issue.

In my MPI program I achieve collective write speeds from 16 nodes of
2000 MB/s on a GPFS filesystem, so parallel HDF5 is working fine in general. 
However, when copying datasets between two parallel files, the copy time 
increases roughly linearly with the number of nodes.

Following, each test was repeated 10 times, and the smallest time was chosen. 
The environment was Parallel HDF5 1.8.14, Intel MPI 4.1.2.040, GPFS 3.5.0 and 
CentOS 6.4 on Linux x86_64.

Consider first a small compact dataset (32K):

# mpirun -np 1 -ppn 1 ./h5copy_mpio_compact
0.0292 s
# mpirun -np 2 -ppn 1 ./h5copy_mpio_compact
0.0343 s
# mpirun -np 4 -ppn 1 ./h5copy_mpio_compact
0.0411 s
# mpirun -np 8 -ppn 1 ./h5copy_mpio_compact
0.0409 s
# mpirun -np 16 -ppn 1 ./h5copy_mpio_compact
0.0407 s

The copy time is constant with the number of MPI nodes. The dataset has a 
compact layout, thus it consists purely of metadata. This test indicates that 
metadata copying is working fine.

Now consider a larger contiguous dataset (32M):

# mpirun -np 1 -ppn 1 ./h5copy_mpio
0.0723 s
# mpirun -np 2 -ppn 1 ./h5copy_mpio
0.371 s
# mpirun -np 4 -ppn 1 ./h5copy_mpio
1.91 s
# mpirun -np 8 -ppn 1 ./h5copy_mpio
4.02 s
# mpirun -np 16 -ppn 1 ./h5copy_mpio
9.49 s

The copy time increases roughly linearly with the number of MPI nodes, even 
though the size of the raw data being copied is the same for all cases. Could 
it be that all processes are trying to write the same raw data to the 
destination object, causing serious write contention?

I would expect that while all processes copy the metadata to their respective 
metadata cache, only one process copies the raw data to the output file. 
However, while trying to understand the source code of H5Ocopy, I could not 
find any special handling of the MPIO case.

Can you reproduce the issue on your parallel filesystem?

Which part of H5Ocopy might be causing the issue?

Regards,
Peter

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to