Dear HDF developers,

I have stumbled upon a grave performance bug in H5Ocopy when using
parallel HDF5. Please see the attached test programs for reproducing
the issue.

In my MPI program I achieve collective write speeds from 16 nodes of
2000 MB/s on a GPFS filesystem, so parallel HDF5 is working fine in
general. However, when copying datasets between two parallel files,
the copy time increases roughly linearly with the number of nodes.

Following, each test was repeated 10 times, and the smallest time was
chosen. The environment was Parallel HDF5 1.8.14, Intel MPI 4.1.2.040,
GPFS 3.5.0 and CentOS 6.4 on Linux x86_64.

Consider first a small compact dataset (32K):

# mpirun -np 1 -ppn 1 ./h5copy_mpio_compact
0.0292 s
# mpirun -np 2 -ppn 1 ./h5copy_mpio_compact
0.0343 s
# mpirun -np 4 -ppn 1 ./h5copy_mpio_compact
0.0411 s
# mpirun -np 8 -ppn 1 ./h5copy_mpio_compact
0.0409 s
# mpirun -np 16 -ppn 1 ./h5copy_mpio_compact
0.0407 s

The copy time is constant with the number of MPI nodes. The dataset
has a compact layout, thus it consists purely of metadata. This test
indicates that metadata copying is working fine.

Now consider a larger contiguous dataset (32M):

# mpirun -np 1 -ppn 1 ./h5copy_mpio
0.0723 s
# mpirun -np 2 -ppn 1 ./h5copy_mpio
0.371 s
# mpirun -np 4 -ppn 1 ./h5copy_mpio
1.91 s
# mpirun -np 8 -ppn 1 ./h5copy_mpio
4.02 s
# mpirun -np 16 -ppn 1 ./h5copy_mpio
9.49 s

The copy time increases roughly linearly with the number of MPI nodes,
even though the size of the raw data being copied is the same for all
cases. Could it be that all processes are trying to write the same raw
data to the destination object, causing serious write contention?

I would expect that while all processes copy the metadata to their
respective metadata cache, only one process copies the raw data to
the output file. However, while trying to understand the source code
of H5Ocopy, I could not find any special handling of the MPIO case.

Can you reproduce the issue on your parallel filesystem?

Which part of H5Ocopy might be causing the issue?

Regards,
Peter
#include <hdf5.h>
#include <mpi.h>
#include <stdio.h>

/*
 * This program measures the time for copying a dataset of 32M size
 * between two parallel HDF5 files. The dataset has a contiguous layout,
 * which means the dataset object consists of metadata and raw data.
 *
 * mpicc -Wall -O2 -o h5copy_mpio h5copy_mpio.c -lhdf5
 */
int main(int argc, char **argv)
{
  double start, stop;
  int step, rank, i;
  hid_t file1, file2, fapl, space, dset;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  fapl = H5Pcreate(H5P_FILE_ACCESS);
  H5Pset_fapl_mpio(fapl, MPI_COMM_WORLD, MPI_INFO_NULL);

  file1 = H5Fcreate("h5copy_mpio_1.h5", H5F_ACC_TRUNC, H5P_DEFAULT, fapl);

  hsize_t dims[1] = {4194304};
  space = H5Screate_simple(1, dims, NULL);
  dset = H5Dcreate(file1, "position", H5T_NATIVE_DOUBLE, space, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
  H5Sclose(space);
  H5Dclose(dset);

  H5Fflush(file1, H5F_SCOPE_LOCAL);
  H5Fclose(file1);

  file1 = H5Fopen("h5copy_mpio_1.h5", H5F_ACC_RDONLY, fapl);
  file2 = H5Fcreate("h5copy_mpio_2.h5", H5F_ACC_TRUNC, H5P_DEFAULT, fapl);

  start = MPI_Wtime();
  H5Ocopy(file1, "position", file2, "position", H5P_DEFAULT, H5P_DEFAULT);
  H5Fflush(file2, H5F_SCOPE_LOCAL);
  stop = MPI_Wtime();
  if (rank == 0) printf("%.3g s\n", stop-start);

  H5Fclose(file1);
  H5Fclose(file2);
  H5Pclose(fapl);

  MPI_Finalize();
}
#include <hdf5.h>
#include <mpi.h>
#include <stdio.h>

/*
 * This program measures the time for copying a dataset of 32K size
 * between two parallel HDF5 files. The dataset has a compact layout,
 * which means the dataset object consists of metadata only.
 *
 * mpicc -Wall -O2 -o h5copy_mpio_compact h5copy_mpio_compact.c -lhdf5
 */
int main(int argc, char **argv)
{
  double start, stop;
  int step, rank, i;
  hid_t file1, file2, fapl, space, dcpl, dset;

  MPI_Init(&argc, &argv);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  fapl = H5Pcreate(H5P_FILE_ACCESS);
  H5Pset_fapl_mpio(fapl, MPI_COMM_WORLD, MPI_INFO_NULL);

  file1 = H5Fcreate("h5copy_mpio_1.h5", H5F_ACC_TRUNC, H5P_DEFAULT, fapl);

  hsize_t dims[1] = {4096};
  space = H5Screate_simple(1, dims, NULL);
  dcpl = H5Pcreate(H5P_DATASET_CREATE);
  H5Pset_layout(dcpl, H5D_COMPACT);
  dset = H5Dcreate(file1, "position", H5T_NATIVE_DOUBLE, space, H5P_DEFAULT, dcpl, H5P_DEFAULT);
  H5Sclose(space);
  H5Pclose(dcpl);
  H5Dclose(dset);

  H5Fflush(file1, H5F_SCOPE_LOCAL);
  H5Fclose(file1);

  file1 = H5Fopen("h5copy_mpio_1.h5", H5F_ACC_RDONLY, fapl);
  file2 = H5Fcreate("h5copy_mpio_2.h5", H5F_ACC_TRUNC, H5P_DEFAULT, fapl);

  start = MPI_Wtime();
  H5Ocopy(file1, "position", file2, "position", H5P_DEFAULT, H5P_DEFAULT);
  H5Fflush(file2, H5F_SCOPE_LOCAL);
  stop = MPI_Wtime();
  if (rank == 0) printf("%.3g s\n", stop-start);

  H5Fclose(file1);
  H5Fclose(file2);
  H5Pclose(fapl);

  MPI_Finalize();
}
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to