[Hdf-forum] HDF5 library hang in H5DWrite_f in collective mode

Håkon Strandenes Fri, 19 May 2017 04:17:13 -0700

Hi,

I have an MPI application where each process sample some data. Each
process can have an arbitrary number of sampling points (or no points at
all). During the simulation each process buffer the sample values in
local memory until the buffer is full. At that point each process send
its data to designated IO processes, and the IO processes open a HDF5
file, extend a dataset and write the data into the file.


The filespace can be quite compicated, constructed with numerous calls
to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
block of data. The chunk size is equal to the buffer size, i.e. each
time the dataset is extended it is extended by exactly one chunk.

The problem is that in some cases, the application hang in h5dwrite_f
(Fortran application). I cannot see why. It happens on multiple systems
with different MPI implementations, so I believe that the problem is in
my application or in the HDF5 library, not in the MPI implementation or
on the system level.

The problem disappear if I turn off collective IO.

I have tried to compile HDF5 with as much error checking as possible
(--enable-debug=all --disable-production) and I do not get any errors or
warnings from the HDF5 library.

I ran the code through TotalView, and got the attached backtrace for the
20 processes that participate in the IO communicator.

Does anyone have any idea on how to continue debugging this problem?

I currently use HDF5 version 1.8.17.

Best regards,
Håkon Strandenes

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

[Hdf-forum] HDF5 library hang in H5DWrite_f in collective mode

Reply via email to