Re: [Hdf-forum] HDF5 library hang in H5DWrite_f in collective mode

Håkon Strandenes Fri, 19 May 2017 10:03:25 -0700

(sorry, forgot to cc mailing list in prev. mail)

A standalone test program would be quite an effort, but I will thinkabout it. I know that at least all simple test cases pass, so I need a"complicated" problem to generate the error.


One thing I wonder about is:
Is the requirements for collective IO in this document:
https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
still valid and accurate?

The reason I ask is that my filespace is complicated. Each IO processcreate the filespace with MANY calls to select_hyperslab. Hence it isneither regular nor singular, and according to the above mentioneddocument the HDF5 library should not be able to do collective IO in thiscase. Still, it seems like it hangs in some collective writing routine.


Am I onto something? Could this be a problem?

Regards,
Håkon


On 05/19/2017 04:46 PM, Quincey Koziol wrote:

Hmm, sounds like you’ve varied a lot of things, which is good.  But, the 
constant seems to be your code now. :-/  Can you replicate the error with a 
small standalone C test program?

        Quincey

On May 19, 2017, at 7:43 AM, Håkon Strandenes <[email protected]> wrote:

The behavior is there both with SGI MPT and Intel MPI. I can try OpenMPI as 
well, but that is not as well tested on the systems we are using as the 
previously mentioned ones.

I also tested and can confirm that the problem is there as well with HDF5 
1.10.1.

Regards,
Håkon



On 05/19/2017 04:29 PM, Quincey Koziol wrote:

Hi Håkon,
        Actually, given this behavior, it’s reasonably possible that you have 
found a bug in the MPI implementation that you have, so I wouldn’t rule that 
out.  What implementation and version of MPI are you using?
        Quincey

On May 19, 2017, at 4:14 AM, Håkon Strandenes <[email protected]> wrote:

Hi,

I have an MPI application where each process sample some data. Each
process can have an arbitrary number of sampling points (or no points at
all). During the simulation each process buffer the sample values in
local memory until the buffer is full. At that point each process send
its data to designated IO processes, and the IO processes open a HDF5
file, extend a dataset and write the data into the file.

The filespace can be quite compicated, constructed with numerous calls
to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
block of data. The chunk size is equal to the buffer size, i.e. each
time the dataset is extended it is extended by exactly one chunk.

The problem is that in some cases, the application hang in h5dwrite_f
(Fortran application). I cannot see why. It happens on multiple systems
with different MPI implementations, so I believe that the problem is in
my application or in the HDF5 library, not in the MPI implementation or
on the system level.

The problem disappear if I turn off collective IO.

I have tried to compile HDF5 with as much error checking as possible
(--enable-debug=all --disable-production) and I do not get any errors or
warnings from the HDF5 library.

I ran the code through TotalView, and got the attached backtrace for the
20 processes that participate in the IO communicator.

Does anyone have any idea on how to continue debugging this problem?

I currently use HDF5 version 1.8.17.

Best regards,
Håkon Strandenes
<Backtrace HDF5 err.png>_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] HDF5 library hang in H5DWrite_f in collective mode

Reply via email to