(sorry, forgot to cc mailing list in prev. mail)
A standalone test program would be quite an effort, but I will think
about it. I know that at least all simple test cases pass, so I need a
"complicated" problem to generate the error.
One thing I wonder about is:
Is the requirements for collective IO in this document:
https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
still valid and accurate?
The reason I ask is that my filespace is complicated. Each IO process
create the filespace with MANY calls to select_hyperslab. Hence it is
neither regular nor singular, and according to the above mentioned
document the HDF5 library should not be able to do collective IO in this
case. Still, it seems like it hangs in some collective writing routine.
Am I onto something? Could this be a problem?
Regards,
Håkon
On 05/19/2017 04:46 PM, Quincey Koziol wrote:
Hmm, sounds like you’ve varied a lot of things, which is good. But, the
constant seems to be your code now. :-/ Can you replicate the error with a
small standalone C test program?
Quincey
On May 19, 2017, at 7:43 AM, Håkon Strandenes <[email protected]> wrote:
The behavior is there both with SGI MPT and Intel MPI. I can try OpenMPI as
well, but that is not as well tested on the systems we are using as the
previously mentioned ones.
I also tested and can confirm that the problem is there as well with HDF5
1.10.1.
Regards,
Håkon
On 05/19/2017 04:29 PM, Quincey Koziol wrote:
Hi Håkon,
Actually, given this behavior, it’s reasonably possible that you have
found a bug in the MPI implementation that you have, so I wouldn’t rule that
out. What implementation and version of MPI are you using?
Quincey
On May 19, 2017, at 4:14 AM, Håkon Strandenes <[email protected]> wrote:
Hi,
I have an MPI application where each process sample some data. Each
process can have an arbitrary number of sampling points (or no points at
all). During the simulation each process buffer the sample values in
local memory until the buffer is full. At that point each process send
its data to designated IO processes, and the IO processes open a HDF5
file, extend a dataset and write the data into the file.
The filespace can be quite compicated, constructed with numerous calls
to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
block of data. The chunk size is equal to the buffer size, i.e. each
time the dataset is extended it is extended by exactly one chunk.
The problem is that in some cases, the application hang in h5dwrite_f
(Fortran application). I cannot see why. It happens on multiple systems
with different MPI implementations, so I believe that the problem is in
my application or in the HDF5 library, not in the MPI implementation or
on the system level.
The problem disappear if I turn off collective IO.
I have tried to compile HDF5 with as much error checking as possible
(--enable-debug=all --disable-production) and I do not get any errors or
warnings from the HDF5 library.
I ran the code through TotalView, and got the attached backtrace for the
20 processes that participate in the IO communicator.
Does anyone have any idea on how to continue debugging this problem?
I currently use HDF5 version 1.8.17.
Best regards,
Håkon Strandenes
<Backtrace HDF5 err.png>_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5