Re: [Hdf-forum] HDF5 library hang in H5DWrite_f in collective mode

Håkon Strandenes Mon, 22 May 2017 12:40:04 -0700

One correction:

The "NOT WORKING" reported for "HDF5 1.8.17, IMPI 5.1.3, Ifort2016.3.210" are another problem with a segmentation fault.


To avioud confusion, I repeat the working/not working cases I tried:

HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS

HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: segfault - other problem,maybe with HDF5 installation


HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING

I also tested on another cluster with GPFS parallel file system (insteadof LUSTRE):


Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.15: OK
Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.18: OK
Intel 17.0, IMPI 2017.2.174, HDF5 1.8.17: NOT WORKING
Intel 17.0, IMPI 2017.2.174, HDF5 1.8.18: NOT WORKING

So the common denominator seems to be Intel MPI 2017.

Regards,
Håkon


On 05/22/2017 05:13 PM, Håkon Strandenes wrote:

I have managed to prepare an example program. I got away a
lot of non-essential stuff, by preparing some datafiles in advance. Theexample is for 20 processes *only*.
I reported earlier that I also found the bug on a system with SGI MPT,this example runs fine on this system, so let's for the moment disregardthat.
The problem occur with combinations of "newer" Intel MPI with "newer" HDF5.

I tested for instance:
HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS

And the following does not work:
HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: NOT WORKING
HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING
Does anyone have any idea on how to proceed in the debugging? Doesanyone see any obvious flaws in my example program?
Thanks for all help.

Regards,
Håkon Strandenes


On 05/20/2017 09:12 PM, Quincey Koziol wrote:
On May 19, 2017, at 12:32 PM, Håkon Strandenes <[email protected]>wrote:
Yes, the issue is still there.
I will try to make a dummy program to demonstrate the error. It mightbe the easiest thing to debug on in the long run.
    That would be very helpful, thanks,
        Quincey
Regards,
Håkon


On 05/19/2017 08:26 PM, Scot Breitenfeld wrote:
Can you try it with 1.10.1 and see if you still have an issue.
Scot
On May 19, 2017, at 1:11 PM, Quincey Koziol <[email protected]> wrote:

Hi Håkon,
On May 19, 2017, at 10:01 AM, Håkon Strandenes<[email protected]> wrote:
(sorry, forgot to cc mailing list in prev. mail)
A standalone test program would be quite an effort, but I willthink about it. I know that at least all simple test cases pass,so I need a "complicated" problem to generate the error.
    Yeah, that’s usually the case with these kind of issues.  :-/
One thing I wonder about is:
Is the requirements for collective IO in this document:
https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
still valid and accurate?
The reason I ask is that my filespace is complicated. Each IOprocess create the filespace with MANY calls to select_hyperslab.Hence it is neither regular nor singular, and according to theabove mentioned document the HDF5 library should not be able to docollective IO in this case. Still, it seems like it hangs in somecollective writing routine.
Am I onto something? Could this be a problem?
Fortunately, we’ve expanded the feature set for collective I/Onow and it supports arbitrary selections on chunked datasets.There’s always the chance for a bug of course, but it would have tobe very unusual, since we are pretty thorough about the regressiontesting…
        Quincey
Regards,
Håkon


On 05/19/2017 04:46 PM, Quincey Koziol wrote:
Hmm, sounds like you’ve varied a lot of things, which is good.But, the constant seems to be your code now. :-/ Can youreplicate the error with a small standalone C test program?
    Quincey
On May 19, 2017, at 7:43 AM, Håkon Strandenes<[email protected]> wrote:
The behavior is there both with SGI MPT and Intel MPI. I can tryOpenMPI as well, but that is not as well tested on the systemswe are using as the previously mentioned ones.
I also tested and can confirm that the problem is there as wellwith HDF5 1.10.1.
Regards,
Håkon



On 05/19/2017 04:29 PM, Quincey Koziol wrote:
Hi Håkon,
Actually, given this behavior, it’s reasonably possiblethat you have found a bug in the MPI implementation that youhave, so I wouldn’t rule that out. What implementation andversion of MPI are you using?
    Quincey
On May 19, 2017, at 4:14 AM, Håkon Strandenes<[email protected]> wrote:
Hi,
I have an MPI application where each process sample some data.Eachprocess can have an arbitrary number of sampling points (or nopoints atall). During the simulation each process buffer the samplevalues inlocal memory until the buffer is full. At that point eachprocess sendits data to designated IO processes, and the IO processes opena HDF5
file, extend a dataset and write the data into the file.
The filespace can be quite compicated, constructed withnumerous callsto "h5sselect_hyperslab_f". The memspace is always a simplecontiguousblock of data. The chunk size is equal to the buffer size,i.e. each
time the dataset is extended it is extended by exactly one chunk.
The problem is that in some cases, the application hang inh5dwrite_f(Fortran application). I cannot see why. It happens onmultiple systemswith different MPI implementations, so I believe that theproblem is inmy application or in the HDF5 library, not in the MPIimplementation or
on the system level.

The problem disappear if I turn off collective IO.
I have tried to compile HDF5 with as much error checking aspossible(--enable-debug=all --disable-production) and I do not get anyerrors or
warnings from the HDF5 library.
I ran the code through TotalView, and got the attachedbacktrace for the
20 processes that participate in the IO communicator.
Does anyone have any idea on how to continue debugging thisproblem?
I currently use HDF5 version 1.8.17.

Best regards,
Håkon Strandenes
<Backtrace HDF5err.png>_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] HDF5 library hang in H5DWrite_f in collective mode

Reply via email to