One correction:

The "NOT WORKING" reported for "HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210" are another problem with a segmentation fault.

To avioud confusion, I repeat the working/not working cases I tried:

HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS

HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: segfault - other problem, maybe with HDF5 installation

HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING

I also tested on another cluster with GPFS parallel file system (instead of LUSTRE):

Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.15: OK
Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.18: OK
Intel 17.0, IMPI 2017.2.174, HDF5 1.8.17: NOT WORKING
Intel 17.0, IMPI 2017.2.174, HDF5 1.8.18: NOT WORKING

So the common denominator seems to be Intel MPI 2017.

Regards,
Håkon


On 05/22/2017 05:13 PM, Håkon Strandenes wrote:
I have managed to prepare an example program. I got away a
lot of non-essential stuff, by preparing some datafiles in advance. The example is for 20 processes *only*.

I reported earlier that I also found the bug on a system with SGI MPT, this example runs fine on this system, so let's for the moment disregard that.

The problem occur with combinations of "newer" Intel MPI with "newer" HDF5.

I tested for instance:
HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS

And the following does not work:
HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: NOT WORKING
HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING

Does anyone have any idea on how to proceed in the debugging? Does anyone see any obvious flaws in my example program?

Thanks for all help.

Regards,
Håkon Strandenes


On 05/20/2017 09:12 PM, Quincey Koziol wrote:

On May 19, 2017, at 12:32 PM, Håkon Strandenes <[email protected]> wrote:

Yes, the issue is still there.

I will try to make a dummy program to demonstrate the error. It might be the easiest thing to debug on in the long run.

    That would be very helpful, thanks,
        Quincey


Regards,
Håkon


On 05/19/2017 08:26 PM, Scot Breitenfeld wrote:
Can you try it with 1.10.1 and see if you still have an issue.
Scot
On May 19, 2017, at 1:11 PM, Quincey Koziol <[email protected]> wrote:

Hi Håkon,

On May 19, 2017, at 10:01 AM, Håkon Strandenes <[email protected]> wrote:

(sorry, forgot to cc mailing list in prev. mail)

A standalone test program would be quite an effort, but I will think about it. I know that at least all simple test cases pass, so I need a "complicated" problem to generate the error.

    Yeah, that’s usually the case with these kind of issues.  :-/


One thing I wonder about is:
Is the requirements for collective IO in this document:
https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
still valid and accurate?

The reason I ask is that my filespace is complicated. Each IO process create the filespace with MANY calls to select_hyperslab. Hence it is neither regular nor singular, and according to the above mentioned document the HDF5 library should not be able to do collective IO in this case. Still, it seems like it hangs in some collective writing routine.

Am I onto something? Could this be a problem?

Fortunately, we’ve expanded the feature set for collective I/O now and it supports arbitrary selections on chunked datasets. There’s always the chance for a bug of course, but it would have to be very unusual, since we are pretty thorough about the regression testing…

        Quincey


Regards,
Håkon


On 05/19/2017 04:46 PM, Quincey Koziol wrote:
Hmm, sounds like you’ve varied a lot of things, which is good. But, the constant seems to be your code now. :-/ Can you replicate the error with a small standalone C test program?
    Quincey
On May 19, 2017, at 7:43 AM, Håkon Strandenes <[email protected]> wrote:

The behavior is there both with SGI MPT and Intel MPI. I can try OpenMPI as well, but that is not as well tested on the systems we are using as the previously mentioned ones.

I also tested and can confirm that the problem is there as well with HDF5 1.10.1.

Regards,
Håkon



On 05/19/2017 04:29 PM, Quincey Koziol wrote:
Hi Håkon,
Actually, given this behavior, it’s reasonably possible that you have found a bug in the MPI implementation that you have, so I wouldn’t rule that out. What implementation and version of MPI are you using?
    Quincey
On May 19, 2017, at 4:14 AM, Håkon Strandenes <[email protected]> wrote:

Hi,

I have an MPI application where each process sample some data. Each process can have an arbitrary number of sampling points (or no points at all). During the simulation each process buffer the sample values in local memory until the buffer is full. At that point each process send its data to designated IO processes, and the IO processes open a HDF5
file, extend a dataset and write the data into the file.

The filespace can be quite compicated, constructed with numerous calls to "h5sselect_hyperslab_f". The memspace is always a simple contiguous block of data. The chunk size is equal to the buffer size, i.e. each
time the dataset is extended it is extended by exactly one chunk.

The problem is that in some cases, the application hang in h5dwrite_f (Fortran application). I cannot see why. It happens on multiple systems with different MPI implementations, so I believe that the problem is in my application or in the HDF5 library, not in the MPI implementation or
on the system level.

The problem disappear if I turn off collective IO.

I have tried to compile HDF5 with as much error checking as possible (--enable-debug=all --disable-production) and I do not get any errors or
warnings from the HDF5 library.

I ran the code through TotalView, and got the attached backtrace for the
20 processes that participate in the IO communicator.

Does anyone have any idea on how to continue debugging this problem?

I currently use HDF5 version 1.8.17.

Best regards,
Håkon Strandenes
<Backtrace HDF5 err.png>_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5



_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to