Re: [Hdf-forum] HDF5 library hang in H5DWrite_f in collective mode

Håkon Strandenes Fri, 02 Jun 2017 05:17:29 -0700

I am sorry for the delay, by health has not been cooperating in
debugging this problem the last week.


I now tried HDF5 1.8.18 with Intel MPI 2017.1.132, and all tests pass,
both serial and parallel, C and Fortran.

The example still fail when run over more than one compute node. When
dataset transfer is H5FD_MPIO_INDEPENDENT_F it succeed.

My next step will be to try to build HDF5 with Cmake instead of
configure, to see if this changes anything.

Regards,
Håkon


On 05/26/2017 04:03 PM, Håkon Strandenes wrote:
> Thanks for trying my example. I will try the tests.
> 
> However, when trying my own example again I realized that the error does
> not occur when running on one compute node or a single workstation. I
> tested 20 processes on a single node, both running on a node-local
> filesystem (local scratch) and a parallel networked filesystem, and that
> worked. Running five processes each on four nodes leads to the
> error/hanging condition.
> 
> Regards,
> Håkon Strandenes
> 
> 
> On 05/25/2017 04:30 PM, Scot Breitenfeld wrote:
>> I tried your example using both HDF5 1.8.18 and our develop branch
>> (basically 1.10.1) on a CentOS 7 system and your program completes
>> successfully using Intel 17.0.4.
>>
>> mpiifort for the Intel(R) MPI Library 2017 Update 3 for Linux*
>> Copyright(C) 2003-2017, Intel Corporation.  All rights reserved.
>> ifort version 17.0.4
>>
>> Can you verify if ‘make test’ passes in testpar and fortran/testpar
>> for your installation?
>>
>> Thanks,
>> Scot
>>
>>> On May 22, 2017, at 2:38 PM, Håkon Strandenes <[email protected]>
>>> wrote:
>>>
>>> One correction:
>>>
>>> The "NOT WORKING" reported for "HDF5 1.8.17, IMPI 5.1.3, Ifort
>>> 2016.3.210" are another problem with a segmentation fault.
>>>
>>> To avioud confusion, I repeat the working/not working cases I tried:
>>>
>>> HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
>>> HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
>>> HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS
>>>
>>> HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: segfault - other problem,
>>> maybe with HDF5 installation
>>>
>>> HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING
>>>
>>> I also tested on another cluster with GPFS parallel file system
>>> (instead of LUSTRE):
>>>
>>> Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.15: OK
>>> Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.18: OK
>>> Intel 17.0, IMPI 2017.2.174, HDF5 1.8.17: NOT WORKING
>>> Intel 17.0, IMPI 2017.2.174, HDF5 1.8.18: NOT WORKING
>>>
>>> So the common denominator seems to be Intel MPI 2017.
>>>
>>> Regards,
>>> Håkon
>>>
>>>
>>> On 05/22/2017 05:13 PM, Håkon Strandenes wrote:
>>>> I have managed to prepare an example program. I got away a
>>>> lot of non-essential stuff, by preparing some datafiles in advance.
>>>> The example is for 20 processes *only*.
>>>> I reported earlier that I also found the bug on a system with SGI
>>>> MPT, this example runs fine on this system, so let's for the moment
>>>> disregard that.
>>>> The problem occur with combinations of "newer" Intel MPI with
>>>> "newer" HDF5.
>>>> I tested for instance:
>>>> HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
>>>> HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
>>>> HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS
>>>> And the following does not work:
>>>> HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: NOT WORKING
>>>> HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING
>>>> Does anyone have any idea on how to proceed in the debugging? Does
>>>> anyone see any obvious flaws in my example program?
>>>> Thanks for all help.
>>>> Regards,
>>>> Håkon Strandenes
>>>> On 05/20/2017 09:12 PM, Quincey Koziol wrote:
>>>>>
>>>>>> On May 19, 2017, at 12:32 PM, Håkon Strandenes
>>>>>> <[email protected]> wrote:
>>>>>>
>>>>>> Yes, the issue is still there.
>>>>>>
>>>>>> I will try to make a dummy program to demonstrate the error. It
>>>>>> might be the easiest thing to debug on in the long run.
>>>>>
>>>>>     That would be very helpful, thanks,
>>>>>         Quincey
>>>>>
>>>>>>
>>>>>> Regards,
>>>>>> Håkon
>>>>>>
>>>>>>
>>>>>> On 05/19/2017 08:26 PM, Scot Breitenfeld wrote:
>>>>>>> Can you try it with 1.10.1 and see if you still have an issue.
>>>>>>> Scot
>>>>>>>> On May 19, 2017, at 1:11 PM, Quincey Koziol <[email protected]> wrote:
>>>>>>>>
>>>>>>>> Hi Håkon,
>>>>>>>>
>>>>>>>>> On May 19, 2017, at 10:01 AM, Håkon Strandenes
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> (sorry, forgot to cc mailing list in prev. mail)
>>>>>>>>>
>>>>>>>>> A standalone test program would be quite an effort, but I will
>>>>>>>>> think about it. I know that at least all simple test cases
>>>>>>>>> pass, so I need a "complicated" problem to generate the error.
>>>>>>>>
>>>>>>>>     Yeah, that’s usually the case with these kind of issues.  :-/
>>>>>>>>
>>>>>>>>
>>>>>>>>> One thing I wonder about is:
>>>>>>>>> Is the requirements for collective IO in this document:
>>>>>>>>> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
>>>>>>>>> still valid and accurate?
>>>>>>>>>
>>>>>>>>> The reason I ask is that my filespace is complicated. Each IO
>>>>>>>>> process create the filespace with MANY calls to
>>>>>>>>> select_hyperslab. Hence it is neither regular nor singular, and
>>>>>>>>> according to the above mentioned document the HDF5 library
>>>>>>>>> should not be able to do collective IO in this case. Still, it
>>>>>>>>> seems like it hangs in some collective writing routine.
>>>>>>>>>
>>>>>>>>> Am I onto something? Could this be a problem?
>>>>>>>>
>>>>>>>>     Fortunately, we’ve expanded the feature set for collective
>>>>>>>> I/O now and it supports arbitrary selections on chunked
>>>>>>>> datasets.  There’s always the chance for a bug of course, but it
>>>>>>>> would have to be very unusual, since we are pretty thorough
>>>>>>>> about the regression testing…
>>>>>>>>
>>>>>>>>         Quincey
>>>>>>>>
>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Håkon
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 05/19/2017 04:46 PM, Quincey Koziol wrote:
>>>>>>>>>> Hmm, sounds like you’ve varied a lot of things, which is
>>>>>>>>>> good.  But, the constant seems to be your code now. :-/  Can
>>>>>>>>>> you replicate the error with a small standalone C test program?
>>>>>>>>>>     Quincey
>>>>>>>>>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> The behavior is there both with SGI MPT and Intel MPI. I can
>>>>>>>>>>> try OpenMPI as well, but that is not as well tested on the
>>>>>>>>>>> systems we are using as the previously mentioned ones.
>>>>>>>>>>>
>>>>>>>>>>> I also tested and can confirm that the problem is there as
>>>>>>>>>>> well with HDF5 1.10.1.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Håkon
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>>>>>>>>>>> Hi Håkon,
>>>>>>>>>>>>     Actually, given this behavior, it’s reasonably possible
>>>>>>>>>>>> that you have found a bug in the MPI implementation that you
>>>>>>>>>>>> have, so I wouldn’t rule that out.  What implementation and
>>>>>>>>>>>> version of MPI are you using?
>>>>>>>>>>>>     Quincey
>>>>>>>>>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes
>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have an MPI application where each process sample some
>>>>>>>>>>>>> data. Each
>>>>>>>>>>>>> process can have an arbitrary number of sampling points (or
>>>>>>>>>>>>> no points at
>>>>>>>>>>>>> all). During the simulation each process buffer the sample
>>>>>>>>>>>>> values in
>>>>>>>>>>>>> local memory until the buffer is full. At that point each
>>>>>>>>>>>>> process send
>>>>>>>>>>>>> its data to designated IO processes, and the IO processes
>>>>>>>>>>>>> open a HDF5
>>>>>>>>>>>>> file, extend a dataset and write the data into the file.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The filespace can be quite compicated, constructed with
>>>>>>>>>>>>> numerous calls
>>>>>>>>>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple
>>>>>>>>>>>>> contiguous
>>>>>>>>>>>>> block of data. The chunk size is equal to the buffer size,
>>>>>>>>>>>>> i.e. each
>>>>>>>>>>>>> time the dataset is extended it is extended by exactly one
>>>>>>>>>>>>> chunk.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The problem is that in some cases, the application hang in
>>>>>>>>>>>>> h5dwrite_f
>>>>>>>>>>>>> (Fortran application). I cannot see why. It happens on
>>>>>>>>>>>>> multiple systems
>>>>>>>>>>>>> with different MPI implementations, so I believe that the
>>>>>>>>>>>>> problem is in
>>>>>>>>>>>>> my application or in the HDF5 library, not in the MPI
>>>>>>>>>>>>> implementation or
>>>>>>>>>>>>> on the system level.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The problem disappear if I turn off collective IO.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have tried to compile HDF5 with as much error checking as
>>>>>>>>>>>>> possible
>>>>>>>>>>>>> (--enable-debug=all --disable-production) and I do not get
>>>>>>>>>>>>> any errors or
>>>>>>>>>>>>> warnings from the HDF5 library.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I ran the code through TotalView, and got the attached
>>>>>>>>>>>>> backtrace for the
>>>>>>>>>>>>> 20 processes that participate in the IO communicator.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Does anyone have any idea on how to continue debugging this
>>>>>>>>>>>>> problem?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I currently use HDF5 version 1.8.17.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Håkon Strandenes
>>>>>>>>>>>>> <Backtrace HDF5
>>>>>>>>>>>>> err.png>_______________________________________________
>>>>>>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>>>>>>>>
>>>>>>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>>>>>>>
>>>>>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>> [email protected]
>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>>>
>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>
>>>>>> _______________________________________________
>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>> [email protected]
>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>
>>>>>> Twitter: https://twitter.com/hdf5
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Hdf-forum is for HDF software users discussion.
>>>>> [email protected]
>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>
>>>>> Twitter: https://twitter.com/hdf5
>>>>>
>>>> _______________________________________________
>>>> Hdf-forum is for HDF software users discussion.
>>>> [email protected]
>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>> Twitter: https://twitter.com/hdf5
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [email protected]
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5
>>
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [email protected]
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5
>>
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
> 

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] HDF5 library hang in H5DWrite_f in collective mode

Reply via email to