Re: [Hdf-forum] HDF5 library hang in H5DWrite_f in collective mode

Scot Breitenfeld Thu, 25 May 2017 07:33:12 -0700

I tried your example using both HDF5 1.8.18 and our develop branch (basically 
1.10.1) on a CentOS 7 system and your program completes successfully using 
Intel 17.0.4.


mpiifort for the Intel(R) MPI Library 2017 Update 3 for Linux*
Copyright(C) 2003-2017, Intel Corporation.  All rights reserved.
ifort version 17.0.4

Can you verify if ‘make test’ passes in testpar and fortran/testpar for your 
installation?

Thanks,
Scot  

> On May 22, 2017, at 2:38 PM, Håkon Strandenes <[email protected]> wrote:
> 
> One correction:
> 
> The "NOT WORKING" reported for "HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210" 
> are another problem with a segmentation fault.
> 
> To avioud confusion, I repeat the working/not working cases I tried:
> 
> HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
> HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
> HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS
> 
> HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: segfault - other problem, maybe 
> with HDF5 installation
> 
> HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING
> 
> I also tested on another cluster with GPFS parallel file system (instead of 
> LUSTRE):
> 
> Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.15: OK
> Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.18: OK
> Intel 17.0, IMPI 2017.2.174, HDF5 1.8.17: NOT WORKING
> Intel 17.0, IMPI 2017.2.174, HDF5 1.8.18: NOT WORKING
> 
> So the common denominator seems to be Intel MPI 2017.
> 
> Regards,
> Håkon
> 
> 
> On 05/22/2017 05:13 PM, Håkon Strandenes wrote:
>> I have managed to prepare an example program. I got away a
>> lot of non-essential stuff, by preparing some datafiles in advance. The 
>> example is for 20 processes *only*.
>> I reported earlier that I also found the bug on a system with SGI MPT, this 
>> example runs fine on this system, so let's for the moment disregard that.
>> The problem occur with combinations of "newer" Intel MPI with "newer" HDF5.
>> I tested for instance:
>> HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
>> HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
>> HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS
>> And the following does not work:
>> HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: NOT WORKING
>> HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING
>> Does anyone have any idea on how to proceed in the debugging? Does anyone 
>> see any obvious flaws in my example program?
>> Thanks for all help.
>> Regards,
>> Håkon Strandenes
>> On 05/20/2017 09:12 PM, Quincey Koziol wrote:
>>> 
>>>> On May 19, 2017, at 12:32 PM, Håkon Strandenes <[email protected]> wrote:
>>>> 
>>>> Yes, the issue is still there.
>>>> 
>>>> I will try to make a dummy program to demonstrate the error. It might be 
>>>> the easiest thing to debug on in the long run.
>>> 
>>>    That would be very helpful, thanks,
>>>        Quincey
>>> 
>>>> 
>>>> Regards,
>>>> Håkon
>>>> 
>>>> 
>>>> On 05/19/2017 08:26 PM, Scot Breitenfeld wrote:
>>>>> Can you try it with 1.10.1 and see if you still have an issue.
>>>>> Scot
>>>>>> On May 19, 2017, at 1:11 PM, Quincey Koziol <[email protected]> wrote:
>>>>>> 
>>>>>> Hi Håkon,
>>>>>> 
>>>>>>> On May 19, 2017, at 10:01 AM, Håkon Strandenes <[email protected]> 
>>>>>>> wrote:
>>>>>>> 
>>>>>>> (sorry, forgot to cc mailing list in prev. mail)
>>>>>>> 
>>>>>>> A standalone test program would be quite an effort, but I will think 
>>>>>>> about it. I know that at least all simple test cases pass, so I need a 
>>>>>>> "complicated" problem to generate the error.
>>>>>> 
>>>>>>    Yeah, that’s usually the case with these kind of issues.  :-/
>>>>>> 
>>>>>> 
>>>>>>> One thing I wonder about is:
>>>>>>> Is the requirements for collective IO in this document:
>>>>>>> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
>>>>>>> still valid and accurate?
>>>>>>> 
>>>>>>> The reason I ask is that my filespace is complicated. Each IO process 
>>>>>>> create the filespace with MANY calls to select_hyperslab. Hence it is 
>>>>>>> neither regular nor singular, and according to the above mentioned 
>>>>>>> document the HDF5 library should not be able to do collective IO in 
>>>>>>> this case. Still, it seems like it hangs in some collective writing 
>>>>>>> routine.
>>>>>>> 
>>>>>>> Am I onto something? Could this be a problem?
>>>>>> 
>>>>>>    Fortunately, we’ve expanded the feature set for collective I/O now 
>>>>>> and it supports arbitrary selections on chunked datasets.  There’s 
>>>>>> always the chance for a bug of course, but it would have to be very 
>>>>>> unusual, since we are pretty thorough about the regression testing…
>>>>>> 
>>>>>>        Quincey
>>>>>> 
>>>>>> 
>>>>>>> Regards,
>>>>>>> Håkon
>>>>>>> 
>>>>>>> 
>>>>>>> On 05/19/2017 04:46 PM, Quincey Koziol wrote:
>>>>>>>> Hmm, sounds like you’ve varied a lot of things, which is good.  But, 
>>>>>>>> the constant seems to be your code now. :-/  Can you replicate the 
>>>>>>>> error with a small standalone C test program?
>>>>>>>>    Quincey
>>>>>>>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> The behavior is there both with SGI MPT and Intel MPI. I can try 
>>>>>>>>> OpenMPI as well, but that is not as well tested on the systems we are 
>>>>>>>>> using as the previously mentioned ones.
>>>>>>>>> 
>>>>>>>>> I also tested and can confirm that the problem is there as well with 
>>>>>>>>> HDF5 1.10.1.
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>> Håkon
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>>>>>>>>> Hi Håkon,
>>>>>>>>>>    Actually, given this behavior, it’s reasonably possible that you 
>>>>>>>>>> have found a bug in the MPI implementation that you have, so I 
>>>>>>>>>> wouldn’t rule that out.  What implementation and version of MPI are 
>>>>>>>>>> you using?
>>>>>>>>>>    Quincey
>>>>>>>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes <[email protected]> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi,
>>>>>>>>>>> 
>>>>>>>>>>> I have an MPI application where each process sample some data. Each
>>>>>>>>>>> process can have an arbitrary number of sampling points (or no 
>>>>>>>>>>> points at
>>>>>>>>>>> all). During the simulation each process buffer the sample values in
>>>>>>>>>>> local memory until the buffer is full. At that point each process 
>>>>>>>>>>> send
>>>>>>>>>>> its data to designated IO processes, and the IO processes open a 
>>>>>>>>>>> HDF5
>>>>>>>>>>> file, extend a dataset and write the data into the file.
>>>>>>>>>>> 
>>>>>>>>>>> The filespace can be quite compicated, constructed with numerous 
>>>>>>>>>>> calls
>>>>>>>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple 
>>>>>>>>>>> contiguous
>>>>>>>>>>> block of data. The chunk size is equal to the buffer size, i.e. each
>>>>>>>>>>> time the dataset is extended it is extended by exactly one chunk.
>>>>>>>>>>> 
>>>>>>>>>>> The problem is that in some cases, the application hang in 
>>>>>>>>>>> h5dwrite_f
>>>>>>>>>>> (Fortran application). I cannot see why. It happens on multiple 
>>>>>>>>>>> systems
>>>>>>>>>>> with different MPI implementations, so I believe that the problem 
>>>>>>>>>>> is in
>>>>>>>>>>> my application or in the HDF5 library, not in the MPI 
>>>>>>>>>>> implementation or
>>>>>>>>>>> on the system level.
>>>>>>>>>>> 
>>>>>>>>>>> The problem disappear if I turn off collective IO.
>>>>>>>>>>> 
>>>>>>>>>>> I have tried to compile HDF5 with as much error checking as possible
>>>>>>>>>>> (--enable-debug=all --disable-production) and I do not get any 
>>>>>>>>>>> errors or
>>>>>>>>>>> warnings from the HDF5 library.
>>>>>>>>>>> 
>>>>>>>>>>> I ran the code through TotalView, and got the attached backtrace 
>>>>>>>>>>> for the
>>>>>>>>>>> 20 processes that participate in the IO communicator.
>>>>>>>>>>> 
>>>>>>>>>>> Does anyone have any idea on how to continue debugging this problem?
>>>>>>>>>>> 
>>>>>>>>>>> I currently use HDF5 version 1.8.17.
>>>>>>>>>>> 
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Håkon Strandenes
>>>>>>>>>>> <Backtrace HDF5 
>>>>>>>>>>> err.png>_______________________________________________
>>>>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>>>>> [email protected]
>>>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>>>>>>  
>>>>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>>>>> _______________________________________________
>>>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>>>> [email protected]
>>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>>>>>  
>>>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>> [email protected]
>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org 
>>>>>> Twitter: https://twitter.com/hdf5
>>>> 
>>>> _______________________________________________
>>>> Hdf-forum is for HDF software users discussion.
>>>> [email protected]
>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>> Twitter: https://twitter.com/hdf5
>>> 
>>> 
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [email protected]
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5
>>> 
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> [email protected]
>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>> Twitter: https://twitter.com/hdf5
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] HDF5 library hang in H5DWrite_f in collective mode

Reply via email to