Re: [Hdf-forum] HDF5 library hang in H5DWrite_f in collective mode

Quincey Koziol Sat, 20 May 2017 12:15:07 -0700

> On May 19, 2017, at 12:32 PM, Håkon Strandenes <[email protected]> wrote:
> 
> Yes, the issue is still there.
> 
> I will try to make a dummy program to demonstrate the error. It might be the 
> easiest thing to debug on in the long run.


        That would be very helpful, thanks,
                Quincey

> 
> Regards,
> Håkon
> 
> 
> On 05/19/2017 08:26 PM, Scot Breitenfeld wrote:
>> Can you try it with 1.10.1 and see if you still have an issue.
>> Scot
>>> On May 19, 2017, at 1:11 PM, Quincey Koziol <[email protected]> wrote:
>>> 
>>> Hi Håkon,
>>> 
>>>> On May 19, 2017, at 10:01 AM, Håkon Strandenes <[email protected]> wrote:
>>>> 
>>>> (sorry, forgot to cc mailing list in prev. mail)
>>>> 
>>>> A standalone test program would be quite an effort, but I will think about 
>>>> it. I know that at least all simple test cases pass, so I need a 
>>>> "complicated" problem to generate the error.
>>> 
>>>     Yeah, that’s usually the case with these kind of issues.  :-/
>>> 
>>> 
>>>> One thing I wonder about is:
>>>> Is the requirements for collective IO in this document:
>>>> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
>>>> still valid and accurate?
>>>> 
>>>> The reason I ask is that my filespace is complicated. Each IO process 
>>>> create the filespace with MANY calls to select_hyperslab. Hence it is 
>>>> neither regular nor singular, and according to the above mentioned 
>>>> document the HDF5 library should not be able to do collective IO in this 
>>>> case. Still, it seems like it hangs in some collective writing routine.
>>>> 
>>>> Am I onto something? Could this be a problem?
>>> 
>>>     Fortunately, we’ve expanded the feature set for collective I/O now and 
>>> it supports arbitrary selections on chunked datasets.  There’s always the 
>>> chance for a bug of course, but it would have to be very unusual, since we 
>>> are pretty thorough about the regression testing…
>>> 
>>>             Quincey
>>> 
>>> 
>>>> Regards,
>>>> Håkon
>>>> 
>>>> 
>>>> On 05/19/2017 04:46 PM, Quincey Koziol wrote:
>>>>> Hmm, sounds like you’ve varied a lot of things, which is good.  But, the 
>>>>> constant seems to be your code now. :-/  Can you replicate the error with 
>>>>> a small standalone C test program?
>>>>>   Quincey
>>>>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes <[email protected]> 
>>>>>> wrote:
>>>>>> 
>>>>>> The behavior is there both with SGI MPT and Intel MPI. I can try OpenMPI 
>>>>>> as well, but that is not as well tested on the systems we are using as 
>>>>>> the previously mentioned ones.
>>>>>> 
>>>>>> I also tested and can confirm that the problem is there as well with 
>>>>>> HDF5 1.10.1.
>>>>>> 
>>>>>> Regards,
>>>>>> Håkon
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>>>>>> Hi Håkon,
>>>>>>>         Actually, given this behavior, it’s reasonably possible that 
>>>>>>> you have found a bug in the MPI implementation that you have, so I 
>>>>>>> wouldn’t rule that out.  What implementation and version of MPI are you 
>>>>>>> using?
>>>>>>>         Quincey
>>>>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes <[email protected]> 
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I have an MPI application where each process sample some data. Each
>>>>>>>> process can have an arbitrary number of sampling points (or no points 
>>>>>>>> at
>>>>>>>> all). During the simulation each process buffer the sample values in
>>>>>>>> local memory until the buffer is full. At that point each process send
>>>>>>>> its data to designated IO processes, and the IO processes open a HDF5
>>>>>>>> file, extend a dataset and write the data into the file.
>>>>>>>> 
>>>>>>>> The filespace can be quite compicated, constructed with numerous calls
>>>>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
>>>>>>>> block of data. The chunk size is equal to the buffer size, i.e. each
>>>>>>>> time the dataset is extended it is extended by exactly one chunk.
>>>>>>>> 
>>>>>>>> The problem is that in some cases, the application hang in h5dwrite_f
>>>>>>>> (Fortran application). I cannot see why. It happens on multiple systems
>>>>>>>> with different MPI implementations, so I believe that the problem is in
>>>>>>>> my application or in the HDF5 library, not in the MPI implementation or
>>>>>>>> on the system level.
>>>>>>>> 
>>>>>>>> The problem disappear if I turn off collective IO.
>>>>>>>> 
>>>>>>>> I have tried to compile HDF5 with as much error checking as possible
>>>>>>>> (--enable-debug=all --disable-production) and I do not get any errors 
>>>>>>>> or
>>>>>>>> warnings from the HDF5 library.
>>>>>>>> 
>>>>>>>> I ran the code through TotalView, and got the attached backtrace for 
>>>>>>>> the
>>>>>>>> 20 processes that participate in the IO communicator.
>>>>>>>> 
>>>>>>>> Does anyone have any idea on how to continue debugging this problem?
>>>>>>>> 
>>>>>>>> I currently use HDF5 version 1.8.17.
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> Håkon Strandenes
>>>>>>>> <Backtrace HDF5 err.png>_______________________________________________
>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>> [email protected]
>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>> _______________________________________________
>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>> [email protected]
>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>> Twitter: https://twitter.com/hdf5
>>> 
>>> 
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [email protected]
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] HDF5 library hang in H5DWrite_f in collective mode

Reply via email to