Re: [Hdf-forum] HDF5 library hang in H5DWrite_f in collective mode

Scot Breitenfeld Fri, 19 May 2017 11:27:50 -0700

Can you try it with 1.10.1 and see if you still have an issue.

Scot


> On May 19, 2017, at 1:11 PM, Quincey Koziol <[email protected]> wrote:
> 
> Hi Håkon,
> 
>> On May 19, 2017, at 10:01 AM, Håkon Strandenes <[email protected]> wrote:
>> 
>> (sorry, forgot to cc mailing list in prev. mail)
>> 
>> A standalone test program would be quite an effort, but I will think about 
>> it. I know that at least all simple test cases pass, so I need a 
>> "complicated" problem to generate the error.
> 
>       Yeah, that’s usually the case with these kind of issues.  :-/
> 
> 
>> One thing I wonder about is:
>> Is the requirements for collective IO in this document:
>> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
>> still valid and accurate?
>> 
>> The reason I ask is that my filespace is complicated. Each IO process create 
>> the filespace with MANY calls to select_hyperslab. Hence it is neither 
>> regular nor singular, and according to the above mentioned document the HDF5 
>> library should not be able to do collective IO in this case. Still, it seems 
>> like it hangs in some collective writing routine.
>> 
>> Am I onto something? Could this be a problem?
> 
>       Fortunately, we’ve expanded the feature set for collective I/O now and 
> it supports arbitrary selections on chunked datasets.  There’s always the 
> chance for a bug of course, but it would have to be very unusual, since we 
> are pretty thorough about the regression testing…
> 
>               Quincey
> 
> 
>> Regards,
>> Håkon
>> 
>> 
>> On 05/19/2017 04:46 PM, Quincey Koziol wrote:
>>> Hmm, sounds like you’ve varied a lot of things, which is good.  But, the 
>>> constant seems to be your code now. :-/  Can you replicate the error with a 
>>> small standalone C test program?
>>>     Quincey
>>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes <[email protected]> wrote:
>>>> 
>>>> The behavior is there both with SGI MPT and Intel MPI. I can try OpenMPI 
>>>> as well, but that is not as well tested on the systems we are using as the 
>>>> previously mentioned ones.
>>>> 
>>>> I also tested and can confirm that the problem is there as well with HDF5 
>>>> 1.10.1.
>>>> 
>>>> Regards,
>>>> Håkon
>>>> 
>>>> 
>>>> 
>>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>>>> Hi Håkon,
>>>>>   Actually, given this behavior, it’s reasonably possible that you have 
>>>>> found a bug in the MPI implementation that you have, so I wouldn’t rule 
>>>>> that out.  What implementation and version of MPI are you using?
>>>>>   Quincey
>>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes <[email protected]> 
>>>>>> wrote:
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I have an MPI application where each process sample some data. Each
>>>>>> process can have an arbitrary number of sampling points (or no points at
>>>>>> all). During the simulation each process buffer the sample values in
>>>>>> local memory until the buffer is full. At that point each process send
>>>>>> its data to designated IO processes, and the IO processes open a HDF5
>>>>>> file, extend a dataset and write the data into the file.
>>>>>> 
>>>>>> The filespace can be quite compicated, constructed with numerous calls
>>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
>>>>>> block of data. The chunk size is equal to the buffer size, i.e. each
>>>>>> time the dataset is extended it is extended by exactly one chunk.
>>>>>> 
>>>>>> The problem is that in some cases, the application hang in h5dwrite_f
>>>>>> (Fortran application). I cannot see why. It happens on multiple systems
>>>>>> with different MPI implementations, so I believe that the problem is in
>>>>>> my application or in the HDF5 library, not in the MPI implementation or
>>>>>> on the system level.
>>>>>> 
>>>>>> The problem disappear if I turn off collective IO.
>>>>>> 
>>>>>> I have tried to compile HDF5 with as much error checking as possible
>>>>>> (--enable-debug=all --disable-production) and I do not get any errors or
>>>>>> warnings from the HDF5 library.
>>>>>> 
>>>>>> I ran the code through TotalView, and got the attached backtrace for the
>>>>>> 20 processes that participate in the IO communicator.
>>>>>> 
>>>>>> Does anyone have any idea on how to continue debugging this problem?
>>>>>> 
>>>>>> I currently use HDF5 version 1.8.17.
>>>>>> 
>>>>>> Best regards,
>>>>>> Håkon Strandenes
>>>>>> <Backtrace HDF5 err.png>_______________________________________________
>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>> [email protected]
>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>> Twitter: https://twitter.com/hdf5
>>>>> _______________________________________________
>>>>> Hdf-forum is for HDF software users discussion.
>>>>> [email protected]
>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>> Twitter: https://twitter.com/hdf5
> 
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] HDF5 library hang in H5DWrite_f in collective mode

Reply via email to