Hi Håkon,

> On May 19, 2017, at 10:01 AM, Håkon Strandenes <[email protected]> wrote:
> 
> (sorry, forgot to cc mailing list in prev. mail)
> 
> A standalone test program would be quite an effort, but I will think about 
> it. I know that at least all simple test cases pass, so I need a 
> "complicated" problem to generate the error.

        Yeah, that’s usually the case with these kind of issues.  :-/


> One thing I wonder about is:
> Is the requirements for collective IO in this document:
> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
> still valid and accurate?
> 
> The reason I ask is that my filespace is complicated. Each IO process create 
> the filespace with MANY calls to select_hyperslab. Hence it is neither 
> regular nor singular, and according to the above mentioned document the HDF5 
> library should not be able to do collective IO in this case. Still, it seems 
> like it hangs in some collective writing routine.
> 
> Am I onto something? Could this be a problem?

        Fortunately, we’ve expanded the feature set for collective I/O now and 
it supports arbitrary selections on chunked datasets.  There’s always the 
chance for a bug of course, but it would have to be very unusual, since we are 
pretty thorough about the regression testing…

                Quincey


> Regards,
> Håkon
> 
> 
> On 05/19/2017 04:46 PM, Quincey Koziol wrote:
>> Hmm, sounds like you’ve varied a lot of things, which is good.  But, the 
>> constant seems to be your code now. :-/  Can you replicate the error with a 
>> small standalone C test program?
>>      Quincey
>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes <[email protected]> wrote:
>>> 
>>> The behavior is there both with SGI MPT and Intel MPI. I can try OpenMPI as 
>>> well, but that is not as well tested on the systems we are using as the 
>>> previously mentioned ones.
>>> 
>>> I also tested and can confirm that the problem is there as well with HDF5 
>>> 1.10.1.
>>> 
>>> Regards,
>>> Håkon
>>> 
>>> 
>>> 
>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>>> Hi Håkon,
>>>>    Actually, given this behavior, it’s reasonably possible that you have 
>>>> found a bug in the MPI implementation that you have, so I wouldn’t rule 
>>>> that out.  What implementation and version of MPI are you using?
>>>>    Quincey
>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes <[email protected]> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I have an MPI application where each process sample some data. Each
>>>>> process can have an arbitrary number of sampling points (or no points at
>>>>> all). During the simulation each process buffer the sample values in
>>>>> local memory until the buffer is full. At that point each process send
>>>>> its data to designated IO processes, and the IO processes open a HDF5
>>>>> file, extend a dataset and write the data into the file.
>>>>> 
>>>>> The filespace can be quite compicated, constructed with numerous calls
>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
>>>>> block of data. The chunk size is equal to the buffer size, i.e. each
>>>>> time the dataset is extended it is extended by exactly one chunk.
>>>>> 
>>>>> The problem is that in some cases, the application hang in h5dwrite_f
>>>>> (Fortran application). I cannot see why. It happens on multiple systems
>>>>> with different MPI implementations, so I believe that the problem is in
>>>>> my application or in the HDF5 library, not in the MPI implementation or
>>>>> on the system level.
>>>>> 
>>>>> The problem disappear if I turn off collective IO.
>>>>> 
>>>>> I have tried to compile HDF5 with as much error checking as possible
>>>>> (--enable-debug=all --disable-production) and I do not get any errors or
>>>>> warnings from the HDF5 library.
>>>>> 
>>>>> I ran the code through TotalView, and got the attached backtrace for the
>>>>> 20 processes that participate in the IO communicator.
>>>>> 
>>>>> Does anyone have any idea on how to continue debugging this problem?
>>>>> 
>>>>> I currently use HDF5 version 1.8.17.
>>>>> 
>>>>> Best regards,
>>>>> Håkon Strandenes
>>>>> <Backtrace HDF5 err.png>_______________________________________________
>>>>> Hdf-forum is for HDF software users discussion.
>>>>> [email protected]
>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>> Twitter: https://twitter.com/hdf5
>>>> _______________________________________________
>>>> Hdf-forum is for HDF software users discussion.
>>>> [email protected]
>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>> Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to