> On May 19, 2017, at 12:32 PM, Håkon Strandenes <[email protected]> wrote:
>
> Yes, the issue is still there.
>
> I will try to make a dummy program to demonstrate the error. It might be the
> easiest thing to debug on in the long run.
That would be very helpful, thanks,
Quincey
>
> Regards,
> Håkon
>
>
> On 05/19/2017 08:26 PM, Scot Breitenfeld wrote:
>> Can you try it with 1.10.1 and see if you still have an issue.
>> Scot
>>> On May 19, 2017, at 1:11 PM, Quincey Koziol <[email protected]> wrote:
>>>
>>> Hi Håkon,
>>>
>>>> On May 19, 2017, at 10:01 AM, Håkon Strandenes <[email protected]> wrote:
>>>>
>>>> (sorry, forgot to cc mailing list in prev. mail)
>>>>
>>>> A standalone test program would be quite an effort, but I will think about
>>>> it. I know that at least all simple test cases pass, so I need a
>>>> "complicated" problem to generate the error.
>>>
>>> Yeah, that’s usually the case with these kind of issues. :-/
>>>
>>>
>>>> One thing I wonder about is:
>>>> Is the requirements for collective IO in this document:
>>>> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
>>>> still valid and accurate?
>>>>
>>>> The reason I ask is that my filespace is complicated. Each IO process
>>>> create the filespace with MANY calls to select_hyperslab. Hence it is
>>>> neither regular nor singular, and according to the above mentioned
>>>> document the HDF5 library should not be able to do collective IO in this
>>>> case. Still, it seems like it hangs in some collective writing routine.
>>>>
>>>> Am I onto something? Could this be a problem?
>>>
>>> Fortunately, we’ve expanded the feature set for collective I/O now and
>>> it supports arbitrary selections on chunked datasets. There’s always the
>>> chance for a bug of course, but it would have to be very unusual, since we
>>> are pretty thorough about the regression testing…
>>>
>>> Quincey
>>>
>>>
>>>> Regards,
>>>> Håkon
>>>>
>>>>
>>>> On 05/19/2017 04:46 PM, Quincey Koziol wrote:
>>>>> Hmm, sounds like you’ve varied a lot of things, which is good. But, the
>>>>> constant seems to be your code now. :-/ Can you replicate the error with
>>>>> a small standalone C test program?
>>>>> Quincey
>>>>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> The behavior is there both with SGI MPT and Intel MPI. I can try OpenMPI
>>>>>> as well, but that is not as well tested on the systems we are using as
>>>>>> the previously mentioned ones.
>>>>>>
>>>>>> I also tested and can confirm that the problem is there as well with
>>>>>> HDF5 1.10.1.
>>>>>>
>>>>>> Regards,
>>>>>> Håkon
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote:
>>>>>>> Hi Håkon,
>>>>>>> Actually, given this behavior, it’s reasonably possible that
>>>>>>> you have found a bug in the MPI implementation that you have, so I
>>>>>>> wouldn’t rule that out. What implementation and version of MPI are you
>>>>>>> using?
>>>>>>> Quincey
>>>>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I have an MPI application where each process sample some data. Each
>>>>>>>> process can have an arbitrary number of sampling points (or no points
>>>>>>>> at
>>>>>>>> all). During the simulation each process buffer the sample values in
>>>>>>>> local memory until the buffer is full. At that point each process send
>>>>>>>> its data to designated IO processes, and the IO processes open a HDF5
>>>>>>>> file, extend a dataset and write the data into the file.
>>>>>>>>
>>>>>>>> The filespace can be quite compicated, constructed with numerous calls
>>>>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple contiguous
>>>>>>>> block of data. The chunk size is equal to the buffer size, i.e. each
>>>>>>>> time the dataset is extended it is extended by exactly one chunk.
>>>>>>>>
>>>>>>>> The problem is that in some cases, the application hang in h5dwrite_f
>>>>>>>> (Fortran application). I cannot see why. It happens on multiple systems
>>>>>>>> with different MPI implementations, so I believe that the problem is in
>>>>>>>> my application or in the HDF5 library, not in the MPI implementation or
>>>>>>>> on the system level.
>>>>>>>>
>>>>>>>> The problem disappear if I turn off collective IO.
>>>>>>>>
>>>>>>>> I have tried to compile HDF5 with as much error checking as possible
>>>>>>>> (--enable-debug=all --disable-production) and I do not get any errors
>>>>>>>> or
>>>>>>>> warnings from the HDF5 library.
>>>>>>>>
>>>>>>>> I ran the code through TotalView, and got the attached backtrace for
>>>>>>>> the
>>>>>>>> 20 processes that participate in the IO communicator.
>>>>>>>>
>>>>>>>> Does anyone have any idea on how to continue debugging this problem?
>>>>>>>>
>>>>>>>> I currently use HDF5 version 1.8.17.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>> Håkon Strandenes
>>>>>>>> <Backtrace HDF5 err.png>_______________________________________________
>>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>>> [email protected]
>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>>> Twitter: https://twitter.com/hdf5
>>>>>>> _______________________________________________
>>>>>>> Hdf-forum is for HDF software users discussion.
>>>>>>> [email protected]
>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>>>>>> Twitter: https://twitter.com/hdf5
>>>
>>>
>>> _______________________________________________
>>> Hdf-forum is for HDF software users discussion.
>>> [email protected]
>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>>> Twitter: https://twitter.com/hdf5
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5