I tried your example using both HDF5 1.8.18 and our develop branch (basically 1.10.1) on a CentOS 7 system and your program completes successfully using Intel 17.0.4.
mpiifort for the Intel(R) MPI Library 2017 Update 3 for Linux* Copyright(C) 2003-2017, Intel Corporation. All rights reserved. ifort version 17.0.4 Can you verify if ‘make test’ passes in testpar and fortran/testpar for your installation? Thanks, Scot > On May 22, 2017, at 2:38 PM, Håkon Strandenes <[email protected]> wrote: > > One correction: > > The "NOT WORKING" reported for "HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210" > are another problem with a segmentation fault. > > To avioud confusion, I repeat the working/not working cases I tried: > > HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS > HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS > HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS > > HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: segfault - other problem, maybe > with HDF5 installation > > HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING > > I also tested on another cluster with GPFS parallel file system (instead of > LUSTRE): > > Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.15: OK > Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.18: OK > Intel 17.0, IMPI 2017.2.174, HDF5 1.8.17: NOT WORKING > Intel 17.0, IMPI 2017.2.174, HDF5 1.8.18: NOT WORKING > > So the common denominator seems to be Intel MPI 2017. > > Regards, > Håkon > > > On 05/22/2017 05:13 PM, Håkon Strandenes wrote: >> I have managed to prepare an example program. I got away a >> lot of non-essential stuff, by preparing some datafiles in advance. The >> example is for 20 processes *only*. >> I reported earlier that I also found the bug on a system with SGI MPT, this >> example runs fine on this system, so let's for the moment disregard that. >> The problem occur with combinations of "newer" Intel MPI with "newer" HDF5. >> I tested for instance: >> HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS >> HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS >> HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS >> And the following does not work: >> HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: NOT WORKING >> HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING >> Does anyone have any idea on how to proceed in the debugging? Does anyone >> see any obvious flaws in my example program? >> Thanks for all help. >> Regards, >> Håkon Strandenes >> On 05/20/2017 09:12 PM, Quincey Koziol wrote: >>> >>>> On May 19, 2017, at 12:32 PM, Håkon Strandenes <[email protected]> wrote: >>>> >>>> Yes, the issue is still there. >>>> >>>> I will try to make a dummy program to demonstrate the error. It might be >>>> the easiest thing to debug on in the long run. >>> >>> That would be very helpful, thanks, >>> Quincey >>> >>>> >>>> Regards, >>>> Håkon >>>> >>>> >>>> On 05/19/2017 08:26 PM, Scot Breitenfeld wrote: >>>>> Can you try it with 1.10.1 and see if you still have an issue. >>>>> Scot >>>>>> On May 19, 2017, at 1:11 PM, Quincey Koziol <[email protected]> wrote: >>>>>> >>>>>> Hi Håkon, >>>>>> >>>>>>> On May 19, 2017, at 10:01 AM, Håkon Strandenes <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> (sorry, forgot to cc mailing list in prev. mail) >>>>>>> >>>>>>> A standalone test program would be quite an effort, but I will think >>>>>>> about it. I know that at least all simple test cases pass, so I need a >>>>>>> "complicated" problem to generate the error. >>>>>> >>>>>> Yeah, that’s usually the case with these kind of issues. :-/ >>>>>> >>>>>> >>>>>>> One thing I wonder about is: >>>>>>> Is the requirements for collective IO in this document: >>>>>>> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf >>>>>>> still valid and accurate? >>>>>>> >>>>>>> The reason I ask is that my filespace is complicated. Each IO process >>>>>>> create the filespace with MANY calls to select_hyperslab. Hence it is >>>>>>> neither regular nor singular, and according to the above mentioned >>>>>>> document the HDF5 library should not be able to do collective IO in >>>>>>> this case. Still, it seems like it hangs in some collective writing >>>>>>> routine. >>>>>>> >>>>>>> Am I onto something? Could this be a problem? >>>>>> >>>>>> Fortunately, we’ve expanded the feature set for collective I/O now >>>>>> and it supports arbitrary selections on chunked datasets. There’s >>>>>> always the chance for a bug of course, but it would have to be very >>>>>> unusual, since we are pretty thorough about the regression testing… >>>>>> >>>>>> Quincey >>>>>> >>>>>> >>>>>>> Regards, >>>>>>> Håkon >>>>>>> >>>>>>> >>>>>>> On 05/19/2017 04:46 PM, Quincey Koziol wrote: >>>>>>>> Hmm, sounds like you’ve varied a lot of things, which is good. But, >>>>>>>> the constant seems to be your code now. :-/ Can you replicate the >>>>>>>> error with a small standalone C test program? >>>>>>>> Quincey >>>>>>>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> The behavior is there both with SGI MPT and Intel MPI. I can try >>>>>>>>> OpenMPI as well, but that is not as well tested on the systems we are >>>>>>>>> using as the previously mentioned ones. >>>>>>>>> >>>>>>>>> I also tested and can confirm that the problem is there as well with >>>>>>>>> HDF5 1.10.1. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Håkon >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote: >>>>>>>>>> Hi Håkon, >>>>>>>>>> Actually, given this behavior, it’s reasonably possible that you >>>>>>>>>> have found a bug in the MPI implementation that you have, so I >>>>>>>>>> wouldn’t rule that out. What implementation and version of MPI are >>>>>>>>>> you using? >>>>>>>>>> Quincey >>>>>>>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I have an MPI application where each process sample some data. Each >>>>>>>>>>> process can have an arbitrary number of sampling points (or no >>>>>>>>>>> points at >>>>>>>>>>> all). During the simulation each process buffer the sample values in >>>>>>>>>>> local memory until the buffer is full. At that point each process >>>>>>>>>>> send >>>>>>>>>>> its data to designated IO processes, and the IO processes open a >>>>>>>>>>> HDF5 >>>>>>>>>>> file, extend a dataset and write the data into the file. >>>>>>>>>>> >>>>>>>>>>> The filespace can be quite compicated, constructed with numerous >>>>>>>>>>> calls >>>>>>>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple >>>>>>>>>>> contiguous >>>>>>>>>>> block of data. The chunk size is equal to the buffer size, i.e. each >>>>>>>>>>> time the dataset is extended it is extended by exactly one chunk. >>>>>>>>>>> >>>>>>>>>>> The problem is that in some cases, the application hang in >>>>>>>>>>> h5dwrite_f >>>>>>>>>>> (Fortran application). I cannot see why. It happens on multiple >>>>>>>>>>> systems >>>>>>>>>>> with different MPI implementations, so I believe that the problem >>>>>>>>>>> is in >>>>>>>>>>> my application or in the HDF5 library, not in the MPI >>>>>>>>>>> implementation or >>>>>>>>>>> on the system level. >>>>>>>>>>> >>>>>>>>>>> The problem disappear if I turn off collective IO. >>>>>>>>>>> >>>>>>>>>>> I have tried to compile HDF5 with as much error checking as possible >>>>>>>>>>> (--enable-debug=all --disable-production) and I do not get any >>>>>>>>>>> errors or >>>>>>>>>>> warnings from the HDF5 library. >>>>>>>>>>> >>>>>>>>>>> I ran the code through TotalView, and got the attached backtrace >>>>>>>>>>> for the >>>>>>>>>>> 20 processes that participate in the IO communicator. >>>>>>>>>>> >>>>>>>>>>> Does anyone have any idea on how to continue debugging this problem? >>>>>>>>>>> >>>>>>>>>>> I currently use HDF5 version 1.8.17. >>>>>>>>>>> >>>>>>>>>>> Best regards, >>>>>>>>>>> Håkon Strandenes >>>>>>>>>>> <Backtrace HDF5 >>>>>>>>>>> err.png>_______________________________________________ >>>>>>>>>>> Hdf-forum is for HDF software users discussion. >>>>>>>>>>> [email protected] >>>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >>>>>>>>>>> >>>>>>>>>>> Twitter: https://twitter.com/hdf5 >>>>>>>>>> _______________________________________________ >>>>>>>>>> Hdf-forum is for HDF software users discussion. >>>>>>>>>> [email protected] >>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >>>>>>>>>> >>>>>>>>>> Twitter: https://twitter.com/hdf5 >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Hdf-forum is for HDF software users discussion. >>>>>> [email protected] >>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >>>>>> Twitter: https://twitter.com/hdf5 >>>> >>>> _______________________________________________ >>>> Hdf-forum is for HDF software users discussion. >>>> [email protected] >>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >>>> Twitter: https://twitter.com/hdf5 >>> >>> >>> _______________________________________________ >>> Hdf-forum is for HDF software users discussion. >>> [email protected] >>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >>> Twitter: https://twitter.com/hdf5 >>> >> _______________________________________________ >> Hdf-forum is for HDF software users discussion. >> [email protected] >> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >> Twitter: https://twitter.com/hdf5 > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > Twitter: https://twitter.com/hdf5 _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
