I am sorry for the delay, by health has not been cooperating in debugging this problem the last week.
I now tried HDF5 1.8.18 with Intel MPI 2017.1.132, and all tests pass, both serial and parallel, C and Fortran. The example still fail when run over more than one compute node. When dataset transfer is H5FD_MPIO_INDEPENDENT_F it succeed. My next step will be to try to build HDF5 with Cmake instead of configure, to see if this changes anything. Regards, Håkon On 05/26/2017 04:03 PM, Håkon Strandenes wrote: > Thanks for trying my example. I will try the tests. > > However, when trying my own example again I realized that the error does > not occur when running on one compute node or a single workstation. I > tested 20 processes on a single node, both running on a node-local > filesystem (local scratch) and a parallel networked filesystem, and that > worked. Running five processes each on four nodes leads to the > error/hanging condition. > > Regards, > Håkon Strandenes > > > On 05/25/2017 04:30 PM, Scot Breitenfeld wrote: >> I tried your example using both HDF5 1.8.18 and our develop branch >> (basically 1.10.1) on a CentOS 7 system and your program completes >> successfully using Intel 17.0.4. >> >> mpiifort for the Intel(R) MPI Library 2017 Update 3 for Linux* >> Copyright(C) 2003-2017, Intel Corporation. All rights reserved. >> ifort version 17.0.4 >> >> Can you verify if ‘make test’ passes in testpar and fortran/testpar >> for your installation? >> >> Thanks, >> Scot >> >>> On May 22, 2017, at 2:38 PM, Håkon Strandenes <[email protected]> >>> wrote: >>> >>> One correction: >>> >>> The "NOT WORKING" reported for "HDF5 1.8.17, IMPI 5.1.3, Ifort >>> 2016.3.210" are another problem with a segmentation fault. >>> >>> To avioud confusion, I repeat the working/not working cases I tried: >>> >>> HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS >>> HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS >>> HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS >>> >>> HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: segfault - other problem, >>> maybe with HDF5 installation >>> >>> HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING >>> >>> I also tested on another cluster with GPFS parallel file system >>> (instead of LUSTRE): >>> >>> Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.15: OK >>> Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.18: OK >>> Intel 17.0, IMPI 2017.2.174, HDF5 1.8.17: NOT WORKING >>> Intel 17.0, IMPI 2017.2.174, HDF5 1.8.18: NOT WORKING >>> >>> So the common denominator seems to be Intel MPI 2017. >>> >>> Regards, >>> Håkon >>> >>> >>> On 05/22/2017 05:13 PM, Håkon Strandenes wrote: >>>> I have managed to prepare an example program. I got away a >>>> lot of non-essential stuff, by preparing some datafiles in advance. >>>> The example is for 20 processes *only*. >>>> I reported earlier that I also found the bug on a system with SGI >>>> MPT, this example runs fine on this system, so let's for the moment >>>> disregard that. >>>> The problem occur with combinations of "newer" Intel MPI with >>>> "newer" HDF5. >>>> I tested for instance: >>>> HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS >>>> HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS >>>> HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS >>>> And the following does not work: >>>> HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: NOT WORKING >>>> HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING >>>> Does anyone have any idea on how to proceed in the debugging? Does >>>> anyone see any obvious flaws in my example program? >>>> Thanks for all help. >>>> Regards, >>>> Håkon Strandenes >>>> On 05/20/2017 09:12 PM, Quincey Koziol wrote: >>>>> >>>>>> On May 19, 2017, at 12:32 PM, Håkon Strandenes >>>>>> <[email protected]> wrote: >>>>>> >>>>>> Yes, the issue is still there. >>>>>> >>>>>> I will try to make a dummy program to demonstrate the error. It >>>>>> might be the easiest thing to debug on in the long run. >>>>> >>>>> That would be very helpful, thanks, >>>>> Quincey >>>>> >>>>>> >>>>>> Regards, >>>>>> Håkon >>>>>> >>>>>> >>>>>> On 05/19/2017 08:26 PM, Scot Breitenfeld wrote: >>>>>>> Can you try it with 1.10.1 and see if you still have an issue. >>>>>>> Scot >>>>>>>> On May 19, 2017, at 1:11 PM, Quincey Koziol <[email protected]> wrote: >>>>>>>> >>>>>>>> Hi Håkon, >>>>>>>> >>>>>>>>> On May 19, 2017, at 10:01 AM, Håkon Strandenes >>>>>>>>> <[email protected]> wrote: >>>>>>>>> >>>>>>>>> (sorry, forgot to cc mailing list in prev. mail) >>>>>>>>> >>>>>>>>> A standalone test program would be quite an effort, but I will >>>>>>>>> think about it. I know that at least all simple test cases >>>>>>>>> pass, so I need a "complicated" problem to generate the error. >>>>>>>> >>>>>>>> Yeah, that’s usually the case with these kind of issues. :-/ >>>>>>>> >>>>>>>> >>>>>>>>> One thing I wonder about is: >>>>>>>>> Is the requirements for collective IO in this document: >>>>>>>>> https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf >>>>>>>>> still valid and accurate? >>>>>>>>> >>>>>>>>> The reason I ask is that my filespace is complicated. Each IO >>>>>>>>> process create the filespace with MANY calls to >>>>>>>>> select_hyperslab. Hence it is neither regular nor singular, and >>>>>>>>> according to the above mentioned document the HDF5 library >>>>>>>>> should not be able to do collective IO in this case. Still, it >>>>>>>>> seems like it hangs in some collective writing routine. >>>>>>>>> >>>>>>>>> Am I onto something? Could this be a problem? >>>>>>>> >>>>>>>> Fortunately, we’ve expanded the feature set for collective >>>>>>>> I/O now and it supports arbitrary selections on chunked >>>>>>>> datasets. There’s always the chance for a bug of course, but it >>>>>>>> would have to be very unusual, since we are pretty thorough >>>>>>>> about the regression testing… >>>>>>>> >>>>>>>> Quincey >>>>>>>> >>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Håkon >>>>>>>>> >>>>>>>>> >>>>>>>>> On 05/19/2017 04:46 PM, Quincey Koziol wrote: >>>>>>>>>> Hmm, sounds like you’ve varied a lot of things, which is >>>>>>>>>> good. But, the constant seems to be your code now. :-/ Can >>>>>>>>>> you replicate the error with a small standalone C test program? >>>>>>>>>> Quincey >>>>>>>>>>> On May 19, 2017, at 7:43 AM, Håkon Strandenes >>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>> The behavior is there both with SGI MPT and Intel MPI. I can >>>>>>>>>>> try OpenMPI as well, but that is not as well tested on the >>>>>>>>>>> systems we are using as the previously mentioned ones. >>>>>>>>>>> >>>>>>>>>>> I also tested and can confirm that the problem is there as >>>>>>>>>>> well with HDF5 1.10.1. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Håkon >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 05/19/2017 04:29 PM, Quincey Koziol wrote: >>>>>>>>>>>> Hi Håkon, >>>>>>>>>>>> Actually, given this behavior, it’s reasonably possible >>>>>>>>>>>> that you have found a bug in the MPI implementation that you >>>>>>>>>>>> have, so I wouldn’t rule that out. What implementation and >>>>>>>>>>>> version of MPI are you using? >>>>>>>>>>>> Quincey >>>>>>>>>>>>> On May 19, 2017, at 4:14 AM, Håkon Strandenes >>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> I have an MPI application where each process sample some >>>>>>>>>>>>> data. Each >>>>>>>>>>>>> process can have an arbitrary number of sampling points (or >>>>>>>>>>>>> no points at >>>>>>>>>>>>> all). During the simulation each process buffer the sample >>>>>>>>>>>>> values in >>>>>>>>>>>>> local memory until the buffer is full. At that point each >>>>>>>>>>>>> process send >>>>>>>>>>>>> its data to designated IO processes, and the IO processes >>>>>>>>>>>>> open a HDF5 >>>>>>>>>>>>> file, extend a dataset and write the data into the file. >>>>>>>>>>>>> >>>>>>>>>>>>> The filespace can be quite compicated, constructed with >>>>>>>>>>>>> numerous calls >>>>>>>>>>>>> to "h5sselect_hyperslab_f". The memspace is always a simple >>>>>>>>>>>>> contiguous >>>>>>>>>>>>> block of data. The chunk size is equal to the buffer size, >>>>>>>>>>>>> i.e. each >>>>>>>>>>>>> time the dataset is extended it is extended by exactly one >>>>>>>>>>>>> chunk. >>>>>>>>>>>>> >>>>>>>>>>>>> The problem is that in some cases, the application hang in >>>>>>>>>>>>> h5dwrite_f >>>>>>>>>>>>> (Fortran application). I cannot see why. It happens on >>>>>>>>>>>>> multiple systems >>>>>>>>>>>>> with different MPI implementations, so I believe that the >>>>>>>>>>>>> problem is in >>>>>>>>>>>>> my application or in the HDF5 library, not in the MPI >>>>>>>>>>>>> implementation or >>>>>>>>>>>>> on the system level. >>>>>>>>>>>>> >>>>>>>>>>>>> The problem disappear if I turn off collective IO. >>>>>>>>>>>>> >>>>>>>>>>>>> I have tried to compile HDF5 with as much error checking as >>>>>>>>>>>>> possible >>>>>>>>>>>>> (--enable-debug=all --disable-production) and I do not get >>>>>>>>>>>>> any errors or >>>>>>>>>>>>> warnings from the HDF5 library. >>>>>>>>>>>>> >>>>>>>>>>>>> I ran the code through TotalView, and got the attached >>>>>>>>>>>>> backtrace for the >>>>>>>>>>>>> 20 processes that participate in the IO communicator. >>>>>>>>>>>>> >>>>>>>>>>>>> Does anyone have any idea on how to continue debugging this >>>>>>>>>>>>> problem? >>>>>>>>>>>>> >>>>>>>>>>>>> I currently use HDF5 version 1.8.17. >>>>>>>>>>>>> >>>>>>>>>>>>> Best regards, >>>>>>>>>>>>> Håkon Strandenes >>>>>>>>>>>>> <Backtrace HDF5 >>>>>>>>>>>>> err.png>_______________________________________________ >>>>>>>>>>>>> Hdf-forum is for HDF software users discussion. >>>>>>>>>>>>> [email protected] >>>>>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >>>>>>>>>>>>> >>>>>>>>>>>>> Twitter: https://twitter.com/hdf5 >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> Hdf-forum is for HDF software users discussion. >>>>>>>>>>>> [email protected] >>>>>>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >>>>>>>>>>>> >>>>>>>>>>>> Twitter: https://twitter.com/hdf5 >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> Hdf-forum is for HDF software users discussion. >>>>>>>> [email protected] >>>>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >>>>>>>> >>>>>>>> Twitter: https://twitter.com/hdf5 >>>>>> >>>>>> _______________________________________________ >>>>>> Hdf-forum is for HDF software users discussion. >>>>>> [email protected] >>>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >>>>>> >>>>>> Twitter: https://twitter.com/hdf5 >>>>> >>>>> >>>>> _______________________________________________ >>>>> Hdf-forum is for HDF software users discussion. >>>>> [email protected] >>>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >>>>> >>>>> Twitter: https://twitter.com/hdf5 >>>>> >>>> _______________________________________________ >>>> Hdf-forum is for HDF software users discussion. >>>> [email protected] >>>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >>>> Twitter: https://twitter.com/hdf5 >>> >>> _______________________________________________ >>> Hdf-forum is for HDF software users discussion. >>> [email protected] >>> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >>> Twitter: https://twitter.com/hdf5 >> >> _______________________________________________ >> Hdf-forum is for HDF software users discussion. >> [email protected] >> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org >> Twitter: https://twitter.com/hdf5 >> > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > Twitter: https://twitter.com/hdf5 > _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
