One correction:
The "NOT WORKING" reported for "HDF5 1.8.17, IMPI 5.1.3, Ifort
2016.3.210" are another problem with a segmentation fault.
To avioud confusion, I repeat the working/not working cases I tried:
HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS
HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: segfault - other problem,
maybe with HDF5 installation
HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING
I also tested on another cluster with GPFS parallel file system (instead
of LUSTRE):
Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.15: OK
Intel 17.0, IMPI 5.1.3.181, HDF5 1.8.18: OK
Intel 17.0, IMPI 2017.2.174, HDF5 1.8.17: NOT WORKING
Intel 17.0, IMPI 2017.2.174, HDF5 1.8.18: NOT WORKING
So the common denominator seems to be Intel MPI 2017.
Regards,
Håkon
On 05/22/2017 05:13 PM, Håkon Strandenes wrote:
I have managed to prepare an example program. I got away a
lot of non-essential stuff, by preparing some datafiles in advance. The
example is for 20 processes *only*.
I reported earlier that I also found the bug on a system with SGI MPT,
this example runs fine on this system, so let's for the moment disregard
that.
The problem occur with combinations of "newer" Intel MPI with "newer" HDF5.
I tested for instance:
HDF5 1.8.16, IMPI 5.0.3 and Ifort 2015.3.187: WORKS
HDF5 1.8.16, IMPI 5.1.2 and Ifort 2016.1.150: WORKS
HDF5 1.8.17, OpenMPI 1.10.3, GFortran 5.4.0: WORKS
And the following does not work:
HDF5 1.8.17, IMPI 5.1.3, Ifort 2016.3.210: NOT WORKING
HDF5 1.8.17, IMPI 2017.1.132, Ifort 2017.1.132: NOT WORKING
Does anyone have any idea on how to proceed in the debugging? Does
anyone see any obvious flaws in my example program?
Thanks for all help.
Regards,
Håkon Strandenes
On 05/20/2017 09:12 PM, Quincey Koziol wrote:
On May 19, 2017, at 12:32 PM, Håkon Strandenes <[email protected]>
wrote:
Yes, the issue is still there.
I will try to make a dummy program to demonstrate the error. It might
be the easiest thing to debug on in the long run.
That would be very helpful, thanks,
Quincey
Regards,
Håkon
On 05/19/2017 08:26 PM, Scot Breitenfeld wrote:
Can you try it with 1.10.1 and see if you still have an issue.
Scot
On May 19, 2017, at 1:11 PM, Quincey Koziol <[email protected]> wrote:
Hi Håkon,
On May 19, 2017, at 10:01 AM, Håkon Strandenes
<[email protected]> wrote:
(sorry, forgot to cc mailing list in prev. mail)
A standalone test program would be quite an effort, but I will
think about it. I know that at least all simple test cases pass,
so I need a "complicated" problem to generate the error.
Yeah, that’s usually the case with these kind of issues. :-/
One thing I wonder about is:
Is the requirements for collective IO in this document:
https://support.hdfgroup.org/HDF5/PHDF5/parallelhdf5hints.pdf
still valid and accurate?
The reason I ask is that my filespace is complicated. Each IO
process create the filespace with MANY calls to select_hyperslab.
Hence it is neither regular nor singular, and according to the
above mentioned document the HDF5 library should not be able to do
collective IO in this case. Still, it seems like it hangs in some
collective writing routine.
Am I onto something? Could this be a problem?
Fortunately, we’ve expanded the feature set for collective I/O
now and it supports arbitrary selections on chunked datasets.
There’s always the chance for a bug of course, but it would have to
be very unusual, since we are pretty thorough about the regression
testing…
Quincey
Regards,
Håkon
On 05/19/2017 04:46 PM, Quincey Koziol wrote:
Hmm, sounds like you’ve varied a lot of things, which is good.
But, the constant seems to be your code now. :-/ Can you
replicate the error with a small standalone C test program?
Quincey
On May 19, 2017, at 7:43 AM, Håkon Strandenes
<[email protected]> wrote:
The behavior is there both with SGI MPT and Intel MPI. I can try
OpenMPI as well, but that is not as well tested on the systems
we are using as the previously mentioned ones.
I also tested and can confirm that the problem is there as well
with HDF5 1.10.1.
Regards,
Håkon
On 05/19/2017 04:29 PM, Quincey Koziol wrote:
Hi Håkon,
Actually, given this behavior, it’s reasonably possible
that you have found a bug in the MPI implementation that you
have, so I wouldn’t rule that out. What implementation and
version of MPI are you using?
Quincey
On May 19, 2017, at 4:14 AM, Håkon Strandenes
<[email protected]> wrote:
Hi,
I have an MPI application where each process sample some data.
Each
process can have an arbitrary number of sampling points (or no
points at
all). During the simulation each process buffer the sample
values in
local memory until the buffer is full. At that point each
process send
its data to designated IO processes, and the IO processes open
a HDF5
file, extend a dataset and write the data into the file.
The filespace can be quite compicated, constructed with
numerous calls
to "h5sselect_hyperslab_f". The memspace is always a simple
contiguous
block of data. The chunk size is equal to the buffer size,
i.e. each
time the dataset is extended it is extended by exactly one chunk.
The problem is that in some cases, the application hang in
h5dwrite_f
(Fortran application). I cannot see why. It happens on
multiple systems
with different MPI implementations, so I believe that the
problem is in
my application or in the HDF5 library, not in the MPI
implementation or
on the system level.
The problem disappear if I turn off collective IO.
I have tried to compile HDF5 with as much error checking as
possible
(--enable-debug=all --disable-production) and I do not get any
errors or
warnings from the HDF5 library.
I ran the code through TotalView, and got the attached
backtrace for the
20 processes that participate in the IO communicator.
Does anyone have any idea on how to continue debugging this
problem?
I currently use HDF5 version 1.8.17.
Best regards,
Håkon Strandenes
<Backtrace HDF5
err.png>_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5