Hi Wolf,
I found the problem in your program. Note that the hang vs the error stack
(from Tim's email) is just different behaviors of different MPI implementations
or versions. One implementation hangs when a call to MPI_File_set_size() from
inside HDF5 is done with different arguments, and the other actually reports
the error.
On to the mistake in your program now.. HDF5 requires the call to H5Dcreate be
collective. That doesn't mean only that all processes have to call it, but also
all processes have to call it with the same arguments. You are creating a
chunked dataset with the same chunked dimensions except on the last process
where you edit the first dimension (nxLocal). This happens here:
if ((nx%iNumOfProc) != 0) {
nxLocal += 1;
ixStart = myID*nxLocal;
if (myID == iNumOfProc-1)
nxLocal -= (nxLocal*iNumOfProc-nx); // last proc has less elements
}
You pass nxLocal to the chunk dimensions here:
chunk_dims[0] = nxLocal;
As long as 32*numofprocesses is 0, you don’t modify nxLocal on the last
process, which explains why it works in those situations.
Note that it is ok to Read and Write to datasets collectively with different
arguments, but you have to create the dataset with the same arguments including
the same chunk dimensions. So what you do above causes one process to see a
dataset with different chunk sizes in its metadata cache, so on file close
time, when processes flush their metadata cache, one process has a different
size of the file than the other processes and this is what causes the problem.
Makes sense?
Thanks,
Mohamad
-----Original Message-----
From: Hdf-forum [mailto:[email protected]] On Behalf Of Wolf
Dapp
Sent: Tuesday, April 07, 2015 11:30 AM
To: [email protected]
Subject: [Hdf-forum] parallel HDF5: H5Fclose hangs when not using a power of 2
number of processes
Dear hdf-forum members,
I have a problem I am hoping someone can help me with. I have a program that
outputs a 2D-array (contiguous, indexed linearly) using parallel HDF5. When I
choose a number of processors that is not a power of 2
(1,2,4,8,...) H5Fclose() hangs, inexplicably. I'm using HDF5 v.1.8.14, and
OpenMPI 1.7.2, on top of GCC 4.8 with Linux.
Can someone help me pinpoint my mistake?
I have searched the forum, and the first hit [searching for "h5fclose hangs"]
was a user mistake that I didn't make (to the best of my knowledge). The second
didn't go on beyond the initial problem description, and didn't offer a
solution.
Attached is a (maybe insufficiently bare-boned, apologies) demonstrator
program. Strangely, the hang only happens if nx >= 32. The code is adapted from
an HDF5 example program.
The demonstrator is compiled with
h5pcc test.hangs.cpp -DVERBOSE -lstdc++
( on my system, for some strange reason, MPI has been compiled with the
deprecated C++ bindings. I need to include -lmpi_cxx also, but that shouldn't
be necessary for anyone else. I hope that's not the reason for the hang-ups. )
Thanks in advance for your help!
Wolf Dapp
--
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5