Hi Wolf,


I found the problem in your program. Note that the hang vs the error stack 
(from Tim's email) is just different behaviors of different MPI implementations 
or versions. One implementation hangs when a call to MPI_File_set_size() from 
inside HDF5 is done with different arguments, and the other actually reports 
the error.



On to the mistake in your program now.. HDF5 requires the call to H5Dcreate be 
collective. That doesn't mean only that all processes have to call it, but also 
all processes have to call it with the same arguments. You are creating a 
chunked dataset with the same chunked dimensions except on the last process 
where you edit the first dimension (nxLocal). This happens here:

if ((nx%iNumOfProc) != 0) {

        nxLocal += 1;

        ixStart = myID*nxLocal;

        if (myID == iNumOfProc-1)

            nxLocal -= (nxLocal*iNumOfProc-nx); // last proc has less elements

    }



You pass nxLocal to the chunk dimensions here:

chunk_dims[0] = nxLocal;



As long as 32*numofprocesses is 0, you don’t modify nxLocal on the last 
process, which explains why it works in those situations.



Note that it is ok to Read and Write to datasets collectively with different 
arguments, but you have to create the dataset with the same arguments including 
the same chunk dimensions. So what you do above causes one process to see a 
dataset with different chunk sizes in its metadata cache, so on file close 
time, when processes flush their metadata cache, one process has a different 
size of the file than the other processes and this is what causes the problem.



Makes sense?



Thanks,

Mohamad





-----Original Message-----
From: Hdf-forum [mailto:[email protected]] On Behalf Of Wolf 
Dapp
Sent: Tuesday, April 07, 2015 11:30 AM
To: [email protected]
Subject: [Hdf-forum] parallel HDF5: H5Fclose hangs when not using a power of 2 
number of processes



Dear hdf-forum members,



I have a problem I am hoping someone can help me with. I have a program that 
outputs a 2D-array (contiguous, indexed linearly) using parallel HDF5. When I 
choose a number of processors that is not a power of 2

(1,2,4,8,...) H5Fclose() hangs, inexplicably. I'm using HDF5 v.1.8.14, and 
OpenMPI 1.7.2, on top of GCC 4.8 with Linux.



Can someone help me pinpoint my mistake?



I have searched the forum, and the first hit [searching for "h5fclose hangs"] 
was a user mistake that I didn't make (to the best of my knowledge). The second 
didn't go on beyond the initial problem description, and didn't offer a 
solution.



Attached is a (maybe insufficiently bare-boned, apologies) demonstrator 
program. Strangely, the hang only happens if nx >= 32. The code is adapted from 
an HDF5 example program.



The demonstrator is compiled with

h5pcc test.hangs.cpp -DVERBOSE -lstdc++



( on my system, for some strange reason, MPI has been compiled with the 
deprecated C++ bindings. I need to include -lmpi_cxx also, but that shouldn't 
be necessary for anyone else. I hope that's not the reason for the hang-ups. )



Thanks in advance for your help!



Wolf Dapp





--




_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to