Re: [Hdf-forum] parallel HDF5: H5Fclose hangs when not using a power of 2 number of processes

Wolf Dapp Wed, 08 Apr 2015 07:54:13 -0700

Hi Mohamad,

thanks for your reply, and thanks for pointing that out. I appreciate
that each processor has a different chunk size, but I wasn't aware this
is a problem.

What would you suggest as a workaround, or solution? The number of
elements per processor /is/ objectively different, and if I simply give
the last process the /same/ chunk_size (without adjusting the file
space), the program crashes violently.

#001: H5Dio.c line 342 in H5D__pre_write(): file selection+offset not
within extent
    major: Dataspace
    minor: Out of range

If I set both dimsf[0] and chunk_dims[0] such that the (padded) data
fits, and each process writes the same chunk (i.e., if I pad the
filespace), then it works, but then the rest of my workflow will break
down, because the file is then not 32x32, but 33x32, with the last
column just zeroes, and in addition the file would be different
depending on how many processes write to it. I suppose I could somehow
resize the filespace to get it back to the proper dimension? However,
I'd probably have the same problem reading the data back in.

Or should I write to the file independently? I suppose I'd pay a hefty
performance price... (in a production run, ~1000 processes write ~20 GB
collectively, repeatedly).

Is there a recommended way how to handle this? The only worked example I
can find for collectively writing different numbers of elements writes
/nothing at all/ on one of the processes (as opposed to a smaller number
of elements than the others):
http://www.hdfgroup.org/ftp/HDF5/examples/misc-examples/coll_test.c

Thanks again for your help!
Wolf

On 04/08/15 16:23, Mohamad Chaarawi wrote:
> Hi Wolf,
> 
> I found the problem in your program. Note that the hang vs the error
> stack (from Tim's email) is just different behaviors of different MPI
> implementations or versions. One implementation hangs when a call to
> MPI_File_set_size() from inside HDF5 is done with different arguments,
> and the other actually reports the error.
> 
> On to the mistake in your program now.. HDF5 requires the call to
> H5Dcreate be collective. That doesn't mean only that all processes have
> to call it, but also all processes have to call it with the same
> arguments. You are creating a chunked dataset with the same chunked
> dimensions except on the last process where you edit the first dimension
> (nxLocal). This happens here:
> 
> if ((nx%iNumOfProc) != 0) {
>    nxLocal += 1;
>    ixStart = myID*nxLocal;
>    if (myID == iNumOfProc-1)
>      nxLocal -= (nxLocal*iNumOfProc-nx); // last proc has less elements       
>                                                                               
>                                                                          
> }
> 
> You pass nxLocal to the chunk dimensions here:
> chunk_dims[0] = nxLocal;
> 
> As long as 32*numofprocesses is 0, you don’t modify nxLocal on the last
> process, which explains why it works in those situations.
> 
> Note that it is ok to Read and Write to datasets collectively with
> different arguments, but you have to create the dataset with the same
> arguments including the same chunk dimensions. So what you do above
> causes one process to see a dataset with different chunk sizes in its
> metadata cache, so on file close time, when processes flush their
> metadata cache, one process has a different size of the file than the
> other processes and this is what causes the problem.
> 
> Makes sense?
> 
> Thanks,
> 
> Mohamad

-- 

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] parallel HDF5: H5Fclose hangs when not using a power of 2 number of processes

Reply via email to