Re: [Hdf-forum] Strategy for pHDF5 collective reads/writs on variable sized communicators

Brandon Barker Tue, 02 Jun 2015 11:55:45 -0700

Based on this comment in the HDF5 docs, I would think it would be
acceptable for a hypserlab selection to go beyond the extent of what was
allocated (written to) the dataset, at least if chunking is used:
"Fill-values are only used for chunked storage datasets when an unallocated
chunk is read from."


I specified a fill value now but this didn't seem to make a difference; do
hyperslabs have some additional conditions that prevent fill values from
working or am I doing something else wrong?

I've tested stride, count, etc with H5Dwrite - this seems to work fine. I
use the same values for H5Dread. H5Dread also works if mpi_size doesn't
change between runs. But it would be nice if I could find out how to make
this more flexible between runs so mpi_size would have to be fixed.

Thanks,

On Fri, May 29, 2015 at 11:17 AM, Brandon Barker <[email protected]
> wrote:

> In the above, I assumed I can't change the arguments to
> H5Sselect_hyperslab (at least not easily), so I tried to fix the issue
> by changing the extent size using a call to H5Dset_extent, with the
> further assumption that a fill value would be used if I try to read
> beyond the end of the data stored in the dataset ... is this wrong?
>
> On Thu, May 28, 2015 at 4:18 PM, Brandon Barker
> <[email protected]> wrote:
> > Thanks Elena,
> >
> > Apologies below for using "chunk" in a different way (e.g. chunk_counter;
> > MPI_CHUNK_SIZE) than it is used in HDF5; perhaps I should call them
> "slabs".
> >
> > Code from the checkpoint procedure (seems to work):
> >   // dataset and memoryset dimensions (just 1d here)
> >   hsize_t     dimsm[]     = {chunk_counter * MPI_CHUNK_SIZE};
> >   hsize_t     dimsf[]     = {dimsm[0] * mpi_size};
> >   hsize_t     maxdims[]   = {H5S_UNLIMITED};
> >   hsize_t     chunkdims[] = {1};
> >   // hyperslab offset and size info */
> >   hsize_t     start[]   = {mpi_rank * MPI_CHUNK_SIZE};
> >   hsize_t     count[]   = {chunk_counter};
> >   hsize_t     block[]   = {MPI_CHUNK_SIZE};
> >   hsize_t     stride[]  = {MPI_CHUNK_SIZE * mpi_size};
> >
> >   dset_plist_create_id = H5Pcreate (H5P_DATASET_CREATE);
> >   status = H5Pset_chunk (dset_plist_create_id, RANK, chunkdims);
> >   dset_id = H5Dcreate (file_id, DATASETNAME, big_int_h5, filespace,
> >                        H5P_DEFAULT, dset_plist_create_id, H5P_DEFAULT);
> >   assert(dset_id != HDF_FAIL);
> >
> >
> >   H5Sselect_hyperslab(filespace, H5S_SELECT_SET,
> >                       start, stride, count, block);
> >
> >
> >
> > Code from the restore procedure (this is where the problem is):
> >
> >   // dataset and memoryset dimensions (just 1d here)
> >   hsize_t     dimsm[1];
> >   hsize_t     dimsf[1];
> >   // hyperslab offset and size info
> >   hsize_t     start[]   = {mpi_rank * MPI_CHUNK_SIZE};
> >   hsize_t     count[1];
> >   hsize_t     block[]   = {MPI_CHUNK_SIZE};
> >   hsize_t     stride[]  = {MPI_CHUNK_SIZE * mpi_size};
> >
> >
> >  //
> >   // Update dimensions and dataspaces as appropriate
> >   //
> >   chunk_counter = get_restore_chunk_counter(dimsf[0]); // Number of
> chunks
> > previously used plus enough new chunks to be divisible by mpi_size.
> >   count[0] = chunk_counter;
> >   dimsm[0] = chunk_counter * MPI_CHUNK_SIZE;
> >   dimsf[0] = dimsm[0] * mpi_size;
> >   status = H5Dset_extent(dset_id, dimsf);
> >   assert(status != HDF_FAIL);
> >
> >   //
> >   // Create the memspace for the dataset and allocate data for it
> >   //
> >   memspace = H5Screate_simple(RANK, dimsm, NULL);
> >   perf_diffs = alloc_and_init(perf_diffs, dimsm[0]);
> >
> >   H5Sselect_hyperslab(filespace, H5S_SELECT_SET, start, stride, count,
> > block);
> >
> >
> > Complete example code:
> >
> https://github.com/cornell-comp-internal/CR-demos/blob/bc507264fe4040d817a2e9603dace0dc06585015/demos/pHDF5/perfectNumbers.c
> >
> >
> > Best,
> >
> >
> >
> > The complete example is here:
> >
> > On Thu, May 28, 2015 at 3:43 PM, Elena Pourmal <[email protected]>
> > wrote:
> >>
> >> Hi Brandon,
> >>
> >> The error message indicates that a hyperslab selection goes beyond
> dataset
> >> extent.
> >>
> >> Please make sure that you are using the correct values for the start,
> >> stride, count and block parameters in the H5Sselect_hyperslab call (if
> you
> >> use it!).  It will help if you provide an excerpt from your code that
> >> selects hyperslabs for each process.
> >>
> >> Elena
> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >> Elena Pourmal  The HDF Group  http://hdfgroup.org
> >> 1800 So. Oak St., Suite 203, Champaign IL 61820
> >> 217.531.6112
> >> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>
> >>
> >>
> >>
> >> On May 28, 2015, at 1:46 PM, Brandon Barker <[email protected]
> >
> >> wrote:
> >>
> >> I believe I've gotten a bit closer by using chunked datasets, but I'm
> now
> >> not sure how to get past this:
> >>
> >> [brandon@euca-128-84-11-180 pHDF5]$ mpirun -n 2 ./perfectNumbers
> >> m, f, count,: 840, 1680, 84
> >> m, f, count,: 840, 1680, 84
> >> HDF5-DIAG: Error detected in HDF5 (1.8.12) MPI-process 1:
> >>   #000: ../../src/H5Dio.c line 158 in H5Dread(): selection+offset not
> >> within extent
> >>     major: Dataspace
> >>     minor: Out of range
> >> perfectNumbers: perfectNumbers.c:399: restore: Assertion `status != -1'
> >> failed.
> >>
> --------------------------------------------------------------------------
> >> mpirun noticed that process rank 1 with PID 28420 on node
> >> euca-128-84-11-180 exited on signal 11 (Segmentation fault).
> >>
> --------------------------------------------------------------------------
> >>
> >>
> >> (m,f,count) represent the memory space and dataspace lengths and the
> count
> >> of strided segments to be read in; prior to using set extents as
> follows, I
> >> would get the error when f was not a multiple of m
> >> dimsf[0] = dimsm[0] * mpi_size;
> >> H5Dset_extent(dset_id, dimsf);
> >>
> >> Now that I am using these, I note that it doesn't seem to have helped
> the
> >> issue, so there must be something else I still need to do.
> >>
> >> Incidentally, I was looking at this example and am not sure what the
> point
> >> of the following code is since rank_chunk is never used:
> >>     if (H5D_CHUNKED == H5Pget_layout (prop))
> >>        rank_chunk = H5Pget_chunk (prop, rank, chunk_dimsr);
> >>
> >> I guess it is just to demonstrate the function call of H5Pget_chunk?
> >>
> >> On Thu, May 28, 2015 at 10:27 AM, Brandon Barker
> >> <[email protected]> wrote:
> >>>
> >>> Hi All,
> >>>
> >>> I have fixed (and pushed the fix for) one bug that related to an
> >>> improperly defined count in the restore function. I still have issues
> for m
> >>> != n:
> >>>
> >>>   #000: ../../src/H5Dio.c line 158 in H5Dread(): selection+offset not
> >>> within extent
> >>>     major: Dataspace
> >>>     minor: Out of range
> >>>
> >>> I believe this is indicative of me needing to use chunked datasets so
> >>> that my dataset can grow in size dynamically.
> >>>
> >>> On Wed, May 27, 2015 at 5:03 PM, Brandon Barker
> >>> <[email protected]> wrote:
> >>>>
> >>>> Hi All,
> >>>>
> >>>> I've been learning pHDF5 by way of developing a toy application that
> >>>> checkpoints and restores its state. The restore function was the last
> to be
> >>>> implemented, but I realized after doing so that I have an issue:
> since each
> >>>> process has strided blocks of data that it is responsible for, the
> number of
> >>>> blocks of data saved during one run may not be evenly distributed
> among
> >>>> processes in another run, as the mpi_size of the latter run may not
> evenly
> >>>> divide the total number of blocks.
> >>>>
> >>>> I was hoping that a fill value might save me here, and just read in 0s
> >>>> if I try reading beyond the end of the dataset. Although, I believe I
> did
> >>>> see a page noting that this isn't possible for contiguous datasets.
> >>>>
> >>>> The good news is that since I'm working with 1-dimenional data, it is
> >>>> fairly easy to refactor relevant code.
> >>>>
> >>>> The error I get emits this message:
> >>>>
> >>>> [brandon@euca-128-84-11-180 pHDF5]$ mpirun -n 2 perfectNumbers
> >>>> HDF5-DIAG: Error detected in HDF5 (1.8.12) MPI-process 0:
> >>>>   #000: ../../src/H5Dio.c line 179 in H5Dread(): can't read data
> >>>>     major: Dataset
> >>>>     minor: Read failed
> >>>>   #001: ../../src/H5Dio.c line 446 in H5D__read(): src and dest data
> >>>> spaces have different sizes
> >>>>     major: Invalid arguments to routine
> >>>>     minor: Bad value
> >>>> perfectNumbers: perfectNumbers.c:382: restore: Assertion `status !=
> -1'
> >>>> failed.
> >>>>
> >>>>
> --------------------------------------------------------------------------
> >>>> mpirun noticed that process rank 0 with PID 3717 on node
> >>>> euca-128-84-11-180 exited on signal 11 (Segmentation fault).
> >>>>
> >>>> Here is the offending line in the restore function; you can observe
> the
> >>>> checkpoint function to see how things are written out to disk.
> >>>>
> >>>> General pointers are appreciated as well - to paraphrase the problem
> >>>> more simply: I have a distributed (strided) array I write out to disk
> as a
> >>>> dataset among n processes, and when I restart the program, I may want
> to
> >>>> divvy up the data among m processes in similar datastructures as
> before, but
> >>>> now m != n. Actually, my problem may be different than just this,
> since I
> >>>> seem to get the same issue even when m == n ... hmm.
> >>>>
> >>>> Thanks,
> >>>> --
> >>>> Brandon E. Barker
> >>>> http://www.cac.cornell.edu/barker/
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> Brandon E. Barker
> >>> http://www.cac.cornell.edu/barker/
> >>
> >>
> >>
> >>
> >> --
> >> Brandon E. Barker
> >> http://www.cac.cornell.edu/barker/
> >> _______________________________________________
> >> Hdf-forum is for HDF software users discussion.
> >> [email protected]
> >> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> >> Twitter: https://twitter.com/hdf5
> >>
> >>
> >
> >
> >
> > --
> > Brandon E. Barker
> > http://www.cac.cornell.edu/barker/
>
>
>
> --
> Brandon E. Barker
> http://www.cac.cornell.edu/barker/
>



-- 
Brandon E. Barker
http://www.cac.cornell.edu/barker/

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Strategy for pHDF5 collective reads/writs on variable sized communicators

Reply via email to