Hi,
Thanks for continuing the conversation. I believe files will be *either*
read from, *or* written to, but not both simultaneously, at least in the
scenario I'm working on right now. I'd like to be able to write to the
same file from different ranks simultaneously, but only to different
datasets. If that's not possible without propagating dataset extension
operations collectively to ranks not writing to that dataset, then I
will start looking at the virtual dataset solution you suggested in your
first reply.
Thanks again,
Chris.
On 7/25/16 7:41 PM, Nelson, Jarom wrote:
If you want to have multiple ranks write to the same file, you’ll need
to open the file in read-write and use parallel HDF5 with the
associated overhead and complexity of the collective calls. I think
the only way to avoid the overhead of the collective calls is to open
separate files for each rank.
If you are going to have a multi-file approach, and read from files
which are open in write mode by another process, you’ll need to have
some way to get the metadata updated in the reading processes. It
sounds like you might try another 1.10.x addition, the single-writer
multiple-reader. If each rank can open its own output file in
read-write, and all the other ranks’ files in read-only, you can avoid
the parallel overhead. I haven’t tried this approach, and you’ll have
to be careful of race conditions and keep the file metadata correct in
all the ranks, but it sounds like it might fit your parallel I/O
model.
https://www.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesSwmrDocs.html
Jarom
*From:*Hdf-forum [mailto:[email protected]] *On
Behalf Of *Chris Green
*Sent:* Monday, July 25, 2016 3:41 PM
*To:* HDF Users Discussion List
*Subject:* Re: [Hdf-forum] Parallel dataset resizing strategies
Hi,
Thanks for this. Comments inline.
On 7/22/16 12:13 PM, Nelson, Jarom wrote:
If you can move to HDF5 1.10, I would recommend independent files for
each MPI rank, and then create a master file (created independently
perhaps by rank 0) with Virtual Datasets linking in the data from each
rank in the format you need. Virtual Datasets can be created with file
matching patterns for dynamically increasing datasets, so you might
look into using that feature.
We don't have existing tools relying on a particular version, so we
are nominally free to move to HDF5 1.10.x. However, it won't be
completely straightforward because I have been relying for now on
using the homebrew version, which is currently 1.18.16. I'd have to
dink the recipe to use 1.10.x, which is not a showstopper.
I found this approach much faster than creating a collective file
(~5-10x speedup on a Lustre filesystem). You don’t need to do any
collective reads or writes, and I think we could even bypass using
parallel HDF5 altogether. Note, this will only work if you only
ever need to open the Virtual Dataset in parallel (i.e. by more
than one process) as non-collective read-only. If you need to have
read-write access to the master file, you can’t access a Virtual
Dataset using collective operations. You can, however, have as
many processes as you like read from a virtual dataset from a file
opened as read-only.
If you have other tools that use your data but can’t move to HDF5
1.10, you can h5repack a file with Virtual Datasets to remove the
Virtual Datasets, and it should be compatible with HDF5 1.8 (use
h5repack from HDF5 1.10 patch 1 or later). This also worked well
for us and I was able to load a repacked file in IDL under a 1.8
HDF5 library. However h5repack is not a parallel application, so
it can be slow to repack a very large file, on the order minutes
per GB.
After having thought a little more about likely parallel models, I
think now we can arrange that:
·Only one rank will write to a particular dataset.
·A dataset will not be read from in the same job in which it was written.
·A dataset may be read by one or more ranks.
I *think* if that's the case, we could use a hierarchical multi-file
format without resorting to virtual datasets, no? I still have some
reading and experimenting to do, but if you have particular
information that would speak to the likely success of this approach,
I'd be happy to hear it.
Thanks,
Chris.
Jarom
*From:*Hdf-forum [mailto:[email protected]] *On
Behalf Of *Chris Green
*Sent:* Friday, July 22, 2016 9:32 AM
*To:* [email protected]
<mailto:[email protected]>
*Subject:* [Hdf-forum] Parallel dataset resizing strategies
Hi,
I am relatively new to HDF5 and HDF5/parallel, and although I have
experience with MPI it is not extensive. We are exploring ways of
saving data in parallel using HDF5 in a field in which it is
practically unknown up to now.
Our paradigm is "parallel modular event processing:"
* A typical job processes many "events."
* An event contains all of the interesting data (raw and
processed) associated with some time interval.
* Each event can be processed independently of all other events.
* Each event's data can be subdivided into internal components,
"data products."
* "Modules" are processing subunits which read or generate one
or more data products for each event.
* One can calculate a data dependency graph specifying the
allowed ordering and/or parallelism of modules processing one
or more events simultaneously for a given job configuration
and event structure.
We have been using h5py with HDF5 and OpenMPI to explore different
strategies for parallel I/O in a future parallel event-processing
framework. One of the approaches we have come up with so far is to
have one HDF5 dataset per unique data product / writer module
combination, keeping track of the different relevant sections of
each dataset via (for now) an external database. This works well
in serial tests, but in parallel tests we are running up against
the constraint that dataset resizing is a collective operation,
meaning that all ranks including non-writers will have to become
aware of and duplicate dataset resizing operations required by
other writers. The problem seems to get even worse if there's a
possibility that two or more instances of a module would need to
extend and write to the same dataset at the same time (while
processing different events, say), since they will have to
coordinate and agree on the new size of the dataset and their
respective sections thereof.
Are we misunderstanding the problem, or is it really this hard?
Has anyone else hit upon a reasonable strategy for handling this
or something like it?
Any pointers appreciated.
Thanks,
Chris Green.
--
Chris Green<[email protected]> <mailto:[email protected]>, FNAL
CS/SCD/ADSS/SSI/TAC;
'phone (630) 840-2167; Skype: chris.h.green;
IM:[email protected] <mailto:[email protected]>, chissgreen (AIM),
chris.h.green (Google Talk).
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected] <mailto:[email protected]>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter:https://twitter.com/hdf5
--
Chris Green<[email protected]> <mailto:[email protected]>, FNAL CS/SCD/ADSS/SSI/TAC;
'phone (630) 840-2167; Skype: chris.h.green;
IM:[email protected] <mailto:[email protected]>, chissgreen (AIM),
chris.h.green (Google Talk).
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5
--
Chris Green <[email protected]>, FNAL CS/SCD/ADSS/SSI/TAC;
'phone (630) 840-2167; Skype: chris.h.green;
IM: [email protected], chissgreen (AIM),
chris.h.green (Google Talk).
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5