From: Hdf-forum 
<[email protected]<mailto:[email protected]>>
 on behalf of "Rowe, Jim" 
<[email protected]<mailto:[email protected]>>
Reply-To: HDF Users Discussion List 
<[email protected]<mailto:[email protected]>>
Date: Monday, February 22, 2016 11:45 AM
To: HDF Users Discussion List 
<[email protected]<mailto:[email protected]>>
Subject: [Hdf-forum] multi-/split- file examples or advice for controlling file 
layout

Hello- we are using some block-level deduping infrastructure that allows us to 
synchronize files around our enterprise.  To make this most effective, we need 
the beginning of files to be as stable as possible.

We have HDF5 files that range from .5 to 20GBs, and generally alter only 5% of 
the data in specific datasets after then initial creation.  We would like to 
structure these such that we can take the most advantage of the aforementioned 
deduping.   Questions:


1)      It appears that H5Pset_fapl_split() is the direction to look to 
separate data from meta data.  Is this fully supported?

Yes. Note that there is a similar driver called 'multi' that will be 
discontinued. That is NOT relevant to your use of the split driver, however. 
The HDF Group will continue to support the split driver.

Any performance issues with these drivers over the single-file type?

Not in cases I have tested. In fact, it *can* lead to improved performance in 
many cases. That said, there is a logistical issue to keep in mind. Every 
'file' is really two files on disk, the meta file and the raw file. So, all the 
software (and users) in your workflows need to be 'hip' to this. Which file do 
user's click on? Which file do they pass in an open call? You have to make sure 
that your workflows call H5Fopen on the correct filesystem object. If a user 
wants to give some file(s) to another user (say via tar'ing them up) does the 
user know to get *both* the raw and meta files? Worse, presently the HDF5 
library is not smart enough to know that H5Pset_fapl_split is needed to open 
such a file. So your software needs to have the smarts to make it happen.


2)      Is there a way to specify where a particular dataset is stored?  E.g., 
in my ideal scenario, I would have 3 files: 1) for my metadata which is 
potentially most volatile as blocks change; 2) for data sets where I am 
altering data which would be somewhat volatile; 3) the last file for my most 
static data.

Hmmm. I am confused. You say 'dataset' here and we're talking about the split 
file setting. All datasets go into the 'raw' file. Well, that isn't entirely 
true. Probably datasets with storage type 'compact' will go into the meta file. 
However, compact datasets are limited in size to 64Kb and probably not relevant 
to your case. It sounds like you might really be looking for a swizzle on the 
family file case.

But, I think there are two ways you could go here whilst still using split 
driver for raw/meta files. First, you could define a 3rd HDF5 file that you 
'mount' into the raw/meta file after you open it (see H5Fmount()). Maybe you 
use the mounted file for your class-3 stuff, and the raw/meta split files for 
you class 2/1 stuff respectively.

Another option is to store your non-volitile stuff as 'external' datasets in 
external (non-hdf5) files. See H5Pset_external() for that.

I'd go with the mount option because the 3rd file would still be a valid HDF5 
file and you can put any number of datasets in any organization you desire into 
that 3rd file. The external dataset option is very limited in functionality.

Hope that helps.


Any other advice or practical experience in this regard?

Best regards,
--Jim
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to