Re: [Hdf-forum] Questions about size of generated Hdf5 files

Pierre de Buyl Tue, 23 May 2017 10:34:51 -0700

Hello,

I just react to this because of the chunk size. Every chunk carries metadata so
a chunk should contain a non-negligible amount of data to avoid inefficiencies
and large file sizes. The guideline of the HDF5 documentation is a chunk size of
the order of 1MB.


Regards,

Pierre

On Tue, May 23, 2017 at 07:12:47PM +0200, Guillaume Jacquenot wrote:
> Hello Quincey
> 
> I am using version 1.8.16
> I have
> I am using chunk of size 1.
> I have tried contiguous dataset, but I have error at runtime
> 
> I have written a test program that creates 3000 datasets filled with 64
> floating point number.
> I can specify the number n, which controls the number of times I saved my
> data (the number of timesteps of a simulation in my case)
> 
> To sum this test program,
> 
>     call hdf5_init(filename)
>     do i = 1, n
>         call hdf5_write(datatosave)
>     end do
>     call hdf5_close()
> 
> 
> 
> With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a
> 370 bytes per empty dataset (Totally reasonnable).
> With 1 =0, I have a HDF5  file with size 7.13 Mo, which surprises me. Why
> such an increase?
> With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an
> increase of 0.02 Mo which is logical : 3000*8*1/1e6 =0.024 Mo)
> 
> When setting chunk size to 10, I obtain the following results
> 
> With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a
> 370 bytes per empty dataset.
> With 1 =0, I have a HDF5  file with size 7.34 Mo, which surprises me.
> With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an
> increase of 3000*8*10/1e6, which is logical.
> 
> I don't understand the first increase of size. It does not make this data
> storage very efficient.
> Do you think coumpound dataset with 3000 columns will present the same
> behaviour? I have not tried since I don't know how to map the content of an
> array when calling the h5dwrite_f function for a compound dataset.
> 
> 
> If I ask 30000 datasets, I observe the same behaviour
> n=0 -> 10.9 Mo
> n=1 -> 73.2 Mo
> 
> Thanks
> 
> 
> 
> Here is the error I have with contiguous dataset
> 
> 
>   #001: hdf5-1.8.16/src/H5Dint.c line 453 in H5D__create_named(): unable to
> create and link to dataset
>     major: Dataset
>     minor: Unable to initialize object
>   #002: hdf5-1.8.16/src/H5L.c line 1638 in H5L_link_object(): unable to
> create new link to object
>     major: Links
>     minor: Unable to initialize object
>   #003: hdf5-1.8.16/src/H5L.c line 1882 in H5L_create_real(): can't insert
> link
>     major: Symbol table
>     minor: Unable to insert object
>   #004: hdf5-1.8.16/src/H5Gtraverse.c line 861  in H5G_traverse(): internal
> path traversal failed
>     major: Symbol table
>     minor: Object not found
>   #005: hdf5-1.8.16/src/H5Gtraverse.c line 641 in H5G_traverse_real():
> traversal operator failed
>     major: Symbol table
>     minor: Callback failed
>   #006: hdf5-1.8.16/src/H5L.c line 1685 in H5L_link_cb(): unable to create
> object
>     major: Object header
>     minor: Unable to initialize object
>   #007: hdf5-1.8.16/src/H5O.c line 3016 in H5O_obj_create(): unable to open
> object
>     major: Object header
>     minor: Can't open object
>   #008: hdf5-1.8.16/src/H5Doh.c line 293 in H5O__dset_create(): unable to
> create dataset
>     major: Dataset
>     minor: Unable to initialize object
>   #009: hdf5-1.8.16/src/H5Dint.c line 1056 in H5D__create(): unable to
> construct layout information
>     major: Dataset
>     minor: Unable to initialize object
>   #010: hdf5-1.8.16/src/H5Dcontig.c line 422 in H5D__contig_construct():
> extendible contiguous non-external dataset
>     major: Dataset
>     minor: Feature is unsupported
> HDF5-DIAG: Error detected in HDF5 (1.8.16) t^C
> 
> 2017-05-23 19:00 GMT+02:00 <[email protected]>:
> 
> > Send Hdf-forum mailing list submissions to
> >         [email protected]
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> >         http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_
> > lists.hdfgroup.org
> >
> > or, via email, send a message with subject or body 'help' to
> >         [email protected]
> >
> > You can reach the person managing the list at
> >         [email protected]
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of Hdf-forum digest..."
> >
> >
> > Today's Topics:
> >
> >    1. Re: Questions about size of generated Hdf5 files (Quincey Koziol)
> >    2. Re: Parallel file access recommendation (Aaron Friesz)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Tue, 23 May 2017 08:22:59 -0700
> > From: Quincey Koziol <[email protected]>
> > To: HDF Users Discussion List <[email protected]>
> > Subject: Re: [Hdf-forum] Questions about size of generated Hdf5 files
> > Message-ID: <[email protected]>
> > Content-Type: text/plain; charset="utf-8"
> >
> > Hi Guillaume,
> >         Are you using chunked or contiguous datasets?  If chunked, what
> > size are you using?  Also, can you use the ?latest? version of the format,
> > which should be smaller, but is only compatible with HDF5 1.10.x or later?
> > (i.e. H5Pset_libver_bounds with ?latest? for low and high bounds,
> > https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm <
> > https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm> )
> >
> >         Quincey
> >
> >
> > > On May 23, 2017, at 3:02 AM, Guillaume Jacquenot <
> > [email protected]> wrote:
> > >
> > > Hello everyone!
> > >
> > > I am creating a HDF5 file from a Fortran program, and I am confused
> > about the size of my generated HDF5 file.
> > >
> > > I am writing 19000 datasets with 21 values of 64 bit (real number).
> > > I write one value at a time, and extend with one each of the 19000
> > datasets everytime.
> > > All data are correctly written.
> > > But the generated file is more than 48 Mo.
> > > I expected the total size of the file to be a little bigger than the raw
> > data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
> > > If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which
> > means each empty dataset is about 400 bytes.
> > > I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can contain
> > everything.
> > >
> > > For comparaison,if I write everything in a text file, where each real
> > number is written with 15 characters, I obtain a 6 Mo CSV file.
> > >
> > > Question 1)
> > > Is this behaviour normal?
> > >
> > > Question 2)
> > > Does extending dataset each time we write data inside can significantly
> > increase the total required space disk size?
> > > Does preallocating dataset and using hyperslab can save some space?
> > > Does chunk parameters can impact the size of generated hdf5 file
> > >
> > > Question 3)
> > > If I pack everything in a compound dataset with 19000 columns, will the
> > result file be smaller?
> > >
> > > N.B:
> > > When looking at the example of generating 100000 groups (grplots.c),the
> > size of the generated HD5 file is 78 Mo for 100000 empty groups
> > > That means each group is about 780 bytes
> > > https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c <
> > https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c>
> > >
> > > Guillaume Jacquenot
> > >
> > >
> > >
> > > _______________________________________________
> > > Hdf-forum is for HDF software users discussion.
> > > [email protected]
> > > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> > > Twitter: https://twitter.com/hdf5
> >
> > -------------- next part --------------
> > An HTML attachment was scrubbed...
> > URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.
> > hdfgroup.org/attachments/20170523/b7107007/attachment-0001.html>
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Tue, 23 May 2017 08:46:07 -0700
> > From: Aaron Friesz <[email protected]>
> > To: HDF Users Discussion List <[email protected]>
> > Subject: Re: [Hdf-forum] Parallel file access recommendation
> > Message-ID:
> >         <CAC4OLecz_6xCPWfXcvkJCjRm+DF+uttMP72VyxFKqPqGNOa2dg@mail.
> > gmail.com>
> > Content-Type: text/plain; charset="utf-8"
> >
> > A year or so back, we changed to BeeGFS as well.  There were some issues
> > getting parrallel I/O setup.  First thing you want to do is run the
> > parrallel mpio test. I believe they can be found here:
> > https://support.hdfgroup.org/HDF5/Tutor/pprog.html.
> >
> > This will help you verify if your cluster has mpio setup correctly.  If
> > that doesn't work, you'll need to get in touch with the management group to
> > fix that.
> >
> > Then you need to make sure you are using an HDF5 library that is configured
> > to do parrallel I/O.
> >
> > I know there aren't a lot of specifics here, but it took me about two weeks
> > of convincing to get my cluster management group to realize that things
> > weren't working quite right.  Once everything was setup, I was able to
> > generate and write about 40 GB of data in around two minutes.
> >
> > On Tue, May 23, 2017 at 8:18 AM, Quincey Koziol <[email protected]> wrote:
> >
> > > Hi Jan,
> > >
> > > > On May 23, 2017, at 2:46 AM, Jan Oliver Oelerich <
> > > [email protected]> wrote:
> > > >
> > > > Hello HDF users,
> > > >
> > > > I am using HDF5 through NetCDF and I recently changed my program so
> > that
> > > each MPI process writes its data directly to the output file as opposed
> > to
> > > the master process gathering the results and being the only one who does
> > > I/O.
> > > >
> > > > Now I see that my program slows down file systems a lot (of the whole
> > > HPC cluster) and I don't really know how to handle I/O. The file system
> > is
> > > a high throughput Beegfs system.
> > > >
> > > > My program uses a hybrid parallelization approach, i.e. work is split
> > > into N MPI processes, each of which spawns M worker threads. Currently, I
> > > write to the output file from each of the M*N threads, but the writing is
> > > guarded by a mutex, so thread-safety shouldn't be a problem. Each writing
> > > process is a complete `open file, write, close file` cycle.
> > > >
> > > > Each write is at a separate region of the HDF5 file, so no chunks are
> > > shared among any two processes. The amount of data to be written per
> > > process is 1/(M*N) times the size of the whole file.
> > > >
> > > > Shouldn't this be exactly how HDF5 + MPI is supposed to be used? What
> > is
> > > the `best practice` regarding parallel file access with HDF5?
> > >
> > >         Yes, this is probably the correct way to operate, but generally
> > > things are much better for this case when collective I/O operations are
> > > used.  Are you using collective or independent I/O?  (Independent is the
> > > default)
> > >
> > >         Quincey
> > >

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Questions about size of generated Hdf5 files

Reply via email to