Re: [Hdf-forum] Questions about size of generated Hdf5 files

Guillaume Jacquenot Tue, 23 May 2017 10:15:06 -0700

Hello Quincey

I am using version 1.8.16
I have
I am using chunk of size 1.
I have tried contiguous dataset, but I have error at runtime


I have written a test program that creates 3000 datasets filled with 64
floating point number.
I can specify the number n, which controls the number of times I saved my
data (the number of timesteps of a simulation in my case)

To sum this test program,

    call hdf5_init(filename)
    do i = 1, n
        call hdf5_write(datatosave)
    end do
    call hdf5_close()



With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a
370 bytes per empty dataset (Totally reasonnable).
With 1 =0, I have a HDF5  file with size 7.13 Mo, which surprises me. Why
such an increase?
With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an
increase of 0.02 Mo which is logical : 3000*8*1/1e6 =0.024 Mo)

When setting chunk size to 10, I obtain the following results

With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a
370 bytes per empty dataset.
With 1 =0, I have a HDF5  file with size 7.34 Mo, which surprises me.
With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an
increase of 3000*8*10/1e6, which is logical.

I don't understand the first increase of size. It does not make this data
storage very efficient.
Do you think coumpound dataset with 3000 columns will present the same
behaviour? I have not tried since I don't know how to map the content of an
array when calling the h5dwrite_f function for a compound dataset.


If I ask 30000 datasets, I observe the same behaviour
n=0 -> 10.9 Mo
n=1 -> 73.2 Mo

Thanks



Here is the error I have with contiguous dataset


  #001: hdf5-1.8.16/src/H5Dint.c line 453 in H5D__create_named(): unable to
create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: hdf5-1.8.16/src/H5L.c line 1638 in H5L_link_object(): unable to
create new link to object
    major: Links
    minor: Unable to initialize object
  #003: hdf5-1.8.16/src/H5L.c line 1882 in H5L_create_real(): can't insert
link
    major: Symbol table
    minor: Unable to insert object
  #004: hdf5-1.8.16/src/H5Gtraverse.c line 861  in H5G_traverse(): internal
path traversal failed
    major: Symbol table
    minor: Object not found
  #005: hdf5-1.8.16/src/H5Gtraverse.c line 641 in H5G_traverse_real():
traversal operator failed
    major: Symbol table
    minor: Callback failed
  #006: hdf5-1.8.16/src/H5L.c line 1685 in H5L_link_cb(): unable to create
object
    major: Object header
    minor: Unable to initialize object
  #007: hdf5-1.8.16/src/H5O.c line 3016 in H5O_obj_create(): unable to open
object
    major: Object header
    minor: Can't open object
  #008: hdf5-1.8.16/src/H5Doh.c line 293 in H5O__dset_create(): unable to
create dataset
    major: Dataset
    minor: Unable to initialize object
  #009: hdf5-1.8.16/src/H5Dint.c line 1056 in H5D__create(): unable to
construct layout information
    major: Dataset
    minor: Unable to initialize object
  #010: hdf5-1.8.16/src/H5Dcontig.c line 422 in H5D__contig_construct():
extendible contiguous non-external dataset
    major: Dataset
    minor: Feature is unsupported
HDF5-DIAG: Error detected in HDF5 (1.8.16) t^C

2017-05-23 19:00 GMT+02:00 <[email protected]>:

> Send Hdf-forum mailing list submissions to
>         [email protected]
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_
> lists.hdfgroup.org
>
> or, via email, send a message with subject or body 'help' to
>         [email protected]
>
> You can reach the person managing the list at
>         [email protected]
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Hdf-forum digest..."
>
>
> Today's Topics:
>
>    1. Re: Questions about size of generated Hdf5 files (Quincey Koziol)
>    2. Re: Parallel file access recommendation (Aaron Friesz)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 23 May 2017 08:22:59 -0700
> From: Quincey Koziol <[email protected]>
> To: HDF Users Discussion List <[email protected]>
> Subject: Re: [Hdf-forum] Questions about size of generated Hdf5 files
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Guillaume,
>         Are you using chunked or contiguous datasets?  If chunked, what
> size are you using?  Also, can you use the ?latest? version of the format,
> which should be smaller, but is only compatible with HDF5 1.10.x or later?
> (i.e. H5Pset_libver_bounds with ?latest? for low and high bounds,
> https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm <
> https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm> )
>
>         Quincey
>
>
> > On May 23, 2017, at 3:02 AM, Guillaume Jacquenot <
> [email protected]> wrote:
> >
> > Hello everyone!
> >
> > I am creating a HDF5 file from a Fortran program, and I am confused
> about the size of my generated HDF5 file.
> >
> > I am writing 19000 datasets with 21 values of 64 bit (real number).
> > I write one value at a time, and extend with one each of the 19000
> datasets everytime.
> > All data are correctly written.
> > But the generated file is more than 48 Mo.
> > I expected the total size of the file to be a little bigger than the raw
> data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
> > If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which
> means each empty dataset is about 400 bytes.
> > I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can contain
> everything.
> >
> > For comparaison,if I write everything in a text file, where each real
> number is written with 15 characters, I obtain a 6 Mo CSV file.
> >
> > Question 1)
> > Is this behaviour normal?
> >
> > Question 2)
> > Does extending dataset each time we write data inside can significantly
> increase the total required space disk size?
> > Does preallocating dataset and using hyperslab can save some space?
> > Does chunk parameters can impact the size of generated hdf5 file
> >
> > Question 3)
> > If I pack everything in a compound dataset with 19000 columns, will the
> result file be smaller?
> >
> > N.B:
> > When looking at the example of generating 100000 groups (grplots.c),the
> size of the generated HD5 file is 78 Mo for 100000 empty groups
> > That means each group is about 780 bytes
> > https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c <
> https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c>
> >
> > Guillaume Jacquenot
> >
> >
> >
> > _______________________________________________
> > Hdf-forum is for HDF software users discussion.
> > [email protected]
> > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> > Twitter: https://twitter.com/hdf5
>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.
> hdfgroup.org/attachments/20170523/b7107007/attachment-0001.html>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 23 May 2017 08:46:07 -0700
> From: Aaron Friesz <[email protected]>
> To: HDF Users Discussion List <[email protected]>
> Subject: Re: [Hdf-forum] Parallel file access recommendation
> Message-ID:
>         <CAC4OLecz_6xCPWfXcvkJCjRm+DF+uttMP72VyxFKqPqGNOa2dg@mail.
> gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> A year or so back, we changed to BeeGFS as well.  There were some issues
> getting parrallel I/O setup.  First thing you want to do is run the
> parrallel mpio test. I believe they can be found here:
> https://support.hdfgroup.org/HDF5/Tutor/pprog.html.
>
> This will help you verify if your cluster has mpio setup correctly.  If
> that doesn't work, you'll need to get in touch with the management group to
> fix that.
>
> Then you need to make sure you are using an HDF5 library that is configured
> to do parrallel I/O.
>
> I know there aren't a lot of specifics here, but it took me about two weeks
> of convincing to get my cluster management group to realize that things
> weren't working quite right.  Once everything was setup, I was able to
> generate and write about 40 GB of data in around two minutes.
>
> On Tue, May 23, 2017 at 8:18 AM, Quincey Koziol <[email protected]> wrote:
>
> > Hi Jan,
> >
> > > On May 23, 2017, at 2:46 AM, Jan Oliver Oelerich <
> > [email protected]> wrote:
> > >
> > > Hello HDF users,
> > >
> > > I am using HDF5 through NetCDF and I recently changed my program so
> that
> > each MPI process writes its data directly to the output file as opposed
> to
> > the master process gathering the results and being the only one who does
> > I/O.
> > >
> > > Now I see that my program slows down file systems a lot (of the whole
> > HPC cluster) and I don't really know how to handle I/O. The file system
> is
> > a high throughput Beegfs system.
> > >
> > > My program uses a hybrid parallelization approach, i.e. work is split
> > into N MPI processes, each of which spawns M worker threads. Currently, I
> > write to the output file from each of the M*N threads, but the writing is
> > guarded by a mutex, so thread-safety shouldn't be a problem. Each writing
> > process is a complete `open file, write, close file` cycle.
> > >
> > > Each write is at a separate region of the HDF5 file, so no chunks are
> > shared among any two processes. The amount of data to be written per
> > process is 1/(M*N) times the size of the whole file.
> > >
> > > Shouldn't this be exactly how HDF5 + MPI is supposed to be used? What
> is
> > the `best practice` regarding parallel file access with HDF5?
> >
> >         Yes, this is probably the correct way to operate, but generally
> > things are much better for this case when collective I/O operations are
> > used.  Are you using collective or independent I/O?  (Independent is the
> > default)
> >
> >         Quincey
> >
> >
> > _______________________________________________
> > Hdf-forum is for HDF software users discussion.
> > [email protected]
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.
> > hdfgroup.org_mailman_listinfo_hdf-2Dforum-5Flists.hdfgroup.org
> &d=DwICAg&c=
> > clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI&r=Rx9txIqgEINHtVDIDfXdIw&m=
> > lnwp4oSn3StCocEX3B_WwTydNuJ5oFX7VYl-Ei3bbpw&s=5GdG4kU-9hw-z8kHIDPj6-
> > WfvdQeASwtycyfNyQ1tn0&e=
> > Twitter: https://urldefense.proofpoint.com/v2/url?u=https-3A__
> > twitter.com_hdf5&d=DwICAg&c=clK7kQUTWtAVEOVIgvi0NU5BOUHhpN
> 0H8p7CSfnc_gI&r=
> > Rx9txIqgEINHtVDIDfXdIw&m=lnwp4oSn3StCocEX3B_WwTydNuJ5oFX7VYl-Ei3bbpw&s=
> > YAEy34105plaH2V5vqw54_wLbsigIZ__8F13hUdNgEQ&e=
> >
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.
> hdfgroup.org/attachments/20170523/aee9a001/attachment-0001.html>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
>
>
> ------------------------------
>
> End of Hdf-forum Digest, Vol 95, Issue 24
> *****************************************
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Questions about size of generated Hdf5 files

Reply via email to