Re: [Hdf-forum] Questions about size of generated Hdf5 files

Quincey Koziol Wed, 24 May 2017 10:54:50 -0700

Hi Guillaume,
        As Pierre mentioned, a chunk size of 1 is not reasonable and will 
generate a lot of metadata overhead.  Something closer to 1MB of data elements 
would be much better.


        Quincey

> On May 24, 2017, at 12:23 AM, Guillaume Jacquenot 
> <[email protected]> wrote:
> 
> Hello Hdf5 community, Quincey
> 
> I have tested 1.8.16 and 1.10.1 versions, also with h5pset_libver_bounds_f 
> subroutine
> 
> I have inserted these commands in my bench program
> 
>     call h5open_f(error)
>     call h5pcreate_f( H5P_FILE_ACCESS_F, fapl_id, error)
>     call h5pset_libver_bounds_f(fapl_id, H5F_LIBVER_LATEST_F, 
> H5F_LIBVER_LATEST_F, error)
> 
> 
> However, I can't see any difference on the size of HDF5 generated files.
> Below is the size and md5sum of the generated hdf5 files, with the 2 hdf5 
> libraries and different number of elements (0,1 and 2) in each dataset
> 
> 
> 
> Version 1.8.16
> $ ./bench.exe 0 && md5sum results.h5 && ls -altr results.h5
> ee8157f1ce74936021b1958fb796741e *results.h5
> -rw-r--r-- 1 xxxxx 1049089 1169632 May 24 09:17 results.h5
> 
> $ ./bench.exe 1 && md5sum results.h5 && ls -altr results.h5
> 1790a5650bb945b17c0f8a4e59adec85 *results.h5
> -rw-r--r-- 1 xxxxx 1049089 7481632 May 24 09:17 results.h5
> 
> $ ./bench.exe 2 && md5sum results.h5 && ls -altr results.h5
> 7d3dff2c6a1c29fa0fe827e4bd5ba79e *results.h5
> -rw-r--r-- 1 xxxxx 1049089 7505632 May 24 09:17 results.h5
> 
> 
> Version 1.10.1
> $ ./bench.exe 0 && md5sum results.h5 && ls -altr results.h5
> ec8169773b9ea015c81fc4cb2205d727 *results.h5
> -rw-r--r-- 1 xxxxx 1049089 1169632 May 24 09:12 results.h5
> 
> $ ./bench.exe 1 && md5sum results.h5 && ls -altr results.h5
> fae64160fe79f4af0ef382fd1790bf76 *results.h5
> -rw-r--r-- 1 xxxxx 1049089 7481632 May 24 09:14 results.h5
> 
> $ ./bench.exe 2 && md5sum results.h5 && ls -altr results.h5
> 20aaf160b3d8ab794ab8c14a604dacc5 *results.h5
> -rw-r--r-- 1 xxxxx 1049089 7505632 May 24 09:14 results.h5
> 
> 
> 
> 
> 
> 2017-05-23 19:12 GMT+02:00 Guillaume Jacquenot <[email protected] 
> <mailto:[email protected]>>:
> Hello Quincey
> 
> I am using version 1.8.16
> 
> I am using chunk of size 1. 
> I have tried contiguous dataset, but I have error at runtime 
> 
> I have written a test program that creates 3000 datasets filled with 64 
> floating point number.
> I can specify the number n, which controls the number of times I saved my 
> data (the number of timesteps of a simulation in my case)
> 
> To sum this test program, 
> 
>     call hdf5_init(filename)
>     do i = 1, n
>         call hdf5_write(datatosave)
>     end do
>     call hdf5_close()
> 
> 
> 
> With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a 370 
> bytes per empty dataset (Totally reasonnable).
> With 1 =0, I have a HDF5  file with size 7.13 Mo, which surprises me. Why 
> such an increase?
> With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an 
> increase of 0.02 Mo which is logical : 3000*8*1/1e6 =0.024 Mo)
> 
> When setting chunk size to 10, I obtain the following results
> 
> With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a 370 
> bytes per empty dataset.
> With 1 =0, I have a HDF5  file with size 7.34 Mo, which surprises me.
> With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an 
> increase of 3000*8*10/1e6, which is logical.
> 
> I don't understand the first increase of size. It does not make this data 
> storage very efficient.
> Do you think coumpound dataset with 3000 columns will present the same 
> behaviour? I have not tried since I don't know how to map the content of an 
> array when calling the h5dwrite_f function for a compound dataset.
> 
> 
> If I ask 30000 datasets, I observe the same behaviour
> n=0 -> 10.9 Mo
> n=1 -> 73.2 Mo
> 
> Thanks
> 
> 
> 
> Here is the error I have with contiguous dataset
> 
> 
>   #001: hdf5-1.8.16/src/H5Dint.c line 453 in H5D__create_named(): unable to 
> create and link to dataset
>     major: Dataset
>     minor: Unable to initialize object
>   #002: hdf5-1.8.16/src/H5L.c line 1638 in H5L_link_object(): unable to 
> create new link to object
>     major: Links
>     minor: Unable to initialize object
>   #003: hdf5-1.8.16/src/H5L.c line 1882 in H5L_create_real(): can't insert 
> link
>     major: Symbol table
>     minor: Unable to insert object
>   #004: hdf5-1.8.16/src/H5Gtraverse.c line 861  in H5G_traverse(): internal 
> path traversal failed
>     major: Symbol table
>     minor: Object not found
>   #005: hdf5-1.8.16/src/H5Gtraverse.c line 641 in H5G_traverse_real(): 
> traversal operator failed
>     major: Symbol table
>     minor: Callback failed
>   #006: hdf5-1.8.16/src/H5L.c line 1685 in H5L_link_cb(): unable to create 
> object
>     major: Object header
>     minor: Unable to initialize object
>   #007: hdf5-1.8.16/src/H5O.c line 3016 in H5O_obj_create(): unable to open 
> object
>     major: Object header
>     minor: Can't open object
>   #008: hdf5-1.8.16/src/H5Doh.c line 293 in H5O__dset_create(): unable to 
> create dataset
>     major: Dataset
>     minor: Unable to initialize object
>   #009: hdf5-1.8.16/src/H5Dint.c line 1056 in H5D__create(): unable to 
> construct layout information
>     major: Dataset
>     minor: Unable to initialize object
>   #010: hdf5-1.8.16/src/H5Dcontig.c line 422 in H5D__contig_construct(): 
> extendible contiguous non-external dataset
>     major: Dataset
>     minor: Feature is unsupported
> HDF5-DIAG: Error detected in HDF5 (1.8.16) t^C
> 
> 2017-05-23 19:00 GMT+02:00 <[email protected] 
> <mailto:[email protected]>>:
> 
> 
> Date: Tue, 23 May 2017 08:22:59 -0700
> From: Quincey Koziol <[email protected] <mailto:[email protected]>>
> To: HDF Users Discussion List <[email protected] 
> <mailto:[email protected]>>
> Subject: Re: [Hdf-forum] Questions about size of generated Hdf5 files
> Message-ID: <[email protected] 
> <mailto:[email protected]>>
> Content-Type: text/plain; charset="utf-8"
> 
> Hi Guillaume,
>         Are you using chunked or contiguous datasets?  If chunked, what size 
> are you using?  Also, can you use the ?latest? version of the format, which 
> should be smaller, but is only compatible with HDF5 1.10.x or later?  (i.e. 
> H5Pset_libver_bounds with ?latest? for low and high bounds, 
> https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm 
> <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm> 
> <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm 
> <https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm>> )
> 
>         Quincey
> 
> 
> > On May 23, 2017, at 3:02 AM, Guillaume Jacquenot 
> > <[email protected] <mailto:[email protected]>> 
> > wrote:
> >
> > Hello everyone!
> >
> > I am creating a HDF5 file from a Fortran program, and I am confused about 
> > the size of my generated HDF5 file.
> >
> > I am writing 19000 datasets with 21 values of 64 bit (real number).
> > I write one value at a time, and extend with one each of the 19000 datasets 
> > everytime.
> > All data are correctly written.
> > But the generated file is more than 48 Mo.
> > I expected the total size of the file to be a little bigger than the raw 
> > data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
> > If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which 
> > means each empty dataset is about 400 bytes.
> > I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can contain 
> > everything.
> >
> > For comparaison,if I write everything in a text file, where each real 
> > number is written with 15 characters, I obtain a 6 Mo CSV file.
> >
> > Question 1)
> > Is this behaviour normal?
> >
> > Question 2)
> > Does extending dataset each time we write data inside can significantly 
> > increase the total required space disk size?
> > Does preallocating dataset and using hyperslab can save some space?
> > Does chunk parameters can impact the size of generated hdf5 file
> >
> > Question 3)
> > If I pack everything in a compound dataset with 19000 columns, will the 
> > result file be smaller?
> >
> > N.B:
> > When looking at the example of generating 100000 groups (grplots.c),the 
> > size of the generated HD5 file is 78 Mo for 100000 empty groups
> > That means each group is about 780 bytes
> > https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c 
> > <https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c> 
> > <https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c 
> > <https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c>>
> >
> > Guillaume Jacquenot
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Questions about size of generated Hdf5 files

Reply via email to