Re: [Hdf-forum] Questions about size of generated Hdf5 files

Guillaume Jacquenot Wed, 24 May 2017 00:26:16 -0700

Hello Hdf5 community, Quincey

I have tested 1.8.16 and 1.10.1 versions, also with h5pset_libver_bounds_f
subroutine


I have inserted these commands in my bench program

    call h5open_f(error)
    call h5pcreate_f( H5P_FILE_ACCESS_F, fapl_id, error)
    call h5pset_libver_bounds_f(fapl_id, H5F_LIBVER_LATEST_F,
H5F_LIBVER_LATEST_F, error)


However, I can't see any difference on the size of HDF5 generated files.
Below is the size and md5sum of the generated hdf5 files, with the 2 hdf5
libraries and different number of elements (0,1 and 2) in each dataset



Version 1.8.16
$ ./bench.exe 0 && md5sum results.h5 && ls -altr results.h5
ee8157f1ce74936021b1958fb796741e *results.h5
-rw-r--r-- 1 xxxxx 1049089 1169632 May 24 09:17 results.h5

$ ./bench.exe 1 && md5sum results.h5 && ls -altr results.h5
1790a5650bb945b17c0f8a4e59adec85 *results.h5
-rw-r--r-- 1 xxxxx 1049089 7481632 May 24 09:17 results.h5

$ ./bench.exe 2 && md5sum results.h5 && ls -altr results.h5
7d3dff2c6a1c29fa0fe827e4bd5ba79e *results.h5
-rw-r--r-- 1 xxxxx 1049089 7505632 May 24 09:17 results.h5


Version 1.10.1
$ ./bench.exe 0 && md5sum results.h5 && ls -altr results.h5
ec8169773b9ea015c81fc4cb2205d727 *results.h5
-rw-r--r-- 1 xxxxx 1049089 1169632 May 24 09:12 results.h5

$ ./bench.exe 1 && md5sum results.h5 && ls -altr results.h5
fae64160fe79f4af0ef382fd1790bf76 *results.h5
-rw-r--r-- 1 xxxxx 1049089 7481632 May 24 09:14 results.h5

$ ./bench.exe 2 && md5sum results.h5 && ls -altr results.h5
20aaf160b3d8ab794ab8c14a604dacc5 *results.h5
-rw-r--r-- 1 xxxxx 1049089 7505632 May 24 09:14 results.h5





2017-05-23 19:12 GMT+02:00 Guillaume Jacquenot <
[email protected]>:

> Hello Quincey
>
> I am using version 1.8.16
>
> I am using chunk of size 1.
> I have tried contiguous dataset, but I have error at runtime
>
> I have written a test program that creates 3000 datasets filled with 64
> floating point number.
> I can specify the number n, which controls the number of times I saved my
> data (the number of timesteps of a simulation in my case)
>
> To sum this test program,
>
>     call hdf5_init(filename)
>     do i = 1, n
>         call hdf5_write(datatosave)
>     end do
>     call hdf5_close()
>
>
>
> With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a
> 370 bytes per empty dataset (Totally reasonnable).
> With 1 =0, I have a HDF5  file with size 7.13 Mo, which surprises me. Why
> such an increase?
> With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an
> increase of 0.02 Mo which is logical : 3000*8*1/1e6 =0.024 Mo)
>
> When setting chunk size to 10, I obtain the following results
>
> With n =0, I have a HDF5  file with size 1.11 Mo, which corresponds to a
> 370 bytes per empty dataset.
> With 1 =0, I have a HDF5  file with size 7.34 Mo, which surprises me.
> With 2 =0, I have a HDF5  file with size 7.15 Mo, which is leads to an
> increase of 3000*8*10/1e6, which is logical.
>
> I don't understand the first increase of size. It does not make this data
> storage very efficient.
> Do you think coumpound dataset with 3000 columns will present the same
> behaviour? I have not tried since I don't know how to map the content of an
> array when calling the h5dwrite_f function for a compound dataset.
>
>
> If I ask 30000 datasets, I observe the same behaviour
> n=0 -> 10.9 Mo
> n=1 -> 73.2 Mo
>
> Thanks
>
>
>
> Here is the error I have with contiguous dataset
>
>
>   #001: hdf5-1.8.16/src/H5Dint.c line 453 in H5D__create_named(): unable
> to create and link to dataset
>     major: Dataset
>     minor: Unable to initialize object
>   #002: hdf5-1.8.16/src/H5L.c line 1638 in H5L_link_object(): unable to
> create new link to object
>     major: Links
>     minor: Unable to initialize object
>   #003: hdf5-1.8.16/src/H5L.c line 1882 in H5L_create_real(): can't insert
> link
>     major: Symbol table
>     minor: Unable to insert object
>   #004: hdf5-1.8.16/src/H5Gtraverse.c line 861  in H5G_traverse():
> internal path traversal failed
>     major: Symbol table
>     minor: Object not found
>   #005: hdf5-1.8.16/src/H5Gtraverse.c line 641 in H5G_traverse_real():
> traversal operator failed
>     major: Symbol table
>     minor: Callback failed
>   #006: hdf5-1.8.16/src/H5L.c line 1685 in H5L_link_cb(): unable to create
> object
>     major: Object header
>     minor: Unable to initialize object
>   #007: hdf5-1.8.16/src/H5O.c line 3016 in H5O_obj_create(): unable to
> open object
>     major: Object header
>     minor: Can't open object
>   #008: hdf5-1.8.16/src/H5Doh.c line 293 in H5O__dset_create(): unable to
> create dataset
>     major: Dataset
>     minor: Unable to initialize object
>   #009: hdf5-1.8.16/src/H5Dint.c line 1056 in H5D__create(): unable to
> construct layout information
>     major: Dataset
>     minor: Unable to initialize object
>   #010: hdf5-1.8.16/src/H5Dcontig.c line 422 in H5D__contig_construct():
> extendible contiguous non-external dataset
>     major: Dataset
>     minor: Feature is unsupported
> HDF5-DIAG: Error detected in HDF5 (1.8.16) t^C
>
> 2017-05-23 19:00 GMT+02:00 <[email protected]>:
>
>>
>>
>> Date: Tue, 23 May 2017 08:22:59 -0700
>> From: Quincey Koziol <[email protected]>
>> To: HDF Users Discussion List <[email protected]>
>> Subject: Re: [Hdf-forum] Questions about size of generated Hdf5 files
>> Message-ID: <[email protected]>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Hi Guillaume,
>>         Are you using chunked or contiguous datasets?  If chunked, what
>> size are you using?  Also, can you use the ?latest? version of the format,
>> which should be smaller, but is only compatible with HDF5 1.10.x or later?
>> (i.e. H5Pset_libver_bounds with ?latest? for low and high bounds,
>> https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm <
>> https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm> )
>>
>>         Quincey
>>
>>
>> > On May 23, 2017, at 3:02 AM, Guillaume Jacquenot <
>> [email protected]> wrote:
>> >
>> > Hello everyone!
>> >
>> > I am creating a HDF5 file from a Fortran program, and I am confused
>> about the size of my generated HDF5 file.
>> >
>> > I am writing 19000 datasets with 21 values of 64 bit (real number).
>> > I write one value at a time, and extend with one each of the 19000
>> datasets everytime.
>> > All data are correctly written.
>> > But the generated file is more than 48 Mo.
>> > I expected the total size of the file to be a little bigger than the
>> raw data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo)
>> > If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which
>> means each empty dataset is about 400 bytes.
>> > I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can
>> contain everything.
>> >
>> > For comparaison,if I write everything in a text file, where each real
>> number is written with 15 characters, I obtain a 6 Mo CSV file.
>> >
>> > Question 1)
>> > Is this behaviour normal?
>> >
>> > Question 2)
>> > Does extending dataset each time we write data inside can significantly
>> increase the total required space disk size?
>> > Does preallocating dataset and using hyperslab can save some space?
>> > Does chunk parameters can impact the size of generated hdf5 file
>> >
>> > Question 3)
>> > If I pack everything in a compound dataset with 19000 columns, will the
>> result file be smaller?
>> >
>> > N.B:
>> > When looking at the example of generating 100000 groups (grplots.c),the
>> size of the generated HD5 file is 78 Mo for 100000 empty groups
>> > That means each group is about 780 bytes
>> > https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c
>> <https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c>
>> >
>> > Guillaume Jacquenot
>>
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Questions about size of generated Hdf5 files

Reply via email to