Hello, I just react to this because of the chunk size. Every chunk carries metadata so a chunk should contain a non-negligible amount of data to avoid inefficiencies and large file sizes. The guideline of the HDF5 documentation is a chunk size of the order of 1MB.
Regards, Pierre On Tue, May 23, 2017 at 07:12:47PM +0200, Guillaume Jacquenot wrote: > Hello Quincey > > I am using version 1.8.16 > I have > I am using chunk of size 1. > I have tried contiguous dataset, but I have error at runtime > > I have written a test program that creates 3000 datasets filled with 64 > floating point number. > I can specify the number n, which controls the number of times I saved my > data (the number of timesteps of a simulation in my case) > > To sum this test program, > > call hdf5_init(filename) > do i = 1, n > call hdf5_write(datatosave) > end do > call hdf5_close() > > > > With n =0, I have a HDF5 file with size 1.11 Mo, which corresponds to a > 370 bytes per empty dataset (Totally reasonnable). > With 1 =0, I have a HDF5 file with size 7.13 Mo, which surprises me. Why > such an increase? > With 2 =0, I have a HDF5 file with size 7.15 Mo, which is leads to an > increase of 0.02 Mo which is logical : 3000*8*1/1e6 =0.024 Mo) > > When setting chunk size to 10, I obtain the following results > > With n =0, I have a HDF5 file with size 1.11 Mo, which corresponds to a > 370 bytes per empty dataset. > With 1 =0, I have a HDF5 file with size 7.34 Mo, which surprises me. > With 2 =0, I have a HDF5 file with size 7.15 Mo, which is leads to an > increase of 3000*8*10/1e6, which is logical. > > I don't understand the first increase of size. It does not make this data > storage very efficient. > Do you think coumpound dataset with 3000 columns will present the same > behaviour? I have not tried since I don't know how to map the content of an > array when calling the h5dwrite_f function for a compound dataset. > > > If I ask 30000 datasets, I observe the same behaviour > n=0 -> 10.9 Mo > n=1 -> 73.2 Mo > > Thanks > > > > Here is the error I have with contiguous dataset > > > #001: hdf5-1.8.16/src/H5Dint.c line 453 in H5D__create_named(): unable to > create and link to dataset > major: Dataset > minor: Unable to initialize object > #002: hdf5-1.8.16/src/H5L.c line 1638 in H5L_link_object(): unable to > create new link to object > major: Links > minor: Unable to initialize object > #003: hdf5-1.8.16/src/H5L.c line 1882 in H5L_create_real(): can't insert > link > major: Symbol table > minor: Unable to insert object > #004: hdf5-1.8.16/src/H5Gtraverse.c line 861 in H5G_traverse(): internal > path traversal failed > major: Symbol table > minor: Object not found > #005: hdf5-1.8.16/src/H5Gtraverse.c line 641 in H5G_traverse_real(): > traversal operator failed > major: Symbol table > minor: Callback failed > #006: hdf5-1.8.16/src/H5L.c line 1685 in H5L_link_cb(): unable to create > object > major: Object header > minor: Unable to initialize object > #007: hdf5-1.8.16/src/H5O.c line 3016 in H5O_obj_create(): unable to open > object > major: Object header > minor: Can't open object > #008: hdf5-1.8.16/src/H5Doh.c line 293 in H5O__dset_create(): unable to > create dataset > major: Dataset > minor: Unable to initialize object > #009: hdf5-1.8.16/src/H5Dint.c line 1056 in H5D__create(): unable to > construct layout information > major: Dataset > minor: Unable to initialize object > #010: hdf5-1.8.16/src/H5Dcontig.c line 422 in H5D__contig_construct(): > extendible contiguous non-external dataset > major: Dataset > minor: Feature is unsupported > HDF5-DIAG: Error detected in HDF5 (1.8.16) t^C > > 2017-05-23 19:00 GMT+02:00 <[email protected]>: > > > Send Hdf-forum mailing list submissions to > > [email protected] > > > > To subscribe or unsubscribe via the World Wide Web, visit > > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_ > > lists.hdfgroup.org > > > > or, via email, send a message with subject or body 'help' to > > [email protected] > > > > You can reach the person managing the list at > > [email protected] > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of Hdf-forum digest..." > > > > > > Today's Topics: > > > > 1. Re: Questions about size of generated Hdf5 files (Quincey Koziol) > > 2. Re: Parallel file access recommendation (Aaron Friesz) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Tue, 23 May 2017 08:22:59 -0700 > > From: Quincey Koziol <[email protected]> > > To: HDF Users Discussion List <[email protected]> > > Subject: Re: [Hdf-forum] Questions about size of generated Hdf5 files > > Message-ID: <[email protected]> > > Content-Type: text/plain; charset="utf-8" > > > > Hi Guillaume, > > Are you using chunked or contiguous datasets? If chunked, what > > size are you using? Also, can you use the ?latest? version of the format, > > which should be smaller, but is only compatible with HDF5 1.10.x or later? > > (i.e. H5Pset_libver_bounds with ?latest? for low and high bounds, > > https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm < > > https://support.hdfgroup.org/HDF5/doc/RM/H5P/H5Pset_libver_bounds.htm> ) > > > > Quincey > > > > > > > On May 23, 2017, at 3:02 AM, Guillaume Jacquenot < > > [email protected]> wrote: > > > > > > Hello everyone! > > > > > > I am creating a HDF5 file from a Fortran program, and I am confused > > about the size of my generated HDF5 file. > > > > > > I am writing 19000 datasets with 21 values of 64 bit (real number). > > > I write one value at a time, and extend with one each of the 19000 > > datasets everytime. > > > All data are correctly written. > > > But the generated file is more than 48 Mo. > > > I expected the total size of the file to be a little bigger than the raw > > data, about 3.2Mo (21*19000*8 / 1e6=3.192Mo) > > > If I only create 19000 empty datasets, I obtain a 6Mo Hdf5 file, which > > means each empty dataset is about 400 bytes. > > > I guess I could create a ~10 Mo (6Mo + 3.2Mo) Hdf5 file that can contain > > everything. > > > > > > For comparaison,if I write everything in a text file, where each real > > number is written with 15 characters, I obtain a 6 Mo CSV file. > > > > > > Question 1) > > > Is this behaviour normal? > > > > > > Question 2) > > > Does extending dataset each time we write data inside can significantly > > increase the total required space disk size? > > > Does preallocating dataset and using hyperslab can save some space? > > > Does chunk parameters can impact the size of generated hdf5 file > > > > > > Question 3) > > > If I pack everything in a compound dataset with 19000 columns, will the > > result file be smaller? > > > > > > N.B: > > > When looking at the example of generating 100000 groups (grplots.c),the > > size of the generated HD5 file is 78 Mo for 100000 empty groups > > > That means each group is about 780 bytes > > > https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c < > > https://support.hdfgroup.org/ftp/HDF5/examples/howto/crtmany/grplots.c> > > > > > > Guillaume Jacquenot > > > > > > > > > > > > _______________________________________________ > > > Hdf-forum is for HDF software users discussion. > > > [email protected] > > > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > > > Twitter: https://twitter.com/hdf5 > > > > -------------- next part -------------- > > An HTML attachment was scrubbed... > > URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists. > > hdfgroup.org/attachments/20170523/b7107007/attachment-0001.html> > > > > ------------------------------ > > > > Message: 2 > > Date: Tue, 23 May 2017 08:46:07 -0700 > > From: Aaron Friesz <[email protected]> > > To: HDF Users Discussion List <[email protected]> > > Subject: Re: [Hdf-forum] Parallel file access recommendation > > Message-ID: > > <CAC4OLecz_6xCPWfXcvkJCjRm+DF+uttMP72VyxFKqPqGNOa2dg@mail. > > gmail.com> > > Content-Type: text/plain; charset="utf-8" > > > > A year or so back, we changed to BeeGFS as well. There were some issues > > getting parrallel I/O setup. First thing you want to do is run the > > parrallel mpio test. I believe they can be found here: > > https://support.hdfgroup.org/HDF5/Tutor/pprog.html. > > > > This will help you verify if your cluster has mpio setup correctly. If > > that doesn't work, you'll need to get in touch with the management group to > > fix that. > > > > Then you need to make sure you are using an HDF5 library that is configured > > to do parrallel I/O. > > > > I know there aren't a lot of specifics here, but it took me about two weeks > > of convincing to get my cluster management group to realize that things > > weren't working quite right. Once everything was setup, I was able to > > generate and write about 40 GB of data in around two minutes. > > > > On Tue, May 23, 2017 at 8:18 AM, Quincey Koziol <[email protected]> wrote: > > > > > Hi Jan, > > > > > > > On May 23, 2017, at 2:46 AM, Jan Oliver Oelerich < > > > [email protected]> wrote: > > > > > > > > Hello HDF users, > > > > > > > > I am using HDF5 through NetCDF and I recently changed my program so > > that > > > each MPI process writes its data directly to the output file as opposed > > to > > > the master process gathering the results and being the only one who does > > > I/O. > > > > > > > > Now I see that my program slows down file systems a lot (of the whole > > > HPC cluster) and I don't really know how to handle I/O. The file system > > is > > > a high throughput Beegfs system. > > > > > > > > My program uses a hybrid parallelization approach, i.e. work is split > > > into N MPI processes, each of which spawns M worker threads. Currently, I > > > write to the output file from each of the M*N threads, but the writing is > > > guarded by a mutex, so thread-safety shouldn't be a problem. Each writing > > > process is a complete `open file, write, close file` cycle. > > > > > > > > Each write is at a separate region of the HDF5 file, so no chunks are > > > shared among any two processes. The amount of data to be written per > > > process is 1/(M*N) times the size of the whole file. > > > > > > > > Shouldn't this be exactly how HDF5 + MPI is supposed to be used? What > > is > > > the `best practice` regarding parallel file access with HDF5? > > > > > > Yes, this is probably the correct way to operate, but generally > > > things are much better for this case when collective I/O operations are > > > used. Are you using collective or independent I/O? (Independent is the > > > default) > > > > > > Quincey > > > _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
