Thanks Elena,

After reading the comments at the end, I think I should try to write a bunch of small 1MB chunks and see what the read performance is. However suppose this leads to 100 times as many chunks, I had the understanding that too many chunks degrades read performance in other ways, but maybe it will still be a win.

Those are good points about leaving the parameters for optimal performance to the applications, but it would be nice if there was a mechanism to allow the writing applications to be responsible for this, or at least provide hints that the hdf5 library could decide if it can support. Then if I am producing a h5 file that a scientist will use through a high level h5 interface, the scientist can communicate the reading access pattern, and I can translate it into a chunk layout for writing, and dataset chunk cache parameters for reading.

best,

David

On 02/14/16 16:55, Elena Pourmal wrote:
Hi David and Filipe,

Chunking and compression is a powerful feature that boosts performance and 
saves space, but if not used correctly (and as you rightfully noted), leads to 
performance issues.

We did discuss the solution you proposed and voted against it. While it is 
reasonable to increase current default chunk cache size from 1 MB to ???, it 
would be unwise for the HDF5 library to use a chunk cache size equal to a 
dataset chunk size. We decided to leave it to applications to determine the 
appropriate chunk cache size and strategies (for example, use 
H5Pset_chunk_cache instead of H5Pset_cache, or disable chunk cache completely!)


Here are several reasons:

1. Chunk size can be pretty big because it worked well when data was written, 
but it may not work well for reading applications. An HDF5 application will use 
a lot of memory when working with such files, especially, if many files and 
datasets are open. We see this scenario very often when users work with the 
collections of the HDF5 files (for example, NPP satellite data; the attached 
paper discusses one of those use cases).

2. Making chunk cache size the same as chunk size will only solve the 
performance problem when data that is written/or read belongs to one chunk. 
This is not usually the case. Suppose you have a row that spans among several 
chunks. When application reads by one row at a time, it will not only use a lot 
of memory because chunk cache is now big, but there will be the same 
performance problem as you described in your email: the same chunk will be read 
and discarded many times.

The way to deal with the performance problem is to adjust access pattern or 
have chunk cache that contains as many chunks as possible for the I/O 
operation. The HDF5 library doesn’t know this a priori and that is why we left 
it to applications. At this point we don’t see how we can help except educating 
our users.

I am attaching a white paper that will be posted on our Website; see section 4. 
Comments are highly appreciated.

Thank you!

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal  The HDF Group  http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to