[Hdf-forum] Chuck cache size proposal

Filipe Maia Wed, 17 Feb 2016 02:37:58 -0800

I also agree that the library should not enforce any set chunk size, but
that was not ever in question. The issue is finding the best chunk cache
size when the user has not defined one.
It seems we all agree that the current value of 1MB is outdated.
I also understand that we need to weigh the concerns of using too much
memory.


Taking the hyperslab size into account, together with the chunk size, is a
good idea and it would give us more valuable information to calculate a
better value for chunk cache (e.g. the maximum of chunk size and hyperslab
size for each dimension).

Another possibility would be to give compressions filters some more
information about the dataspace and allow them to set the chunk cache, but
that is a discussion for another thread.

The scenario of multiple user reads per chunk is not uncommon. For example
my datasets have many images and to be able to efficiently compress it I
need to chunk it with multiple images per chunk (as the images share common
features). The user usually looks at one image at a time resulting in
multiple reads per chunk. I don't think such situations are atypical.

Cheers,
Filipe

On 17 February 2016 at 08:32, Ger van Diepen <[email protected]
<javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:

> I fully agree with Elena that in general you cannot and should not set a
> predefined chunk cache size.
>
> However, I do believe that HDF5 can guess the chunk cache size based on
> the access pattern, provided the user has not already set it. Usually the
> access pattern is regular, so based on the hyperslab being accessed, it can
> assume that the next accesses will be for the next similar hyperslabs.
> Probably a hint parameter can be used to tell that the next hyperslabs will
> be accessed. When the hyperslab shape changes, the user probably starts
> another access pattern.
>
> Of course, the system can never cater for fully random access, but I
> believe that is not used very often. In such a case the user should always
> set the cache size.
>
> One can also think of some higher level functionality where the user
> defines the cursor shape and access pattern making it possible to size the
> cache automatically. Thereafter one can step through the dataset using a
> simple next function. Maybe it also makes optimizations in HDF5 possible
> since the cursor shape and access pattern are known a priori (for instance
> if the cursor shape is the chunk shape when finding, say, the peak value in
> a dataset).
>
> Cheers,
>
> Ger
>
> >>> "David A. Schneider" <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> 2/16/2016
> 9:15 PM >>>
>
> Thanks Elena,
>
> After reading the comments at the end, I think I should try to write a
> bunch of small 1MB chunks and see what the read performance is. However
> suppose this leads to 100 times as many chunks, I had the understanding
> that too many chunks degrades read performance in other ways, but maybe
> it will still be a win.
>
> Those are good points about leaving the parameters for optimal
> performance to the applications, but it would be nice if there was a
> mechanism to allow the writing applications to be responsible for this,
> or at least provide hints that the hdf5 library could decide if it can
> support. Then if I am producing a h5 file that a scientist will use
> through a high level h5 interface, the scientist can communicate the
> reading access pattern, and I can translate it into a chunk layout for
> writing, and dataset chunk cache parameters for reading.
>
> best,
>
> David
>
> On 02/14/16 16:55, Elena Pourmal wrote:
> > Hi David and Filipe,
> >
> > Chunking and compression is a powerful feature that boosts performance
> and saves space, but if not used correctly (and as you rightfully noted),
> leads to performance issues.
> >
> > We did discuss the solution you proposed and voted against it. While it
> is reasonable to increase current default chunk cache size from 1 MB to
> ???, it would be unwise for the HDF5 library to use a chunk cache size
> equal to a dataset chunk size. We decided to leave it to applications to
> determine the appropriate chunk cache size and strategies (for example, use
> H5Pset_chunk_cache instead of H5Pset_cache, or disable chunk cache
> completely!)
> >
> >
> > Here are several reasons:
> >
> > 1. Chunk size can be pretty big because it worked well when data was
> written, but it may not work well for reading applications. An HDF5
> application will use a lot of memory when working with such files,
> especially, if many files and datasets are open. We see this scenario very
> often when users work with the collections of the HDF5 files (for example,
> NPP satellite data; the attached paper discusses one of those use cases).
> >
> > 2. Making chunk cache size the same as chunk size will only solve the
> performance problem when data that is written/or read belongs to one chunk.
> This is not usually the case. Suppose you have a row that spans among
> several chunks. When application reads by one row at a time, it will not
> only use a lot of memory because chunk cache is now big, but there will be
> the same performance problem as you described in your email: the same chunk
> will be read and discarded many times.
> >
> > The way to deal with the performance problem is to adjust access pattern
> or have chunk cache that contains as many chunks as possible for the I/O
> operation. The HDF5 library doesn’t know this a priori and that is why we
> left it to applications. At this point we don’t see how we can help except
> educating our users.
> >
> > I am attaching a white paper that will be posted on our Website; see
> section 4. Comments are highly appreciated.
> >
> > Thank you!
> >
> > Elena
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > Elena Pourmal  The HDF Group  http://hdfgroup.org
> > 1800 So. Oak St., Suite 203, Champaign IL 61820
> > 217.531.6112
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >
> > _______________________________________________
> > Hdf-forum is for HDF software users discussion.
> > [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>
> > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> > Twitter: https://twitter.com/hdf5
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

[Hdf-forum] Chuck cache size proposal

Reply via email to