Re: [Hdf-forum] Chuck cache size proposal

George N. White III Wed, 17 Feb 2016 02:11:02 -0800

Random access is common in some use cases where statistics are generated
for a random sample of a dataset.


On Wed, Feb 17, 2016 at 3:32 AM, Ger van Diepen <[email protected]> wrote:

> I fully agree with Elena that in general you cannot and should not set a
> predefined chunk cache size.
>
> However, I do believe that HDF5 can guess the chunk cache size based on
> the access pattern, provided the user has not already set it. Usually the
> access pattern is regular, so based on the hyperslab being accessed, it can
> assume that the next accesses will be for the next similar hyperslabs.
> Probably a hint parameter can be used to tell that the next hyperslabs will
> be accessed. When the hyperslab shape changes, the user probably starts
> another access pattern.
>
> Of course, the system can never cater for fully random access, but I
> believe that is not used very often. In such a case the user should always
> set the cache size.
>
> One can also think of some higher level functionality where the user
> defines the cursor shape and access pattern making it possible to size the
> cache automatically. Thereafter one can step through the dataset using a
> simple next function. Maybe it also makes optimizations in HDF5 possible
> since the cursor shape and access pattern are known a priori (for instance
> if the cursor shape is the chunk shape when finding, say, the peak value in
> a dataset).
>
> Cheers,
>
> Ger
>
> >>> "David A. Schneider" <[email protected]> 2/16/2016 9:15 PM
> >>>
>
> Thanks Elena,
>
> After reading the comments at the end, I think I should try to write a
> bunch of small 1MB chunks and see what the read performance is. However
> suppose this leads to 100 times as many chunks, I had the understanding
> that too many chunks degrades read performance in other ways, but maybe
> it will still be a win.
>
> Those are good points about leaving the parameters for optimal
> performance to the applications, but it would be nice if there was a
> mechanism to allow the writing applications to be responsible for this,
> or at least provide hints that the hdf5 library could decide if it can
> support. Then if I am producing a h5 file that a scientist will use
> through a high level h5 interface, the scientist can communicate the
> reading access pattern, and I can translate it into a chunk layout for
> writing, and dataset chunk cache parameters for reading.
>
> best,
>
> David
>
> On 02/14/16 16:55, Elena Pourmal wrote:
> > Hi David and Filipe,
> >
> > Chunking and compression is a powerful feature that boosts performance
> and saves space, but if not used correctly (and as you rightfully noted),
> leads to performance issues.
> >
> > We did discuss the solution you proposed and voted against it. While it
> is reasonable to increase current default chunk cache size from 1 MB to
> ???, it would be unwise for the HDF5 library to use a chunk cache size
> equal to a dataset chunk size. We decided to leave it to applications to
> determine the appropriate chunk cache size and strategies (for example, use
> H5Pset_chunk_cache instead of H5Pset_cache, or disable chunk cache
> completely!)
> >
> >
> > Here are several reasons:
> >
> > 1. Chunk size can be pretty big because it worked well when data was
> written, but it may not work well for reading applications. An HDF5
> application will use a lot of memory when working with such files,
> especially, if many files and datasets are open. We see this scenario very
> often when users work with the collections of the HDF5 files (for example,
> NPP satellite data; the attached paper discusses one of those use cases).
> >
> > 2. Making chunk cache size the same as chunk size will only solve the
> performance problem when data that is written/or read belongs to one chunk.
> This is not usually the case. Suppose you have a row that spans among
> several chunks. When application reads by one row at a time, it will not
> only use a lot of memory because chunk cache is now big, but there will be
> the same performance problem as you described in your email: the same chunk
> will be read and discarded many times.
> >
> > The way to deal with the performance problem is to adjust access pattern
> or have chunk cache that contains as many chunks as possible for the I/O
> operation. The HDF5 library doesn’t know this a priori and that is why we
> left it to applications. At this point we don’t see how we can help except
> educating our users.
> >
> > I am attaching a white paper that will be posted on our Website; see
> section 4. Comments are highly appreciated.
> >
> > Thank you!
> >
> > Elena
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > Elena Pourmal  The HDF Group  http://hdfgroup.org
> > 1800 So. Oak St., Suite 203, Champaign IL 61820
> > 217.531.6112
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >
> > _______________________________________________
> > Hdf-forum is for HDF software users discussion.
> > [email protected]
> > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> > Twitter: https://twitter.com/hdf5
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5
>



-- 
George N. White III <[email protected]>
Head of St. Margarets Bay, Nova Scotia

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Chuck cache size proposal

Reply via email to