Random access is common in some use cases where statistics are generated for a random sample of a dataset.
On Wed, Feb 17, 2016 at 3:32 AM, Ger van Diepen <[email protected]> wrote: > I fully agree with Elena that in general you cannot and should not set a > predefined chunk cache size. > > However, I do believe that HDF5 can guess the chunk cache size based on > the access pattern, provided the user has not already set it. Usually the > access pattern is regular, so based on the hyperslab being accessed, it can > assume that the next accesses will be for the next similar hyperslabs. > Probably a hint parameter can be used to tell that the next hyperslabs will > be accessed. When the hyperslab shape changes, the user probably starts > another access pattern. > > Of course, the system can never cater for fully random access, but I > believe that is not used very often. In such a case the user should always > set the cache size. > > One can also think of some higher level functionality where the user > defines the cursor shape and access pattern making it possible to size the > cache automatically. Thereafter one can step through the dataset using a > simple next function. Maybe it also makes optimizations in HDF5 possible > since the cursor shape and access pattern are known a priori (for instance > if the cursor shape is the chunk shape when finding, say, the peak value in > a dataset). > > Cheers, > > Ger > > >>> "David A. Schneider" <[email protected]> 2/16/2016 9:15 PM > >>> > > Thanks Elena, > > After reading the comments at the end, I think I should try to write a > bunch of small 1MB chunks and see what the read performance is. However > suppose this leads to 100 times as many chunks, I had the understanding > that too many chunks degrades read performance in other ways, but maybe > it will still be a win. > > Those are good points about leaving the parameters for optimal > performance to the applications, but it would be nice if there was a > mechanism to allow the writing applications to be responsible for this, > or at least provide hints that the hdf5 library could decide if it can > support. Then if I am producing a h5 file that a scientist will use > through a high level h5 interface, the scientist can communicate the > reading access pattern, and I can translate it into a chunk layout for > writing, and dataset chunk cache parameters for reading. > > best, > > David > > On 02/14/16 16:55, Elena Pourmal wrote: > > Hi David and Filipe, > > > > Chunking and compression is a powerful feature that boosts performance > and saves space, but if not used correctly (and as you rightfully noted), > leads to performance issues. > > > > We did discuss the solution you proposed and voted against it. While it > is reasonable to increase current default chunk cache size from 1 MB to > ???, it would be unwise for the HDF5 library to use a chunk cache size > equal to a dataset chunk size. We decided to leave it to applications to > determine the appropriate chunk cache size and strategies (for example, use > H5Pset_chunk_cache instead of H5Pset_cache, or disable chunk cache > completely!) > > > > > > Here are several reasons: > > > > 1. Chunk size can be pretty big because it worked well when data was > written, but it may not work well for reading applications. An HDF5 > application will use a lot of memory when working with such files, > especially, if many files and datasets are open. We see this scenario very > often when users work with the collections of the HDF5 files (for example, > NPP satellite data; the attached paper discusses one of those use cases). > > > > 2. Making chunk cache size the same as chunk size will only solve the > performance problem when data that is written/or read belongs to one chunk. > This is not usually the case. Suppose you have a row that spans among > several chunks. When application reads by one row at a time, it will not > only use a lot of memory because chunk cache is now big, but there will be > the same performance problem as you described in your email: the same chunk > will be read and discarded many times. > > > > The way to deal with the performance problem is to adjust access pattern > or have chunk cache that contains as many chunks as possible for the I/O > operation. The HDF5 library doesn’t know this a priori and that is why we > left it to applications. At this point we don’t see how we can help except > educating our users. > > > > I am attaching a white paper that will be posted on our Website; see > section 4. Comments are highly appreciated. > > > > Thank you! > > > > Elena > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > Elena Pourmal The HDF Group http://hdfgroup.org > > 1800 So. Oak St., Suite 203, Champaign IL 61820 > > 217.531.6112 > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > > > > _______________________________________________ > > Hdf-forum is for HDF software users discussion. > > [email protected] > > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > > Twitter: https://twitter.com/hdf5 > > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > Twitter: https://twitter.com/hdf5 > > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > Twitter: https://twitter.com/hdf5 > -- George N. White III <[email protected]> Head of St. Margarets Bay, Nova Scotia
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
