Hi Neil, just reviving this old thread to see if there was a progress on this feature. Do you have an update on the status for the upcoming 1.8.8 release?
Thanks, Andy Neil Fortner wrote on 2011-03-21: > Andy, > > On 03/20/2011 02:28 AM, Salnikov, Andrei A. wrote: >> Neil Fortner wrote on 2011-03-14: >>> Andy, >>> >>> On 03/11/2011 06:48 PM, Salnikov, Andrei A. wrote: >>>> Quincey Koziol wrote on 2011-03-10: >>>>> Hi Andy, >>>>> >>>>> On Mar 9, 2011, at 11:15 AM, Salnikov, Andrei A. wrote: >>>>> >>>>>> Quincey Koziol wrote on 2011-03-09: >>>>>>> Hi Andy, >>>>>>> >>>>>>> On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I'm trying to understand a performance hit that we are >>>>>>>> experiencing trying to examine the tree structure of >>>>>>>> our HDF5 files. Originally we observed problem when >>>>>>>> using h5py but it could be reproduced even with h5ls >>>>>>>> command. I tracked it down to a significant delay in >>>>>>>> the call to H5Oget_info_by_name function on a dataset >>>>>>>> with a large number of chunks. It looks like when the >>>>>>>> number of chunks in dataset increases (in our case >>>>>>>> we have 1-10k chunks) the performance of the H5Oget_info >>>>>>>> drops significantly. Looking at the IO statistics it >>>>>>>> seems that HDF5 library does very many small IO operations >>>>>>>> in this case. There is very little CPU spent, but real >>>>>>>> time is measured in tens of seconds. >>>>>>>> >>>>>>>> Is this an expected behavior? Can it be improved somehow >>>>>>>> without reducing the number of chunks drastically? >>>>>>>> >>>>>>>> One more comment about H5Oget_info - it returns a >>>>>>>> structure that contains a lot of different info. >>>>>>>> In the case of h5py code the only member of the >>>>>>>> structure used in the code is "type". could there be >>>>>>>> more efficient way to determine just the type of the >>>>>>>> object without requiring every other piece of info? >>>>>>> Ah, yes, we've noticed that in some of the applications we've >>>>>>> worked with also (including some of the main HDF5 tools, like >>>>>>> h5ls, etc). As you say, H5Oget_info() is fairly heavyweight, >>>>>>> getting all sorts of information about each object. I do think a >>>>>>> lighter- weight call like "H5Oget_type" would be useful. Is there >>>>>>> other "lightweight" information that people would like back for >>>>>>> each object? >>>>>>> >>>>>>> Quincey >>>>>>> >>>>>> Hi Quincey, >>>>>> >>>>>> thanks for confirming this. Could you explain briefly what is >>>>>> going on there and which part of H5O_info_t needs so many reads? >>>>> The H5Oget_info() call is gathering information about the amount of >>>>> space that the metadata for the dataset is using. When there's a >>>>> large B- tree for indexing the chunks, that can take a fair bit of >>>>> time to walk the B-tree. >>>>> >>>>>> Maybe removing heavyweight info from H5O_info_t is the right >>>>>> thing to do, or creating another version of H5O_info_t structure >>>>>> which has only light-weight info? >>>>> I'm leaning toward another light-weight version. I'm asking the >>>>> HDF5 community to help me decide what goes into that structure >>>>> besides the object type. >>>>> >>>> Hi Quincey, >>>> >>>> is there a chance we can get this new version in the next release? >>> We actually already have an experimental branch with a similar feature >>> mostly implemented. It allows you to specify the fields you want >>> filled in by H5Oget_info. The branch can be found at: >>> >>> http://svn.hdfgroup.uiuc.edu/hdf5/branches/h5oget_info_by_field/ >>> >>> The new functions are: >>> >>> herr_t H5Oget_info2(hid_t loc_id, H5O_info_t *oinfo, unsigned fields); >>> herr_t H5Oget_info_by_name2(hid_t loc_id, const char *name, H5O_info_t >>> *oinfo, unsigned fields, hid_t lapl_id); >>> >>> The "fields" parameter can contain the following bitflags (combined >>> with "|"): >>> >>> H5O_INFO_TIME H5O_INFO_NUM_ATTRS H5O_INFO_HDR H5O_INFO_META_SIZE >>> H5O_INFO_ALL (==H5O_INFO_TIME | H5O_INFO_NUM_ATTRS | H5O_INFO_HDR | >>> H5O_INFO_META_SIZE) >>> >>> Passing these flags tells the library to fill in the corresponding >>> fields in oinfo. Other fields are always filled in because there is >>> no performance penalty. In your case, since you only need the type, >>> you can just pass "0". h5ls has also been modified to use these, so >>> it should be faster. >>> >>> Of course, this is experimental code and should not be used in >>> production, but if you're curious how much a lightweight H5Oget_info >>> would help your performance you're welcome to try it. If you do, we'd >>> love to hear about your results, and also your thoughts on the >>> interface. For maximum performance, you should configure the library >>> with "--enable-production" (for this branch, not necessary for >>> releases). >>> >>> Thanks, >>> -Neil >>> >> Hi Neil, >> >> I managed to build this branch and test it. It has indeed improved >> performance dramatically. As you suggest I only use zero value for the >> fields argument, other values have not been included in my test. With >> that value and checking only the "type" field in H5O_info_t it runs >> much faster than previous version.'h5ls' also works better on our files. >> >> What I find interesting is a missing version for H5Oget_info_by_idx >> which would take "fields" argument. Is this function so much different >> from H5Oget_info and H5Oget_info_by_name so it cannot be optimized? >> >> Even without H5Oget_info_by_idx2 I'd be happy to see this branch >> included into next release. > > Glad to hear it improved your performance! It would be easy to add > H5Oget_info_by_idx2, we just didn't do that because we only did the > minimum needed to test the performance in the case we were looking at, > and stopped after reaching that point. We shelved the work because it > didn't make a huge difference in the case we were looking at, but with > your report I will look into getting it scheduled sooner rather than > later. There is a chance we may change the interface to something like > what Quincey suggested. Thanks for taking the time to test this! > > -Neil > >> Cheers, >> Andy >> >> >> _______________________________________________ >> Hdf-forum is for HDF software users discussion. >> Hdf-forum@hdfgroup.org >> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org > > _______________________________________________ Hdf-forum is for HDF > software users discussion. Hdf-forum@hdfgroup.org > http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org _______________________________________________ Hdf-forum is for HDF software users discussion. Hdf-forum@hdfgroup.org http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org