Hi Neil,

just reviving this old thread to see if there was a progress
on this feature. Do you have an update on the status for
the upcoming 1.8.8 release?

Thanks,
Andy


Neil Fortner wrote on 2011-03-21:
> Andy,
> 
> On 03/20/2011 02:28 AM, Salnikov, Andrei A. wrote:
>> Neil Fortner wrote on 2011-03-14:
>>> Andy,
>>> 
>>> On 03/11/2011 06:48 PM, Salnikov, Andrei A. wrote:
>>>> Quincey Koziol wrote on 2011-03-10:
>>>>> Hi Andy,
>>>>> 
>>>>> On Mar 9, 2011, at 11:15 AM, Salnikov, Andrei A. wrote:
>>>>> 
>>>>>> Quincey Koziol wrote on 2011-03-09:
>>>>>>> Hi Andy,
>>>>>>> 
>>>>>>> On Mar 8, 2011, at 7:09 PM, Salnikov, Andrei A. wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I'm trying to understand a performance hit that we are
>>>>>>>> experiencing trying to examine the tree structure of
>>>>>>>> our HDF5 files. Originally we observed problem when
>>>>>>>> using h5py but it could be reproduced even with h5ls
>>>>>>>> command. I tracked it down to a significant delay in
>>>>>>>> the call to H5Oget_info_by_name function on a dataset
>>>>>>>> with a large number of chunks. It looks like when the
>>>>>>>> number of chunks in dataset increases (in our case
>>>>>>>> we have 1-10k chunks) the performance of the H5Oget_info
>>>>>>>> drops significantly. Looking at the IO statistics it
>>>>>>>> seems that HDF5 library does very many small IO operations
>>>>>>>> in this case. There is very little CPU spent, but real
>>>>>>>> time is measured in tens of seconds.
>>>>>>>> 
>>>>>>>> Is this an expected behavior? Can it be improved somehow
>>>>>>>> without reducing the number of chunks drastically?
>>>>>>>> 
>>>>>>>> One more comment about H5Oget_info - it returns a
>>>>>>>> structure that contains a lot of different info.
>>>>>>>> In the case of h5py code the only member of the
>>>>>>>> structure used in the code is "type". could there be
>>>>>>>> more efficient way to determine just the type of the
>>>>>>>> object without requiring every other piece of info?
>>>>>>>         Ah, yes, we've noticed that in some of the applications we've
>>>>>>> worked with also (including some of the main HDF5 tools, like
>>>>>>> h5ls, etc). As you say, H5Oget_info() is fairly heavyweight,
>>>>>>> getting all sorts of information about each object.  I do think a
>>>>>>> lighter- weight call like "H5Oget_type" would be useful.  Is there
>>>>>>> other "lightweight" information that people would like back for
>>>>>>> each object?
>>>>>>> 
>>>>>>>         Quincey
>>>>>>> 
>>>>>> Hi Quincey,
>>>>>> 
>>>>>> thanks for confirming this. Could you explain briefly what is
>>>>>> going on there and which part of H5O_info_t needs so many reads?
>>>>>   The H5Oget_info() call is gathering information about the amount of
>>>>> space that the metadata for the dataset is using.  When there's a
>>>>> large B- tree for indexing the chunks, that can take a fair bit of
>>>>> time to walk the B-tree.
>>>>> 
>>>>>>    Maybe removing heavyweight info from H5O_info_t is the right
>>>>>> thing to do, or creating another version of H5O_info_t structure
>>>>>> which has only light-weight info?
>>>>>   I'm leaning toward another light-weight version.  I'm asking the
>>>>> HDF5 community to help me decide what goes into that structure
>>>>> besides the object type.
>>>>> 
>>>> Hi Quincey,
>>>> 
>>>> is there a chance we can get this new version in the next release?
>>> We actually already have an experimental branch with a similar feature
>>> mostly implemented.  It allows you to specify the fields you want
>>> filled in by H5Oget_info.  The branch can be found at:
>>> 
>>> http://svn.hdfgroup.uiuc.edu/hdf5/branches/h5oget_info_by_field/
>>> 
>>> The new functions are:
>>> 
>>> herr_t H5Oget_info2(hid_t loc_id, H5O_info_t *oinfo, unsigned fields);
>>> herr_t H5Oget_info_by_name2(hid_t loc_id, const char *name, H5O_info_t
>>> *oinfo, unsigned fields, hid_t lapl_id);
>>> 
>>> The "fields" parameter can contain the following bitflags (combined
>>> with "|"):
>>> 
>>> H5O_INFO_TIME H5O_INFO_NUM_ATTRS H5O_INFO_HDR H5O_INFO_META_SIZE
>>> H5O_INFO_ALL (==H5O_INFO_TIME | H5O_INFO_NUM_ATTRS | H5O_INFO_HDR |
>>> H5O_INFO_META_SIZE)
>>> 
>>> Passing these flags tells the library to fill in the corresponding
>>> fields in oinfo.  Other fields are always filled in because there is
>>> no performance penalty.  In your case, since you only need the type,
>>> you can just pass "0".  h5ls has also been modified to use these, so
>>> it should be faster.
>>> 
>>> Of course, this is experimental code and should not be used in
>>> production, but if you're curious how much a lightweight H5Oget_info
>>> would help your performance you're welcome to try it.  If you do, we'd
>>> love to hear about your results, and also your thoughts on the
>>> interface.  For maximum performance, you should configure the library
>>> with "--enable-production" (for this branch, not necessary for
>>> releases).
>>> 
>>> Thanks,
>>> -Neil
>>> 
>> Hi Neil,
>> 
>> I managed to build this branch and test it. It has indeed improved
>> performance dramatically. As you suggest I only use zero value for the
>> fields argument, other values have not been included in my test. With
>> that value and checking only the "type" field in H5O_info_t it runs
>> much faster than previous version.'h5ls' also works better on our files.
>> 
>> What I find interesting is a missing version for H5Oget_info_by_idx
>> which would take "fields" argument. Is this function so much different
>> from H5Oget_info and H5Oget_info_by_name so it cannot be optimized?
>> 
>> Even without H5Oget_info_by_idx2 I'd be happy to see this branch
>> included into next release.
> 
> Glad to hear it improved your performance!  It would be easy to add
> H5Oget_info_by_idx2, we just didn't do that because we only did the
> minimum needed to test the performance in the case we were looking at,
> and stopped after reaching that point.  We shelved the work because it
> didn't make a huge difference in the case we were looking at, but with
> your report I will look into getting it scheduled sooner rather than
> later.  There is a chance we may change the interface to something like
> what Quincey suggested.  Thanks for taking the time to test this!
> 
> -Neil
> 
>> Cheers,
>> Andy
>> 
>> 
>> _______________________________________________
>> Hdf-forum is for HDF software users discussion.
>> Hdf-forum@hdfgroup.org
>> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
> 
> _______________________________________________ Hdf-forum is for HDF
> software users discussion. Hdf-forum@hdfgroup.org
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org




_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to