Re: [Hdf-forum] Optimising HDF5 data structure

Adrien Devresse Fri, 31 Mar 2017 06:37:54 -0700

Dear Tamas,

> we are using HDF5 in our collaboration to store large event data of neutrino 
> interactions. The data itself has a very simple structure but I still could 
> not find a acceptable way to design the structure of the HDF5 format. It 
> would be great if some HDF5 experts could give me a hint how to optimise it.


It is a pleasure to see some HEP people here.

> The slowness is compared to a ROOT structure which is also used in parallel. 
> If I compare some basic event-by-event analysis, the same code run on a ROOT 
> file is almost an order of magnitude faster.

If I remember properly, ROOT can only read in parallel, no write. Does
it matter for you ?


> Approach #2: using a hierarchical structure to store the events to group 
> them. The events can then be accessed by reading "/hits/event_id", like 
> "/hits/23", which is a similar table used in the first approach. To iterate 
> through the events, I need to create a list of nodes and walk over them, or I 
> store the number of events as an attribute and simply use an iterator.
> It seems that it is only a tiny bit faster to access a specific event, which 
> may be related to the fact that HDF5 stores the nodes in a b-tree, like 
> pandas the index table.

This approach would create a large number of dataset ( one per id ),
which is  from my experience, a bad idea in HDF5


I would use Approach #1 and store all your events in a "column" fashion
similar to what ROOT does.

For the fast querying problem, you can post-process your file and add a
separate column acting as an ordered index / associative array with a
layout of the type "event_id" -> "range row"


Best Regards,
Adrien




Le 30. 03. 17 à 21:33, Tamas Gal a écrit :
> Dear all,
>
> we are using HDF5 in our collaboration to store large event data of neutrino 
> interactions. The data itself has a very simple structure but I still could 
> not find a acceptable way to design the structure of the HDF5 format. It 
> would be great if some HDF5 experts could give me a hint how to optimise it.
>
> The data I want to store are basically events, which are simply groups of 
> hits. A hit is a simple structure with the following fields:
>
> Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id 
> (int16)
>
> As already mentioned, an event is simply a list of a few thousands hits and 
> the number of hits is changing from event to event.
>
> I tried different approaches to store information of a few thousands events 
> (thus a couple of million hits) and the final two structures which kind of 
> work but have still poor performance are:
>
> Approach #1: a single "table" to store all hits (basically one array for each 
> hit-field) with an additional "column" (again, an array) to store the 
> event_id they belong to.
>
> This is of course nice if I want to do analysis on the whole file, including 
> all the events, but is slow when I want to iterate through each event_id, 
> since I need select the corresponding hits by looking at the event_ids. In 
> pytables or the Pandas framework, this works using binary search index trees, 
> but it's still a bit slow.
>
> Approach #2: using a hierarchical structure to store the events to group 
> them. The events can then be accessed by reading "/hits/event_id", like 
> "/hits/23", which is a similar table used in the first approach. To iterate 
> through the events, I need to create a list of nodes and walk over them, or I 
> store the number of events as an attribute and simply use an iterator.
> It seems that it is only a tiny bit faster to access a specific event, which 
> may be related to the fact that HDF5 stores the nodes in a b-tree, like 
> pandas the index table.
>
> The slowness is compared to a ROOT structure which is also used in parallel. 
> If I compare some basic event-by-event analysis, the same code run on a ROOT 
> file is almost an order of magnitude faster.
>
> I also tried variable length arrays but I ran into compression issues. Some 
> other approaches were creating meta tables to keep track of the indices of 
> the hits for faster lookup, but this was kind of awkward and not self 
> explaining enough in my opinion.
>
> So my question is: how would an experienced HDF5 user structure this simple 
> data to maximise the performance of the event-by-event readout?
>
> Best regards,
> Tamas
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5

signature.asc
Description: OpenPGP digital signature

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Optimising HDF5 data structure

Reply via email to