Dear all, we are using HDF5 in our collaboration to store large event data of neutrino interactions. The data itself has a very simple structure but I still could not find a acceptable way to design the structure of the HDF5 format. It would be great if some HDF5 experts could give me a hint how to optimise it.
The data I want to store are basically events, which are simply groups of hits. A hit is a simple structure with the following fields: Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id (int16) As already mentioned, an event is simply a list of a few thousands hits and the number of hits is changing from event to event. I tried different approaches to store information of a few thousands events (thus a couple of million hits) and the final two structures which kind of work but have still poor performance are: Approach #1: a single "table" to store all hits (basically one array for each hit-field) with an additional "column" (again, an array) to store the event_id they belong to. This is of course nice if I want to do analysis on the whole file, including all the events, but is slow when I want to iterate through each event_id, since I need select the corresponding hits by looking at the event_ids. In pytables or the Pandas framework, this works using binary search index trees, but it's still a bit slow. Approach #2: using a hierarchical structure to store the events to group them. The events can then be accessed by reading "/hits/event_id", like "/hits/23", which is a similar table used in the first approach. To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator. It seems that it is only a tiny bit faster to access a specific event, which may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas the index table. The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster. I also tried variable length arrays but I ran into compression issues. Some other approaches were creating meta tables to keep track of the indices of the hits for faster lookup, but this was kind of awkward and not self explaining enough in my opinion. So my question is: how would an experienced HDF5 user structure this simple data to maximise the performance of the event-by-event readout? Best regards, Tamas _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
