[Hdf-forum] Optimising HDF5 data structure

Tamas Gal Thu, 30 Mar 2017 12:49:18 -0700

Dear all,

we are using HDF5 in our collaboration to store large event data of neutrino 
interactions. The data itself has a very simple structure but I still could not 
find a acceptable way to design the structure of the HDF5 format. It would be 
great if some HDF5 experts could give me a hint how to optimise it.


The data I want to store are basically events, which are simply groups of hits. 
A hit is a simple structure with the following fields:

Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id (int16)

As already mentioned, an event is simply a list of a few thousands hits and the 
number of hits is changing from event to event.

I tried different approaches to store information of a few thousands events 
(thus a couple of million hits) and the final two structures which kind of work 
but have still poor performance are:

Approach #1: a single "table" to store all hits (basically one array for each 
hit-field) with an additional "column" (again, an array) to store the event_id 
they belong to.

This is of course nice if I want to do analysis on the whole file, including 
all the events, but is slow when I want to iterate through each event_id, since 
I need select the corresponding hits by looking at the event_ids. In pytables 
or the Pandas framework, this works using binary search index trees, but it's 
still a bit slow.

Approach #2: using a hierarchical structure to store the events to group them. 
The events can then be accessed by reading "/hits/event_id", like "/hits/23", 
which is a similar table used in the first approach. To iterate through the 
events, I need to create a list of nodes and walk over them, or I store the 
number of events as an attribute and simply use an iterator.
It seems that it is only a tiny bit faster to access a specific event, which 
may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas 
the index table.

The slowness is compared to a ROOT structure which is also used in parallel. If 
I compare some basic event-by-event analysis, the same code run on a ROOT file 
is almost an order of magnitude faster.

I also tried variable length arrays but I ran into compression issues. Some 
other approaches were creating meta tables to keep track of the indices of the 
hits for faster lookup, but this was kind of awkward and not self explaining 
enough in my opinion.

So my question is: how would an experienced HDF5 user structure this simple 
data to maximise the performance of the event-by-event readout?

Best regards,
Tamas
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

[Hdf-forum] Optimising HDF5 data structure

Reply via email to