Dear Tamas, > we are using HDF5 in our collaboration to store large event data of neutrino > interactions. The data itself has a very simple structure but I still could > not find a acceptable way to design the structure of the HDF5 format. It > would be great if some HDF5 experts could give me a hint how to optimise it.
It is a pleasure to see some HEP people here. > The slowness is compared to a ROOT structure which is also used in parallel. > If I compare some basic event-by-event analysis, the same code run on a ROOT > file is almost an order of magnitude faster. If I remember properly, ROOT can only read in parallel, no write. Does it matter for you ? > Approach #2: using a hierarchical structure to store the events to group > them. The events can then be accessed by reading "/hits/event_id", like > "/hits/23", which is a similar table used in the first approach. To iterate > through the events, I need to create a list of nodes and walk over them, or I > store the number of events as an attribute and simply use an iterator. > It seems that it is only a tiny bit faster to access a specific event, which > may be related to the fact that HDF5 stores the nodes in a b-tree, like > pandas the index table. This approach would create a large number of dataset ( one per id ), which is from my experience, a bad idea in HDF5 I would use Approach #1 and store all your events in a "column" fashion similar to what ROOT does. For the fast querying problem, you can post-process your file and add a separate column acting as an ordered index / associative array with a layout of the type "event_id" -> "range row" Best Regards, Adrien Le 30. 03. 17 à 21:33, Tamas Gal a écrit : > Dear all, > > we are using HDF5 in our collaboration to store large event data of neutrino > interactions. The data itself has a very simple structure but I still could > not find a acceptable way to design the structure of the HDF5 format. It > would be great if some HDF5 experts could give me a hint how to optimise it. > > The data I want to store are basically events, which are simply groups of > hits. A hit is a simple structure with the following fields: > > Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id > (int16) > > As already mentioned, an event is simply a list of a few thousands hits and > the number of hits is changing from event to event. > > I tried different approaches to store information of a few thousands events > (thus a couple of million hits) and the final two structures which kind of > work but have still poor performance are: > > Approach #1: a single "table" to store all hits (basically one array for each > hit-field) with an additional "column" (again, an array) to store the > event_id they belong to. > > This is of course nice if I want to do analysis on the whole file, including > all the events, but is slow when I want to iterate through each event_id, > since I need select the corresponding hits by looking at the event_ids. In > pytables or the Pandas framework, this works using binary search index trees, > but it's still a bit slow. > > Approach #2: using a hierarchical structure to store the events to group > them. The events can then be accessed by reading "/hits/event_id", like > "/hits/23", which is a similar table used in the first approach. To iterate > through the events, I need to create a list of nodes and walk over them, or I > store the number of events as an attribute and simply use an iterator. > It seems that it is only a tiny bit faster to access a specific event, which > may be related to the fact that HDF5 stores the nodes in a b-tree, like > pandas the index table. > > The slowness is compared to a ROOT structure which is also used in parallel. > If I compare some basic event-by-event analysis, the same code run on a ROOT > file is almost an order of magnitude faster. > > I also tried variable length arrays but I ran into compression issues. Some > other approaches were creating meta tables to keep track of the indices of > the hits for faster lookup, but this was kind of awkward and not self > explaining enough in my opinion. > > So my question is: how would an experienced HDF5 user structure this simple > data to maximise the performance of the event-by-event readout? > > Best regards, > Tamas > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > Twitter: https://twitter.com/hdf5
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
