Hello Tamas, I use HDF5 to store stream of irregular time series (IRTS) from financial sector. The events are organised per day into a data-set, where as the data-set is a variable length stream/vector with custom datatype. The custom record type is created to increase density, and and iterator in C/C++ to iterate through the event stream which is linked against Julia,R, python code. Because the custom datatype is saved into the file it is readily accessible though hdfview.
The access pattern to this database is write once/read many, sequential and I get good result over the past 5 years. I use it in MPI cluster environment, C++/Julia/Rcpp. custom datatpe in my case: [event id, asset, .... ] You see to have optimised access both sequentially all events, and sequentially only some events is a by-objective problem that you can mitigate by using more space to gain time. As others pointed out, chunk size matters. hope it helps, steve On Thu, Mar 30, 2017 at 3:33 PM, Tamas Gal <[email protected]> wrote: > > Dear all, > > we are using HDF5 in our collaboration to store large event data of > neutrino interactions. The data itself has a very simple structure but I > still could not find a acceptable way to design the structure of the HDF5 > format. It would be great if some HDF5 experts could give me a hint how to > optimise it. > > The data I want to store are basically events, which are simply groups of > hits. A hit is a simple structure with the following fields: > > Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id > (int16) > > As already mentioned, an event is simply a list of a few thousands hits > and the number of hits is changing from event to event. > > I tried different approaches to store information of a few thousands > events (thus a couple of million hits) and the final two structures which > kind of work but have still poor performance are: > > Approach #1: a single "table" to store all hits (basically one array for > each hit-field) with an additional "column" (again, an array) to store the > event_id they belong to. > > This is of course nice if I want to do analysis on the whole file, > including all the events, but is slow when I want to iterate through each > event_id, since I need select the corresponding hits by looking at the > event_ids. In pytables or the Pandas framework, this works using binary > search index trees, but it's still a bit slow. > > Approach #2: using a hierarchical structure to store the events to group > them. The events can then be accessed by reading "/hits/event_id", like > "/hits/23", which is a similar table used in the first approach. To iterate > through the events, I need to create a list of nodes and walk over them, or > I store the number of events as an attribute and simply use an iterator. > It seems that it is only a tiny bit faster to access a specific event, > which may be related to the fact that HDF5 stores the nodes in a b-tree, > like pandas the index table. > > The slowness is compared to a ROOT structure which is also used in > parallel. If I compare some basic event-by-event analysis, the same code > run on a ROOT file is almost an order of magnitude faster. > > I also tried variable length arrays but I ran into compression issues. > Some other approaches were creating meta tables to keep track of the > indices of the hits for faster lookup, but this was kind of awkward and not > self explaining enough in my opinion. > > So my question is: how would an experienced HDF5 user structure this > simple data to maximise the performance of the event-by-event readout? > > Best regards, > Tamas > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > Twitter: https://twitter.com/hdf5 >
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
