30.03.2017 22:33, Tamas Gal пишет:
Dear all,
we are using HDF5 in our collaboration to store large event data of neutrino
interactions. The data itself has a very simple structure but I still could not
find a acceptable way to design the structure of the HDF5 format. It would be
great if some HDF5 experts could give me a hint how to optimise it.
The data I want to store are basically events, which are simply groups of hits.
A hit is a simple structure with the following fields:
Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id (int16)
As already mentioned, an event is simply a list of a few thousands hits and the
number of hits is changing from event to event.
I tried different approaches to store information of a few thousands events
(thus a couple of million hits) and the final two structures which kind of work
but have still poor performance are:
Approach #1: a single "table" to store all hits (basically one array for each hit-field)
with an additional "column" (again, an array) to store the event_id they belong to.
This is of course nice if I want to do analysis on the whole file, including
all the events, but is slow when I want to iterate through each event_id, since
I need select the corresponding hits by looking at the event_ids. In pytables
or the Pandas framework, this works using binary search index trees, but it's
still a bit slow.
Approach #2: using a hierarchical structure to store the events to group them. The events can then
be accessed by reading "/hits/event_id", like "/hits/23", which is a similar
table used in the first approach. To iterate through the events, I need to create a list of nodes
and walk over them, or I store the number of events as an attribute and simply use an iterator.
It seems that it is only a tiny bit faster to access a specific event, which
may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas
the index table.
The slowness is compared to a ROOT structure which is also used in parallel. If
I compare some basic event-by-event analysis, the same code run on a ROOT file
is almost an order of magnitude faster.
Hello Tamas!
My experience suggests that simply indexing the data is not enough to
achieve top performance. The actual layout of information on disk
(primary index) should be well-suited for your typical queries. For
example, if you need to query by event_id, all values with the same
event_id have to be closely located to minimize the number of disk seeks.
If you have several types of typical queries, it might be worth to
duplicate the information using different physical layouts. This
philosophy is utilized to great success in e.g.
http://cassandra.apache.org/
From my experience HDF5 is almost as fast as direct disk read, and even
*faster* when using fast compression (LZ4, blosc). On my data HDF5
proved to be much faster compared to SQLite and local PostgreSQL databases.
Best wishes,
Andrey Paramonov
--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5