30.03.2017 22:33, Tamas Gal пишет:
Dear all,

we are using HDF5 in our collaboration to store large event data of neutrino 
interactions. The data itself has a very simple structure but I still could not 
find a acceptable way to design the structure of the HDF5 format. It would be 
great if some HDF5 experts could give me a hint how to optimise it.

The data I want to store are basically events, which are simply groups of hits. 
A hit is a simple structure with the following fields:

Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id (int16)

As already mentioned, an event is simply a list of a few thousands hits and the 
number of hits is changing from event to event.

I tried different approaches to store information of a few thousands events 
(thus a couple of million hits) and the final two structures which kind of work 
but have still poor performance are:

Approach #1: a single "table" to store all hits (basically one array for each hit-field) 
with an additional "column" (again, an array) to store the event_id they belong to.

This is of course nice if I want to do analysis on the whole file, including 
all the events, but is slow when I want to iterate through each event_id, since 
I need select the corresponding hits by looking at the event_ids. In pytables 
or the Pandas framework, this works using binary search index trees, but it's 
still a bit slow.

Approach #2: using a hierarchical structure to store the events to group them. The events can then 
be accessed by reading "/hits/event_id", like "/hits/23", which is a similar 
table used in the first approach. To iterate through the events, I need to create a list of nodes 
and walk over them, or I store the number of events as an attribute and simply use an iterator.
It seems that it is only a tiny bit faster to access a specific event, which 
may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas 
the index table.

The slowness is compared to a ROOT structure which is also used in parallel. If 
I compare some basic event-by-event analysis, the same code run on a ROOT file 
is almost an order of magnitude faster.

Hello Tamas!

My experience suggests that simply indexing the data is not enough to achieve top performance. The actual layout of information on disk (primary index) should be well-suited for your typical queries. For example, if you need to query by event_id, all values with the same event_id have to be closely located to minimize the number of disk seeks.

If you have several types of typical queries, it might be worth to duplicate the information using different physical layouts. This philosophy is utilized to great success in e.g.
http://cassandra.apache.org/

From my experience HDF5 is almost as fast as direct disk read, and even *faster* when using fast compression (LZ4, blosc). On my data HDF5 proved to be much faster compared to SQLite and local PostgreSQL databases.

Best wishes,
Andrey Paramonov


--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to