Dear Tamas, My instinct in your situation would be to define a compound data structure to represent one hit (it sounds as if you have done that) and then write a dataset per event.
You could use event-ID for the dataset name, and any other event metadata could be stored as attributes on the dataset. >> The events can then be accessed by reading "/hits/event_id", like >> "/hits/23", which is a similar table used in the first approach. It sounds as if you have already tried this approach. >> To iterate through the events, I need to create a list of nodes and walk >> over them, or I store the number of events as an attribute and simply use an >> iterator. I believe you can directly get the number of rows in each dataset and so I am confused by the attribute suggestion. It seems performance was still an issue? Generally I find that performance is all about the chunk size - HDF will generally read a whole chunk at a time and cache those chunks - have you tried different chunk sizes? rgds Ewan > On 31 Mar 2017, at 9:30 AM, [email protected] wrote: > > Send Hdf-forum mailing list submissions to > [email protected] > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > > or, via email, send a message with subject or body 'help' to > [email protected] > > You can reach the person managing the list at > [email protected] > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Hdf-forum digest..." > > > Today's Topics: > > 1. Optimising HDF5 data structure (Tamas Gal) > 2. Re: Optimising HDF5 data structure (Rafal Lichwala) > 3. Re: Optimising HDF5 data structure (Tamas Gal) > 4. Re: Optimising HDF5 data structure (Francesc Altet) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Thu, 30 Mar 2017 21:33:11 +0200 > From: Tamas Gal <[email protected]> > To: [email protected] > Subject: [Hdf-forum] Optimising HDF5 data structure > Message-ID: <[email protected]> > Content-Type: text/plain; charset=us-ascii > > > Dear all, > > we are using HDF5 in our collaboration to store large event data of neutrino > interactions. The data itself has a very simple structure but I still could > not find a acceptable way to design the structure of the HDF5 format. It > would be great if some HDF5 experts could give me a hint how to optimise it. > > The data I want to store are basically events, which are simply groups of > hits. A hit is a simple structure with the following fields: > > Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id > (int16) > > As already mentioned, an event is simply a list of a few thousands hits and > the number of hits is changing from event to event. > > I tried different approaches to store information of a few thousands events > (thus a couple of million hits) and the final two structures which kind of > work but have still poor performance are: > > Approach #1: a single "table" to store all hits (basically one array for each > hit-field) with an additional "column" (again, an array) to store the > event_id they belong to. > > This is of course nice if I want to do analysis on the whole file, including > all the events, but is slow when I want to iterate through each event_id, > since I need select the corresponding hits by looking at the event_ids. In > pytables or the Pandas framework, this works using binary search index trees, > but it's still a bit slow. > > Approach #2: using a hierarchical structure to store the events to group > them. The events can then be accessed by reading "/hits/event_id", like > "/hits/23", which is a similar table used in the first approach. To iterate > through the events, I need to create a list of nodes and walk over them, or I > store the number of events as an attribute and simply use an iterator. > It seems that it is only a tiny bit faster to access a specific event, which > may be related to the fact that HDF5 stores the nodes in a b-tree, like > pandas the index table. > > The slowness is compared to a ROOT structure which is also used in parallel. > If I compare some basic event-by-event analysis, the same code run on a ROOT > file is almost an order of magnitude faster. > > I also tried variable length arrays but I ran into compression issues. Some > other approaches were creating meta tables to keep track of the indices of > the hits for faster lookup, but this was kind of awkward and not self > explaining enough in my opinion. > > So my question is: how would an experienced HDF5 user structure this simple > data to maximise the performance of the event-by-event readout? > > Best regards, > Tamas > > > ------------------------------ > > Message: 2 > Date: Fri, 31 Mar 2017 09:52:55 +0200 > From: Rafal Lichwala <[email protected]> > To: [email protected] > Subject: Re: [Hdf-forum] Optimising HDF5 data structure > Message-ID: <[email protected]> > Content-Type: text/plain; charset=utf-8; format=flowed > > Hi Tamas, > >> So my question is: how would an experienced HDF5 user structure this simple >> data to maximise the performance of the event-by-event readout? > > I see two solutions for your purposes. > First - try to switch from Python to C++ - it's much faster. > > http://benchmarksgame.alioth.debian.org/u64q/compare.php?lang=python3&lang2=gpp > > Second - I know this is HDF5 forum, but for such a huge but simple set > of data, I would suggest to use some SQL engine as a backend. > MySQL or PostgreSQL would be a good choice if you need a full set of > relational database engine features for your data analysis, but > file-based solutions (SQLite) could be also taken into consideration. > In your case data would be stored into two tables (hits and events) with > a proper key-based join between them. > > Regards, > Rafal > > > > > ------------------------------ > > Message: 3 > Date: Fri, 31 Mar 2017 10:20:37 +0200 > From: Tamas Gal <[email protected]> > To: HDF Users Discussion List <[email protected]> > Subject: Re: [Hdf-forum] Optimising HDF5 data structure > Message-ID: <[email protected]> > Content-Type: text/plain; charset="us-ascii" > > Dear Rafal, > > thanks for you reply. > >> On 31. Mar 2017, at 09:52, Rafal Lichwala <[email protected]> wrote: >> >> I see two solutions for your purposes. >> First - try to switch from Python to C++ - it's much faster. > > I am of course aware of the fact that Python is in general much slower than a > statically typed compiled language, however pytables (http://www.pytables.org > <http://www.pytables.org/>) and h5py (http://www.h5py.org > <http://www.h5py.org/>) are thin wrappers and are tightly bound to the numpy > library (http://www.numpy.org <http://www.numpy.org/>) which is totally > competitive. I also use Julia to access HDF5 content and I did not notice a > better performance. So I am not sure if this is a real bottleneck in our > case... > >> Second - I know this is HDF5 forum, but for such a huge but simple set of >> data, I would suggest to use some SQL engine as a backend. > > We definitely need a file based approach, so a centralised database engine is > not an option. I also tried sqlite, however the performance is very poor > compared to our HDF5 solution. > > So maybe our data structure is not that bad overall, yet our expectations > might be a bit too high? > > Cheers, > Tamas > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/attachments/20170331/5b241dd2/attachment-0001.html> > > ------------------------------ > > Message: 4 > Date: Fri, 31 Mar 2017 08:29:50 +0000 > From: Francesc Altet <[email protected]> > To: HDF Users Discussion List <[email protected]> > Subject: Re: [Hdf-forum] Optimising HDF5 data structure > Message-ID: > > <sn2pr17mb08325034f8f122d27aacde3eb5...@sn2pr17mb0832.namprd17.prod.outlook.com> > > Content-Type: text/plain; charset="us-ascii" > > Hi Tamas, > > > I'd say that there should be a layout in which you can store your data in > HDF5 that is competitive with ROOT; it is just that finding it may require > some more experimentation. Things like the compressor used, the chunksizes > and the index level that you are using might be critical for achieving more > performance. Could you send us some links to your codebases and perhaps > elaborate more on the performance figures that you are getting on each of > your approaches? > > > Best, > > Francesc Alted > > ________________________________ > From: Hdf-forum <[email protected]> on behalf of Tamas Gal > <[email protected]> > Sent: Friday, March 31, 2017 10:20:37 AM > To: HDF Users Discussion List > Subject: Re: [Hdf-forum] Optimising HDF5 data structure > > Dear Rafal, > > thanks for you reply. > > On 31. Mar 2017, at 09:52, Rafal Lichwala > <[email protected]<mailto:[email protected]>> wrote: > > I see two solutions for your purposes. > First - try to switch from Python to C++ - it's much faster. > > I am of course aware of the fact that Python is in general much slower than a > statically typed compiled language, however pytables > (http://www.pytables.org) and h5py (http://www.h5py.org) are thin wrappers > and are tightly bound to the numpy library (http://www.numpy.org) which is > totally competitive. I also use Julia to access HDF5 content and I did not > notice a better performance. So I am not sure if this is a real bottleneck in > our case... > > Second - I know this is HDF5 forum, but for such a huge but simple set of > data, I would suggest to use some SQL engine as a backend. > > We definitely need a file based approach, so a centralised database engine is > not an option. I also tried sqlite, however the performance is very poor > compared to our HDF5 solution. > > So maybe our data structure is not that bad overall, yet our expectations > might be a bit too high? > > Cheers, > Tamas > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/attachments/20170331/3a986f34/attachment.html> > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > > > ------------------------------ > > End of Hdf-forum Digest, Vol 93, Issue 29 > ***************************************** _______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
