Re: [Hdf-forum] Optimising HDF5 data structure

Ewan Makepeace Fri, 31 Mar 2017 04:30:34 -0700

Dear Tamas,

My instinct in your situation would be to define a compound data structure to 
represent one hit (it sounds as if you have done that) and then write a dataset 
per event.


You could use event-ID for the dataset name, and any other event metadata could 
be stored as attributes on the dataset.

>> The events can then be accessed by reading "/hits/event_id", like 
>> "/hits/23", which is a similar table used in the first approach.

It sounds as if you have already tried this approach.

>> To iterate through the events, I need to create a list of nodes and walk 
>> over them, or I store the number of events as an attribute and simply use an 
>> iterator.

I believe you can directly get the number of rows in each dataset and so I am 
confused by the attribute suggestion. It seems performance was still an issue?

Generally I find that performance is all about the chunk size - HDF will 
generally read a whole chunk at a time and cache those chunks - have you tried 
different chunk sizes?

rgds
Ewan


> On 31 Mar 2017, at 9:30 AM, [email protected] wrote:
> 
> Send Hdf-forum mailing list submissions to
>       [email protected]
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>       http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> 
> or, via email, send a message with subject or body 'help' to
>       [email protected]
> 
> You can reach the person managing the list at
>       [email protected]
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Hdf-forum digest..."
> 
> 
> Today's Topics:
> 
>   1. Optimising HDF5 data structure (Tamas Gal)
>   2. Re: Optimising HDF5 data structure (Rafal Lichwala)
>   3. Re: Optimising HDF5 data structure (Tamas Gal)
>   4. Re: Optimising HDF5 data structure (Francesc Altet)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Thu, 30 Mar 2017 21:33:11 +0200
> From: Tamas Gal <[email protected]>
> To: [email protected]
> Subject: [Hdf-forum] Optimising HDF5 data structure
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=us-ascii
> 
> 
> Dear all,
> 
> we are using HDF5 in our collaboration to store large event data of neutrino 
> interactions. The data itself has a very simple structure but I still could 
> not find a acceptable way to design the structure of the HDF5 format. It 
> would be great if some HDF5 experts could give me a hint how to optimise it.
> 
> The data I want to store are basically events, which are simply groups of 
> hits. A hit is a simple structure with the following fields:
> 
> Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id 
> (int16)
> 
> As already mentioned, an event is simply a list of a few thousands hits and 
> the number of hits is changing from event to event.
> 
> I tried different approaches to store information of a few thousands events 
> (thus a couple of million hits) and the final two structures which kind of 
> work but have still poor performance are:
> 
> Approach #1: a single "table" to store all hits (basically one array for each 
> hit-field) with an additional "column" (again, an array) to store the 
> event_id they belong to.
> 
> This is of course nice if I want to do analysis on the whole file, including 
> all the events, but is slow when I want to iterate through each event_id, 
> since I need select the corresponding hits by looking at the event_ids. In 
> pytables or the Pandas framework, this works using binary search index trees, 
> but it's still a bit slow.
> 
> Approach #2: using a hierarchical structure to store the events to group 
> them. The events can then be accessed by reading "/hits/event_id", like 
> "/hits/23", which is a similar table used in the first approach. To iterate 
> through the events, I need to create a list of nodes and walk over them, or I 
> store the number of events as an attribute and simply use an iterator.
> It seems that it is only a tiny bit faster to access a specific event, which 
> may be related to the fact that HDF5 stores the nodes in a b-tree, like 
> pandas the index table.
> 
> The slowness is compared to a ROOT structure which is also used in parallel. 
> If I compare some basic event-by-event analysis, the same code run on a ROOT 
> file is almost an order of magnitude faster.
> 
> I also tried variable length arrays but I ran into compression issues. Some 
> other approaches were creating meta tables to keep track of the indices of 
> the hits for faster lookup, but this was kind of awkward and not self 
> explaining enough in my opinion.
> 
> So my question is: how would an experienced HDF5 user structure this simple 
> data to maximise the performance of the event-by-event readout?
> 
> Best regards,
> Tamas
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Fri, 31 Mar 2017 09:52:55 +0200
> From: Rafal Lichwala <[email protected]>
> To: [email protected]
> Subject: Re: [Hdf-forum] Optimising HDF5 data structure
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=utf-8; format=flowed
> 
> Hi Tamas,
> 
>> So my question is: how would an experienced HDF5 user structure this simple 
>> data to maximise the performance of the event-by-event readout?
> 
> I see two solutions for your purposes.
> First - try to switch from Python to C++ - it's much faster.
> 
> http://benchmarksgame.alioth.debian.org/u64q/compare.php?lang=python3&lang2=gpp
> 
> Second - I know this is HDF5 forum, but for such a huge but simple set 
> of data, I would suggest to use some SQL engine as a backend.
> MySQL or PostgreSQL would be a good choice if you need a full set of 
> relational database engine features for your data analysis, but 
> file-based solutions (SQLite) could be also taken into consideration.
> In your case data would be stored into two tables (hits and events) with 
> a proper key-based join between them.
> 
> Regards,
> Rafal
> 
> 
> 
> 
> ------------------------------
> 
> Message: 3
> Date: Fri, 31 Mar 2017 10:20:37 +0200
> From: Tamas Gal <[email protected]>
> To: HDF Users Discussion List <[email protected]>
> Subject: Re: [Hdf-forum] Optimising HDF5 data structure
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="us-ascii"
> 
> Dear Rafal,
> 
> thanks for you reply.
> 
>> On 31. Mar 2017, at 09:52, Rafal Lichwala <[email protected]> wrote:
>> 
>> I see two solutions for your purposes.
>> First - try to switch from Python to C++ - it's much faster.
> 
> I am of course aware of the fact that Python is in general much slower than a 
> statically typed compiled language, however pytables (http://www.pytables.org 
> <http://www.pytables.org/>) and h5py (http://www.h5py.org 
> <http://www.h5py.org/>) are thin wrappers and are tightly bound to the numpy 
> library (http://www.numpy.org <http://www.numpy.org/>) which is totally 
> competitive. I also use Julia to access HDF5 content and I did not notice a 
> better performance. So I am not sure if this is a real bottleneck in our 
> case...
> 
>> Second - I know this is HDF5 forum, but for such a huge but simple set of 
>> data, I would suggest to use some SQL engine as a backend.
> 
> We definitely need a file based approach, so a centralised database engine is 
> not an option. I also tried sqlite, however the performance is very poor 
> compared to our HDF5 solution.
> 
> So maybe our data structure is not that bad overall, yet our expectations 
> might be a bit too high?
> 
> Cheers,
> Tamas
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/attachments/20170331/5b241dd2/attachment-0001.html>
> 
> ------------------------------
> 
> Message: 4
> Date: Fri, 31 Mar 2017 08:29:50 +0000
> From: Francesc Altet <[email protected]>
> To: HDF Users Discussion List <[email protected]>
> Subject: Re: [Hdf-forum] Optimising HDF5 data structure
> Message-ID:
>       
> <sn2pr17mb08325034f8f122d27aacde3eb5...@sn2pr17mb0832.namprd17.prod.outlook.com>
>       
> Content-Type: text/plain; charset="us-ascii"
> 
> Hi Tamas,
> 
> 
> I'd say that there should be a layout in which you can store your data in 
> HDF5 that is competitive with ROOT; it is just that finding it may require 
> some more experimentation.  Things like the compressor used, the chunksizes 
> and the index level that you are using might be critical for achieving more 
> performance.  Could you send us some links to your codebases and perhaps 
> elaborate more on the performance figures that you are getting on each of 
> your approaches?
> 
> 
> Best,
> 
> Francesc Alted
> 
> ________________________________
> From: Hdf-forum <[email protected]> on behalf of Tamas Gal 
> <[email protected]>
> Sent: Friday, March 31, 2017 10:20:37 AM
> To: HDF Users Discussion List
> Subject: Re: [Hdf-forum] Optimising HDF5 data structure
> 
> Dear Rafal,
> 
> thanks for you reply.
> 
> On 31. Mar 2017, at 09:52, Rafal Lichwala 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> I see two solutions for your purposes.
> First - try to switch from Python to C++ - it's much faster.
> 
> I am of course aware of the fact that Python is in general much slower than a 
> statically typed compiled language, however pytables 
> (http://www.pytables.org) and h5py (http://www.h5py.org) are thin wrappers 
> and are tightly bound to the numpy library (http://www.numpy.org) which is 
> totally competitive. I also use Julia to access HDF5 content and I did not 
> notice a better performance. So I am not sure if this is a real bottleneck in 
> our case...
> 
> Second - I know this is HDF5 forum, but for such a huge but simple set of 
> data, I would suggest to use some SQL engine as a backend.
> 
> We definitely need a file based approach, so a centralised database engine is 
> not an option. I also tried sqlite, however the performance is very poor 
> compared to our HDF5 solution.
> 
> So maybe our data structure is not that bad overall, yet our expectations 
> might be a bit too high?
> 
> Cheers,
> Tamas
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/attachments/20170331/3a986f34/attachment.html>
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> 
> 
> ------------------------------
> 
> End of Hdf-forum Digest, Vol 93, Issue 29
> *****************************************


_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Optimising HDF5 data structure

Reply via email to