Re: [Hdf-forum] Optimising HDF5 data structure

Tamas Gal Fri, 31 Mar 2017 02:15:42 -0700

> On 31. Mar 2017, at 10:29, Francesc Altet <[email protected]> wrote:
> I'd say that there should be a layout in which you can store your data in 
> HDF5 that is competitive with ROOT; it is just that finding it may require 
> some more experimentation.


alright, there is hope ;)

> On 31. Mar 2017, at 10:29, Francesc Altet <[email protected]> wrote:
> Things like the compressor used, the chunksizes and the index level that you 
> are using might be critical for achieving more performance. 

We experimented with compression levels and libs and ended up using the blosc. 
And this is what we used:

tb.Filters(complevel=5, shuffle=True, fletcher32=True, complib='blosc')

We also pass the number of expected rows when creating the tables, however this 
is a pytables feature, so there is some magic in the background.

> On 31. Mar 2017, at 10:29, Francesc Altet <[email protected]> wrote:
> Could you send us some links to your codebases and perhaps elaborate more on 
> the performance figures that you are getting on each of your approaches? 


The chunksizes had no significant impact on the performance, but I admit I need 
to rerun all the performance scripts to show some actual values. The index 
level is new to me, I need to read up on that, but I think pytables takes care 
of it.

Here are some examples comparing the ROOT and HDF5 file formats, reading both 
with thin C wrappers in Python:

ROOT_readout.py  5.27s user 3.33s system 153% cpu 5.609 total
HDF5_big_table_readout.py  17.88s user 4.29s system 105% cpu 21.585 total

> On 31. Mar 2017, at 10:36, Андрей Парамонов <[email protected]> wrote:
> My experience suggests that simply indexing the data is not enough to achieve 
> top performance. The actual layout of information on disk (primary index) 
> should be well-suited for your typical queries. For example, if you need to 
> query by event_id, all values with the same event_id have to be closely 
> located to minimize the number of disk seeks.

OK, this was also my thought. It seems we went in the wrong direction with this 
indexing and big table thing.

> On 31. Mar 2017, at 10:36, Андрей Парамонов <[email protected]> wrote:
> If you have several types of typical queries, it might be worth to duplicate 
> the information using different physical layouts. This philosophy is utilized 
> to great success in e.g.
> http://cassandra.apache.org/ <http://cassandra.apache.org/>
Thanks, I will have a look!

> On 31. Mar 2017, at 10:36, Андрей Парамонов <[email protected]> wrote:
> From my experience HDF5 is almost as fast as direct disk read, and even 
> *faster* when using fast compression (LZ4, blosc). On my data HDF5 proved to 
> be much faster compared to SQLite and local PostgreSQL databases.


Sounds good ;-)

Cheers,
Tamas

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Optimising HDF5 data structure

Reply via email to