Re: [Hdf-forum] Optimising HDF5 data structure

Francesc Altet Fri, 31 Mar 2017 04:51:01 -0700

Oh sure, there is always hope [😉]


Ok, so based on your report, I'd suggest to use other codecs inside Blosc.  By 
default the "blosc" filter translates to the "blosc:blosclz" codec internally, 
but you can also specify "blosc:lz4", "blosc:snappy", "blosc:zlib" and 
"blosc:zstd".  Each codec has its strong and weak points, so my first advice is 
that you experiment with them (specially with "blosc:zstd" which is a 
surprisingly good newcomer).


Then, you should experiment with different chunksizes.  If you are using 
PyTables, then make sure to pass them in the `chunkshape` parameter, whereas 
h5py uses `chunks`.


Indexing can make lookups much faster too.  Make sure that you create the index 
with maximum optimization (look for 
http://www.pytables.org/usersguide/libref/structured_storage.html?highlight=create_index#tables.Column.create_csindex)
 before deciding that it is not for you.  Also, using blosc (+ a suitable 
codec) when creating the index can usually accelerate things quite a bit.


In general, for understanding how chunksizes, compression and indexing can 
affect your lookup performance it is worth to have a look at the _Optimization 
Tips_ chapter of PyTables UG: 
http://www.pytables.org/usersguide/optimization.html .


Good luck,

Francesc Alted


________________________________
From: Hdf-forum <[email protected]> on behalf of Tamas Gal 
<[email protected]>
Sent: Friday, March 31, 2017 11:14 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Optimising HDF5 data structure

On 31. Mar 2017, at 10:29, Francesc Altet 
<[email protected]<mailto:[email protected]>> wrote:
I'd say that there should be a layout in which you can store your data in HDF5 
that is competitive with ROOT; it is just that finding it may require some more 
experimentation.

alright, there is hope ;)

On 31. Mar 2017, at 10:29, Francesc Altet 
<[email protected]<mailto:[email protected]>> wrote:
Things like the compressor used, the chunksizes and the index level that you 
are using might be critical for achieving more performance.

We experimented with compression levels and libs and ended up using the blosc. 
And this is what we used:


tb.Filters(complevel=5, shuffle=True, fletcher32=True, complib='blosc')

We also pass the number of expected rows when creating the tables, however this 
is a pytables feature, so there is some magic in the background.

On 31. Mar 2017, at 10:29, Francesc Altet 
<[email protected]<mailto:[email protected]>> wrote:
Could you send us some links to your codebases and perhaps elaborate more on 
the performance figures that you are getting on each of your approaches?

The chunksizes had no significant impact on the performance, but I admit I need 
to rerun all the performance scripts to show some actual values. The index 
level is new to me, I need to read up on that, but I think pytables takes care 
of it.

Here are some examples comparing the ROOT and HDF5 file formats, reading both 
with thin C wrappers in Python:

ROOT_readout.py  5.27s user 3.33s system 153% cpu 5.609 total
HDF5_big_table_readout.py  17.88s user 4.29s system 105% cpu 21.585 total

On 31. Mar 2017, at 10:36, Андрей Парамонов 
<[email protected]<mailto:[email protected]>> wrote:
My experience suggests that simply indexing the data is not enough to achieve 
top performance. The actual layout of information on disk (primary index) 
should be well-suited for your typical queries. For example, if you need to 
query by event_id, all values with the same event_id have to be closely located 
to minimize the number of disk seeks.

OK, this was also my thought. It seems we went in the wrong direction with this 
indexing and big table thing.

On 31. Mar 2017, at 10:36, Андрей Парамонов 
<[email protected]<mailto:[email protected]>> wrote:
If you have several types of typical queries, it might be worth to duplicate 
the information using different physical layouts. This philosophy is utilized 
to great success in e.g.
http://cassandra.apache.org/

Thanks, I will have a look!

On 31. Mar 2017, at 10:36, Андрей Парамонов 
<[email protected]<mailto:[email protected]>> wrote:
From my experience HDF5 is almost as fast as direct disk read, and even 
*faster* when using fast compression (LZ4, blosc). On my data HDF5 proved to be 
much faster compared to SQLite and local PostgreSQL databases.

Sounds good ;-)

Cheers,
Tamas

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Optimising HDF5 data structure

Reply via email to