> On 31. Mar 2017, at 10:29, Francesc Altet <[email protected]> wrote:
> I'd say that there should be a layout in which you can store your data in
> HDF5 that is competitive with ROOT; it is just that finding it may require
> some more experimentation.
alright, there is hope ;)
> On 31. Mar 2017, at 10:29, Francesc Altet <[email protected]> wrote:
> Things like the compressor used, the chunksizes and the index level that you
> are using might be critical for achieving more performance.
We experimented with compression levels and libs and ended up using the blosc.
And this is what we used:
tb.Filters(complevel=5, shuffle=True, fletcher32=True, complib='blosc')
We also pass the number of expected rows when creating the tables, however this
is a pytables feature, so there is some magic in the background.
> On 31. Mar 2017, at 10:29, Francesc Altet <[email protected]> wrote:
> Could you send us some links to your codebases and perhaps elaborate more on
> the performance figures that you are getting on each of your approaches?
The chunksizes had no significant impact on the performance, but I admit I need
to rerun all the performance scripts to show some actual values. The index
level is new to me, I need to read up on that, but I think pytables takes care
of it.
Here are some examples comparing the ROOT and HDF5 file formats, reading both
with thin C wrappers in Python:
ROOT_readout.py 5.27s user 3.33s system 153% cpu 5.609 total
HDF5_big_table_readout.py 17.88s user 4.29s system 105% cpu 21.585 total
> On 31. Mar 2017, at 10:36, Андрей Парамонов <[email protected]> wrote:
> My experience suggests that simply indexing the data is not enough to achieve
> top performance. The actual layout of information on disk (primary index)
> should be well-suited for your typical queries. For example, if you need to
> query by event_id, all values with the same event_id have to be closely
> located to minimize the number of disk seeks.
OK, this was also my thought. It seems we went in the wrong direction with this
indexing and big table thing.
> On 31. Mar 2017, at 10:36, Андрей Парамонов <[email protected]> wrote:
> If you have several types of typical queries, it might be worth to duplicate
> the information using different physical layouts. This philosophy is utilized
> to great success in e.g.
> http://cassandra.apache.org/ <http://cassandra.apache.org/>
Thanks, I will have a look!
> On 31. Mar 2017, at 10:36, Андрей Парамонов <[email protected]> wrote:
> From my experience HDF5 is almost as fast as direct disk read, and even
> *faster* when using fast compression (LZ4, blosc). On my data HDF5 proved to
> be much faster compared to SQLite and local PostgreSQL databases.
Sounds good ;-)
Cheers,
Tamas
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5