Re: [Hdf-forum] Optimising HDF5 data structure

Elvis Stansvik Fri, 31 Mar 2017 09:31:30 -0700

Den 31 mars 2017 11:15 fm skrev "Tamas Gal" <[email protected]>:
>>
>> On 31. Mar 2017, at 10:29, Francesc Altet <[email protected]> wrote:
>>
>> I'd say that there should be a layout in which you can store your data
in HDF5 that is competitive with ROOT; it is just that finding it may
require some more experimentation.
>
>
> alright, there is hope ;)
>
>> On 31. Mar 2017, at 10:29, Francesc Altet <[email protected]> wrote:
>>
>> Things like the compressor used, the chunksizes and the index level that
you are using might be critical for achieving more performance.
>
>
> We experimented with compression levels and libs and ended up using the
blosc. And this is what we used:
>
> tb.Filters(complevel=5, shuffle=True, fletcher32=True, complib='blosc')


Typing on my phone so can't say much. Just wanted to react to this. I
haven't used pytables, but if the shuffle parameter here refers to the HDF5
library's built in shuffle filter, I think you want to turn it off when
using blosc, since the blosc compressor does its own shuffling, and I think
the two may interfere.

Cheers,
Elvis

>
>
> We also pass the number of expected rows when creating the tables,
however this is a pytables feature, so there is some magic in the
background.
>
>> On 31. Mar 2017, at 10:29, Francesc Altet <[email protected]> wrote:
>>
>> Could you send us some links to your codebases and perhaps elaborate
more on the performance figures that you are getting on each of your
approaches?
>
>
> The chunksizes had no significant impact on the performance, but I admit
I need to rerun all the performance scripts to show some actual values. The
index level is new to me, I need to read up on that, but I think pytables
takes care of it.
>
> Here are some examples comparing the ROOT and HDF5 file formats, reading
both with thin C wrappers in Python:
>
> ROOT_readout.py  5.27s user 3.33s system 153% cpu 5.609 total
> HDF5_big_table_readout.py  17.88s user 4.29s system 105% cpu 21.585 total
>
>> On 31. Mar 2017, at 10:36, Андрей Парамонов <[email protected]> wrote:
>>
>> My experience suggests that simply indexing the data is not enough to
achieve top performance. The actual layout of information on disk (primary
index) should be well-suited for your typical queries. For example, if you
need to query by event_id, all values with the same event_id have to be
closely located to minimize the number of disk seeks.
>
>
> OK, this was also my thought. It seems we went in the wrong direction
with this indexing and big table thing.
>
>> On 31. Mar 2017, at 10:36, Андрей Парамонов <[email protected]> wrote:
>>
>> If you have several types of typical queries, it might be worth to
duplicate the information using different physical layouts. This philosophy
is utilized to great success in e.g.
>> http://cassandra.apache.org/
>
>
> Thanks, I will have a look!
>
>> On 31. Mar 2017, at 10:36, Андрей Парамонов <[email protected]> wrote:
>>
>> From my experience HDF5 is almost as fast as direct disk read, and even
*faster* when using fast compression (LZ4, blosc). On my data HDF5 proved
to be much faster compared to SQLite and local PostgreSQL databases.
>
>
> Sounds good ;-)
>
> Cheers,
> Tamas
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
> Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Optimising HDF5 data structure

Reply via email to