Den 31 mars 2017 11:15 fm skrev "Tamas Gal" <[email protected]>: >> >> On 31. Mar 2017, at 10:29, Francesc Altet <[email protected]> wrote: >> >> I'd say that there should be a layout in which you can store your data in HDF5 that is competitive with ROOT; it is just that finding it may require some more experimentation. > > > alright, there is hope ;) > >> On 31. Mar 2017, at 10:29, Francesc Altet <[email protected]> wrote: >> >> Things like the compressor used, the chunksizes and the index level that you are using might be critical for achieving more performance. > > > We experimented with compression levels and libs and ended up using the blosc. And this is what we used: > > tb.Filters(complevel=5, shuffle=True, fletcher32=True, complib='blosc')
Typing on my phone so can't say much. Just wanted to react to this. I haven't used pytables, but if the shuffle parameter here refers to the HDF5 library's built in shuffle filter, I think you want to turn it off when using blosc, since the blosc compressor does its own shuffling, and I think the two may interfere. Cheers, Elvis > > > We also pass the number of expected rows when creating the tables, however this is a pytables feature, so there is some magic in the background. > >> On 31. Mar 2017, at 10:29, Francesc Altet <[email protected]> wrote: >> >> Could you send us some links to your codebases and perhaps elaborate more on the performance figures that you are getting on each of your approaches? > > > The chunksizes had no significant impact on the performance, but I admit I need to rerun all the performance scripts to show some actual values. The index level is new to me, I need to read up on that, but I think pytables takes care of it. > > Here are some examples comparing the ROOT and HDF5 file formats, reading both with thin C wrappers in Python: > > ROOT_readout.py 5.27s user 3.33s system 153% cpu 5.609 total > HDF5_big_table_readout.py 17.88s user 4.29s system 105% cpu 21.585 total > >> On 31. Mar 2017, at 10:36, Андрей Парамонов <[email protected]> wrote: >> >> My experience suggests that simply indexing the data is not enough to achieve top performance. The actual layout of information on disk (primary index) should be well-suited for your typical queries. For example, if you need to query by event_id, all values with the same event_id have to be closely located to minimize the number of disk seeks. > > > OK, this was also my thought. It seems we went in the wrong direction with this indexing and big table thing. > >> On 31. Mar 2017, at 10:36, Андрей Парамонов <[email protected]> wrote: >> >> If you have several types of typical queries, it might be worth to duplicate the information using different physical layouts. This philosophy is utilized to great success in e.g. >> http://cassandra.apache.org/ > > > Thanks, I will have a look! > >> On 31. Mar 2017, at 10:36, Андрей Парамонов <[email protected]> wrote: >> >> From my experience HDF5 is almost as fast as direct disk read, and even *faster* when using fast compression (LZ4, blosc). On my data HDF5 proved to be much faster compared to SQLite and local PostgreSQL databases. > > > Sounds good ;-) > > Cheers, > Tamas > > _______________________________________________ > Hdf-forum is for HDF software users discussion. > [email protected] > http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org > Twitter: https://twitter.com/hdf5
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
