Oh sure, there is always hope [😉]
Ok, so based on your report, I'd suggest to use other codecs inside Blosc. By default the "blosc" filter translates to the "blosc:blosclz" codec internally, but you can also specify "blosc:lz4", "blosc:snappy", "blosc:zlib" and "blosc:zstd". Each codec has its strong and weak points, so my first advice is that you experiment with them (specially with "blosc:zstd" which is a surprisingly good newcomer). Then, you should experiment with different chunksizes. If you are using PyTables, then make sure to pass them in the `chunkshape` parameter, whereas h5py uses `chunks`. Indexing can make lookups much faster too. Make sure that you create the index with maximum optimization (look for http://www.pytables.org/usersguide/libref/structured_storage.html?highlight=create_index#tables.Column.create_csindex) before deciding that it is not for you. Also, using blosc (+ a suitable codec) when creating the index can usually accelerate things quite a bit. In general, for understanding how chunksizes, compression and indexing can affect your lookup performance it is worth to have a look at the _Optimization Tips_ chapter of PyTables UG: http://www.pytables.org/usersguide/optimization.html . Good luck, Francesc Alted ________________________________ From: Hdf-forum <[email protected]> on behalf of Tamas Gal <[email protected]> Sent: Friday, March 31, 2017 11:14 AM To: HDF Users Discussion List Subject: Re: [Hdf-forum] Optimising HDF5 data structure On 31. Mar 2017, at 10:29, Francesc Altet <[email protected]<mailto:[email protected]>> wrote: I'd say that there should be a layout in which you can store your data in HDF5 that is competitive with ROOT; it is just that finding it may require some more experimentation. alright, there is hope ;) On 31. Mar 2017, at 10:29, Francesc Altet <[email protected]<mailto:[email protected]>> wrote: Things like the compressor used, the chunksizes and the index level that you are using might be critical for achieving more performance. We experimented with compression levels and libs and ended up using the blosc. And this is what we used: tb.Filters(complevel=5, shuffle=True, fletcher32=True, complib='blosc') We also pass the number of expected rows when creating the tables, however this is a pytables feature, so there is some magic in the background. On 31. Mar 2017, at 10:29, Francesc Altet <[email protected]<mailto:[email protected]>> wrote: Could you send us some links to your codebases and perhaps elaborate more on the performance figures that you are getting on each of your approaches? The chunksizes had no significant impact on the performance, but I admit I need to rerun all the performance scripts to show some actual values. The index level is new to me, I need to read up on that, but I think pytables takes care of it. Here are some examples comparing the ROOT and HDF5 file formats, reading both with thin C wrappers in Python: ROOT_readout.py 5.27s user 3.33s system 153% cpu 5.609 total HDF5_big_table_readout.py 17.88s user 4.29s system 105% cpu 21.585 total On 31. Mar 2017, at 10:36, Андрей Парамонов <[email protected]<mailto:[email protected]>> wrote: My experience suggests that simply indexing the data is not enough to achieve top performance. The actual layout of information on disk (primary index) should be well-suited for your typical queries. For example, if you need to query by event_id, all values with the same event_id have to be closely located to minimize the number of disk seeks. OK, this was also my thought. It seems we went in the wrong direction with this indexing and big table thing. On 31. Mar 2017, at 10:36, Андрей Парамонов <[email protected]<mailto:[email protected]>> wrote: If you have several types of typical queries, it might be worth to duplicate the information using different physical layouts. This philosophy is utilized to great success in e.g. http://cassandra.apache.org/ Thanks, I will have a look! On 31. Mar 2017, at 10:36, Андрей Парамонов <[email protected]<mailto:[email protected]>> wrote: From my experience HDF5 is almost as fast as direct disk read, and even *faster* when using fast compression (LZ4, blosc). On my data HDF5 proved to be much faster compared to SQLite and local PostgreSQL databases. Sounds good ;-) Cheers, Tamas
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
