Indeed, indexing is a PyTables feature, so if want to use HDF5 with other interfaces, then better not rely on this.
Francesc Alted ________________________________ From: Hdf-forum <[email protected]> on behalf of Tamas Gal <[email protected]> Sent: Friday, March 31, 2017 5:10:45 PM To: HDF Users Discussion List Subject: Re: [Hdf-forum] Optimising HDF5 data structure Thanks for all the feedback so far! On 31. Mar 2017, at 13:49, Francesc Altet <[email protected]<mailto:[email protected]>> wrote: Then, you should experiment with different chunksizes. If you are using PyTables, then make sure to pass them in the `chunkshape` parameter, whereas h5py uses `chunks`. [...] Indexing can make lookups much faster too. Make sure that you create the index with maximum optimization (look forhttp://www.pytables.org/usersguide/libref/structured_storage.html?highlight=create_index#tables.Column.create_csindex) before deciding that it is not for you. Also, using blosc (+ a suitable codec) when creating the index can usually accelerate things quite a bit. Alright, I will study that extensively. :) I am just curious if I tie the HDF5 format to much to the pytables framework. We also use other languages and as far as I understand, pytables creates some hidden tables to do all the magic behind. Or are these commonly supported HDF5 features? (sorry for the dumb question) On 31. Mar 2017, at 10:47, Adrien Devresse <[email protected]<mailto:[email protected]>> wrote: It is a pleasure to see some HEP people here. Thanks, glad to hear ;) On 31. Mar 2017, at 10:47, Adrien Devresse <[email protected]<mailto:[email protected]>> wrote: The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster. If I remember properly, ROOT can only read in parallel, no write. Does it matter for you ? Ahm, with "used in parallel" I was referring to the fact that we use both ROOT and HDF5 files in our collaboration. There is a generation conflict between the two "frameworks" as you may imagine. Younger people refuse to use ROOT (for good reasons, but that's another story). That's why I maintain this branch in parallel. On 31. Mar 2017, at 10:47, Adrien Devresse <[email protected]<mailto:[email protected]>> wrote: This approach would create a large number of dataset ( one per id ), which is from my experience, a bad idea in HDF5 Yes, this is kind of the problem with the second approach. h5py is extremely fast when iterating whereas pytables takes 50 times longer using the very same code (a for loop and direct access to the nodes). And there are people using other frameworks so there might be some huge performance variations I fear, which of course is not user friendly at all. On 31. Mar 2017, at 10:47, Adrien Devresse <[email protected]<mailto:[email protected]>> wrote: I would use Approach #1 and store all your events in a "column" fashion similar to what ROOT does. For the fast querying problem, you can post-process your file and add a separate column acting as an ordered index / associative array with a layout of the type "event_id" -> "range row" I see... So there might be a well suited set of chunk/index parameters which could improve the speed of that structure. I need to dig deeper then. Cheers, Tamas
_______________________________________________ Hdf-forum is for HDF software users discussion. [email protected] http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5
