Re: [Hdf-forum] Optimising HDF5 data structure

Francesc Altet Fri, 31 Mar 2017 08:38:24 -0700

Indeed, indexing is a PyTables feature, so if want to use HDF5 with other 
interfaces, then better not rely on this.

Francesc Alted

________________________________
From: Hdf-forum <[email protected]> on behalf of Tamas Gal 
<[email protected]>
Sent: Friday, March 31, 2017 5:10:45 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Optimising HDF5 data structure

Thanks for all the feedback so far!

On 31. Mar 2017, at 13:49, Francesc Altet 
<[email protected]<mailto:[email protected]>> wrote:
Then, you should experiment with different chunksizes.  If you are using 
PyTables, then make sure to pass them in the `chunkshape` parameter, whereas 
h5py uses `chunks`.
[...]
Indexing can make lookups much faster too.  Make sure that you create the index 
with maximum optimization (look 
forhttp://www.pytables.org/usersguide/libref/structured_storage.html?highlight=create_index#tables.Column.create_csindex)
 before deciding that it is not for you.  Also, using blosc (+ a suitable 
codec) when creating the index can usually accelerate things quite a bit.

Alright, I will study that extensively. :)

I am just curious if I tie the HDF5 format to much to the pytables framework. 
We also use other languages and as far as I understand, pytables creates some 
hidden tables to do all the magic behind. Or are these commonly supported HDF5 
features? (sorry for the dumb question)

On 31. Mar 2017, at 10:47, Adrien Devresse 
<[email protected]<mailto:[email protected]>> wrote:
It is a pleasure to see some HEP people here.

Thanks, glad to hear ;)

On 31. Mar 2017, at 10:47, Adrien Devresse 
<[email protected]<mailto:[email protected]>> wrote:
The slowness is compared to a ROOT structure which is also used in parallel. If 
I compare some basic event-by-event analysis, the same code run on a ROOT file 
is almost an order of magnitude faster.

If I remember properly, ROOT can only read in parallel, no write. Does
it matter for you ?

Ahm, with "used in parallel" I was referring to the fact that we use both ROOT 
and HDF5 files in our collaboration. There is a generation conflict between the 
two "frameworks" as you may imagine. Younger people refuse to use ROOT (for 
good reasons, but that's another story). That's why I maintain this branch in 
parallel.

On 31. Mar 2017, at 10:47, Adrien Devresse 
<[email protected]<mailto:[email protected]>> wrote:
This approach would create a large number of dataset ( one per id ),
which is  from my experience, a bad idea in HDF5

Yes, this is kind of the problem with the second approach. h5py is extremely 
fast when iterating whereas pytables takes 50 times longer using the very same 
code (a for loop and direct access to the nodes). And there are people using 
other frameworks so there might be some huge performance variations I fear, 
which of course is not user friendly at all.

On 31. Mar 2017, at 10:47, Adrien Devresse 
<[email protected]<mailto:[email protected]>> wrote:
I would use Approach #1 and store all your events in a "column" fashion
similar to what ROOT does.

For the fast querying problem, you can post-process your file and add a
separate column acting as an ordered index / associative array with a
layout of the type "event_id" -> "range row"

I see... So there might be a well suited set of chunk/index parameters which 
could improve the speed of that structure. I need to dig deeper then.

Cheers,
Tamas

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Optimising HDF5 data structure

Reply via email to