Re: [Hdf-forum] Why is the chunk_size argument for H5TBmake_table() of a scalar type (hsize_t) ?

Elena Pourmal Sun, 07 Sep 2014 13:41:55 -0700

Hi Joe,

You are out of luck here….


HDF5 TB interface treats a table as a 1-dim array of elements that have a 
compound datatype (i.e., table fields  are the fields of the HDF5 compound 
datatype). One cannot have a chunk that is a column of the table; the chunk 
always contains several rows of the table. If you want to use PyTables, you may 
take a look at their array object instead of the table to store each column 
independently.

Do you have any data that suggests that reading by a column from the table is 
really slow? If so, you should experiment with the chunk size and chunk cache 
size to tune the performance.

Elena
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal  The HDF Group  http://hdfgroup.org
1800 So. Oak St., Suite 203, Champaign IL 61820
217.531.6112
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~




On Sep 2, 2014, at 3:22 PM, J. Lee 
<[email protected]<mailto:[email protected]>> wrote:

Hello,

My program that produces the data is written in c++ and I should be able to 
link high level C API library H5TB.
The table has mixed types of ‘fields’ (or columns)  (e.g. uint64, float etc).  
The table can have up to 80 columns and 10000 rows, but it should be able to 
scale to larger dimensions.  The dimension of the table is fixed for each 
production of the table.

On the consumer side, I’m considering PyPandas or PyTables.  The program on 
consumer side needs to apply numeric functions along each column, so storing 
the columns rather than the rows in continuous space is much more efficient for 
the consumer side program.   Performance is more critical on consumer side as 
it is aggregating output from multiple producer programs.

I’m considering H5TB high level API, along with the block write  (i.e. write 
fixed number of rows) approach proposed by Darryl in this forum 
http://hdf-forum.184993.n3.nabble.com/hdf-forum-Efficient-Way-to-Write-Compound-Data-td193448.html#a193447.

On a write to a file,  I would like the column rather than the row for each 
block to be stored in contiguous memory in the H5 file, assuming that this will 
help with performance when PyTables on consumer side accesses the table by 
column (not by row).

Th example codes on Chunking 
http://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/ shows that chunk_dims[2] 
has 2 elements.   For example, if the block has 1000 rows,  I would use 
chunk_dim[2] = {1000, 1} so that the 1000 rows for each column is stored in 
contiguous piece or memory.

Does H5TBmake_table() support such chunking dimension and if so, what is the 
syntax that I would use ?

Thanks!
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]<mailto:[email protected]>
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Why is the chunk_size argument for H5TBmake_table() of a scalar type (hsize_t) ?

Reply via email to