Some background information about HDF5 and chunking:

HDF5 supports chunking of datasets. If you look at an HDF5 file via h5ls or
h5dump, you should see the chunk size. "Chunking" means that the data are
not stored contiguously,  but in terms of chunks with a certain size (e.g.
100x100 elements); each chunk is stored contiguously, but the chunks
themselves are stored (I think) in a B-tree. To create chunks, one needs to
explicitly request that when the dataset is created; Julia's HDF5 module
supports this (don't know whether this is passed through to JLD). There are
also ways to post-process HDF5 files via command line tools such as
"h5repack" to add chunking or compression. If data are not chunked, then it
is very important whether you read the data by rows or in columns. (I
assume you know this since you asked about chunking.)

I don't think there exists a way to iterate over a matrix in an HDF5 file
via chunks. You'll probably have to roll your own. After obtaining the
chunk size of a dataset, you'd probably use the "getindex" function on the
HDF5 dataset to read each chunk.

-erik


On Fri, Sep 25, 2015 at 5:30 PM, Jim Garrison <j...@garrison.cc> wrote:

> I have a very large, dense matrix (too large to fit in RAM) "A" saved in
> an HDF5/JLD file.  I also have several vectors of the same dimension
> v[1]..v[n], which *are* able to fit in memory, and I would like to know
> the result of multiplying each vector by my matrix A.
>
> My current idea is to do this manually -- simply loading each row or
> column from the matrix dataset, one at a time, and implementing
> matrix-vector multiplication myself.  However, it is possible that my
> large matrix is already stored is some sort of "chunked"/block form in
> the HDF5 file, and it would be nice to choose my blocks accordingly so
> the calculation happens as efficiently as possible.  Is there a way to
> load all blocks of an HDF5 dataset in the most efficient way?
>
> In terms of the bigger picture, I've also considered that it might be
> nice to implement in JLD all general matrix-matrix and matrix-vector
> operations for JldDataset (which would throw an error when the
> dimensions of the objects do not match, or when the data types do not
> make sense for multiplication).  But I could also see this being an
> unwelcome "feature," as it would be quite easy to accidentally call this
> even when it is possible to load the matrix into memory.  (Also, it
> would not be the most efficient way to handle my problem, as it would
> involve loading the dataset n times, once for each vector v[i].)
>
>


-- 
Erik Schnetter <schnet...@gmail.com>
http://www.perimeterinstitute.ca/personal/eschnetter/

Reply via email to