Re: [Hdf-forum] Fast Sparse Matrix Products by Finding Allocated Chunks

Miller, Mark C. Wed, 12 Aug 2015 11:42:10 -0700

Ok, you might have a look at the implementation of the h5dump.c or h5ls.c tools 
because I have some vague recollection that those tools are able to tease out 
of the file *some* information regarding block storage of datasets. I don't 
know if they get at the key information you want; which blocks are empty but 
some of what those tools do might get you closer.


Mark


From: Hdf-forum 
<[email protected]<mailto:[email protected]>>
 on behalf of Aidan Macdonald 
<[email protected]<mailto:[email protected]>>
Reply-To: HDF Users Discussion List 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, August 12, 2015 11:35 AM
To: HDF Users Discussion List 
<[email protected]<mailto:[email protected]>>
Subject: Re: [Hdf-forum] Fast Sparse Matrix Products by Finding Allocated Chunks

Is this what you mean by a 'sparse format'?

Yes, exactly.

 However, I am not sure why you need to know how HDF5 has handled the chunks 
*in*the*file, unless you are attempting to write an out-of-core matrix multiply.

Yes, I am trying to write an out-of-core matrix multiply.

 I think you can easily determine which blocks are 'empty' by examining a block 
you've read into memory for all fill value or not. Any block which consists 
entirely of fill-value is, of course, an empty block. And, then you can use 
that information to help bootstrap your sparse matrix multiply. So, you could 
maybe read the matrix several blocks at a time, rather than all at once, 
examining returned blocks for all-fill-value or not and then building up your 
sparse in memory representation from that. If you read the matrix in one 
H5Dread call, however, then you'd wind up with a fully instantiated matrix with 
many fill values in memory *before* you could be being to reduce that storage 
to a sparse format.

I think, but can't prove, that if I did the check, I would create more CPU 
cycles that I would save. Because I need to read, check, and then dump or 
multiply out. I was hoping for something that could give me a list of the 
allocated chunks, then I could do a "dictionary of 
keys"<https://en.wikipedia.org/wiki/Sparse_matrix#Dictionary_of_keys_.28DOK.29> 
block matrix multiplication.

According to Table 15 
here<https://www.hdfgroup.org/HDF5/doc/UG/10_Datasets.html>, if the space is 
not allocated, then an error is thrown. So perhaps this error will be faster 
than actually reading the data from disk. But to do that, I need the fill_value 
undefined. I was hoping for a better way to see if the chunk is actually 
allocated.

The best way in my mind is to get some sort of list of all the chunks that are 
allocated. Or a iterator that goes through them, then I can do the block matrix 
multiplication.

Aidan Plenert Macdonald
Website<http://acsweb.ucsd.edu/~amacdona/>

On Wed, Aug 12, 2015 at 10:16 AM, Miller, Mark C. 
<[email protected]<mailto:[email protected]>> wrote:
Have a look at this reference . . .

http://www.hdfgroup.org/HDF5/doc_resource/H5Fill_Values.html

as well as documentation on H5Pset_fill_value and H5Pset_fill_time.

I have a vague recollection that if you create a large, chunked dataset but 
then only write to certain parts of it, HDF5 is smart enough to store only 
those chunks in the file that actually have non-fill values within them. The 
above ref seems to be consistent with this (except in parallel I/O settings).

Is this what you mean by a 'sparse format'?

However, I am not sure why you need to know how HDF5 has handled the chunks 
*in*the*file, unless you are attempting to write an out-of-core matrix multiply.

I think you can easily determine which blocks are 'empty' by examining a block 
you've read into memory for all fill value or not. Any block which consists 
entirely of fill-value is, of course, an empty block. And, then you can use 
that information to help bootstrap your sparse matrix multiply. So, you could 
maybe read the matrix several blocks at a time, rather than all at once, 
examining returned blocks for all-fill-value or not and then building up your 
sparse in memory representation from that. If you read the matrix in one 
H5Dread call, however, then you'd wind up with a fully instatiated matrix with 
many fill values in memory *before* you could be being to reduce that storage 
to a sparse format.

I wonder if it might be possible to write your own custom 'filter' that you 
applied during H5Dread that would do all this for you as chunks are read from 
the file? It might be.

Mark



From: Hdf-forum 
<[email protected]<mailto:[email protected]>>
 on behalf of Aidan Macdonald 
<[email protected]<mailto:[email protected]>>
Reply-To: HDF Users Discussion List 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, August 12, 2015 9:05 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: [Hdf-forum] Fast Sparse Matrix Products by Finding Allocated Chunks

Hi,

I am using Python h5py to use HDF5, but I am planning on pushing into C/C++.

I am using HDF5 to store sparse matrices which I need to do matrix products on. 
I am using chunked storage which 'appears' to be storing the data in a block 
sparse format. PLEASE CONFIRM that this is true. I couldn't find documentation 
stating this to be true, but by looking at file sizes during data loading, my 
block sparse assumption seemed to be true.

I would like to matrix multiply and use the sparsity of the data to make it go 
faster. I can handle the algorithmic aspect, but I can't figure out how to see 
which chunks are allocated so I can iterate over these.

If there is a better way to go at this (existing code!), please let me know. I 
am new to HDF5, and thoroughly impressed.

Thank you,

Aidan Plenert Macdonald
Website<http://acsweb.ucsd.edu/~amacdona/>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]<mailto:[email protected]>
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Re: [Hdf-forum] Fast Sparse Matrix Products by Finding Allocated Chunks

Reply via email to