Non-chunked large files / hdf5 support

Maarten Breddels Tue, 26 Nov 2019 07:02:00 -0800

In vaex I always write the data to hdf5 as 1 large chunk (per column).
The reason is that it allows the mmapped columns to be exposed as a
single numpy array (talking numerical data only for now), which many
people are quite comfortable with.


The strategy for vaex to write unchunked data, is to first create an
'empty' hdf5 file (filled with zeros), mmap those huge arrays, and
write to that in chunks.

This means that in vaex I need to support mutable data (only used
internally, vaex' default is immutable data like arrow), since I need
to write to the memory mapped data. It also makes the exporting code
relatively simple.

I could not find a way in Arrow to get something similar done, at
least not without having a single pa.array instance for each column. I
think Arrow's mindset is that you should just use chunks right? Or is
this also something that can be considered for Arrow?

An alternative would be to implement Arrow in hdf5, which I basically
do now in vaex (with limited support). Again, I'm wondering if there
is there an interest in storing arrow data in hdf5 from the Arrow
community?

cheers,

Maarten

Non-chunked large files / hdf5 support

Reply via email to