Re: Non-chunked large files / hdf5 support

2019-12-19 Thread Wes McKinney
On Tue, Dec 17, 2019 at 5:15 AM Maarten Breddels wrote: > > Hi, > > I had to catch up a bit with the arrow documentation before I could respond > properly. My fear was that Arrow demanded that the in-memory representation > was always 'packed', or 'flat'. After going through the docs, it seems tha

Re: Non-chunked large files / hdf5 support

2019-12-17 Thread Maarten Breddels
Hi, I had to catch up a bit with the arrow documentation before I could respond properly. My fear was that Arrow demanded that the in-memory representation was always 'packed', or 'flat'. After going through the docs, it seems that only when doing IPC or stream writing, it is written in this form.

Re: Non-chunked large files / hdf5 support

2019-11-27 Thread Wes McKinney
hi, There have been a number of discussions over the years about on-disk pre-allocation strategies. No volunteers have implemented anything, though. Developing an HDF5 integration library with pre-allocation and buffer management utilities seems like a reasonable growth area for the project. The f

Re: Non-chunked large files / hdf5 support

2019-11-26 Thread Francois Saint-Jacques
Hello Maarten, In theory, you could provide a custom mmap-allocator and use the builder facility. Since the array is still in "build-phase" and not sealed, it should be fine if mremap changes the pointer address. This might fail in practice since the allocator is also used for auxiliary data, e.g.

Non-chunked large files / hdf5 support

2019-11-26 Thread Maarten Breddels
In vaex I always write the data to hdf5 as 1 large chunk (per column). The reason is that it allows the mmapped columns to be exposed as a single numpy array (talking numerical data only for now), which many people are quite comfortable with. The strategy for vaex to write unchunked data, is to fi