Create large IPC format record batch(es) in-place without copy or prior data analysis

John Muehlhausen Wed, 20 Oct 2021 14:26:18 -0700

Motivation:

We have memory-mappable Arrow IPC files with N batches where column(s) are
sorted to support binary search.  Because log2(n) < log2(n/2)+log2(n/2) and
binary search is required on each batch, we prefer the batches to be as
large as possible to reduce total search time... perhaps larger than
available RAM.... on the read side, only pages needed for the search
bisections and subsequent slice traversal are mapped in, of course.


The question then becomes one of creating large IPC-format files where
individual batches do not exist first in RAM because of their size.

Conceptually, this would seem to entail:
* allocating a fixed mmap'd area for writing to
* using builders to create buffers at the locations they would end up at
for an IPC format, and freezing these as arrays (if I understand the
terminology correctly)
* plopping in various other things such as metadata, schema, etc

One difficulty is that we want to size this area without having first
analyzed the data to be written to it, since such an analysis consumes
compute resources.  Therefore the area set aside for (e.g.) a variable
length string column would be a guess based on statistics and we would want
to just write the column buffers until the first one is full, which may
leave others (or itself) partially unpopulated.

This could result in some "wasted space" in the file which is a tradeoff we
can live with for the above reasons, which brings me back to
https://issues.apache.org/jira/browse/ARROW-5916 where this was discussed
before (and another discussion is linked there).  The idea was to allow
record batch lengths to be smaller than the associated buffer lengths,
which seemed like an easy change at the time... although I'll grant that we
only use trivial arrow types and in more complex cases there may be
side-effects I can't envision.

One of the ideas was to go ahead and fill in the buffers to create a valid
recordbatch but then store the sliced-down size in (e.g.) the user-defined
metadata, but this forces anyone using the IPC file to use a non-standard
mechanism to reject the "data" that fills the unpopulated buffer sections.

Even with the ability for a batch to be smaller than its buffers (to help
readers reject the residual of the buffers without referring to custom
metadata), I think I'm left with needing to create low-level code outside
of the Arrow library to create such a file since I cannot first create the
batch in RAM and then copy it out, due to the size and also due to wanting
to avoid the copy operation.

Any thoughts on creating large IPC format record batch(es) in-place in a
single pre-allocated buffer, that could be used with mmap?

Here is someone with a similar concern:
https://www.mail-archive.com/user@arrow.apache.org/msg01187.html

It seems like the
https://arrow.apache.org/docs/cpp/examples/row_columnar_conversion.html
example could be tweaked to use "pools" that defines exactly where to put
each buffer, but then the final `arrow::Table::Make` (or equivalent for
batches/IPC) must also receive instruction about where exactly to write
user metadata, schema, footer, etc.

Thanks for any ideas,
John

Create large IPC format record batch(es) in-place without copy or prior data analysis

Reply via email to