Hi,
I add dev@ because this may need to improve Apache Arrow C++.
It seems that we need the following new feature for this use
case (combining chunks with small memory to process large
data with pandas, mmap and small memory):
* Writing chunks in arrow::Table as one large
arrow::RecordTable without creating intermediate
combined chunks
The current arrow::ipc::RecordBatchWriter::WriteTable()
always splits the given arrow::Table to one or more
arrow::RecordBatch. We may be able to add the feature that
writes the given arrow::Table as one combined
arrow::RecordBatch without creating intermediate combined
chunks.
Do C++ developers have any opinion on this?
Thanks,
--
kou
In
<ch2pr20mb30950806b40fe286d414ac97eb...@ch2pr20mb3095.namprd20.prod.outlook.com>
"[Python/C-Glib] writing IPC file format column-by-column " on Wed, 9 Sep
2020 10:11:54 +0000,
Ishan Anand <[email protected]> wrote:
> Hi
>
> I'm looking at using Arrow primarily on low-resource instances with out of
> memory datasets. This is the workflow I'm trying to implement.
>
>
> * Write record batches in IPC streaming format to a file from a C runtime.
> * Consume it one row at a time from python/C by loading the file in
> chunks.
> * If the schema is simple enough to support zero copy operations, make
> the table readable from pandas. This needs me to,
> * convert it into a Table with a single chunk per column (since pandas
> can't use mmap with chunked arrays).
> * write the table in IPC random access format to a file.
>
> PyArrow provides a method `combine_chunks` to combine chunks into a single
> chunk. However, it needs to create the entire table in memory (I suspect it
> is 2x, since it loads both versions of the table in memory but that can be
> avoided).
>
> Since the Arrow layout is columnar, I'm curious if it is possible to write
> the table one column at a time. And if the existing glib/python APIs support
> it? The C++ file writer objects seem to go down to serializing a single
> record batch at a time and not per column.
>
>
> Thank you,
> Ishan