Re: [Python/C-Glib] writing IPC file format column-by-column

Sutou Kouhei Thu, 10 Sep 2020 14:55:19 -0700

Hi,

I add dev@ because this may need to improve Apache Arrow C++.


It seems that we need the following new feature for this use
case (combining chunks with small memory to process large
data with pandas, mmap and small memory):

  * Writing chunks in arrow::Table as one large
    arrow::RecordTable without creating intermediate
    combined chunks

The current arrow::ipc::RecordBatchWriter::WriteTable()
always splits the given arrow::Table to one or more
arrow::RecordBatch. We may be able to add the feature that
writes the given arrow::Table as one combined
arrow::RecordBatch without creating intermediate combined
chunks.


Do C++ developers have any opinion on this?


Thanks,
--
kou

In 
 
<ch2pr20mb30950806b40fe286d414ac97eb...@ch2pr20mb3095.namprd20.prod.outlook.com>
  "[Python/C-Glib] writing IPC file format column-by-column " on Wed, 9 Sep 
2020 10:11:54 +0000,
  Ishan Anand <[email protected]> wrote:

> Hi
> 
> I'm looking at using Arrow primarily on low-resource instances with out of 
> memory datasets. This is the workflow I'm trying to implement.
> 
> 
>   *   Write record batches in IPC streaming format to a file from a C runtime.
>   *   Consume it one row at a time from python/C by loading the file in 
> chunks.
>   *   If the schema is simple enough to support zero copy operations, make 
> the table readable from pandas. This needs me to,
>      *   convert it into a Table with a single chunk per column (since pandas 
> can't use mmap with chunked arrays).
>      *   write the table in IPC random access format to a file.
> 
> PyArrow provides a method `combine_chunks` to combine chunks into a single 
> chunk. However, it needs to create the entire table in memory (I suspect it 
> is 2x, since it loads both versions of the table in memory but that can be 
> avoided).
> 
> Since the Arrow layout is columnar, I'm curious if it is possible to write 
> the table one column at a time. And if the existing glib/python APIs support 
> it? The C++ file writer objects seem to go down to serializing a single 
> record batch at a time and not per column.
> 
> 
> Thank you,
> Ishan

Re: [Python/C-Glib] writing IPC file format column-by-column

Reply via email to