Hi, I add dev@ because this may need to improve Apache Arrow C++.
It seems that we need the following new feature for this use case (combining chunks with small memory to process large data with pandas, mmap and small memory): * Writing chunks in arrow::Table as one large arrow::RecordTable without creating intermediate combined chunks The current arrow::ipc::RecordBatchWriter::WriteTable() always splits the given arrow::Table to one or more arrow::RecordBatch. We may be able to add the feature that writes the given arrow::Table as one combined arrow::RecordBatch without creating intermediate combined chunks. Do C++ developers have any opinion on this? Thanks, -- kou In <ch2pr20mb30950806b40fe286d414ac97eb...@ch2pr20mb3095.namprd20.prod.outlook.com> "[Python/C-Glib] writing IPC file format column-by-column " on Wed, 9 Sep 2020 10:11:54 +0000, Ishan Anand <anand.is...@outlook.com> wrote: > Hi > > I'm looking at using Arrow primarily on low-resource instances with out of > memory datasets. This is the workflow I'm trying to implement. > > > * Write record batches in IPC streaming format to a file from a C runtime. > * Consume it one row at a time from python/C by loading the file in > chunks. > * If the schema is simple enough to support zero copy operations, make > the table readable from pandas. This needs me to, > * convert it into a Table with a single chunk per column (since pandas > can't use mmap with chunked arrays). > * write the table in IPC random access format to a file. > > PyArrow provides a method `combine_chunks` to combine chunks into a single > chunk. However, it needs to create the entire table in memory (I suspect it > is 2x, since it loads both versions of the table in memory but that can be > avoided). > > Since the Arrow layout is columnar, I'm curious if it is possible to write > the table one column at a time. And if the existing glib/python APIs support > it? The C++ file writer objects seem to go down to serializing a single > record batch at a time and not per column. > > > Thank you, > Ishan