> > Please confirm there is no way around this. If you want a table on disk to > be zero copy compatible, you must build it from contiguous arrays in > memory. You can’t build an infinite sized memory mapped, zero copy > compatible file on disk. Right? >
For Pandas conversion I believe this is true, but I'm not an expert. If you can process the data incrementally then reading a record batch at a time and converting to pandas should still be zero copy. On Wed, May 5, 2021 at 8:46 AM Jeff Wickstrom <[email protected]> wrote: > Hi Micah, All, > > > > Thank you for the tip to use WriteRecordBatch. I got that to work. But, > probably no surprise, it leads to a table on disk that can’t be loaded zero > copy. “pyarrow.lib.ArrowInvalid: Needed to copy 2 chunks with 0 nulls, but > zero_copy_only was True” > > > > Please confirm there is no way around this. If you want a table on disk to > be zero copy compatible, you must build it from contiguous arrays in > memory. You can’t build an infinite sized memory mapped, zero copy > compatible file on disk. Right? > > > > Does RecordBatchBuilder help with this? I haven’t figured out how to use > that yet. > > > > I am using Arrow 1.0.1. Has anything changed the story on this topic in > more recent releases? Seems we need to upgrade anyway. > > > > This is how I am using the RecordBatchWriter so far: > > > > // write current batch > > > > arrow::ArrayDataVector arrayDataVetor{}; > > for (auto& builder: arrayBuilders) > > { > > std::shared_ptr<arrow::ArrayData> arrayData; > > builder->FinishInternal(&arrayData); > > arrayDataVetor.push_back(arrayData); > > } > > > > auto recordBatch = arrow::RecordBatch::Make(schema, > totalRecordsAddedForCurrentBatch, arrayDataVetor); > > auto writeRecordBatchStatus = > pRecordBatchWriter->WriteRecordBatch(*recordBatch); > > if (!writeRecordBatchStatus.ok()) > > return arrow::Status::ExecutionError("Failed to WriteRecordBatch"); > > > > Thank you in advance for your insights. Confirming how to best approach > this is a big help for our design in progress. > > > > Jeff > > > > *From:* Micah Kornfield <[email protected]> > *Sent:* Monday, May 3, 2021 12:10 PM > *To:* [email protected] > *Subject:* Re: [C++] Writing a large Arrow table to disk > > > > Hi Jeff, > > Maybe the ArrayBuilder() classes themselves are already managing the > business of spilling over available RAM? > > They do not (unless the underlying allocator already does this). Either > way there would be a big memcopy/flush when writing to disk. > > > > > > Do I need to point up front to a MemoryMappedFile or use a MemoryManager > to make this scale? > > Per above this could help but also might not. > > > > Maybe I am done already, but it seems like I’ll need to break up the > content added to the ArrayBuilder() classes while also only adding one > chunk to the ChunkedArray(s) for zero copy compatibility. > > > > Off the top of my head the best way of doing this if possible it to create > either Smaller tables or Record batches and write them incrementally (with > WriteTable or WriteRecordBatch [1]) > > > > [1] > https://arrow.apache.org/docs/cpp/api/ipc.html#_CPPv4N5arrow3ipc17RecordBatchWriter16WriteRecordBatchERK11RecordBatch > [arrow.apache.org] > <https://urldefense.com/v3/__https:/arrow.apache.org/docs/cpp/api/ipc.html*_CPPv4N5arrow3ipc17RecordBatchWriter16WriteRecordBatchERK11RecordBatch__;Iw!!CKZwjTOV!gWYjC-rByf1xdIP7MmM3Qt35N7SSVps28PdkA_kH3DfrgHw2l5e76iasnSMPCog$> > > > > On Mon, May 3, 2021 at 11:24 AM Jeff Wickstrom <[email protected]> > wrote: > > Hello, > > > > I am getting started and prototyping with the Arrow C++ API. Great stuff! > I have some newbie questions. > > > > My first goal is to write a bunch of “records” to an ArrowTable on disk. > The resulting file can be bigger than RAM so I want to use a zero copy > approach when it is subsequently consumed in a Python script, to_pandas(), > etc. > > > > My prototype seems to be working. I build the table following the “Row to > columnar conversion” example. Then I open a FileOutputStream, get a > RecordBatchWriter from arrow::ipc::NewFileWriter(), and call WriteTable() > on it to save the table to disk. I get my desired zero copy usage of it in > Python. > > > > My question is how do I build that table on disk in a scalable way? Like > when there are billions of records? Maybe the ArrayBuilder() classes > themselves are already managing the business of spilling over available > RAM? Do I need to point up front to a MemoryMappedFile or use a > MemoryManager to make this scale? Maybe I am done already, but it seems > like I’ll need to break up the content added to the ArrayBuilder() classes > while also only adding one chunk to the ChunkedArray(s) for zero copy > compatibility. > > > > Please help me understand conceptually the best way to write out a bigger > than RAM Arrow table. A modified “Row to columnar conversion” example would > be great as well if applicable. > > > > Thank you in advance for your help! > > > > Cheers, > > Jeff > >
