[ https://issues.apache.org/jira/browse/ARROW-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoine Pitrou updated ARROW-9441: ---------------------------------- Fix Version/s: (was: 4.0.0) 5.0.0 > [C++] Optimize RecordBatchReader::ReadAll > ----------------------------------------- > > Key: ARROW-9441 > URL: https://issues.apache.org/jira/browse/ARROW-9441 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Wes McKinney > Assignee: Ji Liu > Priority: Major > Fix For: 5.0.0 > > > Based on perf reports, more time is spent manipulating C++ data structures > than reconstructing record batches from IPC messages, which strikes me as not > what we want > here is from a perf report based on the Python code > {code} > for i in range(100): > pa.ipc.open_stream('nyctaxi.arrow').read_all() > {code} > {code} > - 50.40% 0.06% python libarrow.so.100.0.0 > [.] arrow::RecordBatchReader::ReadAll > - 50.34% arrow::RecordBatchReader::ReadAll > - 25.86% arrow::Table::FromRecordBatches > - 18.41% arrow::SimpleRecordBatch::column > - 16.00% arrow::MakeArray > - 10.49% > arrow::VisitTypeInline<arrow::internal::ArrayDataWrapper> > 7.71% arrow::PrimitiveArray::SetData > 1.87% arrow::StringArray::StringArray > 1.54% __pthread_mutex_lock > 0.88% __pthread_mutex_unlock > 0.67% std::_Hash_bytes > 0.60% arrow::ChunkedArray::ChunkedArray > - 22.30% arrow::RecordBatchReader::ReadAll > - 22.12% arrow::ipc::RecordBatchStreamReaderImpl::ReadNext > - 15.91% arrow::ipc::ReadRecordBatchInternal > - 15.15% arrow::ipc::LoadRecordBatch > - 14.45% arrow::ipc::ArrayLoader::Load > + 13.15% arrow::VisitTypeInline<arrow::ipc::ArrayLoader> > + 5.53% arrow::ipc::InputStreamMessageReader::ReadNextMessage > 1.84% arrow::SimpleRecordBatch::~SimpleRecordBatch > {code} > Perhaps {{ChunkedArray}} internally should be changed to contain a vector of > {{ArrayData}} instead of boxed Arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)