[ https://issues.apache.org/jira/browse/ARROW-4713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267009#comment-17267009 ]
Ying Zhou edited comment on ARROW-4713 at 1/18/21, 4:47 AM: ------------------------------------------------------------ Hmm..this looks interesting. If @Yurui Zhou won’t take it I potentially can. However I don’t think I have time for that before July though. So if I will take it it will need to happen half a year later and won’t be available in 4.0. was (Author: yingzhou474): Hmm..this looks interesting. I don’t think I have time for that before July though. So if I will take it it will need to happen half a year later and won’t be available in 4.0. > [C++] Improve C++ Orc Adapter performance and memory footprint > -------------------------------------------------------------- > > Key: ARROW-4713 > URL: https://issues.apache.org/jira/browse/ARROW-4713 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Yurui Zhou > Assignee: Yurui Zhou > Priority: Major > Labels: orc, pull-request-available > Fix For: 4.0.0 > > Time Spent: 7h 20m > Remaining Estimate: 0h > > Currently the Arrow C++ provide a naive adapter implementation that allow > user to read orc file to Arrow RecordBatch. However, this implementation have > several drawbacks: > * Inefficient conversion that incurs huge memcpy overhead > ** currently the ORC adapter are performing byte to byte memcpy to move data > to ORC VectorBatch to Arrow RecordBatch regardless of the fact that ORC > VectorBatch shares the same memory layout with Arrow in most of the Data Types > * Huge memory footprint because the lack of TableReader implementation > ** The ORC adapter currently only allow user to read data with the unit of > stripe. However, as a columnar format with high compression ration, data read > from a ORC stripe can potential takes over gigabytes of memory, which makes > the ORC adapter not quite usable in production environment. > Here we propose a new ORC adapter implementation to fix the issues mentioned > above: > * To reduce conversion overhead, instead of performing naive data copy, the > new adapter would be able to fully taking advantage of the memory layout > similarity between ORC VectorBatch and Arrow RecordBatch. Namely the new > adapter will perform pointer manipulation to transfer the memory ownership > from VectorBatch to Arrow RecordBatch whenever possible. > * The new ORC Adapter would be able to provide user a row level granularity > when reading data from Orc File. The user should be able to specify how many > rows should be expected on output RecordBatch and the ORC Adapter should make > sure no more the requested number of rows would be returned. -- This message was sent by Atlassian Jira (v8.3.4#803005)