[ https://issues.apache.org/jira/browse/ARROW-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623958#comment-17623958 ]
Weston Pace commented on ARROW-18160: ------------------------------------- Oh. Yes, that is copying. Sorry, didn't read your message clearly. > [C++] Scanner slicing large row groups leads to inefficient RAM usage > --------------------------------------------------------------------- > > Key: ARROW-18160 > URL: https://issues.apache.org/jira/browse/ARROW-18160 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Priority: Major > > As an example, consider a 4GB parquet file with 1 giant row group. At the > moment it is inevitable that we read this in as one large 4GB record batch > (there are other JIRAs for sub-row-group reads which, if implemented, would > obsolete this one). > We then slice off pieces of that 4GB parquet file for processing: > {noformat} > next_batch = current.slice(0, batch_size) > current = current.slice(batch_size) > {noformat} > However, even though {{current}} is shrinking each time, it always references > the entire data (slicing doesn't allow memory to be freed). We may want to > investigate alternative strategies here so that we can free up memory when we > are done processing it. -- This message was sent by Atlassian Jira (v8.20.10#820010)