Weston Pace created ARROW-18160: ----------------------------------- Summary: [C++] Scanner slicing large row groups leads to inefficient RAM usage Key: ARROW-18160 URL: https://issues.apache.org/jira/browse/ARROW-18160 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Weston Pace
As an example, consider a 4GB parquet file with 1 giant row group. At the moment it is inevitable that we read this in as one large 4GB record batch (there are other JIRAs for sub-row-group reads which, if implemented, would obsolete this one). We then slice off pieces of that 4GB parquet file for processing: {noformat} next_batch = current.slice(0, batch_size) current = current.slice(batch_size) {noformat} However, even though {{current}} is shrinking each time, it always references the entire data (slicing doesn't allow memory to be freed). We may want to investigate alternative strategies here so that we can free up memory when we are done processing it. -- This message was sent by Atlassian Jira (v8.20.10#820010)