Weston Pace created ARROW-18160:
-----------------------------------

             Summary: [C++] Scanner slicing large row groups leads to 
inefficient RAM usage
                 Key: ARROW-18160
                 URL: https://issues.apache.org/jira/browse/ARROW-18160
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
            Reporter: Weston Pace


As an example, consider a 4GB parquet file with 1 giant row group.  At the 
moment it is inevitable that we read this in as one large 4GB record batch 
(there are other JIRAs for sub-row-group reads which, if implemented, would 
obsolete this one).

We then slice off pieces of that 4GB parquet file for processing:

{noformat}
next_batch = current.slice(0, batch_size)
current = current.slice(batch_size)
{noformat}

However, even though {{current}} is shrinking each time, it always references 
the entire data (slicing doesn't allow memory to be freed).  We may want to 
investigate alternative strategies here so that we can free up memory when we 
are done processing it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to