[ 
https://issues.apache.org/jira/browse/ARROW-18160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17623957#comment-17623957
 ] 

Weston Pace commented on ARROW-18160:
-------------------------------------

Well, there is a somewhat drastic approach, which would be to add a "clone" 
capability or a "reallocate but only reference sliced portion" capability.  If 
a small hit to performance can deliver large RAM savings it might be worth it 
sometimes.  I'm not sure it's justified in this case yet but wanted to mention 
it for consideration.

Alternatively, some kind of unique allocator which allows returning portions of 
buffers as we read through them could be considered.

> [C++] Scanner slicing large row groups leads to inefficient RAM usage
> ---------------------------------------------------------------------
>
>                 Key: ARROW-18160
>                 URL: https://issues.apache.org/jira/browse/ARROW-18160
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> As an example, consider a 4GB parquet file with 1 giant row group.  At the 
> moment it is inevitable that we read this in as one large 4GB record batch 
> (there are other JIRAs for sub-row-group reads which, if implemented, would 
> obsolete this one).
> We then slice off pieces of that 4GB parquet file for processing:
> {noformat}
> next_batch = current.slice(0, batch_size)
> current = current.slice(batch_size)
> {noformat}
> However, even though {{current}} is shrinking each time, it always references 
> the entire data (slicing doesn't allow memory to be freed).  We may want to 
> investigate alternative strategies here so that we can free up memory when we 
> are done processing it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to