[ 
https://issues.apache.org/jira/browse/ARROW-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607297#comment-17607297
 ] 

David Li commented on ARROW-17599:
----------------------------------

I think a lot of this was just because of how it was historically added: 
originally, the cache was added without adding an iterator interface, so the 
cache would necessarily have to preserve the input data. I think now that we're 
changing things here, we should perhaps consider adding an explicit 
cache-based, iterator/generator-based API so that the API contract is clear.

However, I think we still do want to pre-buffer all row groups, because that 
way the read coalescing can do the best job possible. That said it probably 
needs benchmarking to determine what makes sense. It could work to only start 
reads for a row group when we read it (with the understanding that some I/O may 
'spill over' into the next row group for optimal I/O patterns)

> [C++] ReadRangeCache should not retain data after read
> ------------------------------------------------------
>
>                 Key: ARROW-17599
>                 URL: https://issues.apache.org/jira/browse/ARROW-17599
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Assignee: Percy Camilo TriveƱo Aucahuasi
>            Priority: Major
>              Labels: good-second-issue
>
> I've added a unit test of the issue here: 
> https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention
> We use the ReadRangeCache for pre-buffering IPC and parquet files.  Sometimes 
> those files are quite large (gigabytes).  The usage is roughly:
> for X in num_row_groups:
>   CacheAllThePiecesWeNeedForRowGroupX
>   WaitForPiecesToArriveForRowGroupX
>   ReadThePiecesWeNeedForRowGroupX
> However, once we've read in row group X and passed it on to Acero, etc. we do 
> not release the data for row group X.  The read range cache's entries vector 
> still holds a pointer to the buffer.  The data is not released until the file 
> reader itself is destroyed which only happens when we have finished 
> processing an entire file.
> This leads to excessive memory usage when pre-buffering is enabled.
> This could potentially be a little difficult to implement because a single 
> read range's cache entry could be shared by multiple ranges so we will need 
> some kind of reference counting to know when we have fully finished with an 
> entry and can release it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to