[ https://issues.apache.org/jira/browse/ARROW-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607293#comment-17607293 ]
Weston Pace commented on ARROW-17599: ------------------------------------- Although the more I think about it the less I'm sure that parquet reader API makes sense. Why would someone want to prebuffer a chunk of data and then read from it multiple times? [~lidavidm] any thoughts on which approach we should take? > [C++] ReadRangeCache should not retain data after read > ------------------------------------------------------ > > Key: ARROW-17599 > URL: https://issues.apache.org/jira/browse/ARROW-17599 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Weston Pace > Assignee: Percy Camilo TriveƱo Aucahuasi > Priority: Major > Labels: good-second-issue > > I've added a unit test of the issue here: > https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention > We use the ReadRangeCache for pre-buffering IPC and parquet files. Sometimes > those files are quite large (gigabytes). The usage is roughly: > for X in num_row_groups: > CacheAllThePiecesWeNeedForRowGroupX > WaitForPiecesToArriveForRowGroupX > ReadThePiecesWeNeedForRowGroupX > However, once we've read in row group X and passed it on to Acero, etc. we do > not release the data for row group X. The read range cache's entries vector > still holds a pointer to the buffer. The data is not released until the file > reader itself is destroyed which only happens when we have finished > processing an entire file. > This leads to excessive memory usage when pre-buffering is enabled. > This could potentially be a little difficult to implement because a single > read range's cache entry could be shared by multiple ranges so we will need > some kind of reference counting to know when we have fully finished with an > entry and can release it. -- This message was sent by Atlassian Jira (v8.20.10#820010)