[jira] [Updated] (ARROW-17599) [C++] ReadRangeCache should not retain data after read

2022-09-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-17599:
---
Labels: good-second-issue pull-request-available  (was: good-second-issue)

> [C++] ReadRangeCache should not retain data after read
> --
>
> Key: ARROW-17599
> URL: https://issues.apache.org/jira/browse/ARROW-17599
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Assignee: Percy Camilo TriveƱo Aucahuasi
>Priority: Major
>  Labels: good-second-issue, pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> I've added a unit test of the issue here: 
> https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention
> We use the ReadRangeCache for pre-buffering IPC and parquet files.  Sometimes 
> those files are quite large (gigabytes).  The usage is roughly:
> for X in num_row_groups:
>   CacheAllThePiecesWeNeedForRowGroupX
>   WaitForPiecesToArriveForRowGroupX
>   ReadThePiecesWeNeedForRowGroupX
> However, once we've read in row group X and passed it on to Acero, etc. we do 
> not release the data for row group X.  The read range cache's entries vector 
> still holds a pointer to the buffer.  The data is not released until the file 
> reader itself is destroyed which only happens when we have finished 
> processing an entire file.
> This leads to excessive memory usage when pre-buffering is enabled.
> This could potentially be a little difficult to implement because a single 
> read range's cache entry could be shared by multiple ranges so we will need 
> some kind of reference counting to know when we have fully finished with an 
> entry and can release it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (ARROW-17599) [C++] ReadRangeCache should not retain data after read

2022-09-02 Thread Weston Pace (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-17599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weston Pace updated ARROW-17599:

Labels: good-second-issue  (was: )

> [C++] ReadRangeCache should not retain data after read
> --
>
> Key: ARROW-17599
> URL: https://issues.apache.org/jira/browse/ARROW-17599
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Weston Pace
>Priority: Major
>  Labels: good-second-issue
>
> I've added a unit test of the issue here: 
> https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention
> We use the ReadRangeCache for pre-buffering IPC and parquet files.  Sometimes 
> those files are quite large (gigabytes).  The usage is roughly:
> for X in num_row_groups:
>   CacheAllThePiecesWeNeedForRowGroupX
>   WaitForPiecesToArriveForRowGroupX
>   ReadThePiecesWeNeedForRowGroupX
> However, once we've read in row group X and passed it on to Acero, etc. we do 
> not release the data for row group X.  The read range cache's entries vector 
> still holds a pointer to the buffer.  The data is not released until the file 
> reader itself is destroyed which only happens when we have finished 
> processing an entire file.
> This leads to excessive memory usage when pre-buffering is enabled.
> This could potentially be a little difficult to implement because a single 
> read range's cache entry could be shared by multiple ranges so we will need 
> some kind of reference counting to know when we have fully finished with an 
> entry and can release it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)