[ https://issues.apache.org/jira/browse/ARROW-16294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated ARROW-16294: ----------------------------------- Labels: pull-request-available (was: ) > [C++] Improve performance of parquet readahead > ---------------------------------------------- > > Key: ARROW-16294 > URL: https://issues.apache.org/jira/browse/ARROW-16294 > Project: Apache Arrow > Issue Type: Improvement > Reporter: Weston Pace > Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > The 7.0.0 readahead for parquet would read up to 256 row groups at once which > meant that, if the consumer were too slow, we would almost certainly run out > of memory. > ARROW-15410 improved readahead as a whole and, in the process, changed > parquet so it's always reading 1 row group in advance. > This is not always ideal in S3 scenarios. We may want to read many row > groups in advance if the row groups are small. To fix this we should > continue reading in parallel until there are at least batch_size * > batch_readahead rows being fetched. -- This message was sent by Atlassian Jira (v8.20.7#820007)