[ https://issues.apache.org/jira/browse/ARROW-16294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526714#comment-17526714 ]
David Li commented on ARROW-16294: ---------------------------------- This is very similar to ARROW-14648 right? Or rather ARROW-14648 is the fully general solution? > [C++] Improve performance of parquet readahead > ---------------------------------------------- > > Key: ARROW-16294 > URL: https://issues.apache.org/jira/browse/ARROW-16294 > Project: Apache Arrow > Issue Type: Improvement > Reporter: Weston Pace > Priority: Major > > The 7.0.0 readahead for parquet would read up to 256 row groups at once which > meant that, if the consumer were too slow, we would almost certainly run out > of memory. > ARROW-15410 improved readahead as a whole and, in the process, changed > parquet so it's always reading 1 row group in advance. > This is not always ideal in S3 scenarios. We may want to read many row > groups in advance if the row groups are small. To fix this we should > continue reading in parallel until there are at least batch_size * > batch_readahead rows being fetched. -- This message was sent by Atlassian Jira (v8.20.7#820007)