[ https://issues.apache.org/jira/browse/ARROW-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andy Grove resolved ARROW-10995. -------------------------------- Resolution: Fixed Issue resolved by pull request 9029 [https://github.com/apache/arrow/pull/9029] > [Rust] [DataFusion] Improve parallelism when reading Parquet files > ------------------------------------------------------------------ > > Key: ARROW-10995 > URL: https://issues.apache.org/jira/browse/ARROW-10995 > Project: Apache Arrow > Issue Type: Improvement > Components: Rust - DataFusion > Reporter: Andy Grove > Assignee: Andy Grove > Priority: Major > Labels: pull-request-available > Fix For: 3.0.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > > Currently the unit of parallelism is the number of parquet files being read. > For example, if we run a query against a Parquet table that consists of 8 > partitions then we will attempt to run 8 async tasks in parallel and if there > is a single Parquet file then we will only try and run 1 async task so this > does not scale well. Also, if there are hundreds or thousands of Parquet > files then we will try and process them all concurrently which also doesn't > scale well. > These are the options for improving this situation: > > # Use Parquet row groups as the unit of partitioning and divide the number > of row groups by the desired level of concurrency (defaulting to number of > cores) > # Keep file as the unit of partitions and add a RepartitionExec into the > plan if there are fewer partitions (files) than cores and in the case where > there are more files than cores, split the files up into lists so that each > partition is a list of files rather than a single file. Each partition task > will process one file at a time. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)