[ 
https://issues.apache.org/jira/browse/ARROW-10995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andy Grove resolved ARROW-10995.
--------------------------------
    Resolution: Fixed

Issue resolved by pull request 9029
[https://github.com/apache/arrow/pull/9029]

> [Rust] [DataFusion] Improve parallelism when reading Parquet files
> ------------------------------------------------------------------
>
>                 Key: ARROW-10995
>                 URL: https://issues.apache.org/jira/browse/ARROW-10995
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust - DataFusion
>            Reporter: Andy Grove
>            Assignee: Andy Grove
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 3.0.0
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Currently the unit of parallelism is the number of parquet files being read.
> For example, if we run a query against a Parquet table that consists of 8 
> partitions then we will attempt to run 8 async tasks in parallel and if there 
> is a single Parquet file then we will only try and run 1 async task so this 
> does not scale well. Also, if there are hundreds or thousands of Parquet 
> files then we will try and process them all concurrently which also doesn't 
> scale well.
> These are the options for improving this situation:
>  
>  # Use Parquet row groups as the unit of partitioning and divide the number 
> of row groups by the desired level of concurrency (defaulting to number of 
> cores)
>  # Keep file as the unit of partitions and add a RepartitionExec into the 
> plan if there are fewer partitions (files) than cores and in the case where 
> there are more files than cores, split the files up into lists so that each 
> partition is a list of files rather than a single file. Each partition task 
> will process one file at a time.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to