[ 
https://issues.apache.org/jira/browse/SPARK-22411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reopened SPARK-22411:
-----------------------------

> Heuristic to combine splits in DataSourceScanExec isn't accurate when dynamic 
> allocation is enabled
> ---------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22411
>                 URL: https://issues.apache.org/jira/browse/SPARK-22411
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Vinitha Reddy Gankidi
>            Assignee: Vinitha Reddy Gankidi
>            Priority: Major
>             Fix For: 2.3.0
>
>
> The heuristic to calculate the maxSplitSize in DataSourceScanExec is as 
> follows:
> https://github.com/apache/spark/blob/d28d5732ae205771f1f443b15b10e64dcffb5ff0/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L431
> Default parallelism in this case is the number of total cores of all the 
> registered executors for this application. This works well with static 
> allocation but with dynamic allocation enabled, this value is usually one 
> (with default config of min and initial executors as zero) at the time of 
> split calculation. This heuristic was introduced in SPARK-14582. 
> When Dynamic allocation it is confusing to tune the split size with this 
> heuristic. It is better to ignore bytesPerCore and use the values of 
> 'spark.sql.files.maxPartitionBytes' as the max split size. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to