Vinitha Reddy Gankidi created SPARK-22411:
---------------------------------------------

             Summary: Heuristic to combine splits in DataSourceScanExec isn't 
accurate when dynamic allocation is enabled
                 Key: SPARK-22411
                 URL: https://issues.apache.org/jira/browse/SPARK-22411
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Vinitha Reddy Gankidi
            Priority: Major


The heuristic to calculate the maxSplitSize in DataSourceScanExec is as follows:
https://github.com/apache/spark/blob/d28d5732ae205771f1f443b15b10e64dcffb5ff0/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L431

Default parallelism in this case is the number of total cores of all the 
registered executors for this application. This works well with static 
allocation but with dynamic allocation enabled, this value is usually one (with 
default config of min and initial executors as zero) at the time of split 
calculation. This heuristic was introduced in SPARK-14582. 

When Dynamic allocation it is confusing to tune the split size with this 
heuristic. It is better to ignore bytesPerCore and use the values of 
'spark.sql.files.maxPartitionBytes' as the max split size. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to