[ https://issues.apache.org/jira/browse/SPARK-22411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiao Li reopened SPARK-22411: ----------------------------- > Heuristic to combine splits in DataSourceScanExec isn't accurate when dynamic > allocation is enabled > --------------------------------------------------------------------------------------------------- > > Key: SPARK-22411 > URL: https://issues.apache.org/jira/browse/SPARK-22411 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.0.0 > Reporter: Vinitha Reddy Gankidi > Assignee: Vinitha Reddy Gankidi > Priority: Major > Fix For: 2.3.0 > > > The heuristic to calculate the maxSplitSize in DataSourceScanExec is as > follows: > https://github.com/apache/spark/blob/d28d5732ae205771f1f443b15b10e64dcffb5ff0/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L431 > Default parallelism in this case is the number of total cores of all the > registered executors for this application. This works well with static > allocation but with dynamic allocation enabled, this value is usually one > (with default config of min and initial executors as zero) at the time of > split calculation. This heuristic was introduced in SPARK-14582. > When Dynamic allocation it is confusing to tune the split size with this > heuristic. It is better to ignore bytesPerCore and use the values of > 'spark.sql.files.maxPartitionBytes' as the max split size. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org