[ https://issues.apache.org/jira/browse/SPARK-49485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-49485: ----------------------------------- Labels: pull-request-available (was: ) > When dynamic allocation enabled and remaining executors laying on a same > host, specluative tasks will not be triggered > ---------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-49485 > URL: https://issues.apache.org/jira/browse/SPARK-49485 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.5.2 > Environment: just run this job on a yarn cluster with only 1 host. > set spark.dynamicAllocation.minExecutors = 3 > {code:java} > object SpecHangJob { > def main(args: Array[String]): Unit = { > val spark = > SparkSession.builder.appName("test").enableHiveSupport().getOrCreate > val sc = spark.sparkContext > val conf = sc.getConf > val taskNum = conf.getInt("spark.test.parallelism", 3) > val seq = (0 until taskNum).toList > sc.parallelize(seq, taskNum).map( > i => { > if (i == 0) { > try { > Thread.sleep(1000 * 60 * 30) > } catch { > case exception: Exception => println(exception.getMessage) > } > } else { > try { > Thread.sleep(1000 * 30) > } catch { > case e: Exception => println(e.getMessage) > } > } > "haha" > } > ).collect() > } > } {code} > you will find task 0 has triggerd speculative execution but speculative task > not starts > > Reporter: Jianfu Li > Priority: Critical > Labels: pull-request-available > > We find some cases that dynamic allocation enabled, sometimes tasks launched > on remaining executors on a same slow host will execute for a long time, and > it makes a spark job keep in running state for a long time.Specluative > executes cannot fix the problem,cause if there is one slow task and three > executors(set spark.dynamicAllocation.minExecutors = 3),maxNeed will be 2 and > ExecutorAllocationManager will think the count of executors is sufficient and > will not require new executors.Or a new executor allocated but lay on the > same host.Both the case result to long-time-not-finish of a job.We should > detect this case, exclude the old host and require for new executors -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org