[ 
https://issues.apache.org/jira/browse/SPARK-49485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-49485:
-----------------------------------
    Labels: pull-request-available  (was: )

> When dynamic allocation enabled and remaining executors laying on a same 
> host, specluative tasks will not be triggered
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-49485
>                 URL: https://issues.apache.org/jira/browse/SPARK-49485
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.5.2
>         Environment: just run this job on a yarn cluster with only 1 host.
> set  spark.dynamicAllocation.minExecutors = 3
> {code:java}
> object SpecHangJob {
>   def main(args: Array[String]): Unit = {
>     val spark = 
> SparkSession.builder.appName("test").enableHiveSupport().getOrCreate
>     val sc = spark.sparkContext
>     val conf = sc.getConf
>     val taskNum = conf.getInt("spark.test.parallelism", 3)
>     val seq = (0 until taskNum).toList
>     sc.parallelize(seq, taskNum).map(
>       i => {
>         if (i == 0) {
>           try {
>             Thread.sleep(1000 * 60 * 30)
>           } catch {
>             case exception: Exception => println(exception.getMessage)
>           }
>         } else {
>           try {
>             Thread.sleep(1000 * 30)
>           } catch {
>             case e: Exception => println(e.getMessage)
>           }
>         }
>         "haha"
>       }
>     ).collect()
>   }
> } {code}
> you will find task 0 has triggerd speculative execution but speculative task 
> not starts
>  
>            Reporter: Jianfu Li
>            Priority: Critical
>              Labels: pull-request-available
>
> We find some cases that dynamic allocation enabled, sometimes tasks launched 
> on remaining executors on a same slow host will execute for a long time, and 
> it makes a spark job keep in running state for a long time.Specluative 
> executes cannot fix the problem,cause if there is one slow task and three 
> executors(set spark.dynamicAllocation.minExecutors = 3),maxNeed will be 2 and 
> ExecutorAllocationManager will think the count of executors is sufficient and 
> will not require new executors.Or a new executor allocated but lay on the 
> same host.Both the case result to long-time-not-finish of a job.We should 
> detect this case, exclude the old host and require for new executors



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to