Jianfu Li created SPARK-49485:
---------------------------------

             Summary: When dynamic allocation enabled and remaining executors 
laying on a same host, specluative tasks will not be triggered
                 Key: SPARK-49485
                 URL: https://issues.apache.org/jira/browse/SPARK-49485
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 3.5.2
         Environment: just run this job on a yarn cluster with only 1 host.

set  spark.dynamicAllocation.minExecutors = 3
{code:java}
object SpecHangJob {
  def main(args: Array[String]): Unit = {
    val spark = 
SparkSession.builder.appName("test").enableHiveSupport().getOrCreate
    val sc = spark.sparkContext

    val conf = sc.getConf
    val taskNum = conf.getInt("spark.test.parallelism", 3)
    val seq = (0 until taskNum).toList
    sc.parallelize(seq, taskNum).map(
      i => {
        if (i == 0) {
          try {
            Thread.sleep(1000 * 60 * 30)
          } catch {
            case exception: Exception => println(exception.getMessage)
          }
        } else {
          try {
            Thread.sleep(1000 * 30)
          } catch {
            case e: Exception => println(e.getMessage)
          }
        }
        "haha"
      }
    ).collect()
  }

} {code}
you will find task 0 has triggerd speculative execution but speculative task 
not starts

 
            Reporter: Jianfu Li


We find some cases that dynamic allocation enabled, sometimes tasks launched on 
remaining executors on a same slow host will execute for a long time, and it 
makes a spark job keep in running state for a long time.Specluative executes 
cannot fix the problem,cause if there is one slow task and three executors(set 
spark.dynamicAllocation.minExecutors = 3),maxNeed will be 2 and 
ExecutorAllocationManager will think the count of executors is sufficient and 
will not require new executors.Or a new executor allocated but lay on the same 
host.Both the case result to long-time-not-finish of a job.We should detect 
this case, exclude the old host and require for new executors



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to