[GitHub] spark pull request #21729: SPARK-24755 Executor loss can cause task to not b...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21729#discussion_r200990424 --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala --- @@ -1365,6 +1365,113 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg assert(taskOption4.get.addedJars === addedJarsMidTaskSet) } + test("SPARK-24755 Executor loss can cause task to not be resubmitted") { +val conf = new SparkConf().set("spark.speculation", "true") +sc = new SparkContext("local", "test", conf) +// Set the speculation multiplier to be 0 so speculative tasks are launched immediately +sc.conf.set("spark.speculation.multiplier", "0.0") +sc.conf.set("spark.speculation.quantile", "0.5") +sc.conf.set("spark.speculation", "true") + +var killTaskCalled = false +sched = new FakeTaskScheduler(sc, ("exec1", "host1"), + ("exec2", "host2"), ("exec3", "host3")) +sched.initialize(new FakeSchedulerBackend() { + override def killTask(taskId: Long, +executorId: String, +interruptThread: Boolean, +reason: String): Unit = { +// Check the only one killTask event in this case, which triggered by +// task 2.1 completed. +assert(taskId === 2) +assert(executorId === "exec3") +assert(interruptThread) +assert(reason === "another attempt succeeded") +killTaskCalled = true + } +}) + +// Keep track of the index of tasks that are resubmitted, +// so that the test can check that task is resubmitted correctly +var resubmittedTasks = new mutable.HashSet[Int] +val dagScheduler = new FakeDAGScheduler(sc, sched) { + override def taskEnded(task: Task[_], + reason: TaskEndReason, --- End diff -- ditto --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21729: SPARK-24755 Executor loss can cause task to not b...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21729#discussion_r200990279 --- Diff: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala --- @@ -1365,6 +1365,113 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg assert(taskOption4.get.addedJars === addedJarsMidTaskSet) } + test("SPARK-24755 Executor loss can cause task to not be resubmitted") { +val conf = new SparkConf().set("spark.speculation", "true") +sc = new SparkContext("local", "test", conf) +// Set the speculation multiplier to be 0 so speculative tasks are launched immediately +sc.conf.set("spark.speculation.multiplier", "0.0") +sc.conf.set("spark.speculation.quantile", "0.5") +sc.conf.set("spark.speculation", "true") + +var killTaskCalled = false +sched = new FakeTaskScheduler(sc, ("exec1", "host1"), + ("exec2", "host2"), ("exec3", "host3")) +sched.initialize(new FakeSchedulerBackend() { + override def killTask(taskId: Long, +executorId: String, --- End diff -- nit: indent --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21729: SPARK-24755 Executor loss can cause task to not b...
Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/21729#discussion_r200989413 --- Diff: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala --- @@ -87,7 +87,7 @@ private[spark] class TaskSetManager( // Set the coresponding index of Boolean var when the task killed by other attempt tasks, --- End diff -- typo I made before, coresponding -> corresponding. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21729: SPARK-24755 Executor loss can cause task to not b...
GitHub user hthuynh2 opened a pull request: https://github.com/apache/spark/pull/21729 SPARK-24755 Executor loss can cause task to not be resubmitted **Description** As described in [SPARK-24755](https://issues.apache.org/jira/browse/SPARK-24755), when speculation is enabled, there is scenario that executor loss can cause task to not be resubmitted. This patch changes the variable killedByOtherAttempt to keeps track of the taskId of tasks that are killed by other attempt. By doing this, we can still prevent resubmitting task killed by other attempt while resubmit successful attempt when executor lost. **How was this patch tested?** A UT is added based on the UT written by @xuanyuanking with modification to simulate the scenario described in SPARK-24755. You can merge this pull request into a Git repository by running: $ git pull https://github.com/hthuynh2/spark SPARK_24755 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21729.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21729 commit 093e39cf76378821284ef7d771e819afb69930ae Author: Hieu Huynh <âhieu.huynh@...> Date: 2018-07-08T18:20:26Z SPARK-24755 Executor loss can cause task to not be resubmitted --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org