[GitHub] spark pull request #21729: SPARK-24755 Executor loss can cause task to not b...

2018-07-09 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/21729#discussion_r200990424
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
@@ -1365,6 +1365,113 @@ class TaskSetManagerSuite extends SparkFunSuite 
with LocalSparkContext with Logg
 assert(taskOption4.get.addedJars === addedJarsMidTaskSet)
   }
 
+  test("SPARK-24755 Executor loss can cause task to not be resubmitted") {
+val conf = new SparkConf().set("spark.speculation", "true")
+sc = new SparkContext("local", "test", conf)
+// Set the speculation multiplier to be 0 so speculative tasks are 
launched immediately
+sc.conf.set("spark.speculation.multiplier", "0.0")
+sc.conf.set("spark.speculation.quantile", "0.5")
+sc.conf.set("spark.speculation", "true")
+
+var killTaskCalled = false
+sched = new FakeTaskScheduler(sc, ("exec1", "host1"),
+  ("exec2", "host2"), ("exec3", "host3"))
+sched.initialize(new FakeSchedulerBackend() {
+  override def killTask(taskId: Long,
+executorId: String,
+interruptThread: Boolean,
+reason: String): Unit = {
+// Check the only one killTask event in this case, which triggered 
by
+// task 2.1 completed.
+assert(taskId === 2)
+assert(executorId === "exec3")
+assert(interruptThread)
+assert(reason === "another attempt succeeded")
+killTaskCalled = true
+  }
+})
+
+// Keep track of the index of tasks that are resubmitted,
+// so that the test can check that task is resubmitted correctly
+var resubmittedTasks = new mutable.HashSet[Int]
+val dagScheduler = new FakeDAGScheduler(sc, sched) {
+  override def taskEnded(task: Task[_],
+ reason: TaskEndReason,
--- End diff --

ditto


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21729: SPARK-24755 Executor loss can cause task to not b...

2018-07-09 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/21729#discussion_r200990279
  
--- Diff: 
core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ---
@@ -1365,6 +1365,113 @@ class TaskSetManagerSuite extends SparkFunSuite 
with LocalSparkContext with Logg
 assert(taskOption4.get.addedJars === addedJarsMidTaskSet)
   }
 
+  test("SPARK-24755 Executor loss can cause task to not be resubmitted") {
+val conf = new SparkConf().set("spark.speculation", "true")
+sc = new SparkContext("local", "test", conf)
+// Set the speculation multiplier to be 0 so speculative tasks are 
launched immediately
+sc.conf.set("spark.speculation.multiplier", "0.0")
+sc.conf.set("spark.speculation.quantile", "0.5")
+sc.conf.set("spark.speculation", "true")
+
+var killTaskCalled = false
+sched = new FakeTaskScheduler(sc, ("exec1", "host1"),
+  ("exec2", "host2"), ("exec3", "host3"))
+sched.initialize(new FakeSchedulerBackend() {
+  override def killTask(taskId: Long,
+executorId: String,
--- End diff --

nit: indent


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21729: SPARK-24755 Executor loss can cause task to not b...

2018-07-09 Thread xuanyuanking
Github user xuanyuanking commented on a diff in the pull request:

https://github.com/apache/spark/pull/21729#discussion_r200989413
  
--- Diff: 
core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ---
@@ -87,7 +87,7 @@ private[spark] class TaskSetManager(
   // Set the coresponding index of Boolean var when the task killed by 
other attempt tasks,
--- End diff --

typo I made before, coresponding -> corresponding.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21729: SPARK-24755 Executor loss can cause task to not b...

2018-07-08 Thread hthuynh2
GitHub user hthuynh2 opened a pull request:

https://github.com/apache/spark/pull/21729

SPARK-24755 Executor loss can cause task to not be resubmitted

**Description**
As described in 
[SPARK-24755](https://issues.apache.org/jira/browse/SPARK-24755), when 
speculation is enabled, there is scenario that executor loss can cause task to 
not be resubmitted. 
This patch changes the variable killedByOtherAttempt to keeps track of the 
taskId of tasks that are killed by other attempt. By doing this, we can still 
prevent resubmitting task killed by other attempt while resubmit successful 
attempt when executor lost.

**How was this patch tested?**
A UT is added based on the UT written by @xuanyuanking with modification to 
simulate the scenario described in SPARK-24755. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/hthuynh2/spark SPARK_24755

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21729.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21729


commit 093e39cf76378821284ef7d771e819afb69930ae
Author: Hieu Huynh <“hieu.huynh@...>
Date:   2018-07-08T18:20:26Z

SPARK-24755 Executor loss can cause task to not be resubmitted




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org