[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...

2016-05-05 Thread tgravescs
Github user tgravescs commented on the pull request:

https://github.com/apache/spark/pull/11205#issuecomment-217158802
  
sorry somehow I missed this go by, I haven't looked at the code chanes in 
detail yet.  The  TaskEnd event should be being sent all the time now, we fixed 
this bug a while back.  Or is it because its out of order?

Can you describe in more detail the exact issue and how this change fixes 
it?  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...

2016-05-04 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/11205#issuecomment-217052874
  
Also cc @vanzin @srowen


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...

2016-02-19 Thread jerryshao
Github user jerryshao commented on the pull request:

https://github.com/apache/spark/pull/11205#issuecomment-186486879
  
Hi @andrewor14 , thanks a lot for your comments. 

The reason why I introduce another data structure to track each executor's 
stage and task numbers is mentioned before and I pasted here again:

>Executors may never be idle. Currently we use the executor to tasks 
mapping relation to identify the status of executors, in maximum task failure 
situation, some TaskEnd events may never be delivered, which makes the related 
executor always be busy.

According to my test, TaskEnd event may not be delivered as the expected 
number, which will make executor never be released. So compared to the old 
implementation, I changed to clean the related task number when stage is 
completed. That's why I introduce a complicated data structure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...

2016-02-19 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/11205#issuecomment-186368937
  
@jerryshao I took a look at this and it looks overly complicated. It seems 
that the problem is sometimes we have negative `totalRunningTasks` and that 
leads to undesirable behavior. Can't we fix this by expressing that in terms of 
`stageIdToTaskIndices.map(_.values.size).sum`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...

2016-02-19 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/11205#discussion_r53505158
  
--- Diff: 
core/src/test/scala/org/apache/spark/ExecutorAllocationManagerSuite.scala ---
@@ -890,6 +890,43 @@ class ExecutorAllocationManagerSuite
 assert(removeTimes(manager) === Map.empty)
   }
 
+  test("correctly handle task failure more than 4 times situation") {
--- End diff --

say max failures here instead of 4 times in case we want to change this in 
the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...

2016-02-19 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/11205#discussion_r53505109
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -540,9 +540,11 @@ private[spark] class ExecutorAllocationManager(
 
 private val stageIdToNumTasks = new mutable.HashMap[Int, Int]
 private val stageIdToTaskIndices = new mutable.HashMap[Int, 
mutable.HashSet[Int]]
-private val executorIdToTaskIds = new mutable.HashMap[String, 
mutable.HashSet[Long]]
-// Number of tasks currently running on the cluster.  Should be 0 when 
no stages are active.
-private var numRunningTasks: Int = _
+private val executorIdToStageAndNumTasks =
+  new mutable.HashMap[String, mutable.HashMap[Int, Int]]
+
+// Track stage id and related running tasks.
+private val stageIdToRunningTasks = new mutable.HashMap[Int, 
mutable.HashSet[Int]]
--- End diff --

Wait, actually this is the exact same thing as `stageIdToTaskIndices`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...

2016-02-19 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/11205#discussion_r53498775
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -540,9 +540,11 @@ private[spark] class ExecutorAllocationManager(
 
 private val stageIdToNumTasks = new mutable.HashMap[Int, Int]
 private val stageIdToTaskIndices = new mutable.HashMap[Int, 
mutable.HashSet[Int]]
-private val executorIdToTaskIds = new mutable.HashMap[String, 
mutable.HashSet[Long]]
-// Number of tasks currently running on the cluster.  Should be 0 when 
no stages are active.
-private var numRunningTasks: Int = _
+private val executorIdToStageAndNumTasks =
+  new mutable.HashMap[String, mutable.HashMap[Int, Int]]
+
+// Track stage id and related running tasks.
+private val stageIdToRunningTasks = new mutable.HashMap[Int, 
mutable.HashSet[Int]]
--- End diff --

this needs to specify task index. Right now it's not clear whether the 
values are task indices or task IDs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...

2016-02-19 Thread andrewor14
Github user andrewor14 commented on a diff in the pull request:

https://github.com/apache/spark/pull/11205#discussion_r53498726
  
--- Diff: 
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala ---
@@ -540,9 +540,11 @@ private[spark] class ExecutorAllocationManager(
 
 private val stageIdToNumTasks = new mutable.HashMap[Int, Int]
 private val stageIdToTaskIndices = new mutable.HashMap[Int, 
mutable.HashSet[Int]]
-private val executorIdToTaskIds = new mutable.HashMap[String, 
mutable.HashSet[Long]]
-// Number of tasks currently running on the cluster.  Should be 0 when 
no stages are active.
-private var numRunningTasks: Int = _
+private val executorIdToStageAndNumTasks =
+  new mutable.HashMap[String, mutable.HashMap[Int, Int]]
--- End diff --

this is a very complicated data structure. Why do we need to keep track of 
what stage each executor is running? Is there a simpler way to fix this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...

2016-02-19 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/11205#issuecomment-186352005
  
Just to add a link here to the previous PR #9288


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...

2016-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11205#issuecomment-184122268
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51298/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...

2016-02-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11205#issuecomment-184122265
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...

2016-02-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11205#issuecomment-184121750
  
**[Test build #51298 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51298/consoleFull)**
 for PR 11205 at commit 
[`966eb89`](https://github.com/apache/spark/commit/966eb891ba8da1936412b9894335ee7ce41b4c4b).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...

2016-02-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11205#issuecomment-184088639
  
**[Test build #51298 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51298/consoleFull)**
 for PR 11205 at commit 
[`966eb89`](https://github.com/apache/spark/commit/966eb891ba8da1936412b9894335ee7ce41b4c4b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...

2016-02-14 Thread jerryshao
GitHub user jerryshao opened a pull request:

https://github.com/apache/spark/pull/11205

[SPARK-11334][Core] Handle maximum task failure situation in dynamic 
allocation

Currently there're two problems in dynamic allocation when maximum task 
failure is met:

1. Number of running tasks will possibly be negative, which will affect the 
calculation of needed executors.
2. Executors may never be idle. Currently we use the executor to tasks 
mapping relation to identify the status of executors, in maximum task failure 
situation, some `TaskEnd` events may never be delivered, which makes the 
related executor always be busy.

This patch tries to fix these two issues, please review, thanks a lot.

CC @andrewor14 and @tgravescs . 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jerryshao/apache-spark SPARK-11334

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/11205.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #11205


commit 966eb891ba8da1936412b9894335ee7ce41b4c4b
Author: jerryshao 
Date:   2016-02-15T06:32:01Z

Fix maximum task failure issue in dynamic allocation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org