[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/11205#issuecomment-217158802 sorry somehow I missed this go by, I haven't looked at the code chanes in detail yet. The TaskEnd event should be being sent all the time now, we fixed this bug a while back. Or is it because its out of order? Can you describe in more detail the exact issue and how this change fixes it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/11205#issuecomment-217052874 Also cc @vanzin @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/11205#issuecomment-186486879 Hi @andrewor14 , thanks a lot for your comments. The reason why I introduce another data structure to track each executor's stage and task numbers is mentioned before and I pasted here again: >Executors may never be idle. Currently we use the executor to tasks mapping relation to identify the status of executors, in maximum task failure situation, some TaskEnd events may never be delivered, which makes the related executor always be busy. According to my test, TaskEnd event may not be delivered as the expected number, which will make executor never be released. So compared to the old implementation, I changed to clean the related task number when stage is completed. That's why I introduce a complicated data structure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/11205#issuecomment-186368937 @jerryshao I took a look at this and it looks overly complicated. It seems that the problem is sometimes we have negative `totalRunningTasks` and that leads to undesirable behavior. Can't we fix this by expressing that in terms of `stageIdToTaskIndices.map(_.values.size).sum`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/11205#discussion_r53505158 --- Diff: core/src/test/scala/org/apache/spark/ExecutorAllocationManagerSuite.scala --- @@ -890,6 +890,43 @@ class ExecutorAllocationManagerSuite assert(removeTimes(manager) === Map.empty) } + test("correctly handle task failure more than 4 times situation") { --- End diff -- say max failures here instead of 4 times in case we want to change this in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/11205#discussion_r53505109 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -540,9 +540,11 @@ private[spark] class ExecutorAllocationManager( private val stageIdToNumTasks = new mutable.HashMap[Int, Int] private val stageIdToTaskIndices = new mutable.HashMap[Int, mutable.HashSet[Int]] -private val executorIdToTaskIds = new mutable.HashMap[String, mutable.HashSet[Long]] -// Number of tasks currently running on the cluster. Should be 0 when no stages are active. -private var numRunningTasks: Int = _ +private val executorIdToStageAndNumTasks = + new mutable.HashMap[String, mutable.HashMap[Int, Int]] + +// Track stage id and related running tasks. +private val stageIdToRunningTasks = new mutable.HashMap[Int, mutable.HashSet[Int]] --- End diff -- Wait, actually this is the exact same thing as `stageIdToTaskIndices` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/11205#discussion_r53498775 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -540,9 +540,11 @@ private[spark] class ExecutorAllocationManager( private val stageIdToNumTasks = new mutable.HashMap[Int, Int] private val stageIdToTaskIndices = new mutable.HashMap[Int, mutable.HashSet[Int]] -private val executorIdToTaskIds = new mutable.HashMap[String, mutable.HashSet[Long]] -// Number of tasks currently running on the cluster. Should be 0 when no stages are active. -private var numRunningTasks: Int = _ +private val executorIdToStageAndNumTasks = + new mutable.HashMap[String, mutable.HashMap[Int, Int]] + +// Track stage id and related running tasks. +private val stageIdToRunningTasks = new mutable.HashMap[Int, mutable.HashSet[Int]] --- End diff -- this needs to specify task index. Right now it's not clear whether the values are task indices or task IDs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/11205#discussion_r53498726 --- Diff: core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala --- @@ -540,9 +540,11 @@ private[spark] class ExecutorAllocationManager( private val stageIdToNumTasks = new mutable.HashMap[Int, Int] private val stageIdToTaskIndices = new mutable.HashMap[Int, mutable.HashSet[Int]] -private val executorIdToTaskIds = new mutable.HashMap[String, mutable.HashSet[Long]] -// Number of tasks currently running on the cluster. Should be 0 when no stages are active. -private var numRunningTasks: Int = _ +private val executorIdToStageAndNumTasks = + new mutable.HashMap[String, mutable.HashMap[Int, Int]] --- End diff -- this is a very complicated data structure. Why do we need to keep track of what stage each executor is running? Is there a simpler way to fix this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/11205#issuecomment-186352005 Just to add a link here to the previous PR #9288 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11205#issuecomment-184122268 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/51298/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11205#issuecomment-184122265 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11205#issuecomment-184121750 **[Test build #51298 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51298/consoleFull)** for PR 11205 at commit [`966eb89`](https://github.com/apache/spark/commit/966eb891ba8da1936412b9894335ee7ce41b4c4b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11205#issuecomment-184088639 **[Test build #51298 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/51298/consoleFull)** for PR 11205 at commit [`966eb89`](https://github.com/apache/spark/commit/966eb891ba8da1936412b9894335ee7ce41b4c4b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11334][Core] Handle maximum task failur...
GitHub user jerryshao opened a pull request: https://github.com/apache/spark/pull/11205 [SPARK-11334][Core] Handle maximum task failure situation in dynamic allocation Currently there're two problems in dynamic allocation when maximum task failure is met: 1. Number of running tasks will possibly be negative, which will affect the calculation of needed executors. 2. Executors may never be idle. Currently we use the executor to tasks mapping relation to identify the status of executors, in maximum task failure situation, some `TaskEnd` events may never be delivered, which makes the related executor always be busy. This patch tries to fix these two issues, please review, thanks a lot. CC @andrewor14 and @tgravescs . You can merge this pull request into a Git repository by running: $ git pull https://github.com/jerryshao/apache-spark SPARK-11334 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11205.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11205 commit 966eb891ba8da1936412b9894335ee7ce41b4c4b Author: jerryshao Date: 2016-02-15T06:32:01Z Fix maximum task failure issue in dynamic allocation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org