[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/11205 I guess the issue still exists, let me verify the issue again, if it still exists I will bring the PR to latest. Thanks! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user rustagi commented on the issue: https://github.com/apache/spark/pull/11205 Sorry haven't been able to confirm this patch becaus have not seen issue in production for quite some time. It was much more persistent with 2.0 than 2.1 Not sure of cause. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/11205 This PR is pretty old and a lot has changed since, but it looks like this can be fixed now by just fixing code to look at `stageIdToTaskIndices` instead of keeping `numRunningTasks` around? (Or maybe use `numRunningTasks` as a cache for `stageIdToTaskIndices.values.sum`.) Also, doesn't `isExecutorIdle` take care of the second bullet in your description? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/11205 **[Test build #83067 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83067/testReport)** for PR 11205 at commit [`59f9c15`](https://github.com/apache/spark/commit/59f9c156c3ad746f84f385bcf277685c9c329286). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/11205 @vanzin , in the current code `stageIdToTaskIndices` cannot be used to track number of running tasks, because this structure doesn't remove task index from itself when task is finished successfully. Yes `isExecutorIdle` is used to take care of executor idle, but the way to identify whether executor is idle is not robust enough. In this scenario, when stage is aborted because of max task failures, some task end event will be missing, so using number of tasks per executor will lead to residual data, and makes executor always be busy. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/11205 Verified again, looks like the 2nd bullet is not valid anymore, I cannot reproduce it in latest master branch, this might have already been fixed in SPARK-13054. So only first issue still exists, I think @sitalkedia 's PR is enough to handle this 1st issue. I'm going to close this one. @sitalkedia would you please reopen your PR, sorry to bring in noise. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/11205 **[Test build #83067 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/83067/testReport)** for PR 11205 at commit [`59f9c15`](https://github.com/apache/spark/commit/59f9c156c3ad746f84f385bcf277685c9c329286). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/11205 Merged build finished. Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/11205 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/83067/ Test PASSed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/11205 gentle ping @rustagi --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/11205 gentle ping @rustgi, have you maybe had some time to confirm this patch maybe? It sounds the only thing we need here is the confirmation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user sadhen commented on the issue: https://github.com/apache/spark/pull/11205 @jerryshao I think the 2nd bullet has not been fixed in SPARK-13054. I use spark 2.1.1, and I still find that finished tasks remain in `private val executorIdToTaskIds = new mutable.HashMap[String, mutable.HashSet[Long]]` But the numRunningTasks equals 0 since: ``` if (numRunningTasks != 0) { logWarning("No stages are running, but numRunningTasks != 0") numRunningTasks = 0 } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user rustagi commented on the issue: https://github.com/apache/spark/pull/11205 I am seeing this issue quite frequently. Not sure what is causing it but frequently we will get a onTaskEnd event after a stage has ended. This will cause the numRunningTasks to become negative. If executor number is updated then number of required executors(maxNumExecutorsNeeded) becomes negative & have issues in new executor allocation and deallocation. Best case you get executors that are unable to deallocate & over time spark does not allocate new executors even if there are tasks pending. There is a simple hacky patch here: https://github.com/apache/spark/pull/9288 & this one is an attempt to correct it with more accountability. I am seeing this issue so frequently that I am not sure its possible to run Spark with dynamic allocation successfully for long duration without fixing it. I'll try the hacky patch & confirm. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #11205: [SPARK-11334][Core] Handle maximum task failure situatio...
Github user rustagi commented on the issue: https://github.com/apache/spark/pull/11205 I can confirm that removing speculation & setting maxtaskfailure to 1 eliminates this problem. Will try the patch & confirm --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org