[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/11512#issuecomment-196207275 Reopen it and change the title. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...
Github user jerryshao closed the pull request at: https://github.com/apache/spark/pull/11512 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/11512#issuecomment-196165073 Sure, thanks @tgravescs and @vanzin for your comments. I will close this create new PRs accordingly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/11512#issuecomment-195370827 Yes its only if they explicitly System.exit. Or someone puts a System.exit back into Spark but that shouldn't happen except on failure. ok @jerryshao can you update pr for master to just be to change default behavior of shutdown hook assume failures. Then a PR to 1.6 with this signal handler --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/11512#issuecomment-195137045 The signal handler change looks fine to me; I agree with Saisai that the race would happen very rarely and shouldn't be a problem. Regarding the change in behavior... > I'm wondering if we just don't change the behavior for 2.x to report failed by default So this would only apply to applications that do `System.exit` explicitly, right? At this point in time I hope people have stopped doing that, so it should be fine. A quick grep in Spark code shows that `System.exit` seems to be used only for error cases, so the change would actually help with that (not that Spark should be calling `System.exit` itself, but that's a different problem). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/11512#issuecomment-194475748 So if they catch exceptions themselves we have to assume succeeded which is what we do now. I just hate to add more logic and not really solve the underlying problem. But unfortunately its really hard for us to know since its any arbitrary user code. We can't just ask the scheduler if its done. So I'm thinking about this some more and I'm wondering if we just don't change the behavior for 2.x to report failed by default. The reason we do it this way is the SPARK code itself and examples used to System.exit(). We have removed those quite a few releases back now. People shouldn't be calling System.exit and should just be calling sc.stop(). 2.x gives us an opportunity to change that behavior. @vanzin @pwendell Do you guys have any concerns about this? if we want to pull this into 1.6.x line I would be ok adding in the signal handler to fix this one issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...
Github user jerryshao commented on the pull request: https://github.com/apache/spark/pull/11512#issuecomment-194097212 Thanks a lot for your comment @tgravescs , I think the scenario you mentioned though has race condition, it seldom happens, and currently we don't have solutions to handle that, since driver and AM are independent. Yes, normally we should expect scheduler throw all the exceptions, but you cannot expect any user code will starting a job immediately after `sc` is created, or exception will definitely be thrown to AM, what if main function catch all the exceptions? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/11512#issuecomment-193855548 So I think the problem with this is still that we don't know the real state of the application. There is still a race condition here where the users application and sparkcontext could be succeeded (ie wrote output and was about to finish) but if we get a SIGTERM at the wrong point we mark it as failed rather then succeeded. Normally if Spark was actually running anything the DAGScheduler would throw an exception. I think your use case here is the driver is sleeping and not running anything distributed so the scheduler doesn't throw, it just calls sc.stop(). I would rather explore having the scheduler throw if it has any jobs, including inactive ones or allowing the application master to know the state of the Spark jobs. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/11512#discussion_r55375068 --- Diff: yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -116,6 +118,21 @@ private[spark] class ApplicationMaster( private var delegationTokenRenewerOption: Option[AMDelegationTokenRenewer] = None + if (SystemUtils.IS_OS_UNIX) { +// Register signal handler for signal "TERM", "INT" and "HUP". For the cases where AM receive a +// signal and stop, from RM's aspect this application needs to be reattempted, rather than mark +// as success. +// Replace this signal handler with SignalLogger in AM side. +val signalHandler = new SignalHandler() { + override def handle(sig: Signal): Unit = { + +logInfo(s"received signal: ${sig.getName}") --- End diff -- I'd rather see this have same information as message in the signallogger: log.error("RECEIVED SIGNAL " + signal.getNumber() + ": SIG" + signal.getName()) You can add something to indicate from AM and not signal logger if you want --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11512#issuecomment-192131877 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11512#issuecomment-192131879 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/52448/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11512#issuecomment-192131732 **[Test build #52448 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52448/consoleFull)** for PR 11512 at commit [`b461b71`](https://github.com/apache/spark/commit/b461b717ed51b532f823615bcb79f66b17635c4d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11512#issuecomment-192125457 **[Test build #52448 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52448/consoleFull)** for PR 11512 at commit [`b461b71`](https://github.com/apache/spark/commit/b461b717ed51b532f823615bcb79f66b17635c4d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13642][Yarn] Properly handle signal kil...
GitHub user jerryshao opened a pull request: https://github.com/apache/spark/pull/11512 [SPARK-13642][Yarn] Properly handle signal kill in ApplicationMaster ## What changes were proposed in this pull request? This patch is fixing the race condition in ApplicationMaster when receiving a signal. In the current implementation, if signal is received and with no any exception, this application will be finished with successful state in Yarn, and there's no another attempt. Actually the application is killed by signal in the runtime, so another attempt is expected. This patch adds a signal handler to handle the signal things, if signal is received, marking this application finished with failure, rather than success. ## How was this patch tested? This patch is tested with following situations: 1. Application is finished normally. 2. Application is finished by calling `System.exit(n)`. 3. Application is killed by yarn command. 4. ApplicationMaster is killed by "SIGTERM" send by `kill pid` command. 5. ApplicationMaster is killed by NM with "SIGTERM" in case of NM failure. All the scenarios return the expected states. CC @tgravescs , please help to review this fix, thanks a lot. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jerryshao/apache-spark SPARK-13642 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/11512.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #11512 commit b461b717ed51b532f823615bcb79f66b17635c4d Author: jerryshao Date: 2016-03-04T05:52:18Z Properly handle signal kill in ApplicationMaster --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org