[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r30820880 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -383,40 +406,81 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, } } + /** + * This system security manager applies to the entire process. + * It's main purpose is to handle the case if the user code does a System.exit. + * This allows us to catch that and properly set the YARN application status and + * cleanup if needed. + */ + private def setupSystemSecurityManager(): Unit = { +try { + var stopped = false + System.setSecurityManager(new java.lang.SecurityManager() { +override def checkExit(paramInt: Int) { + if (!stopped) { +logInfo("In securityManager checkExit, exit code: " + paramInt) +if (paramInt == 0) { + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) +} else { + finish(FinalApplicationStatus.FAILED, +paramInt, +"User class exited with non-zero exit code") +} +stopped = true + } +} +// required for the checkExit to work properly +override def checkPermission(perm: java.security.Permission): Unit = {} + }) +} +catch { + case e: SecurityException => +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_SECURITY, + "Error in setSecurityManager") +logError("Error in setSecurityManager:", e) +} + } + + /** + * Start the user class, which contains the spark driver, in a separate Thread. + * If the main routine exits cleanly or exits with System.exit(0) we + * assume it was successful, for all other cases we assume failure. + * + * Returns the user thread that was started. + */ private def startUserClass(): Thread = { logInfo("Starting the user JAR in a separate Thread") System.setProperty("spark.executor.instances", args.numExecutors.toString) val mainMethod = Class.forName(args.userClass, false, Thread.currentThread.getContextClassLoader).getMethod("main", classOf[Array[String]]) -userClassThread = new Thread { +val userThread = new Thread { override def run() { -var status = FinalApplicationStatus.FAILED try { - // Copy val mainArgs = new Array[String](args.userArgs.size) args.userArgs.copyToArray(mainArgs, 0, args.userArgs.size) mainMethod.invoke(null, mainArgs) - // Some apps have "System.exit(0)" at the end. The user thread will stop here unless - // it has an uncaught exception thrown out. It needs a shutdown hook to set SUCCEEDED. - status = FinalApplicationStatus.SUCCEEDED + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) + logDebug("Done running users class") } catch { case e: InvocationTargetException => e.getCause match { case _: InterruptedException => // Reporter thread can interrupt to stop user class - - case e => throw e + case e: Exception => --- End diff -- This was changed in a subsequent PR. Check the current code. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user javabrett commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r30769030 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -383,40 +406,81 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, } } + /** + * This system security manager applies to the entire process. + * It's main purpose is to handle the case if the user code does a System.exit. + * This allows us to catch that and properly set the YARN application status and + * cleanup if needed. + */ + private def setupSystemSecurityManager(): Unit = { +try { + var stopped = false + System.setSecurityManager(new java.lang.SecurityManager() { +override def checkExit(paramInt: Int) { + if (!stopped) { +logInfo("In securityManager checkExit, exit code: " + paramInt) +if (paramInt == 0) { + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) +} else { + finish(FinalApplicationStatus.FAILED, +paramInt, +"User class exited with non-zero exit code") +} +stopped = true + } +} +// required for the checkExit to work properly +override def checkPermission(perm: java.security.Permission): Unit = {} + }) +} +catch { + case e: SecurityException => +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_SECURITY, + "Error in setSecurityManager") +logError("Error in setSecurityManager:", e) +} + } + + /** + * Start the user class, which contains the spark driver, in a separate Thread. + * If the main routine exits cleanly or exits with System.exit(0) we + * assume it was successful, for all other cases we assume failure. + * + * Returns the user thread that was started. + */ private def startUserClass(): Thread = { logInfo("Starting the user JAR in a separate Thread") System.setProperty("spark.executor.instances", args.numExecutors.toString) val mainMethod = Class.forName(args.userClass, false, Thread.currentThread.getContextClassLoader).getMethod("main", classOf[Array[String]]) -userClassThread = new Thread { +val userThread = new Thread { override def run() { -var status = FinalApplicationStatus.FAILED try { - // Copy val mainArgs = new Array[String](args.userArgs.size) args.userArgs.copyToArray(mainArgs, 0, args.userArgs.size) mainMethod.invoke(null, mainArgs) - // Some apps have "System.exit(0)" at the end. The user thread will stop here unless - // it has an uncaught exception thrown out. It needs a shutdown hook to set SUCCEEDED. - status = FinalApplicationStatus.SUCCEEDED + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) + logDebug("Done running users class") } catch { case e: InvocationTargetException => e.getCause match { case _: InterruptedException => // Reporter thread can interrupt to stop user class - - case e => throw e + case e: Exception => --- End diff -- I'm curious, should this be Throwable? If my application throws an uncaught Error, shouldn't that also result in FAILED, and would it (still) do so with this change? P.S. my Scala is not that strong. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/2577 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-58197006 Thanks @andrewor14. I've merged this into 1.2 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-58136343 LGTM, feel free to merge it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-58037937 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21330/Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-58037924 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21330/consoleFull) for PR 2577 at commit [`9c2efbf`](https://github.com/apache/spark/commit/9c2efbfd8d199bf89f911e44c7b07c6afe6b15bd). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class CacheTableCommand(tableName: String, plan: Option[LogicalPlan], isLazy: Boolean)` * `case class UncacheTableCommand(tableName: String) extends Command` * `case class CacheTableCommand(` * `case class UncacheTableCommand(tableName: String) extends LeafNode with Command ` * `case class DescribeCommand(child: SparkPlan, output: Seq[Attribute])(` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-58026499 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21330/consoleFull) for PR 2577 at commit [`9c2efbf`](https://github.com/apache/spark/commit/9c2efbfd8d199bf89f911e44c7b07c6afe6b15bd). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18459205 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -383,40 +405,82 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, } } + /** + * This system security manager applies to the entire process. + * It's main purpose is to handle the case if the user code does a System.exit. + * This allows us to catch that and properly set the YARN application status and + * cleanup if needed. + */ + private def setupSystemSecurityManager() = { +try { + var stopped = false + System.setSecurityManager(new java.lang.SecurityManager() { +override def checkExit(paramInt: Int) { + if (!stopped) { +logInfo("In securityManager checkExit, exit code: " + paramInt) +if (paramInt == 0) { + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) +} else { + finish(FinalApplicationStatus.FAILED, +paramInt, +"User class exited with non-zero exit code") +} +stopped = true + } +} +// required for the checkExit to work properly +override def checkPermission(perm: java.security.Permission): Unit = { +} --- End diff -- In the future please clarify what you want bumped up as you said this prior and I thought you meant remove the extra space between 430 and 431. I assume you actually mean the } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18459132 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -383,40 +405,82 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, } } + /** + * This system security manager applies to the entire process. + * It's main purpose is to handle the case if the user code does a System.exit. + * This allows us to catch that and properly set the YARN application status and + * cleanup if needed. + */ + private def setupSystemSecurityManager() = { +try { + var stopped = false + System.setSecurityManager(new java.lang.SecurityManager() { +override def checkExit(paramInt: Int) { + if (!stopped) { +logInfo("In securityManager checkExit, exit code: " + paramInt) +if (paramInt == 0) { + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) +} else { + finish(FinalApplicationStatus.FAILED, +paramInt, +"User class exited with non-zero exit code") +} +stopped = true + } +} +// required for the checkExit to work properly +override def checkPermission(perm: java.security.Permission): Unit = { +} + }) +} +catch { + case e: SecurityException => +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_SECURITY, + "Error in setSecurityManager") +logError("Error in setSecurityManager:", e) +} + } + + /** + * Start the user class, which contains the spark driver, in a separate Thread. + * If the main routine exits cleanly or exits with System.exit(0) we + * assume it was successful, for all other cases we assume failure. + * + * Returns the user thread that was started. + */ private def startUserClass(): Thread = { logInfo("Starting the user JAR in a separate Thread") System.setProperty("spark.executor.instances", args.numExecutors.toString) val mainMethod = Class.forName(args.userClass, false, Thread.currentThread.getContextClassLoader).getMethod("main", classOf[Array[String]]) -userClassThread = new Thread { +val userThread = new Thread { override def run() { -var status = FinalApplicationStatus.FAILED try { - // Copy val mainArgs = new Array[String](args.userArgs.size) args.userArgs.copyToArray(mainArgs, 0, args.userArgs.size) mainMethod.invoke(null, mainArgs) - // Some apps have "System.exit(0)" at the end. The user thread will stop here unless - // it has an uncaught exception thrown out. It needs a shutdown hook to set SUCCEEDED. - status = FinalApplicationStatus.SUCCEEDED + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) + logDebug("Done running users class") } catch { case e: InvocationTargetException => e.getCause match { case _: InterruptedException => // Reporter thread can interrupt to stop user class - - case e => throw e + case e: Throwable => --- End diff -- that is fine, but note you didn't comment on this one earlier, you commented somewhere else in the code. this one we end up re-throwing so I wasn't as concerned with it. I can change it --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57857279 Hey @tgravescs this LGTM pending a few minor comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18418623 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -383,40 +405,82 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, } } + /** + * This system security manager applies to the entire process. + * It's main purpose is to handle the case if the user code does a System.exit. + * This allows us to catch that and properly set the YARN application status and + * cleanup if needed. + */ + private def setupSystemSecurityManager() = { +try { + var stopped = false + System.setSecurityManager(new java.lang.SecurityManager() { +override def checkExit(paramInt: Int) { + if (!stopped) { +logInfo("In securityManager checkExit, exit code: " + paramInt) +if (paramInt == 0) { + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) +} else { + finish(FinalApplicationStatus.FAILED, +paramInt, +"User class exited with non-zero exit code") +} +stopped = true + } +} +// required for the checkExit to work properly +override def checkPermission(perm: java.security.Permission): Unit = { +} --- End diff -- can you bump this up one line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18418613 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -383,40 +405,82 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, } } + /** + * This system security manager applies to the entire process. + * It's main purpose is to handle the case if the user code does a System.exit. + * This allows us to catch that and properly set the YARN application status and + * cleanup if needed. + */ + private def setupSystemSecurityManager() = { --- End diff -- can you add `: Unit` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18418362 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -383,40 +405,82 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, } } + /** + * This system security manager applies to the entire process. + * It's main purpose is to handle the case if the user code does a System.exit. + * This allows us to catch that and properly set the YARN application status and + * cleanup if needed. + */ + private def setupSystemSecurityManager() = { +try { + var stopped = false + System.setSecurityManager(new java.lang.SecurityManager() { +override def checkExit(paramInt: Int) { + if (!stopped) { +logInfo("In securityManager checkExit, exit code: " + paramInt) +if (paramInt == 0) { + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) +} else { + finish(FinalApplicationStatus.FAILED, +paramInt, +"User class exited with non-zero exit code") +} +stopped = true + } +} +// required for the checkExit to work properly +override def checkPermission(perm: java.security.Permission): Unit = { +} + }) +} +catch { + case e: SecurityException => +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_SECURITY, + "Error in setSecurityManager") +logError("Error in setSecurityManager:", e) +} + } + + /** + * Start the user class, which contains the spark driver, in a separate Thread. + * If the main routine exits cleanly or exits with System.exit(0) we + * assume it was successful, for all other cases we assume failure. + * + * Returns the user thread that was started. + */ private def startUserClass(): Thread = { logInfo("Starting the user JAR in a separate Thread") System.setProperty("spark.executor.instances", args.numExecutors.toString) val mainMethod = Class.forName(args.userClass, false, Thread.currentThread.getContextClassLoader).getMethod("main", classOf[Array[String]]) -userClassThread = new Thread { +val userThread = new Thread { override def run() { -var status = FinalApplicationStatus.FAILED try { - // Copy val mainArgs = new Array[String](args.userArgs.size) args.userArgs.copyToArray(mainArgs, 0, args.userArgs.size) mainMethod.invoke(null, mainArgs) - // Some apps have "System.exit(0)" at the end. The user thread will stop here unless - // it has an uncaught exception thrown out. It needs a shutdown hook to set SUCCEEDED. - status = FinalApplicationStatus.SUCCEEDED + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) + logDebug("Done running users class") } catch { case e: InvocationTargetException => e.getCause match { case _: InterruptedException => // Reporter thread can interrupt to stop user class - - case e => throw e + case e: Throwable => --- End diff -- I still think we should catch only Exception here. All we ever do in `finish` is to kill the threads and log the exit code, and if we get a really bad `Throwable` that kills the JVM then these threads won't survive anyway. It's just that the JVM is not guaranteed to do whatever `finish` does properly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18418213 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -328,10 +349,18 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private def waitForSparkDriver(): ActorRef = { logInfo("Waiting for Spark driver to be reachable.") var driverUp = false +var count = 0 val hostport = args.userArgs(0) val (driverHost, driverPort) = Utils.parseHostPort(hostport) -while (!driverUp) { + +// spark driver should already be up since it launched us, but we don't want to +// wait forever, so wait 100 seconds max to match the cluster mode setting. +// Leave this config unpublished for now. --- End diff -- minor, but can you add `SPARK-3779` to the comment so others know we're tracking this issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18418163 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -231,33 +259,26 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, val t = new Thread { override def run() { var failureCount = 0 - while (!finished) { try { -checkNumExecutorsFailed() -if (!finished) { +if (allocator.getNumExecutorsFailed >= maxNumExecutorFailures) { + finish(FinalApplicationStatus.FAILED, +ApplicationMaster.EXIT_MAX_EXECUTOR_FAILURES, +"Max number of executor failures reached") +} else { logDebug("Sending progress") allocator.allocateResources() } failureCount = 0 } catch { +case i: InterruptedException => case e: Throwable => { failureCount += 1 if (!NonFatal(e) || failureCount >= reporterMaxFailures) { --- End diff -- Since we're catching `InterruptedException` here it's always gonna be `NonFatal` right? I think this check is now outdated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57818934 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21250/consoleFull) for PR 2577 at commit [`e8cc261`](https://github.com/apache/spark/commit/e8cc261ba8e8b639d2fd375638ae5bb0925c1411). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57818940 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21250/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57809815 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21250/consoleFull) for PR 2577 at commit [`e8cc261`](https://github.com/apache/spark/commit/e8cc261ba8e8b639d2fd375638ae5bb0925c1411). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57808148 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21248/consoleFull) for PR 2577 at commit [`24c98e3`](https://github.com/apache/spark/commit/24c98e3154ee2ee93dc4c958ac982b534a798972). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57808155 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21248/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57807655 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21248/consoleFull) for PR 2577 at commit [`24c98e3`](https://github.com/apache/spark/commit/24c98e3154ee2ee93dc4c958ac982b534a798972). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57807019 Addressed all the review comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18395494 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -328,10 +348,18 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private def waitForSparkDriver(): ActorRef = { logInfo("Waiting for Spark driver to be reachable.") var driverUp = false +var count = 0 val hostport = args.userArgs(0) val (driverHost, driverPort) = Utils.parseHostPort(hostport) -while (!driverUp) { + +// spark driver should already be up since it launched us, but we don't want to +// wait forever, so wait 100 seconds max to match the cluster mode setting. +// Leave this config unpublished for now. +val numTries = sparkConf.getInt("spark.yarn.ApplicationMaster.client.waitTries", 1000) --- End diff -- https://issues.apache.org/jira/browse/SPARK-3779 filed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18395371 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -328,10 +348,18 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private def waitForSparkDriver(): ActorRef = { logInfo("Waiting for Spark driver to be reachable.") var driverUp = false +var count = 0 val hostport = args.userArgs(0) val (driverHost, driverPort) = Utils.parseHostPort(hostport) -while (!driverUp) { + +// spark driver should already be up since it launched us, but we don't want to +// wait forever, so wait 100 seconds max to match the cluster mode setting. +// Leave this config unpublished for now. +val numTries = sparkConf.getInt("spark.yarn.ApplicationMaster.client.waitTries", 1000) --- End diff -- ok for this pr I'll leave it applicationMaster.waitTries and match cluster mode and I'll file a separate jira to clean it up. The documentation doesn't state how long each loop is for example. I think these would be better to just change to be a wait times versus number of tries and then they can be used for both modes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18251599 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -383,40 +404,80 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, } } + /** + * Start the user class, which contains the spark driver. + * If the main routine exits cleanly or exits with System.exit(0) we + * assume it was successful, for all other cases we assume failure. + */ private def startUserClass(): Thread = { logInfo("Starting the user JAR in a separate Thread") System.setProperty("spark.executor.instances", args.numExecutors.toString) +var stopped = false val mainMethod = Class.forName(args.userClass, false, Thread.currentThread.getContextClassLoader).getMethod("main", classOf[Array[String]]) -userClassThread = new Thread { +val userThread = new Thread { override def run() { -var status = FinalApplicationStatus.FAILED + +try { + // Note this security manager applies to the entire process, not + // just this thread. It's here to handle the case if the user code + // does System.exit + System.setSecurityManager(new java.lang.SecurityManager() { +override def checkExit(paramInt: Int) { + if (!stopped) { +logInfo("In securityManager checkExit, exit code: " + paramInt) +if (paramInt == 0) { + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) +} else { + finish(FinalApplicationStatus.FAILED, +paramInt, +"User class exited with non-zero exit code") +} +stopped = true + } +} + +// required for the checkExit to work properly +override def checkPermission(perm: java.security.Permission): Unit = { +} + }) +} +catch { + case e: SecurityException => { +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_SECURITY, + "Error in setSecurityManager") +logError("Error in setSecurityManager:", e) + } +} + --- End diff -- sounds good, I'll separate it out. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18249218 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -383,40 +404,80 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, } } + /** + * Start the user class, which contains the spark driver. + * If the main routine exits cleanly or exits with System.exit(0) we + * assume it was successful, for all other cases we assume failure. + */ private def startUserClass(): Thread = { logInfo("Starting the user JAR in a separate Thread") System.setProperty("spark.executor.instances", args.numExecutors.toString) +var stopped = false val mainMethod = Class.forName(args.userClass, false, Thread.currentThread.getContextClassLoader).getMethod("main", classOf[Array[String]]) -userClassThread = new Thread { +val userThread = new Thread { override def run() { -var status = FinalApplicationStatus.FAILED + +try { + // Note this security manager applies to the entire process, not + // just this thread. It's here to handle the case if the user code + // does System.exit + System.setSecurityManager(new java.lang.SecurityManager() { +override def checkExit(paramInt: Int) { + if (!stopped) { +logInfo("In securityManager checkExit, exit code: " + paramInt) +if (paramInt == 0) { + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) +} else { + finish(FinalApplicationStatus.FAILED, +paramInt, +"User class exited with non-zero exit code") +} +stopped = true + } +} + +// required for the checkExit to work properly +override def checkPermission(perm: java.security.Permission): Unit = { +} + }) +} +catch { + case e: SecurityException => { +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_SECURITY, + "Error in setSecurityManager") +logError("Error in setSecurityManager:", e) + } +} + --- End diff -- Not a big deal, but I think it'll make the content of this thread easier to read by minimizing the logic we put in it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18249133 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -328,10 +348,18 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private def waitForSparkDriver(): ActorRef = { logInfo("Waiting for Spark driver to be reachable.") var driverUp = false +var count = 0 val hostport = args.userArgs(0) val (driverHost, driverPort) = Utils.parseHostPort(hostport) -while (!driverUp) { + +// spark driver should already be up since it launched us, but we don't want to +// wait forever, so wait 100 seconds max to match the cluster mode setting. +// Leave this config unpublished for now. +val numTries = sparkConf.getInt("spark.yarn.ApplicationMaster.client.waitTries", 1000) --- End diff -- It's kind of inconsistent to use `applicationMaster.client.waitTries` for client mode but `applicationMaster.waitTries` for cluster mode, and the existing documentation for the latter makes no mention of cluster mode even though it's only used there. It's fine to keep the `client` config here but we should make the other one `applicationMaster.cluster.waitTries` in a future JIRA and deprecate the less specific one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18248626 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -71,80 +74,108 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private val sparkContextRef = new AtomicReference[SparkContext](null) final def run(): Int = { -val appAttemptId = client.getAttemptId() +try { + val appAttemptId = client.getAttemptId() -if (isDriver) { - // Set the web ui port to be ephemeral for yarn so we don't conflict with - // other spark processes running on the same box - System.setProperty("spark.ui.port", "0") + if (isDriver) { +// Set the web ui port to be ephemeral for yarn so we don't conflict with +// other spark processes running on the same box +System.setProperty("spark.ui.port", "0") - // Set the master property to match the requested mode. - System.setProperty("spark.master", "yarn-cluster") +// Set the master property to match the requested mode. +System.setProperty("spark.master", "yarn-cluster") - // Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. - System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) -} +// Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. +System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) + } -logInfo("ApplicationAttemptId: " + appAttemptId) + logInfo("ApplicationAttemptId: " + appAttemptId) -val cleanupHook = new Runnable { - override def run() { -// If the SparkContext is still registered, shut it down as a best case effort in case -// users do not call sc.stop or do System.exit(). -val sc = sparkContextRef.get() -if (sc != null) { - logInfo("Invoking sc stop from shutdown hook") - sc.stop() - finish(FinalApplicationStatus.SUCCEEDED) -} + val cleanupHook = new Runnable { +override def run() { + // If the SparkContext is still registered, shut it down as a best case effort in case + // users do not call sc.stop or do System.exit(). + val sc = sparkContextRef.get() + if (sc != null) { +logInfo("Invoking sc stop from shutdown hook") +sc.stop() + } + val maxAppAttempts = client.getMaxRegAttempts(yarnConf) + val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts + + if (!finished) { +// this shouldn't ever happen, but if it does assume weird failure +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION, + "shutdown hook called without cleanly finishing") + } -// Cleanup the staging dir after the app is finished, or if it's the last attempt at -// running the AM. -val maxAppAttempts = client.getMaxRegAttempts(yarnConf) -val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts -if (finished || isLastAttempt) { - cleanupStagingDir() + if (!unregistered) { +// we only want to unregister if we don't want the RM to retry +if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) { + unregister(finalStatus, finalMsg) + cleanupStagingDir() +} + } } } -} -// Use higher priority than FileSystem. -assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) -ShutdownHookManager - .get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) + // Use higher priority than FileSystem. + assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) + ShutdownHookManager +.get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) -// Call this to force generation of secret so it gets populated into the -// Hadoop UGI. This has to happen before the startUserClass which does a -// doAs in order for the credentials to be passed on to the executor containers. -val securityMgr = new SecurityManager(sparkConf) + // Call this to force generation of secret so it gets populated into the + // Hadoop UGI. This has to happen before the startUserClass
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18243719 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -328,10 +348,18 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private def waitForSparkDriver(): ActorRef = { logInfo("Waiting for Spark driver to be reachable.") var driverUp = false +var count = 0 val hostport = args.userArgs(0) val (driverHost, driverPort) = Utils.parseHostPort(hostport) -while (!driverUp) { + +// spark driver should already be up since it launched us, but we don't want to +// wait forever, so wait 100 seconds max to match the cluster mode setting. +// Leave this config unpublished for now. +val numTries = sparkConf.getInt("spark.yarn.ApplicationMaster.client.waitTries", 1000) --- End diff -- yes the client was tacked on to mean it used in the client mode because the timing of the loops are different between the modes. Its an internal config right now so user shouldn't be setting. The timing is different because client mode is already up when this is launched, versus in cluster mode we are launching the user code, which takes some times (10's of seconds). I'll file a separate jira to fix up the mismatch in doc/config. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18243513 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -71,80 +74,108 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private val sparkContextRef = new AtomicReference[SparkContext](null) final def run(): Int = { -val appAttemptId = client.getAttemptId() +try { + val appAttemptId = client.getAttemptId() -if (isDriver) { - // Set the web ui port to be ephemeral for yarn so we don't conflict with - // other spark processes running on the same box - System.setProperty("spark.ui.port", "0") + if (isDriver) { +// Set the web ui port to be ephemeral for yarn so we don't conflict with +// other spark processes running on the same box +System.setProperty("spark.ui.port", "0") - // Set the master property to match the requested mode. - System.setProperty("spark.master", "yarn-cluster") +// Set the master property to match the requested mode. +System.setProperty("spark.master", "yarn-cluster") - // Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. - System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) -} +// Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. +System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) + } -logInfo("ApplicationAttemptId: " + appAttemptId) + logInfo("ApplicationAttemptId: " + appAttemptId) -val cleanupHook = new Runnable { - override def run() { -// If the SparkContext is still registered, shut it down as a best case effort in case -// users do not call sc.stop or do System.exit(). -val sc = sparkContextRef.get() -if (sc != null) { - logInfo("Invoking sc stop from shutdown hook") - sc.stop() - finish(FinalApplicationStatus.SUCCEEDED) -} + val cleanupHook = new Runnable { +override def run() { + // If the SparkContext is still registered, shut it down as a best case effort in case + // users do not call sc.stop or do System.exit(). + val sc = sparkContextRef.get() + if (sc != null) { +logInfo("Invoking sc stop from shutdown hook") +sc.stop() + } + val maxAppAttempts = client.getMaxRegAttempts(yarnConf) + val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts + + if (!finished) { +// this shouldn't ever happen, but if it does assume weird failure +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION, + "shutdown hook called without cleanly finishing") + } -// Cleanup the staging dir after the app is finished, or if it's the last attempt at -// running the AM. -val maxAppAttempts = client.getMaxRegAttempts(yarnConf) -val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts -if (finished || isLastAttempt) { - cleanupStagingDir() + if (!unregistered) { +// we only want to unregister if we don't want the RM to retry +if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) { + unregister(finalStatus, finalMsg) + cleanupStagingDir() +} + } } } -} -// Use higher priority than FileSystem. -assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) -ShutdownHookManager - .get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) + // Use higher priority than FileSystem. + assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) + ShutdownHookManager +.get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) -// Call this to force generation of secret so it gets populated into the -// Hadoop UGI. This has to happen before the startUserClass which does a -// doAs in order for the credentials to be passed on to the executor containers. -val securityMgr = new SecurityManager(sparkConf) + // Call this to force generation of secret so it gets populated into the + // Hadoop UGI. This has to happen before the startUserClass
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18240426 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -231,33 +258,26 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, val t = new Thread { override def run() { var failureCount = 0 - -while (!finished) { +while (!finished && !Thread.currentThread().isInterrupted()) { --- End diff -- Is the second check needed? If the thread is interrupted won't this already have exited? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18240375 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -343,6 +371,11 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, Thread.sleep(100) } } + +if (!driverUp) { + throw new Exception("Failed to connect to driver!") --- End diff -- Can you throw `SparkException` here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18240353 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -328,10 +348,18 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private def waitForSparkDriver(): ActorRef = { logInfo("Waiting for Spark driver to be reachable.") var driverUp = false +var count = 0 val hostport = args.userArgs(0) val (driverHost, driverPort) = Utils.parseHostPort(hostport) -while (!driverUp) { + +// spark driver should already be up since it launched us, but we don't want to +// wait forever, so wait 100 seconds max to match the cluster mode setting. +// Leave this config unpublished for now. +val numTries = sparkConf.getInt("spark.yarn.ApplicationMaster.client.waitTries", 1000) --- End diff -- This config should use camel case for `applicationMaster`. Also, there's already a `spark.yarn.applicationMaster.waitTries`. Does the extra `client` mean it's for client mode? Do we want a separate setting for client vs deploy modes here? By the way there is a mismatch between what is already there `spark.yarn.ApplicationMatser.waitTries` and what we document `spark.yarn.applicationMaster.waitTries`. I think this is a bug that we can fix separately. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18240121 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -383,40 +404,80 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, } } + /** + * Start the user class, which contains the spark driver. --- End diff -- Thanks for documenting this. Can you add that this is started in a separate thread and this method returns that thread? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18240067 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -383,40 +404,80 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, } } + /** + * Start the user class, which contains the spark driver. + * If the main routine exits cleanly or exits with System.exit(0) we + * assume it was successful, for all other cases we assume failure. + */ private def startUserClass(): Thread = { logInfo("Starting the user JAR in a separate Thread") System.setProperty("spark.executor.instances", args.numExecutors.toString) +var stopped = false val mainMethod = Class.forName(args.userClass, false, Thread.currentThread.getContextClassLoader).getMethod("main", classOf[Array[String]]) -userClassThread = new Thread { +val userThread = new Thread { override def run() { -var status = FinalApplicationStatus.FAILED + +try { + // Note this security manager applies to the entire process, not + // just this thread. It's here to handle the case if the user code + // does System.exit + System.setSecurityManager(new java.lang.SecurityManager() { +override def checkExit(paramInt: Int) { + if (!stopped) { +logInfo("In securityManager checkExit, exit code: " + paramInt) +if (paramInt == 0) { + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) +} else { + finish(FinalApplicationStatus.FAILED, +paramInt, +"User class exited with non-zero exit code") +} +stopped = true + } +} + +// required for the checkExit to work properly +override def checkPermission(perm: java.security.Permission): Unit = { +} --- End diff -- can you bump this up one line --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18239755 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -383,40 +404,80 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, } } + /** + * Start the user class, which contains the spark driver. + * If the main routine exits cleanly or exits with System.exit(0) we + * assume it was successful, for all other cases we assume failure. + */ private def startUserClass(): Thread = { logInfo("Starting the user JAR in a separate Thread") System.setProperty("spark.executor.instances", args.numExecutors.toString) +var stopped = false val mainMethod = Class.forName(args.userClass, false, Thread.currentThread.getContextClassLoader).getMethod("main", classOf[Array[String]]) -userClassThread = new Thread { +val userThread = new Thread { override def run() { -var status = FinalApplicationStatus.FAILED + +try { + // Note this security manager applies to the entire process, not + // just this thread. It's here to handle the case if the user code + // does System.exit + System.setSecurityManager(new java.lang.SecurityManager() { +override def checkExit(paramInt: Int) { + if (!stopped) { +logInfo("In securityManager checkExit, exit code: " + paramInt) +if (paramInt == 0) { + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) +} else { + finish(FinalApplicationStatus.FAILED, +paramInt, +"User class exited with non-zero exit code") +} +stopped = true + } +} + +// required for the checkExit to work properly +override def checkPermission(perm: java.security.Permission): Unit = { +} + }) +} +catch { + case e: SecurityException => { +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_SECURITY, + "Error in setSecurityManager") +logError("Error in setSecurityManager:", e) + } +} + --- End diff -- No it could be pulled out, but its also not needed for yarn-client mode --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18239726 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -71,80 +74,108 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private val sparkContextRef = new AtomicReference[SparkContext](null) final def run(): Int = { -val appAttemptId = client.getAttemptId() +try { + val appAttemptId = client.getAttemptId() -if (isDriver) { - // Set the web ui port to be ephemeral for yarn so we don't conflict with - // other spark processes running on the same box - System.setProperty("spark.ui.port", "0") + if (isDriver) { +// Set the web ui port to be ephemeral for yarn so we don't conflict with +// other spark processes running on the same box +System.setProperty("spark.ui.port", "0") - // Set the master property to match the requested mode. - System.setProperty("spark.master", "yarn-cluster") +// Set the master property to match the requested mode. +System.setProperty("spark.master", "yarn-cluster") - // Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. - System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) -} +// Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. +System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) + } -logInfo("ApplicationAttemptId: " + appAttemptId) + logInfo("ApplicationAttemptId: " + appAttemptId) -val cleanupHook = new Runnable { - override def run() { -// If the SparkContext is still registered, shut it down as a best case effort in case -// users do not call sc.stop or do System.exit(). -val sc = sparkContextRef.get() -if (sc != null) { - logInfo("Invoking sc stop from shutdown hook") - sc.stop() - finish(FinalApplicationStatus.SUCCEEDED) -} + val cleanupHook = new Runnable { +override def run() { + // If the SparkContext is still registered, shut it down as a best case effort in case + // users do not call sc.stop or do System.exit(). + val sc = sparkContextRef.get() + if (sc != null) { +logInfo("Invoking sc stop from shutdown hook") +sc.stop() + } + val maxAppAttempts = client.getMaxRegAttempts(yarnConf) + val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts + + if (!finished) { +// this shouldn't ever happen, but if it does assume weird failure +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION, + "shutdown hook called without cleanly finishing") + } -// Cleanup the staging dir after the app is finished, or if it's the last attempt at -// running the AM. -val maxAppAttempts = client.getMaxRegAttempts(yarnConf) -val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts -if (finished || isLastAttempt) { - cleanupStagingDir() + if (!unregistered) { +// we only want to unregister if we don't want the RM to retry +if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) { + unregister(finalStatus, finalMsg) + cleanupStagingDir() +} + } } } -} -// Use higher priority than FileSystem. -assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) -ShutdownHookManager - .get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) + // Use higher priority than FileSystem. + assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) + ShutdownHookManager +.get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) -// Call this to force generation of secret so it gets populated into the -// Hadoop UGI. This has to happen before the startUserClass which does a -// doAs in order for the credentials to be passed on to the executor containers. -val securityMgr = new SecurityManager(sparkConf) + // Call this to force generation of secret so it gets populated into the + // Hadoop UGI. This has to happen before the startUserClass
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18239465 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -383,40 +404,80 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, } } + /** + * Start the user class, which contains the spark driver. + * If the main routine exits cleanly or exits with System.exit(0) we + * assume it was successful, for all other cases we assume failure. + */ private def startUserClass(): Thread = { logInfo("Starting the user JAR in a separate Thread") System.setProperty("spark.executor.instances", args.numExecutors.toString) +var stopped = false val mainMethod = Class.forName(args.userClass, false, Thread.currentThread.getContextClassLoader).getMethod("main", classOf[Array[String]]) -userClassThread = new Thread { +val userThread = new Thread { override def run() { -var status = FinalApplicationStatus.FAILED + +try { + // Note this security manager applies to the entire process, not + // just this thread. It's here to handle the case if the user code + // does System.exit + System.setSecurityManager(new java.lang.SecurityManager() { +override def checkExit(paramInt: Int) { + if (!stopped) { +logInfo("In securityManager checkExit, exit code: " + paramInt) +if (paramInt == 0) { + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) +} else { + finish(FinalApplicationStatus.FAILED, +paramInt, +"User class exited with non-zero exit code") +} +stopped = true + } +} + +// required for the checkExit to work properly +override def checkPermission(perm: java.security.Permission): Unit = { +} + }) +} +catch { + case e: SecurityException => { +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_SECURITY, + "Error in setSecurityManager") +logError("Error in setSecurityManager:", e) + } +} + --- End diff -- Does this block have to execute inside the thread, since it's a system-wide setting? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18239431 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -383,40 +404,80 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, } } + /** + * Start the user class, which contains the spark driver. + * If the main routine exits cleanly or exits with System.exit(0) we + * assume it was successful, for all other cases we assume failure. + */ private def startUserClass(): Thread = { logInfo("Starting the user JAR in a separate Thread") System.setProperty("spark.executor.instances", args.numExecutors.toString) +var stopped = false val mainMethod = Class.forName(args.userClass, false, Thread.currentThread.getContextClassLoader).getMethod("main", classOf[Array[String]]) -userClassThread = new Thread { +val userThread = new Thread { override def run() { -var status = FinalApplicationStatus.FAILED + +try { + // Note this security manager applies to the entire process, not + // just this thread. It's here to handle the case if the user code + // does System.exit + System.setSecurityManager(new java.lang.SecurityManager() { +override def checkExit(paramInt: Int) { + if (!stopped) { +logInfo("In securityManager checkExit, exit code: " + paramInt) +if (paramInt == 0) { + finish(FinalApplicationStatus.SUCCEEDED, ApplicationMaster.EXIT_SUCCESS) +} else { + finish(FinalApplicationStatus.FAILED, +paramInt, +"User class exited with non-zero exit code") +} +stopped = true + } +} + +// required for the checkExit to work properly +override def checkPermission(perm: java.security.Permission): Unit = { +} + }) +} +catch { + case e: SecurityException => { --- End diff -- nit: no need for `{` here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18238861 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -450,6 +511,15 @@ object ApplicationMaster extends Logging { val SHUTDOWN_HOOK_PRIORITY: Int = 30 + // exit codes for different causes, no reason behind the values + val EXIT_SUCCESS = 0 + val EXIT_UNCAUGHT_EXCEPTION = 10 + val EXIT_MAX_EXECUTOR_FAILURES = 11 + val EXIT_REPORTER_FAILURE = 12 + val EXIT_SC_NOT_INITED = 13 + val EXIT_SECURITY = 14 + val EXIT_EXCEPTION_USER_CLASS = 15 + --- End diff -- These should all be private --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18238788 --- Diff: yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClientImpl.scala --- @@ -66,14 +70,16 @@ private class YarnRMClientImpl(args: ApplicationMasterArguments) extends YarnRMC appAttemptId } - override def shutdown(status: FinalApplicationStatus, diagnostics: String = "") = { -val finishReq = Records.newRecord(classOf[FinishApplicationMasterRequest]) - .asInstanceOf[FinishApplicationMasterRequest] -finishReq.setAppAttemptId(getAttemptId()) -finishReq.setFinishApplicationStatus(status) -finishReq.setDiagnostics(diagnostics) -finishReq.setTrackingUrl(uiHistoryAddress) -resourceManager.finishApplicationMaster(finishReq) + override def unregister(status: FinalApplicationStatus, diagnostics: String = "") = synchronized { +if (registered) { + val finishReq = Records.newRecord(classOf[FinishApplicationMasterRequest]) +.asInstanceOf[FinishApplicationMasterRequest] --- End diff -- this pr didn't change this code, other then wrapping it with an if. Its also going to be deprecated soon so I don't see a reason to fix it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user andrewor14 commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18238499 --- Diff: yarn/alpha/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClientImpl.scala --- @@ -66,14 +70,16 @@ private class YarnRMClientImpl(args: ApplicationMasterArguments) extends YarnRMC appAttemptId } - override def shutdown(status: FinalApplicationStatus, diagnostics: String = "") = { -val finishReq = Records.newRecord(classOf[FinishApplicationMasterRequest]) - .asInstanceOf[FinishApplicationMasterRequest] -finishReq.setAppAttemptId(getAttemptId()) -finishReq.setFinishApplicationStatus(status) -finishReq.setDiagnostics(diagnostics) -finishReq.setTrackingUrl(uiHistoryAddress) -resourceManager.finishApplicationMaster(finishReq) + override def unregister(status: FinalApplicationStatus, diagnostics: String = "") = synchronized { +if (registered) { + val finishReq = Records.newRecord(classOf[FinishApplicationMasterRequest]) +.asInstanceOf[FinishApplicationMasterRequest] --- End diff -- You probably don't need this cast --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57347769 LGTM. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57325372 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21040/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57325361 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21040/consoleFull) for PR 2577 at commit [`fab166d`](https://github.com/apache/spark/commit/fab166dba852f732049dda112daab17a491dd94c). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57314584 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21040/consoleFull) for PR 2577 at commit [`fab166d`](https://github.com/apache/spark/commit/fab166dba852f732049dda112daab17a491dd94c). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57313993 thanks for the review @vanzin. I've updated it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18183146 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -71,80 +74,134 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private val sparkContextRef = new AtomicReference[SparkContext](null) final def run(): Int = { -val appAttemptId = client.getAttemptId() +try { + val appAttemptId = client.getAttemptId() -if (isDriver) { - // Set the web ui port to be ephemeral for yarn so we don't conflict with - // other spark processes running on the same box - System.setProperty("spark.ui.port", "0") + if (isDriver) { +// Set the web ui port to be ephemeral for yarn so we don't conflict with +// other spark processes running on the same box +System.setProperty("spark.ui.port", "0") - // Set the master property to match the requested mode. - System.setProperty("spark.master", "yarn-cluster") +// Set the master property to match the requested mode. +System.setProperty("spark.master", "yarn-cluster") - // Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. - System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) -} +// Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. +System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) + } -logInfo("ApplicationAttemptId: " + appAttemptId) + logInfo("ApplicationAttemptId: " + appAttemptId) -val cleanupHook = new Runnable { - override def run() { -// If the SparkContext is still registered, shut it down as a best case effort in case -// users do not call sc.stop or do System.exit(). -val sc = sparkContextRef.get() -if (sc != null) { - logInfo("Invoking sc stop from shutdown hook") - sc.stop() - finish(FinalApplicationStatus.SUCCEEDED) -} + val cleanupHook = new Runnable { +override def run() { + // If the SparkContext is still registered, shut it down as a best case effort in case + // users do not call sc.stop or do System.exit(). + val sc = sparkContextRef.get() + if (sc != null) { +logInfo("Invoking sc stop from shutdown hook") +sc.stop() + } + val maxAppAttempts = client.getMaxRegAttempts(yarnConf) + val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts + + if (!finished) { +// this shouldn't ever happen, but if it does assume weird failure +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION, + "shutdown hook called without cleanly finishing") + } -// Cleanup the staging dir after the app is finished, or if it's the last attempt at -// running the AM. -val maxAppAttempts = client.getMaxRegAttempts(yarnConf) -val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts -if (finished || isLastAttempt) { - cleanupStagingDir() + if (!unregistered) { +// we only want to unregister if we don't want the RM to retry +if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) { + unregister(finalStatus, finalMsg) + cleanupStagingDir() +} + } } } -} -// Use higher priority than FileSystem. -assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) -ShutdownHookManager - .get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) + // Use higher priority than FileSystem. + assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) + ShutdownHookManager +.get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) -// Call this to force generation of secret so it gets populated into the -// Hadoop UGI. This has to happen before the startUserClass which does a -// doAs in order for the credentials to be passed on to the executor containers. -val securityMgr = new SecurityManager(sparkConf) + // Call this to force generation of secret so it gets populated into the + // Hadoop UGI. This has to happen before the startUserClass whi
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18182939 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -71,80 +74,134 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private val sparkContextRef = new AtomicReference[SparkContext](null) final def run(): Int = { -val appAttemptId = client.getAttemptId() +try { + val appAttemptId = client.getAttemptId() -if (isDriver) { - // Set the web ui port to be ephemeral for yarn so we don't conflict with - // other spark processes running on the same box - System.setProperty("spark.ui.port", "0") + if (isDriver) { +// Set the web ui port to be ephemeral for yarn so we don't conflict with +// other spark processes running on the same box +System.setProperty("spark.ui.port", "0") - // Set the master property to match the requested mode. - System.setProperty("spark.master", "yarn-cluster") +// Set the master property to match the requested mode. +System.setProperty("spark.master", "yarn-cluster") - // Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. - System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) -} +// Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. +System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) + } -logInfo("ApplicationAttemptId: " + appAttemptId) + logInfo("ApplicationAttemptId: " + appAttemptId) -val cleanupHook = new Runnable { - override def run() { -// If the SparkContext is still registered, shut it down as a best case effort in case -// users do not call sc.stop or do System.exit(). -val sc = sparkContextRef.get() -if (sc != null) { - logInfo("Invoking sc stop from shutdown hook") - sc.stop() - finish(FinalApplicationStatus.SUCCEEDED) -} + val cleanupHook = new Runnable { +override def run() { + // If the SparkContext is still registered, shut it down as a best case effort in case + // users do not call sc.stop or do System.exit(). + val sc = sparkContextRef.get() + if (sc != null) { +logInfo("Invoking sc stop from shutdown hook") +sc.stop() + } + val maxAppAttempts = client.getMaxRegAttempts(yarnConf) + val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts + + if (!finished) { +// this shouldn't ever happen, but if it does assume weird failure +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION, + "shutdown hook called without cleanly finishing") + } -// Cleanup the staging dir after the app is finished, or if it's the last attempt at -// running the AM. -val maxAppAttempts = client.getMaxRegAttempts(yarnConf) -val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts -if (finished || isLastAttempt) { - cleanupStagingDir() + if (!unregistered) { +// we only want to unregister if we don't want the RM to retry +if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) { + unregister(finalStatus, finalMsg) + cleanupStagingDir() +} + } } } -} -// Use higher priority than FileSystem. -assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) -ShutdownHookManager - .get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) + // Use higher priority than FileSystem. + assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) + ShutdownHookManager +.get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) -// Call this to force generation of secret so it gets populated into the -// Hadoop UGI. This has to happen before the startUserClass which does a -// doAs in order for the credentials to be passed on to the executor containers. -val securityMgr = new SecurityManager(sparkConf) + // Call this to force generation of secret so it gets populated into the + // Hadoop UGI. This has to happen before the startUserClass
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18181156 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -71,80 +74,134 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private val sparkContextRef = new AtomicReference[SparkContext](null) final def run(): Int = { -val appAttemptId = client.getAttemptId() +try { + val appAttemptId = client.getAttemptId() -if (isDriver) { - // Set the web ui port to be ephemeral for yarn so we don't conflict with - // other spark processes running on the same box - System.setProperty("spark.ui.port", "0") + if (isDriver) { +// Set the web ui port to be ephemeral for yarn so we don't conflict with +// other spark processes running on the same box +System.setProperty("spark.ui.port", "0") - // Set the master property to match the requested mode. - System.setProperty("spark.master", "yarn-cluster") +// Set the master property to match the requested mode. +System.setProperty("spark.master", "yarn-cluster") - // Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. - System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) -} +// Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. +System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) + } -logInfo("ApplicationAttemptId: " + appAttemptId) + logInfo("ApplicationAttemptId: " + appAttemptId) -val cleanupHook = new Runnable { - override def run() { -// If the SparkContext is still registered, shut it down as a best case effort in case -// users do not call sc.stop or do System.exit(). -val sc = sparkContextRef.get() -if (sc != null) { - logInfo("Invoking sc stop from shutdown hook") - sc.stop() - finish(FinalApplicationStatus.SUCCEEDED) -} + val cleanupHook = new Runnable { +override def run() { + // If the SparkContext is still registered, shut it down as a best case effort in case + // users do not call sc.stop or do System.exit(). + val sc = sparkContextRef.get() + if (sc != null) { +logInfo("Invoking sc stop from shutdown hook") +sc.stop() + } + val maxAppAttempts = client.getMaxRegAttempts(yarnConf) + val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts + + if (!finished) { +// this shouldn't ever happen, but if it does assume weird failure +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION, + "shutdown hook called without cleanly finishing") + } -// Cleanup the staging dir after the app is finished, or if it's the last attempt at -// running the AM. -val maxAppAttempts = client.getMaxRegAttempts(yarnConf) -val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts -if (finished || isLastAttempt) { - cleanupStagingDir() + if (!unregistered) { +// we only want to unregister if we don't want the RM to retry +if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) { + unregister(finalStatus, finalMsg) + cleanupStagingDir() +} + } } } -} -// Use higher priority than FileSystem. -assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) -ShutdownHookManager - .get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) + // Use higher priority than FileSystem. + assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) + ShutdownHookManager +.get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) -// Call this to force generation of secret so it gets populated into the -// Hadoop UGI. This has to happen before the startUserClass which does a -// doAs in order for the credentials to be passed on to the executor containers. -val securityMgr = new SecurityManager(sparkConf) + // Call this to force generation of secret so it gets populated into the + // Hadoop UGI. This has to happen before the startUserClass whi
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18180991 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -71,80 +74,134 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private val sparkContextRef = new AtomicReference[SparkContext](null) final def run(): Int = { -val appAttemptId = client.getAttemptId() +try { + val appAttemptId = client.getAttemptId() -if (isDriver) { - // Set the web ui port to be ephemeral for yarn so we don't conflict with - // other spark processes running on the same box - System.setProperty("spark.ui.port", "0") + if (isDriver) { +// Set the web ui port to be ephemeral for yarn so we don't conflict with +// other spark processes running on the same box +System.setProperty("spark.ui.port", "0") - // Set the master property to match the requested mode. - System.setProperty("spark.master", "yarn-cluster") +// Set the master property to match the requested mode. +System.setProperty("spark.master", "yarn-cluster") - // Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. - System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) -} +// Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. +System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) + } -logInfo("ApplicationAttemptId: " + appAttemptId) + logInfo("ApplicationAttemptId: " + appAttemptId) -val cleanupHook = new Runnable { - override def run() { -// If the SparkContext is still registered, shut it down as a best case effort in case -// users do not call sc.stop or do System.exit(). -val sc = sparkContextRef.get() -if (sc != null) { - logInfo("Invoking sc stop from shutdown hook") - sc.stop() - finish(FinalApplicationStatus.SUCCEEDED) -} + val cleanupHook = new Runnable { +override def run() { + // If the SparkContext is still registered, shut it down as a best case effort in case + // users do not call sc.stop or do System.exit(). + val sc = sparkContextRef.get() + if (sc != null) { +logInfo("Invoking sc stop from shutdown hook") +sc.stop() + } + val maxAppAttempts = client.getMaxRegAttempts(yarnConf) + val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts + + if (!finished) { +// this shouldn't ever happen, but if it does assume weird failure +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION, + "shutdown hook called without cleanly finishing") + } -// Cleanup the staging dir after the app is finished, or if it's the last attempt at -// running the AM. -val maxAppAttempts = client.getMaxRegAttempts(yarnConf) -val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts -if (finished || isLastAttempt) { - cleanupStagingDir() + if (!unregistered) { +// we only want to unregister if we don't want the RM to retry +if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) { + unregister(finalStatus, finalMsg) + cleanupStagingDir() +} + } } } -} -// Use higher priority than FileSystem. -assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) -ShutdownHookManager - .get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) + // Use higher priority than FileSystem. + assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) + ShutdownHookManager +.get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) -// Call this to force generation of secret so it gets populated into the -// Hadoop UGI. This has to happen before the startUserClass which does a -// doAs in order for the credentials to be passed on to the executor containers. -val securityMgr = new SecurityManager(sparkConf) + // Call this to force generation of secret so it gets populated into the + // Hadoop UGI. This has to happen before the startUserClass whi
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18180627 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -71,80 +74,134 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private val sparkContextRef = new AtomicReference[SparkContext](null) final def run(): Int = { -val appAttemptId = client.getAttemptId() +try { + val appAttemptId = client.getAttemptId() -if (isDriver) { - // Set the web ui port to be ephemeral for yarn so we don't conflict with - // other spark processes running on the same box - System.setProperty("spark.ui.port", "0") + if (isDriver) { +// Set the web ui port to be ephemeral for yarn so we don't conflict with +// other spark processes running on the same box +System.setProperty("spark.ui.port", "0") - // Set the master property to match the requested mode. - System.setProperty("spark.master", "yarn-cluster") +// Set the master property to match the requested mode. +System.setProperty("spark.master", "yarn-cluster") - // Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. - System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) -} +// Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. +System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) + } -logInfo("ApplicationAttemptId: " + appAttemptId) + logInfo("ApplicationAttemptId: " + appAttemptId) -val cleanupHook = new Runnable { - override def run() { -// If the SparkContext is still registered, shut it down as a best case effort in case -// users do not call sc.stop or do System.exit(). -val sc = sparkContextRef.get() -if (sc != null) { - logInfo("Invoking sc stop from shutdown hook") - sc.stop() - finish(FinalApplicationStatus.SUCCEEDED) -} + val cleanupHook = new Runnable { +override def run() { + // If the SparkContext is still registered, shut it down as a best case effort in case + // users do not call sc.stop or do System.exit(). + val sc = sparkContextRef.get() + if (sc != null) { +logInfo("Invoking sc stop from shutdown hook") +sc.stop() + } + val maxAppAttempts = client.getMaxRegAttempts(yarnConf) + val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts + + if (!finished) { +// this shouldn't ever happen, but if it does assume weird failure +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION, + "shutdown hook called without cleanly finishing") + } -// Cleanup the staging dir after the app is finished, or if it's the last attempt at -// running the AM. -val maxAppAttempts = client.getMaxRegAttempts(yarnConf) -val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts -if (finished || isLastAttempt) { - cleanupStagingDir() + if (!unregistered) { +// we only want to unregister if we don't want the RM to retry +if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) { + unregister(finalStatus, finalMsg) + cleanupStagingDir() +} + } } } -} -// Use higher priority than FileSystem. -assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) -ShutdownHookManager - .get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) + // Use higher priority than FileSystem. + assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) + ShutdownHookManager +.get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) -// Call this to force generation of secret so it gets populated into the -// Hadoop UGI. This has to happen before the startUserClass which does a -// doAs in order for the credentials to be passed on to the executor containers. -val securityMgr = new SecurityManager(sparkConf) + // Call this to force generation of secret so it gets populated into the + // Hadoop UGI. This has to happen before the startUserClass
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18179848 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -71,80 +74,134 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private val sparkContextRef = new AtomicReference[SparkContext](null) final def run(): Int = { -val appAttemptId = client.getAttemptId() +try { + val appAttemptId = client.getAttemptId() -if (isDriver) { - // Set the web ui port to be ephemeral for yarn so we don't conflict with - // other spark processes running on the same box - System.setProperty("spark.ui.port", "0") + if (isDriver) { +// Set the web ui port to be ephemeral for yarn so we don't conflict with +// other spark processes running on the same box +System.setProperty("spark.ui.port", "0") - // Set the master property to match the requested mode. - System.setProperty("spark.master", "yarn-cluster") +// Set the master property to match the requested mode. +System.setProperty("spark.master", "yarn-cluster") - // Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. - System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) -} +// Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. +System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) + } -logInfo("ApplicationAttemptId: " + appAttemptId) + logInfo("ApplicationAttemptId: " + appAttemptId) -val cleanupHook = new Runnable { - override def run() { -// If the SparkContext is still registered, shut it down as a best case effort in case -// users do not call sc.stop or do System.exit(). -val sc = sparkContextRef.get() -if (sc != null) { - logInfo("Invoking sc stop from shutdown hook") - sc.stop() - finish(FinalApplicationStatus.SUCCEEDED) -} + val cleanupHook = new Runnable { +override def run() { + // If the SparkContext is still registered, shut it down as a best case effort in case + // users do not call sc.stop or do System.exit(). + val sc = sparkContextRef.get() + if (sc != null) { +logInfo("Invoking sc stop from shutdown hook") +sc.stop() + } + val maxAppAttempts = client.getMaxRegAttempts(yarnConf) + val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts + + if (!finished) { +// this shouldn't ever happen, but if it does assume weird failure +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION, + "shutdown hook called without cleanly finishing") + } -// Cleanup the staging dir after the app is finished, or if it's the last attempt at -// running the AM. -val maxAppAttempts = client.getMaxRegAttempts(yarnConf) -val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts -if (finished || isLastAttempt) { - cleanupStagingDir() + if (!unregistered) { +// we only want to unregister if we don't want the RM to retry +if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) { + unregister(finalStatus, finalMsg) + cleanupStagingDir() +} + } } } -} -// Use higher priority than FileSystem. -assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) -ShutdownHookManager - .get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) + // Use higher priority than FileSystem. + assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) + ShutdownHookManager +.get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) -// Call this to force generation of secret so it gets populated into the -// Hadoop UGI. This has to happen before the startUserClass which does a -// doAs in order for the credentials to be passed on to the executor containers. -val securityMgr = new SecurityManager(sparkConf) + // Call this to force generation of secret so it gets populated into the + // Hadoop UGI. This has to happen before the startUserClass
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57209937 Looks ok to me, although the exception handling does feel a little paranoid. :-) Just had a few nits. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18174887 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -383,40 +432,80 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, } } + /** + * Start the user class, which contains the spark driver. + * If the main routine exits cleanly or exits with System.exit(0) we + * assume it was successful, for all other cases we assume failure. + */ private def startUserClass(): Thread = { logInfo("Starting the user JAR in a separate Thread") System.setProperty("spark.executor.instances", args.numExecutors.toString) +var stopped = false val mainMethod = Class.forName(args.userClass, false, Thread.currentThread.getContextClassLoader).getMethod("main", classOf[Array[String]]) -userClassThread = new Thread { +val userThread = new Thread { override def run() { -var status = FinalApplicationStatus.FAILED + +try { + // Note this security manager applies to the entire process, not + // just this thread. Its here to handle the case if the user code --- End diff -- nit: It's --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18174148 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -71,80 +74,134 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private val sparkContextRef = new AtomicReference[SparkContext](null) final def run(): Int = { -val appAttemptId = client.getAttemptId() +try { + val appAttemptId = client.getAttemptId() -if (isDriver) { - // Set the web ui port to be ephemeral for yarn so we don't conflict with - // other spark processes running on the same box - System.setProperty("spark.ui.port", "0") + if (isDriver) { +// Set the web ui port to be ephemeral for yarn so we don't conflict with +// other spark processes running on the same box +System.setProperty("spark.ui.port", "0") - // Set the master property to match the requested mode. - System.setProperty("spark.master", "yarn-cluster") +// Set the master property to match the requested mode. +System.setProperty("spark.master", "yarn-cluster") - // Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. - System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) -} +// Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. +System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) + } -logInfo("ApplicationAttemptId: " + appAttemptId) + logInfo("ApplicationAttemptId: " + appAttemptId) -val cleanupHook = new Runnable { - override def run() { -// If the SparkContext is still registered, shut it down as a best case effort in case -// users do not call sc.stop or do System.exit(). -val sc = sparkContextRef.get() -if (sc != null) { - logInfo("Invoking sc stop from shutdown hook") - sc.stop() - finish(FinalApplicationStatus.SUCCEEDED) -} + val cleanupHook = new Runnable { +override def run() { + // If the SparkContext is still registered, shut it down as a best case effort in case + // users do not call sc.stop or do System.exit(). + val sc = sparkContextRef.get() + if (sc != null) { +logInfo("Invoking sc stop from shutdown hook") +sc.stop() + } + val maxAppAttempts = client.getMaxRegAttempts(yarnConf) + val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts + + if (!finished) { +// this shouldn't ever happen, but if it does assume weird failure +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION, + "shutdown hook called without cleanly finishing") + } -// Cleanup the staging dir after the app is finished, or if it's the last attempt at -// running the AM. -val maxAppAttempts = client.getMaxRegAttempts(yarnConf) -val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts -if (finished || isLastAttempt) { - cleanupStagingDir() + if (!unregistered) { +// we only want to unregister if we don't want the RM to retry +if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) { + unregister(finalStatus, finalMsg) + cleanupStagingDir() +} + } } } -} -// Use higher priority than FileSystem. -assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) -ShutdownHookManager - .get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) + // Use higher priority than FileSystem. + assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) + ShutdownHookManager +.get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) -// Call this to force generation of secret so it gets populated into the -// Hadoop UGI. This has to happen before the startUserClass which does a -// doAs in order for the credentials to be passed on to the executor containers. -val securityMgr = new SecurityManager(sparkConf) + // Call this to force generation of secret so it gets populated into the + // Hadoop UGI. This has to happen before the startUserClass whi
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18174477 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -71,80 +74,134 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private val sparkContextRef = new AtomicReference[SparkContext](null) final def run(): Int = { -val appAttemptId = client.getAttemptId() +try { + val appAttemptId = client.getAttemptId() -if (isDriver) { - // Set the web ui port to be ephemeral for yarn so we don't conflict with - // other spark processes running on the same box - System.setProperty("spark.ui.port", "0") + if (isDriver) { +// Set the web ui port to be ephemeral for yarn so we don't conflict with +// other spark processes running on the same box +System.setProperty("spark.ui.port", "0") - // Set the master property to match the requested mode. - System.setProperty("spark.master", "yarn-cluster") +// Set the master property to match the requested mode. +System.setProperty("spark.master", "yarn-cluster") - // Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. - System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) -} +// Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. +System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) + } -logInfo("ApplicationAttemptId: " + appAttemptId) + logInfo("ApplicationAttemptId: " + appAttemptId) -val cleanupHook = new Runnable { - override def run() { -// If the SparkContext is still registered, shut it down as a best case effort in case -// users do not call sc.stop or do System.exit(). -val sc = sparkContextRef.get() -if (sc != null) { - logInfo("Invoking sc stop from shutdown hook") - sc.stop() - finish(FinalApplicationStatus.SUCCEEDED) -} + val cleanupHook = new Runnable { +override def run() { + // If the SparkContext is still registered, shut it down as a best case effort in case + // users do not call sc.stop or do System.exit(). + val sc = sparkContextRef.get() + if (sc != null) { +logInfo("Invoking sc stop from shutdown hook") +sc.stop() + } + val maxAppAttempts = client.getMaxRegAttempts(yarnConf) + val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts + + if (!finished) { +// this shouldn't ever happen, but if it does assume weird failure +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION, + "shutdown hook called without cleanly finishing") + } -// Cleanup the staging dir after the app is finished, or if it's the last attempt at -// running the AM. -val maxAppAttempts = client.getMaxRegAttempts(yarnConf) -val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts -if (finished || isLastAttempt) { - cleanupStagingDir() + if (!unregistered) { +// we only want to unregister if we don't want the RM to retry +if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) { + unregister(finalStatus, finalMsg) + cleanupStagingDir() +} + } } } -} -// Use higher priority than FileSystem. -assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) -ShutdownHookManager - .get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) + // Use higher priority than FileSystem. + assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) + ShutdownHookManager +.get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) -// Call this to force generation of secret so it gets populated into the -// Hadoop UGI. This has to happen before the startUserClass which does a -// doAs in order for the credentials to be passed on to the executor containers. -val securityMgr = new SecurityManager(sparkConf) + // Call this to force generation of secret so it gets populated into the + // Hadoop UGI. This has to happen before the startUserClass whi
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18174440 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -232,32 +285,27 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, override def run() { var failureCount = 0 -while (!finished) { +while (!finished && !Thread.currentThread().isInterrupted()) { try { -checkNumExecutorsFailed() -if (!finished) { + --- End diff -- nit: blank line not needed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18174328 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -71,80 +74,134 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private val sparkContextRef = new AtomicReference[SparkContext](null) final def run(): Int = { -val appAttemptId = client.getAttemptId() +try { + val appAttemptId = client.getAttemptId() -if (isDriver) { - // Set the web ui port to be ephemeral for yarn so we don't conflict with - // other spark processes running on the same box - System.setProperty("spark.ui.port", "0") + if (isDriver) { +// Set the web ui port to be ephemeral for yarn so we don't conflict with +// other spark processes running on the same box +System.setProperty("spark.ui.port", "0") - // Set the master property to match the requested mode. - System.setProperty("spark.master", "yarn-cluster") +// Set the master property to match the requested mode. +System.setProperty("spark.master", "yarn-cluster") - // Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. - System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) -} +// Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. +System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) + } -logInfo("ApplicationAttemptId: " + appAttemptId) + logInfo("ApplicationAttemptId: " + appAttemptId) -val cleanupHook = new Runnable { - override def run() { -// If the SparkContext is still registered, shut it down as a best case effort in case -// users do not call sc.stop or do System.exit(). -val sc = sparkContextRef.get() -if (sc != null) { - logInfo("Invoking sc stop from shutdown hook") - sc.stop() - finish(FinalApplicationStatus.SUCCEEDED) -} + val cleanupHook = new Runnable { +override def run() { + // If the SparkContext is still registered, shut it down as a best case effort in case + // users do not call sc.stop or do System.exit(). + val sc = sparkContextRef.get() + if (sc != null) { +logInfo("Invoking sc stop from shutdown hook") +sc.stop() + } + val maxAppAttempts = client.getMaxRegAttempts(yarnConf) + val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts + + if (!finished) { +// this shouldn't ever happen, but if it does assume weird failure +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION, + "shutdown hook called without cleanly finishing") + } -// Cleanup the staging dir after the app is finished, or if it's the last attempt at -// running the AM. -val maxAppAttempts = client.getMaxRegAttempts(yarnConf) -val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts -if (finished || isLastAttempt) { - cleanupStagingDir() + if (!unregistered) { +// we only want to unregister if we don't want the RM to retry +if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) { + unregister(finalStatus, finalMsg) + cleanupStagingDir() +} + } } } -} -// Use higher priority than FileSystem. -assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) -ShutdownHookManager - .get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) + // Use higher priority than FileSystem. + assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) + ShutdownHookManager +.get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) -// Call this to force generation of secret so it gets populated into the -// Hadoop UGI. This has to happen before the startUserClass which does a -// doAs in order for the credentials to be passed on to the executor containers. -val securityMgr = new SecurityManager(sparkConf) + // Call this to force generation of secret so it gets populated into the + // Hadoop UGI. This has to happen before the startUserClass whi
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18174075 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -71,80 +74,134 @@ private[spark] class ApplicationMaster(args: ApplicationMasterArguments, private val sparkContextRef = new AtomicReference[SparkContext](null) final def run(): Int = { -val appAttemptId = client.getAttemptId() +try { + val appAttemptId = client.getAttemptId() -if (isDriver) { - // Set the web ui port to be ephemeral for yarn so we don't conflict with - // other spark processes running on the same box - System.setProperty("spark.ui.port", "0") + if (isDriver) { +// Set the web ui port to be ephemeral for yarn so we don't conflict with +// other spark processes running on the same box +System.setProperty("spark.ui.port", "0") - // Set the master property to match the requested mode. - System.setProperty("spark.master", "yarn-cluster") +// Set the master property to match the requested mode. +System.setProperty("spark.master", "yarn-cluster") - // Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. - System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) -} +// Propagate the application ID so that YarnClusterSchedulerBackend can pick it up. +System.setProperty("spark.yarn.app.id", appAttemptId.getApplicationId().toString()) + } -logInfo("ApplicationAttemptId: " + appAttemptId) + logInfo("ApplicationAttemptId: " + appAttemptId) -val cleanupHook = new Runnable { - override def run() { -// If the SparkContext is still registered, shut it down as a best case effort in case -// users do not call sc.stop or do System.exit(). -val sc = sparkContextRef.get() -if (sc != null) { - logInfo("Invoking sc stop from shutdown hook") - sc.stop() - finish(FinalApplicationStatus.SUCCEEDED) -} + val cleanupHook = new Runnable { +override def run() { + // If the SparkContext is still registered, shut it down as a best case effort in case + // users do not call sc.stop or do System.exit(). + val sc = sparkContextRef.get() + if (sc != null) { +logInfo("Invoking sc stop from shutdown hook") +sc.stop() + } + val maxAppAttempts = client.getMaxRegAttempts(yarnConf) + val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts + + if (!finished) { +// this shouldn't ever happen, but if it does assume weird failure +finish(FinalApplicationStatus.FAILED, + ApplicationMaster.EXIT_UNCAUGHT_EXCEPTION, + "shutdown hook called without cleanly finishing") + } -// Cleanup the staging dir after the app is finished, or if it's the last attempt at -// running the AM. -val maxAppAttempts = client.getMaxRegAttempts(yarnConf) -val isLastAttempt = client.getAttemptId().getAttemptId() >= maxAppAttempts -if (finished || isLastAttempt) { - cleanupStagingDir() + if (!unregistered) { +// we only want to unregister if we don't want the RM to retry +if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) { + unregister(finalStatus, finalMsg) + cleanupStagingDir() +} + } } } -} -// Use higher priority than FileSystem. -assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) -ShutdownHookManager - .get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) + // Use higher priority than FileSystem. + assert(ApplicationMaster.SHUTDOWN_HOOK_PRIORITY > FileSystem.SHUTDOWN_HOOK_PRIORITY) + ShutdownHookManager +.get().addShutdownHook(cleanupHook, ApplicationMaster.SHUTDOWN_HOOK_PRIORITY) -// Call this to force generation of secret so it gets populated into the -// Hadoop UGI. This has to happen before the startUserClass which does a -// doAs in order for the credentials to be passed on to the executor containers. -val securityMgr = new SecurityManager(sparkConf) + // Call this to force generation of secret so it gets populated into the + // Hadoop UGI. This has to happen before the startUserClass whi
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18173150 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -450,6 +539,15 @@ object ApplicationMaster extends Logging { val SHUTDOWN_HOOK_PRIORITY: Int = 30 + // exit codes for different causes, no reason behind the values --- End diff -- The application Master is not an executor so I chose not to use it. It also doesn't have the same exit reasons which could be useful if the user has an exit code and wants to know what that matches up to --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user witgo commented on a diff in the pull request: https://github.com/apache/spark/pull/2577#discussion_r18170801 --- Diff: yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala --- @@ -450,6 +539,15 @@ object ApplicationMaster extends Logging { val SHUTDOWN_HOOK_PRIORITY: Int = 30 + // exit codes for different causes, no reason behind the values --- End diff -- We can use this class? [ExecutorExitCode](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/ExecutorExitCode.scala) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57187866 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20975/consoleFull) for PR 2577 at commit [`32f4dfa`](https://github.com/apache/spark/commit/32f4dfa5d0778e3e7bd7c5da27c71f8c3f5930f6). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57187881 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20975/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57185274 also note this does change everything to allow yarn to retry. previously when it hit the maximum number of executor failures it didn't retry the AM. I waffled back and forth on this one. At first the thought was that if that many executors are dying its probably an issue with the user code, but then again if you have a really long running job then I can think of situations you want it to retry.Anyone have strong opinion on that? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user tgravescs commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57176392 @witgo can you verify this covers https://github.com/apache/spark/pull/2311 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2577#issuecomment-57176963 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20975/consoleFull) for PR 2577 at commit [`32f4dfa`](https://github.com/apache/spark/commit/32f4dfa5d0778e3e7bd7c5da27c71f8c3f5930f6). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3627] - [yarn] - fix exit code and fina...
GitHub user tgravescs opened a pull request: https://github.com/apache/spark/pull/2577 [SPARK-3627] - [yarn] - fix exit code and final status reporting to RM See the description and whats handled in the jira comment: https://issues.apache.org/jira/browse/SPARK-3627?focusedCommentId=14150013&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14150013 This does not handle yarn client mode reporting of the driver to the AM. I think that should be handled when we make it an unmanaged AM. You can merge this pull request into a Git repository by running: $ git pull https://github.com/tgravescs/spark SPARK-3627 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2577.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2577 commit d3cc80050dfb472c426d1c1399edd9134e53a563 Author: Thomas Graves Date: 2014-09-29T13:38:43Z SPARK-3627 - yarn - fix exit code and final status reporting to RM commit f0b65199f50bae99aa89ea3a5915b610d7134392 Author: Thomas Graves Date: 2014-09-29T15:05:29Z change order of cleanup staging dir commit 32f4dfa5d0778e3e7bd7c5da27c71f8c3f5930f6 Author: Thomas Graves Date: 2014-09-29T15:10:09Z switch back --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org