[jira] [Resolved] (SPARK-6449) Driver OOM results in reported application result SUCCESS
[ https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6449. -- Resolution: Duplicate Fix Version/s: (was: 1.3.0) Driver OOM results in reported application result SUCCESS - Key: SPARK-6449 URL: https://issues.apache.org/jira/browse/SPARK-6449 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Reporter: Ryan Williams I ran a job yesterday that according to the History Server and YARN RM finished with status {{SUCCESS}}. Clicking around on the history server UI, there were too few stages run, and I couldn't figure out why that would have been. Finally, inspecting the end of the driver's logs, I saw: {code} 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext Exception in thread Driver scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit exceeded (of class java.lang.OutOfMemoryError) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0, (reason: Shutdown hook called before final status was reported.) 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before final status was reported.) 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1426705269584_0055 {code} The driver OOM'd, [the {{catch}} block that presumably should have caught it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484] threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and written to the event log. This should be logged as a failed job and reported as such to YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6449) Driver OOM results in reported application result SUCCESS
[ https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ryan Williams resolved SPARK-6449. -- Resolution: Implemented Fix Version/s: 1.3.0 Driver OOM results in reported application result SUCCESS - Key: SPARK-6449 URL: https://issues.apache.org/jira/browse/SPARK-6449 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.3.0 Reporter: Ryan Williams Fix For: 1.3.0 I ran a job yesterday that according to the History Server and YARN RM finished with status {{SUCCESS}}. Clicking around on the history server UI, there were too few stages run, and I couldn't figure out why that would have been. Finally, inspecting the end of the driver's logs, I saw: {code} 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext Exception in thread Driver scala.MatchError: java.lang.OutOfMemoryError: GC overhead limit exceeded (of class java.lang.OutOfMemoryError) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0, (reason: Shutdown hook called before final status was reported.) 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before final status was reported.) 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory .sparkStaging/application_1426705269584_0055 {code} The driver OOM'd, [the {{catch}} block that presumably should have caught it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484] threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and written to the event log. This should be logged as a failed job and reported as such to YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org