[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS
[ https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377251#comment-14377251 ] Ryan Williams commented on SPARK-6449: -- Seems like this was fixed as of [SPARK-6018|https://issues.apache.org/jira/browse/SPARK-6018], closing > Driver OOM results in reported application result SUCCESS > - > > Key: SPARK-6449 > URL: https://issues.apache.org/jira/browse/SPARK-6449 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.0 >Reporter: Ryan Williams > > I ran a job yesterday that according to the History Server and YARN RM > finished with status {{SUCCESS}}. > Clicking around on the history server UI, there were too few stages run, and > I couldn't figure out why that would have been. > Finally, inspecting the end of the driver's logs, I saw: > {code} > 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Shutting down remote daemon. > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Remote daemon shut down; proceeding with flushing remote transports. > 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext > Exception in thread "Driver" scala.MatchError: java.lang.OutOfMemoryError: GC > overhead limit exceeded (of class java.lang.OutOfMemoryError) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, > exitCode: 0, (reason: Shutdown hook called before final status was reported.) > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering > ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before > final status was reported.) > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Remoting shut down. > 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be > successfully unregistered. > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory > .sparkStaging/application_1426705269584_0055 > {code} > The driver OOM'd, [the {{catch}} block that presumably should have caught > it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484] > threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and > written to the event log. > This should be logged as a failed job and reported as such to YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS
[ https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377249#comment-14377249 ] Ryan Williams commented on SPARK-6449: -- It doesn't look like it; [here is a gist|https://gist.github.com/ryan-williams/ff74066c127546910cac] with the entire file (8M), and the last 1000 lines, fwiw. > Driver OOM results in reported application result SUCCESS > - > > Key: SPARK-6449 > URL: https://issues.apache.org/jira/browse/SPARK-6449 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.0 >Reporter: Ryan Williams > > I ran a job yesterday that according to the History Server and YARN RM > finished with status {{SUCCESS}}. > Clicking around on the history server UI, there were too few stages run, and > I couldn't figure out why that would have been. > Finally, inspecting the end of the driver's logs, I saw: > {code} > 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Shutting down remote daemon. > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Remote daemon shut down; proceeding with flushing remote transports. > 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext > Exception in thread "Driver" scala.MatchError: java.lang.OutOfMemoryError: GC > overhead limit exceeded (of class java.lang.OutOfMemoryError) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, > exitCode: 0, (reason: Shutdown hook called before final status was reported.) > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering > ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before > final status was reported.) > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Remoting shut down. > 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be > successfully unregistered. > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory > .sparkStaging/application_1426705269584_0055 > {code} > The driver OOM'd, [the {{catch}} block that presumably should have caught > it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484] > threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and > written to the event log. > This should be logged as a failed job and reported as such to YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS
[ https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375905#comment-14375905 ] Thomas Graves commented on SPARK-6449: -- [~rdub] Was there an exception in the log higher up? Wondering if it shows the entire exception for the out of memory. > Driver OOM results in reported application result SUCCESS > - > > Key: SPARK-6449 > URL: https://issues.apache.org/jira/browse/SPARK-6449 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.0 >Reporter: Ryan Williams > > I ran a job yesterday that according to the History Server and YARN RM > finished with status {{SUCCESS}}. > Clicking around on the history server UI, there were too few stages run, and > I couldn't figure out why that would have been. > Finally, inspecting the end of the driver's logs, I saw: > {code} > 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Shutting down remote daemon. > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Remote daemon shut down; proceeding with flushing remote transports. > 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext > Exception in thread "Driver" scala.MatchError: java.lang.OutOfMemoryError: GC > overhead limit exceeded (of class java.lang.OutOfMemoryError) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, > exitCode: 0, (reason: Shutdown hook called before final status was reported.) > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering > ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before > final status was reported.) > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Remoting shut down. > 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be > successfully unregistered. > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory > .sparkStaging/application_1426705269584_0055 > {code} > The driver OOM'd, [the {{catch}} block that presumably should have caught > it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484] > threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and > written to the event log. > This should be logged as a failed job and reported as such to YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS
[ https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375353#comment-14375353 ] Apache Spark commented on SPARK-6449: - User 'ryan-williams' has created a pull request for this issue: https://github.com/apache/spark/pull/5130 > Driver OOM results in reported application result SUCCESS > - > > Key: SPARK-6449 > URL: https://issues.apache.org/jira/browse/SPARK-6449 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.0 >Reporter: Ryan Williams > > I ran a job yesterday that according to the History Server and YARN RM > finished with status {{SUCCESS}}. > Clicking around on the history server UI, there were too few stages run, and > I couldn't figure out why that would have been. > Finally, inspecting the end of the driver's logs, I saw: > {code} > 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Shutting down remote daemon. > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Remote daemon shut down; proceeding with flushing remote transports. > 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext > Exception in thread "Driver" scala.MatchError: java.lang.OutOfMemoryError: GC > overhead limit exceeded (of class java.lang.OutOfMemoryError) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, > exitCode: 0, (reason: Shutdown hook called before final status was reported.) > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering > ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before > final status was reported.) > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Remoting shut down. > 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be > successfully unregistered. > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory > .sparkStaging/application_1426705269584_0055 > {code} > The driver OOM'd, [the {{catch}} block that presumably should have caught > it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484] > threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and > written to the event log. > This should be logged as a failed job and reported as such to YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS
[ https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375244#comment-14375244 ] Ryan Williams commented on SPARK-6449: -- Hey [~rxin] yea I have some code that I suspect fixes it but I don't have that complete a picture of the expectations around various kinds of exceptions here so I wanted to test it and haven't gotten a chance to today. I just opened up PR #5130 though so feel free to take a look. > Driver OOM results in reported application result SUCCESS > - > > Key: SPARK-6449 > URL: https://issues.apache.org/jira/browse/SPARK-6449 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.3.0 >Reporter: Ryan Williams > > I ran a job yesterday that according to the History Server and YARN RM > finished with status {{SUCCESS}}. > Clicking around on the history server UI, there were too few stages run, and > I couldn't figure out why that would have been. > Finally, inspecting the end of the driver's logs, I saw: > {code} > 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Shutting down remote daemon. > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Remote daemon shut down; proceeding with flushing remote transports. > 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext > Exception in thread "Driver" scala.MatchError: java.lang.OutOfMemoryError: GC > overhead limit exceeded (of class java.lang.OutOfMemoryError) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, > exitCode: 0, (reason: Shutdown hook called before final status was reported.) > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering > ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before > final status was reported.) > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Remoting shut down. > 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be > successfully unregistered. > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory > .sparkStaging/application_1426705269584_0055 > {code} > The driver OOM'd, [the {{catch}} block that presumably should have caught > it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484] > threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and > written to the event log. > This should be logged as a failed job and reported as such to YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS
[ https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14374827#comment-14374827 ] Reynold Xin commented on SPARK-6449: Ryan - do you want to submit a pull request for this? Seems easy to fix. > Driver OOM results in reported application result SUCCESS > - > > Key: SPARK-6449 > URL: https://issues.apache.org/jira/browse/SPARK-6449 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.3.0 >Reporter: Ryan Williams > > I ran a job yesterday that according to the History Server and YARN RM > finished with status {{SUCCESS}}. > Clicking around on the history server UI, there were too few stages run, and > I couldn't figure out why that would have been. > Finally, inspecting the end of the driver's logs, I saw: > {code} > 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Shutting down remote daemon. > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Remote daemon shut down; proceeding with flushing remote transports. > 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext > Exception in thread "Driver" scala.MatchError: java.lang.OutOfMemoryError: GC > overhead limit exceeded (of class java.lang.OutOfMemoryError) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485) > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, > exitCode: 0, (reason: Shutdown hook called before final status was reported.) > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering > ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before > final status was reported.) > 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: > Remoting shut down. > 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be > successfully unregistered. > 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory > .sparkStaging/application_1426705269584_0055 > {code} > The driver OOM'd, [the {{catch}} block that presumably should have caught > it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484] > threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and > written to the event log. > This should be logged as a failed job and reported as such to YARN. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org