[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS

2015-03-23 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377251#comment-14377251
 ] 

Ryan Williams commented on SPARK-6449:
--

Seems like this was fixed as of 
[SPARK-6018|https://issues.apache.org/jira/browse/SPARK-6018], closing

> Driver OOM results in reported application result SUCCESS
> -
>
> Key: SPARK-6449
> URL: https://issues.apache.org/jira/browse/SPARK-6449
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Ryan Williams
>
> I ran a job yesterday that according to the History Server and YARN RM 
> finished with status {{SUCCESS}}.
> Clicking around on the history server UI, there were too few stages run, and 
> I couldn't figure out why that would have been.
> Finally, inspecting the end of the driver's logs, I saw:
> {code}
> 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Shutting down remote daemon.
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Remote daemon shut down; proceeding with flushing remote transports.
> 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext
> Exception in thread "Driver" scala.MatchError: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded (of class java.lang.OutOfMemoryError)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0, (reason: Shutdown hook called before final status was reported.)
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before 
> final status was reported.)
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Remoting shut down.
> 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be 
> successfully unregistered.
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory 
> .sparkStaging/application_1426705269584_0055
> {code}
> The driver OOM'd, [the {{catch}} block that presumably should have caught 
> it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484]
>  threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and 
> written to the event log.
> This should be logged as a failed job and reported as such to YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS

2015-03-23 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14377249#comment-14377249
 ] 

Ryan Williams commented on SPARK-6449:
--

It doesn't look like it; [here is a 
gist|https://gist.github.com/ryan-williams/ff74066c127546910cac] with the 
entire file (8M), and the last 1000 lines, fwiw.



> Driver OOM results in reported application result SUCCESS
> -
>
> Key: SPARK-6449
> URL: https://issues.apache.org/jira/browse/SPARK-6449
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Ryan Williams
>
> I ran a job yesterday that according to the History Server and YARN RM 
> finished with status {{SUCCESS}}.
> Clicking around on the history server UI, there were too few stages run, and 
> I couldn't figure out why that would have been.
> Finally, inspecting the end of the driver's logs, I saw:
> {code}
> 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Shutting down remote daemon.
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Remote daemon shut down; proceeding with flushing remote transports.
> 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext
> Exception in thread "Driver" scala.MatchError: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded (of class java.lang.OutOfMemoryError)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0, (reason: Shutdown hook called before final status was reported.)
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before 
> final status was reported.)
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Remoting shut down.
> 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be 
> successfully unregistered.
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory 
> .sparkStaging/application_1426705269584_0055
> {code}
> The driver OOM'd, [the {{catch}} block that presumably should have caught 
> it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484]
>  threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and 
> written to the event log.
> This should be logged as a failed job and reported as such to YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS

2015-03-23 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375905#comment-14375905
 ] 

Thomas Graves commented on SPARK-6449:
--

[~rdub] Was there an exception in the log higher up? Wondering if it shows the 
entire exception for the out of memory.

> Driver OOM results in reported application result SUCCESS
> -
>
> Key: SPARK-6449
> URL: https://issues.apache.org/jira/browse/SPARK-6449
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Ryan Williams
>
> I ran a job yesterday that according to the History Server and YARN RM 
> finished with status {{SUCCESS}}.
> Clicking around on the history server UI, there were too few stages run, and 
> I couldn't figure out why that would have been.
> Finally, inspecting the end of the driver's logs, I saw:
> {code}
> 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Shutting down remote daemon.
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Remote daemon shut down; proceeding with flushing remote transports.
> 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext
> Exception in thread "Driver" scala.MatchError: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded (of class java.lang.OutOfMemoryError)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0, (reason: Shutdown hook called before final status was reported.)
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before 
> final status was reported.)
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Remoting shut down.
> 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be 
> successfully unregistered.
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory 
> .sparkStaging/application_1426705269584_0055
> {code}
> The driver OOM'd, [the {{catch}} block that presumably should have caught 
> it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484]
>  threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and 
> written to the event log.
> This should be logged as a failed job and reported as such to YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS

2015-03-22 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375353#comment-14375353
 ] 

Apache Spark commented on SPARK-6449:
-

User 'ryan-williams' has created a pull request for this issue:
https://github.com/apache/spark/pull/5130

> Driver OOM results in reported application result SUCCESS
> -
>
> Key: SPARK-6449
> URL: https://issues.apache.org/jira/browse/SPARK-6449
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Ryan Williams
>
> I ran a job yesterday that according to the History Server and YARN RM 
> finished with status {{SUCCESS}}.
> Clicking around on the history server UI, there were too few stages run, and 
> I couldn't figure out why that would have been.
> Finally, inspecting the end of the driver's logs, I saw:
> {code}
> 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Shutting down remote daemon.
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Remote daemon shut down; proceeding with flushing remote transports.
> 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext
> Exception in thread "Driver" scala.MatchError: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded (of class java.lang.OutOfMemoryError)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0, (reason: Shutdown hook called before final status was reported.)
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before 
> final status was reported.)
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Remoting shut down.
> 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be 
> successfully unregistered.
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory 
> .sparkStaging/application_1426705269584_0055
> {code}
> The driver OOM'd, [the {{catch}} block that presumably should have caught 
> it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484]
>  threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and 
> written to the event log.
> This should be logged as a failed job and reported as such to YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS

2015-03-22 Thread Ryan Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375244#comment-14375244
 ] 

Ryan Williams commented on SPARK-6449:
--

Hey [~rxin] yea I have some code that I suspect fixes it but I don't have that 
complete a picture of the expectations around various kinds of exceptions here 
so I wanted to test it and haven't gotten a chance to today. I just opened up 
PR #5130 though so feel free to take a look.

> Driver OOM results in reported application result SUCCESS
> -
>
> Key: SPARK-6449
> URL: https://issues.apache.org/jira/browse/SPARK-6449
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.3.0
>Reporter: Ryan Williams
>
> I ran a job yesterday that according to the History Server and YARN RM 
> finished with status {{SUCCESS}}.
> Clicking around on the history server UI, there were too few stages run, and 
> I couldn't figure out why that would have been.
> Finally, inspecting the end of the driver's logs, I saw:
> {code}
> 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Shutting down remote daemon.
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Remote daemon shut down; proceeding with flushing remote transports.
> 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext
> Exception in thread "Driver" scala.MatchError: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded (of class java.lang.OutOfMemoryError)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0, (reason: Shutdown hook called before final status was reported.)
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before 
> final status was reported.)
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Remoting shut down.
> 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be 
> successfully unregistered.
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory 
> .sparkStaging/application_1426705269584_0055
> {code}
> The driver OOM'd, [the {{catch}} block that presumably should have caught 
> it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484]
>  threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and 
> written to the event log.
> This should be logged as a failed job and reported as such to YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6449) Driver OOM results in reported application result SUCCESS

2015-03-22 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6449?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14374827#comment-14374827
 ] 

Reynold Xin commented on SPARK-6449:


Ryan - do you want to submit a pull request for this? Seems easy to fix.


> Driver OOM results in reported application result SUCCESS
> -
>
> Key: SPARK-6449
> URL: https://issues.apache.org/jira/browse/SPARK-6449
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.3.0
>Reporter: Ryan Williams
>
> I ran a job yesterday that according to the History Server and YARN RM 
> finished with status {{SUCCESS}}.
> Clicking around on the history server UI, there were too few stages run, and 
> I couldn't figure out why that would have been.
> Finally, inspecting the end of the driver's logs, I saw:
> {code}
> 15/03/20 15:08:13 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Shutting down remote daemon.
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Remote daemon shut down; proceeding with flushing remote transports.
> 15/03/20 15:08:13 INFO spark.SparkContext: Successfully stopped SparkContext
> Exception in thread "Driver" scala.MatchError: java.lang.OutOfMemoryError: GC 
> overhead limit exceeded (of class java.lang.OutOfMemoryError)
> at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:485)
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Final app status: SUCCEEDED, 
> exitCode: 0, (reason: Shutdown hook called before final status was reported.)
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Unregistering 
> ApplicationMaster with SUCCEEDED (diag message: Shutdown hook called before 
> final status was reported.)
> 15/03/20 15:08:13 INFO remote.RemoteActorRefProvider$RemotingTerminator: 
> Remoting shut down.
> 15/03/20 15:08:13 INFO impl.AMRMClientImpl: Waiting for application to be 
> successfully unregistered.
> 15/03/20 15:08:13 INFO yarn.ApplicationMaster: Deleting staging directory 
> .sparkStaging/application_1426705269584_0055
> {code}
> The driver OOM'd, [the {{catch}} block that presumably should have caught 
> it|https://github.com/apache/spark/blob/b6090f902e6ec24923b4dde4aabc9076956521c1/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L484]
>  threw a {{MatchError}}, and then {{SUCCESS}} was returned to YARN and 
> written to the event log.
> This should be logged as a failed job and reported as such to YARN.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org