[ 
https://issues.apache.org/jira/browse/SPARK-11801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034798#comment-15034798
 ] 

Imran Rashid commented on SPARK-11801:
--------------------------------------

This surprised me too, but [~vsr] reported (offline) that the exception was 
handled *before* the ShutdownHandlers kicked in.  Of course that doesn't 
disagree with the OnOutOfMemoryError handler getting executed immediately, I 
guess its just a race that may even vary among JVMs.  But like I said earlier, 
I'm OK with generic executor lost msgs, the problem is that there are plenty of 
times when the other confusing error msgs *do* get back to the driver, which is 
really confusing.

so Marcelo, you're basically proposing that we unify the behavior of yarn & the 
other cluster mangers in the other direction, by eliminating 
[{{OnOutOfMemoryError}} here? | 
https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L311]
  The uncaught exception handler [already exits with a special code on OOM | 
https://github.com/apache/spark/blob/de64b65f7cf2ac58c1abc310ba547637fdbb8557/core/src/main/scala/org/apache/spark/util/SparkUncaughtExceptionHandler.scala#L42],
 which works for the other cluster managers.

The argument I see in favor of {{OnOutOfMemoryError}} is that it seems more 
reliable.  Your OOM really can be triggered by *any* thread in the executors, 
not just the main task running threads.  We'd have to make sure that every 
thread (even those created by 3rd party libs) properly handled OOMs.  We could 
carefully audit all spark threads, but I don't think this is possible in 
general.

> Notify driver when OOM is thrown before executor JVM is killed 
> ---------------------------------------------------------------
>
>                 Key: SPARK-11801
>                 URL: https://issues.apache.org/jira/browse/SPARK-11801
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.5.1
>            Reporter: Srinivasa Reddy Vundela
>            Priority: Minor
>
> Here is some background for the issue.
> Customer got OOM exception in one of the task and executor got killed with 
> kill %p. It is unclear in driver logs/Spark UI why the task is lost or 
> executor is lost. Customer has to look into the executor logs to see OOM is 
> the cause for the task/executor lost. 
> It would be helpful if driver logs/spark UI shows the reason for task 
> failures by making sure that task updates the driver with OOM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to