[ 
https://issues.apache.org/jira/browse/SPARK-11801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032699#comment-15032699
 ] 

Imran Rashid commented on SPARK-11801:
--------------------------------------

Hi [~vsr],

Thanks for reporting and working on this.  This is a pretty tricky issue so I'd 
like to have a thorough discussion of it here.  First, I think it would help to 
clarify exactly what we're trying to do here:

1) After an OOM, there are no "spurious" failure msgs (eg., about directories 
not existing).  All error messages should clearly indicate it was from an OOM.
2) Its clear that there was an OOM from looking at any of (a) the UI (b) the 
driver logs (c) the executor logs.  In fact, I'd much rather have clear error 
messages on the driver -- cleaning up the executor is a bonus, since the user 
is going to look at the driver first.
3) All tasks that fail because of the OOM clearly indicate that they hit an 
OOM.  that is, if you have 16 tasks running concurrently, though only one of 
them gets the OOM, really they all fail from OOM.  I don't even think if it 
helps at all to even distinguish the original task that gets the OOM vs the 
other tasks that get killed later, but I don't have a really strong opinion.

One thing which seems unusual here is that when you run under yarn, you 
automatically get {{-XX:OnOutOfMemoryError='kill %p'}} added to the executor 
args, but it is *not* added with any of the other cluster managers.  So the 
first question is to understand if that is really intentional -- why isn't that 
included for all cluster manager?  Do real users of the standalone cluster 
manager just always add those args in themselves?  And if that distinction 
really is intentional, then we need to make sure this approach works with any 
cluster manager.  (eg., it should be tested on at least yarn and standalone.)

It seems that {{-XX:OnOutOfMemoryError='kill %p'}} was introduced for yarn from 
the initial version -- I'm not sure if it was a conscious decision to have it 
differ between the different approaches.
https://github.com/apache/spark/commit/d90d2af1036e909f81cf77c85bfe589993c4f9f3

Its worth noting that in all environments, there is already *some* handling for 
the OOM, via the uncaught exception handler, which is [added to the main 
executor threads | 
https://github.com/apache/spark/blob/de64b65f7cf2ac58c1abc310ba547637fdbb8557/core/src/main/scala/org/apache/spark/executor/Executor.scala#L76]
 and even [invoked sometimes when the exception is caught | 
https://github.com/apache/spark/blob/de64b65f7cf2ac58c1abc310ba547637fdbb8557/core/src/main/scala/org/apache/spark/executor/Executor.scala#L317].
  However, I assume that relying on {{{{-XX:OnOutOfMemoryError='kill %p'}} is 
still a better idea, since the OOM could occur in some thread we haven't 
installed the uncaught exception handler, and it also seems safer to rely on 
the jvm to do this itself.  But the downside is that it seems right now, when 
running under yarn, the {{kill %p}} is triggering the shutdown hooks before 
even the first OOM failure gets sent back to the driver some of the time.

Trying to ping some folks who might have an idea on why the cluster managers 
differ (and to confirm my reading of the code): [~andrewor14] [~tgraves] 
[~mrid...@yahoo-inc.com] [~vanzin]

Keeping all of that in mind, in general I'm in favor of this approach.  I think 
its impossible to guarantee that we get the perfect messages in all cases, but 
we can make a best effort to improve the error handling in most cases.

> Notify driver when OOM is thrown before executor JVM is killed 
> ---------------------------------------------------------------
>
>                 Key: SPARK-11801
>                 URL: https://issues.apache.org/jira/browse/SPARK-11801
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.5.1
>            Reporter: Srinivasa Reddy Vundela
>            Priority: Minor
>
> Here is some background for the issue.
> Customer got OOM exception in one of the task and executor got killed with 
> kill %p. It is unclear in driver logs/Spark UI why the task is lost or 
> executor is lost. Customer has to look into the executor logs to see OOM is 
> the cause for the task/executor lost. 
> It would be helpful if driver logs/spark UI shows the reason for task 
> failures by making sure that task updates the driver with OOM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to