[ 
https://issues.apache.org/jira/browse/SPARK-11801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036302#comment-15036302
 ] 

Imran Rashid commented on SPARK-11801:
--------------------------------------

to summarize, it seems we agree that:

1) we want to keep {{OnOutOfMemoryError}} in yarn mode.
2) we can't guarantee anything making it back to the driver at all on OOM

I *think* there is agreement that:

3) the current situation can lead to misleading msgs, which can be improved

and we are not totally agreed on:

4) should other cluster managers user {{OnOutOfMemoryError}}?  But this is 
independent, I've opened SPARK-12099 to deal with that, and I think we can 
ignore that here
5) *how* exactly should we improve the msgs?

So the main thing to discuss is #5.  I think there are a few options:

(a) given that we can't guarantee anything useful gets back to the driver, lets 
try to keep things consistent and always have the driver just receive a generic 
"executor lost" type of message.  We could still try to improve the logs on the 
executor somewhat with the shtudown reprioritization

(b) make a best effort at getting a better error msg to the driver, and improve 
the logs on the executor.  I think this would include all of
(i) handle OOM specially in {{TaskRunner}}, sending a msg back to the driver 
immediately
(ii) kill the running tasks, in a handler that is higher priority than the disk 
cleanup
(iii) have {{YarnAllocator}} deal with {{SparkExitCode.OOM}}

There are probably more details to discuss on (b), I am not sure that proposal 
is 100% correct, but maybe to start, can we discuss (a) vs (b)?  
[~mrid...@yahoo-inc.com] you voiced a preference for (a) -- but do you feel 
very strongly about that?  I have a strong preference (b), since I think we can 
do decently in most cases, and it would really help a lot of users out.

> Notify driver when OOM is thrown before executor JVM is killed 
> ---------------------------------------------------------------
>
>                 Key: SPARK-11801
>                 URL: https://issues.apache.org/jira/browse/SPARK-11801
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.5.1
>            Reporter: Srinivasa Reddy Vundela
>            Priority: Minor
>
> Here is some background for the issue.
> Customer got OOM exception in one of the task and executor got killed with 
> kill %p. It is unclear in driver logs/Spark UI why the task is lost or 
> executor is lost. Customer has to look into the executor logs to see OOM is 
> the cause for the task/executor lost. 
> It would be helpful if driver logs/spark UI shows the reason for task 
> failures by making sure that task updates the driver with OOM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to