[ 
https://issues.apache.org/jira/browse/SPARK-11801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034124#comment-15034124
 ] 

Imran Rashid commented on SPARK-11801:
--------------------------------------

[~mrid...@yahoo-inc.com] thanks for the input, I think we agree on most points. 
  I was just being a little sloppy on your point (1), agree that the weird 
errors are from the shutdown hook, which comes from the kill, which comes from 
the OOM.

For (2), I agree that we can't guarantee anything, we can just make a best 
effort.  I was somewhat surprised by this as well, but the fact is that often 
those messages do get back to the driver, which just leads to customer 
confusion.  I would say that our order of preference for what the user sees are:

1) clear messages on the driver that the executor had an OOM (as well as on the 
executor)
2) generic "executor lost" task failure messages on the driver, hopefully more 
complete logs on the executor
3) driver sees "tasks lost" msgs with spurious errors, and executor has OOM in 
logs but also spurious msgs

Right now, we mostly get #3.  I think we can change the behavior so we try to 
do #1, but if not we get #2.  In other words, *if* we do get to send a message, 
lets make sure its the right message.

Regarding:

bq. You will see exactly similar behavior when SIGTERM is raised when executor 
runs beyond memory limits in YARN.  So would be good to decouple this from OOM 
issue.
 
I agree that would be better, but is there a general way to handle this?   If 
the executor is lost without sending any messages back to the driver, I think 
the handling is already as good as we can hope for -- the tasks are failed w/ 
[{{ExecutorLostFailure}} | 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L806].
  This is trying to do the best we can on one special (and very important) case 
for OOM.  
https://issues.apache.org/jira/browse/SPARK-11799 added a small amount of 
cleanup to the executor logs in general, but isn't that great.  My one thought 
is that maybe you could check if the executor was shutting down in [task error 
handling in Executor | 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L307
 ], eg. if the executor was already in shutdown you send some generic 
"ExecutorLost" failure msg.  But I worry that maybe you will avoid ever sending 
the original root cause in that case.  (ie., you'd force solution #2, when #1 
was possible.)

(incidentally, getting better msgs when killed by yarn has been improved by 
https://issues.apache.org/jira/browse/SPARK-9790)

> Notify driver when OOM is thrown before executor JVM is killed 
> ---------------------------------------------------------------
>
>                 Key: SPARK-11801
>                 URL: https://issues.apache.org/jira/browse/SPARK-11801
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.5.1
>            Reporter: Srinivasa Reddy Vundela
>            Priority: Minor
>
> Here is some background for the issue.
> Customer got OOM exception in one of the task and executor got killed with 
> kill %p. It is unclear in driver logs/Spark UI why the task is lost or 
> executor is lost. Customer has to look into the executor logs to see OOM is 
> the cause for the task/executor lost. 
> It would be helpful if driver logs/spark UI shows the reason for task 
> failures by making sure that task updates the driver with OOM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to