[ https://issues.apache.org/jira/browse/SPARK-11801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15034124#comment-15034124 ]
Imran Rashid commented on SPARK-11801: -------------------------------------- [~mrid...@yahoo-inc.com] thanks for the input, I think we agree on most points. I was just being a little sloppy on your point (1), agree that the weird errors are from the shutdown hook, which comes from the kill, which comes from the OOM. For (2), I agree that we can't guarantee anything, we can just make a best effort. I was somewhat surprised by this as well, but the fact is that often those messages do get back to the driver, which just leads to customer confusion. I would say that our order of preference for what the user sees are: 1) clear messages on the driver that the executor had an OOM (as well as on the executor) 2) generic "executor lost" task failure messages on the driver, hopefully more complete logs on the executor 3) driver sees "tasks lost" msgs with spurious errors, and executor has OOM in logs but also spurious msgs Right now, we mostly get #3. I think we can change the behavior so we try to do #1, but if not we get #2. In other words, *if* we do get to send a message, lets make sure its the right message. Regarding: bq. You will see exactly similar behavior when SIGTERM is raised when executor runs beyond memory limits in YARN. So would be good to decouple this from OOM issue. I agree that would be better, but is there a general way to handle this? If the executor is lost without sending any messages back to the driver, I think the handling is already as good as we can hope for -- the tasks are failed w/ [{{ExecutorLostFailure}} | https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L806]. This is trying to do the best we can on one special (and very important) case for OOM. https://issues.apache.org/jira/browse/SPARK-11799 added a small amount of cleanup to the executor logs in general, but isn't that great. My one thought is that maybe you could check if the executor was shutting down in [task error handling in Executor | https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L307 ], eg. if the executor was already in shutdown you send some generic "ExecutorLost" failure msg. But I worry that maybe you will avoid ever sending the original root cause in that case. (ie., you'd force solution #2, when #1 was possible.) (incidentally, getting better msgs when killed by yarn has been improved by https://issues.apache.org/jira/browse/SPARK-9790) > Notify driver when OOM is thrown before executor JVM is killed > --------------------------------------------------------------- > > Key: SPARK-11801 > URL: https://issues.apache.org/jira/browse/SPARK-11801 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 1.5.1 > Reporter: Srinivasa Reddy Vundela > Priority: Minor > > Here is some background for the issue. > Customer got OOM exception in one of the task and executor got killed with > kill %p. It is unclear in driver logs/Spark UI why the task is lost or > executor is lost. Customer has to look into the executor logs to see OOM is > the cause for the task/executor lost. > It would be helpful if driver logs/spark UI shows the reason for task > failures by making sure that task updates the driver with OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org