[ https://issues.apache.org/jira/browse/SPARK-11801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15036302#comment-15036302 ]
Imran Rashid commented on SPARK-11801: -------------------------------------- to summarize, it seems we agree that: 1) we want to keep {{OnOutOfMemoryError}} in yarn mode. 2) we can't guarantee anything making it back to the driver at all on OOM I *think* there is agreement that: 3) the current situation can lead to misleading msgs, which can be improved and we are not totally agreed on: 4) should other cluster managers user {{OnOutOfMemoryError}}? But this is independent, I've opened SPARK-12099 to deal with that, and I think we can ignore that here 5) *how* exactly should we improve the msgs? So the main thing to discuss is #5. I think there are a few options: (a) given that we can't guarantee anything useful gets back to the driver, lets try to keep things consistent and always have the driver just receive a generic "executor lost" type of message. We could still try to improve the logs on the executor somewhat with the shtudown reprioritization (b) make a best effort at getting a better error msg to the driver, and improve the logs on the executor. I think this would include all of (i) handle OOM specially in {{TaskRunner}}, sending a msg back to the driver immediately (ii) kill the running tasks, in a handler that is higher priority than the disk cleanup (iii) have {{YarnAllocator}} deal with {{SparkExitCode.OOM}} There are probably more details to discuss on (b), I am not sure that proposal is 100% correct, but maybe to start, can we discuss (a) vs (b)? [~mrid...@yahoo-inc.com] you voiced a preference for (a) -- but do you feel very strongly about that? I have a strong preference (b), since I think we can do decently in most cases, and it would really help a lot of users out. > Notify driver when OOM is thrown before executor JVM is killed > --------------------------------------------------------------- > > Key: SPARK-11801 > URL: https://issues.apache.org/jira/browse/SPARK-11801 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 1.5.1 > Reporter: Srinivasa Reddy Vundela > Priority: Minor > > Here is some background for the issue. > Customer got OOM exception in one of the task and executor got killed with > kill %p. It is unclear in driver logs/Spark UI why the task is lost or > executor is lost. Customer has to look into the executor logs to see OOM is > the cause for the task/executor lost. > It would be helpful if driver logs/spark UI shows the reason for task > failures by making sure that task updates the driver with OOM. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org