[ 
https://issues.apache.org/jira/browse/SPARK-11801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15032736#comment-15032736
 ] 

Mridul Muralidharan commented on SPARK-11801:
---------------------------------------------

To give background:

It was a conscious decision to kill - once an OOM is thrown, VM has already 
tried to free up memory and failed (including direct buffers, references, etc). 
Other than in niche usecases, what causes the OOM and what all it impacts is 
unpredictable - memory usage by one thread can cause OOM in another (often 
unrelated) thread, daemon threads can die, akka messaging can fail, DFS/local 
writes can fail, etc.
A clean restart is much more predictable than inconsistent state of the VM.

Why other cluster managers are not doing so - I am not sure of, probably 
something to change there as well ?


To address points raised:
1) The "spurious" messages are actually coming from execution of shutdown hooks 
- not due to OOM.
To elaborate : the OOM trigger'ed the kill - which causes VM to gracefully exit 
by invoking shutdown hooks - which cause the sockets to close, dfs 
filesystem/references to close, deletion of local directories, etc - and so 
those threads log errors/warnings/etc.
You will see exactly similar behavior when SIGTERM is raised when executor runs 
beyond memory limits in YARN.
So would be good to decouple this from OOM issue.

2) Trying to send messages to driver when facing OOM (or from shutdown hooks) 
is relying on unstable behavior with no guarantee to succeed - and sets the 
wrong expectation from users on expected behavior (which they cant rely on).
Actually I would expect it to fail more often than succeed.


> Notify driver when OOM is thrown before executor JVM is killed 
> ---------------------------------------------------------------
>
>                 Key: SPARK-11801
>                 URL: https://issues.apache.org/jira/browse/SPARK-11801
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.5.1
>            Reporter: Srinivasa Reddy Vundela
>            Priority: Minor
>
> Here is some background for the issue.
> Customer got OOM exception in one of the task and executor got killed with 
> kill %p. It is unclear in driver logs/Spark UI why the task is lost or 
> executor is lost. Customer has to look into the executor logs to see OOM is 
> the cause for the task/executor lost. 
> It would be helpful if driver logs/spark UI shows the reason for task 
> failures by making sure that task updates the driver with OOM. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to