[ 
https://issues.apache.org/jira/browse/SPARK-11306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14973506#comment-14973506
 ] 

Apache Spark commented on SPARK-11306:
--------------------------------------

User 'kayousterhout' has created a pull request for this issue:
https://github.com/apache/spark/pull/9273

> Executor JVM loss can lead to a hang in Standalone mode
> -------------------------------------------------------
>
>                 Key: SPARK-11306
>                 URL: https://issues.apache.org/jira/browse/SPARK-11306
>             Project: Spark
>          Issue Type: Bug
>            Reporter: Kay Ousterhout
>            Assignee: Kay Ousterhout
>
> This commit: 
> https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0
>  introduced a bug where, in Standalone mode, if a task fails and crashes the 
> JVM, the failure is considered a "normal failure" (meaning it's considered 
> unrelated to the task), so the failure isn't counted against the task's 
> maximum number of failures: 
> https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-a755f3d892ff2506a7aa7db52022d77cL138.
>   As a result, if a task fails in a way that results in it crashing the JVM, 
> it will continuously be re-launched, resulting in a hang.
> Unfortunately this issue is difficult to reproduce because of a race 
> condition where we have multiple code paths that are used to handle executor 
> losses, and in the setup I'm using, Akka's notification that the executor was 
> lost always gets to the TaskSchedulerImpl first, so the task eventually gets 
> killed (see my recent email to the dev list).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to