Kay Ousterhout created SPARK-11306:
--------------------------------------

             Summary: Executor JVM loss can lead to a hang in Standalone mode
                 Key: SPARK-11306
                 URL: https://issues.apache.org/jira/browse/SPARK-11306
             Project: Spark
          Issue Type: Bug
            Reporter: Kay Ousterhout
            Assignee: Kay Ousterhout


This commit: 
https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0 
introduced a bug where, in Standalone mode, if a task fails and crashes the 
JVM, the failure is considered a "normal failure" (meaning it's considered 
unrelated to the task), so the failure isn't counted against the task's maximum 
number of failures: 
https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-a755f3d892ff2506a7aa7db52022d77cL138.
  As a result, if a task fails in a way that results in it crashing the JVM, it 
will continuously be re-launched, resulting in a hang.

Unfortunately this issue is difficult to reproduce because of a race condition 
where we have multiple code paths that are used to handle executor losses, and 
in the setup I'm using, Akka's notification that the executor was lost always 
gets to the TaskSchedulerImpl first, so the task eventually gets killed (see my 
recent email to the dev list).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to