[ https://issues.apache.org/jira/browse/SPARK-11306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14973506#comment-14973506 ]
Apache Spark commented on SPARK-11306: -------------------------------------- User 'kayousterhout' has created a pull request for this issue: https://github.com/apache/spark/pull/9273 > Executor JVM loss can lead to a hang in Standalone mode > ------------------------------------------------------- > > Key: SPARK-11306 > URL: https://issues.apache.org/jira/browse/SPARK-11306 > Project: Spark > Issue Type: Bug > Reporter: Kay Ousterhout > Assignee: Kay Ousterhout > > This commit: > https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0 > introduced a bug where, in Standalone mode, if a task fails and crashes the > JVM, the failure is considered a "normal failure" (meaning it's considered > unrelated to the task), so the failure isn't counted against the task's > maximum number of failures: > https://github.com/apache/spark/commit/af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0#diff-a755f3d892ff2506a7aa7db52022d77cL138. > As a result, if a task fails in a way that results in it crashing the JVM, > it will continuously be re-launched, resulting in a hang. > Unfortunately this issue is difficult to reproduce because of a race > condition where we have multiple code paths that are used to handle executor > losses, and in the setup I'm using, Akka's notification that the executor was > lost always gets to the TaskSchedulerImpl first, so the task eventually gets > killed (see my recent email to the dev list). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org