[ https://issues.apache.org/jira/browse/SPARK-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matei Zaharia resolved SPARK-644. --------------------------------- Resolution: Fixed > Jobs canceled due to repeated executor failures may hang > -------------------------------------------------------- > > Key: SPARK-644 > URL: https://issues.apache.org/jira/browse/SPARK-644 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 0.6.1 > Reporter: Josh Rosen > Assignee: Josh Rosen > > In order to prevent an infinite loop, the standalone master aborts jobs that > experience more than 10 executor failures (see > https://github.com/mesos/spark/pull/210). Currently, the master crashes when > aborting jobs (this is the issue that uncovered SPARK-643). If we fix the > crash, which involves removing a {{throw}} from the actor's {{receive}} > method, then these failures can lead to a hang because they cause the job to > be removed from the master's scheduler, but the upstream scheduler components > aren't notified of the failure and will wait for the job to finish. > I've considered fixing this by adding additional callbacks to propagate the > failure to the higher-level schedulers. It might be cleaner to move the > decision to abort the job into the higher-level layers of the scheduler, > sending an {{AbortJob(jobId)}} method to the Master. The Client is already > notified of executor state changes, so it may be able to make the decision to > abort (or defer that decision to a higher layer). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org