Saisai Shao created SPARK-13642: ----------------------------------- Summary: Inconsistent finishing state between driver and AM Key: SPARK-13642 URL: https://issues.apache.org/jira/browse/SPARK-13642 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.6.0 Reporter: Saisai Shao
Currently when running Spark on Yarn with yarn cluster mode, the default application final state is "SUCCEED", if any exception is occurred, this final state will be changed to "FAILED" and trigger the reattempt if possible. This is OK in normal case, but there's a race condition when AM received a signal (SIGTERM), no any exception is occurred. In this situation, shutdown hook will be invoked and marked this application as finished with success, and there's no another attempt. In such situation, actually from Spark's aspect this application is failed and need another attempt, but from Yarn's aspect the application is finished with success. This could happened in NM failure situation, the failure of NM will send SIGTERM to AM, AM should make this attempt as failure and rerun again, not invoke unregister. So to increase the chance of this race condition, here is the reproduced code: {code} val sc = ... Thread.sleep(30000L) sc.parallelize(1 to 100).collect() {code} If the AM is failed in sleeping, there's no exception been thrown, so from Yarn's point this application is finished successfully, but from Spark's point, this application should be reattempted. So basically, I think only after the finish of user class, we could mark this application as "SUCCESS", otherwise, especially in the signal stopped scenario, it would be better to mark as failed and try again (except explicitly KILL command by yarn). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org