[ https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177347#comment-15177347 ]
Saisai Shao commented on SPARK-13642: ------------------------------------- [~tgraves] [~vanzin], would you please comment on this, why the default application final state is "SUCCESS"? Is it better to mark this application as "SUCCESS" only after user class is exited? Thanks a lot. > Inconsistent finishing state between driver and AM > --------------------------------------------------- > > Key: SPARK-13642 > URL: https://issues.apache.org/jira/browse/SPARK-13642 > Project: Spark > Issue Type: Bug > Components: YARN > Affects Versions: 1.6.0 > Reporter: Saisai Shao > > Currently when running Spark on Yarn with yarn cluster mode, the default > application final state is "SUCCEED", if any exception is occurred, this > final state will be changed to "FAILED" and trigger the reattempt if > possible. > This is OK in normal case, but if there's a race condition when AM received a > signal (SIGTERM) and no any exception is occurred. In this situation, > shutdown hook will be invoked and marked this application as finished with > success, and there's no another attempt. > In such situation, actually from Spark's aspect this application is failed > and need another attempt, but from Yarn's aspect the application is finished > with success. > This could happened in NM failure situation, the failure of NM will send > SIGTERM to AM, AM should make this attempt as failure and rerun again, not > invoke unregister. > So to increase the chance of this race condition, here is the reproduced code: > {code} > val sc = ... > Thread.sleep(30000L) > sc.parallelize(1 to 100).collect() > {code} > If the AM is failed in sleeping, there's no exception been thrown, so from > Yarn's point this application is finished successfully, but from Spark's > point, this application should be reattempted. > So basically, I think only after the finish of user class, we could mark this > application as "SUCCESS", otherwise, especially in the signal stopped > scenario, it would be better to mark as failed and try again (except > explicitly KILL command by yarn). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org