[ https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097653#comment-15097653 ]
Rohith Sharma K S commented on YARN-4497: ----------------------------------------- bq. "If attempt 1~28 are removed and attempt 29~31 has been saved to appstore successfully, there will be no NPE for RM recovery." I think we need analyze the RM log more. Removing attempts will cause NPE only when RM continues to run when failing to operate(e.g. store/remove) on RMStateStore. Is there any other case might cause NPE? Maybe we need fix it. The issue happens straightforwardly in the above case. Can you run test written for this JIRA without fix with slight change like below which is similar to YARN-3480. {code} memStore.removeApplicationAttemptInternal(am0.getApplicationAttemptId()); memStore.removeApplicationAttemptInternal(am1.getApplicationAttemptId()); {code} {color:red}Reason{color} : While recovering, nextAttemptId is set to firstAttemptIdInStateStore only if {{submissionContext.getAttemptFailuresValidityInterval() > 0}}. {code} if (submissionContext.getAttemptFailuresValidityInterval() > 0) { this.firstAttemptIdInStateStore = appState.getFirstAttemptId(); this.nextAttemptId = firstAttemptIdInStateStore; } {code} What if {{submissionContext.getAttemptFailuresValidityInterval()}} is not set? Attempt id will always start from 1 event thought attempt is removed. *log*: *Before recovery* {noformat} 2016-01-14 11:16:52,775 INFO [Thread-2] recovery.RMStateStore (MemoryRMStateStore.java:removeApplicationAttemptInternal(151)) - Removing state for attempt: appattempt_1452750396633_0001_000001 2016-01-14 11:16:52,775 INFO [Thread-2] recovery.RMStateStore (MemoryRMStateStore.java:removeApplicationAttemptInternal(151)) - Removing state for attempt: appattempt_1452750396633_0001_000002 {noformat} *After Recovery* I have removed *attemptState.getState()* in the log message to show attempt is created is attempt id 1 {noformat} 2016-01-14 11:16:52,885 INFO [Thread-2] attempt.RMAppAttemptImpl (RMAppAttemptImpl.java:recover(886)) - Recovering attempt: appattempt_1452750396633_0001_000001 with final state: 2016-01-14 11:16:52,885 ERROR [Thread-2] resourcemanager.ResourceManager (ResourceManager.java:serviceStart(599)) - Failed to load/recover state java.lang.NullPointerException at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:889) {noformat} I will reopen the YARN-4584 for quick fix i.e before removing attempts from state store, need to check of validity interval. We shall move the discussion there. > RM might fail to restart when recovering apps whose attempts are missing > ------------------------------------------------------------------------ > > Key: YARN-4497 > URL: https://issues.apache.org/jira/browse/YARN-4497 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Jun Gong > Assignee: Jun Gong > Priority: Critical > Attachments: YARN-4497.01.patch, YARN-4497.02.patch > > > Find following problem when discussing in YARN-3480. > If RM fails to store some attempts in RMStateStore, there will be missing > attempts in RMStateStore, for the case storing attempt1, attempt2 and > attempt3, RM successfully stored attempt1 and attempt3, but failed to store > attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one > by one, for this case, we will recover attmept1, then attempt2. When > recovering attempt2, we call > *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find > its ApplicationAttemptStateData, but it could not find it, an error will come > at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880). -- This message was sent by Atlassian JIRA (v6.3.4#6332)