[ https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15095389#comment-15095389 ]
Jun Gong commented on YARN-4497: -------------------------------- [~jianhe] Thanks for review and comments. {quote} for the patch, I think making below change in RMAppImpl#recover may be enough ? {quote} There might be some problems: 1. *appState.attempts.keySet()* is not sorted by attempt ID, however we need recover them by order because we use *currentAttempt* to get AMBlacklist and we calle *getNumFailedAppAttempts()* in *createNewAttempt()* . 2. We need update *nextAttemptId* after recovering attempts. 3. We need to deal with the case 2 in previous comment: attempt's final state is missed(fail to store its final state), otherwise it will cause RM to relaunch this attempt: it will be in *LAUNCEHD* state after recover, and will time out(the attempt has already failed), then RM will relaunch it. > RM might fail to restart when recovering apps whose attempts are missing > ------------------------------------------------------------------------ > > Key: YARN-4497 > URL: https://issues.apache.org/jira/browse/YARN-4497 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Jun Gong > Assignee: Jun Gong > Priority: Critical > Attachments: YARN-4497.01.patch > > > Find following problem when discussing in YARN-3480. > If RM fails to store some attempts in RMStateStore, there will be missing > attempts in RMStateStore, for the case storing attempt1, attempt2 and > attempt3, RM successfully stored attempt1 and attempt3, but failed to store > attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one > by one, for this case, we will recover attmept1, then attempt2. When > recovering attempt2, we call > *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find > its ApplicationAttemptStateData, but it could not find it, an error will come > at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880). -- This message was sent by Atlassian JIRA (v6.3.4#6332)