[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15075118#comment-15075118
 ] 

Jun Gong commented on YARN-4497:
--------------------------------

In the patch, it deals with two cases:

1. attempt is missed
When recovering a attempt, remove the attempt from app.attempts if we could not 
find corresponding ApplicationAttemptStateData from RMStateStore. There is no 
ApplicationAttemptStateData found, it means corresponding AM is never 
launched(launching AM is after receiving event 
*RMAppAttemptEventType.ATTEMPT_NEW_SAVED*, and we must not have received the 
event.).

2. attempt's final state is missed(fail to store its final state)
When recovering these attempts, we set their state to FAILED(or any other final 
state, or adding a state UNKOWN if needed), then the attempt could deal well 
with event *RMAppAttemptEventType.RECOVER*. 

> RM might fail to restart when recovering apps whose attempts are missing
> ------------------------------------------------------------------------
>
>                 Key: YARN-4497
>                 URL: https://issues.apache.org/jira/browse/YARN-4497
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-4497.01.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to