[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

Sunil G (JIRA) Wed, 13 Jan 2016 07:09:07 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15096309#comment-15096309
 ]


Sunil G commented on YARN-4497:
-------------------------------

Hi [~hex108]
I second the idea of sorting {{appState.attempts.keySet()}} which looks more 
clean.

bq.attempt.recoveredFinalStatus is being set to always to FAILED. These 
attempts might be KILLED/FINISHED also.
Yes, there are no clear way to update this. We cannot rely much on diagnostics 
also. I feel keeping FAILED is fine till we have some clear information to 
updates as KILLED. I dont this having a final state UNKNOWN is a good idea. Too 
much of complexity to have a new final state.



> RM might fail to restart when recovering apps whose attempts are missing
> ------------------------------------------------------------------------
>
>                 Key: YARN-4497
>                 URL: https://issues.apache.org/jira/browse/YARN-4497
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>            Priority: Critical
>         Attachments: YARN-4497.01.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

Reply via email to