[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

Rohith Sharma K S (JIRA) Wed, 13 Jan 2016 21:54:07 -0800

    [ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097653#comment-15097653
 ]


Rohith Sharma K S commented on YARN-4497:
-----------------------------------------

bq. "If attempt 1~28 are removed and attempt 29~31 has been saved to appstore 
successfully, there will be no NPE for RM recovery." I think we need analyze 
the RM log more. Removing attempts will cause NPE only when RM continues to run 
when failing to operate(e.g. store/remove) on RMStateStore. Is there any other 
case might cause NPE? Maybe we need fix it.
The issue happens straightforwardly in the above case. Can you run test written 
for this JIRA without fix with slight  change like below which is similar to 
YARN-3480.
{code}
memStore.removeApplicationAttemptInternal(am0.getApplicationAttemptId());
memStore.removeApplicationAttemptInternal(am1.getApplicationAttemptId());
{code}
{color:red}Reason{color} : While recovering, nextAttemptId is set to 
firstAttemptIdInStateStore only if 
{{submissionContext.getAttemptFailuresValidityInterval() > 0}}.
{code}
    if (submissionContext.getAttemptFailuresValidityInterval() > 0) {
      this.firstAttemptIdInStateStore = appState.getFirstAttemptId();
      this.nextAttemptId = firstAttemptIdInStateStore;
    }
{code}
What if {{submissionContext.getAttemptFailuresValidityInterval()}} is not set? 
Attempt id will always start from 1 event thought attempt is removed.
*log*:
*Before recovery*
{noformat}
2016-01-14 11:16:52,775 INFO  [Thread-2] recovery.RMStateStore 
(MemoryRMStateStore.java:removeApplicationAttemptInternal(151)) - Removing 
state for attempt: appattempt_1452750396633_0001_000001
2016-01-14 11:16:52,775 INFO  [Thread-2] recovery.RMStateStore 
(MemoryRMStateStore.java:removeApplicationAttemptInternal(151)) - Removing 
state for attempt: appattempt_1452750396633_0001_000002
{noformat}
*After Recovery* I have removed *attemptState.getState()* in the log message to 
show attempt is created is attempt id 1
{noformat}
2016-01-14 11:16:52,885 INFO  [Thread-2] attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:recover(886)) - Recovering attempt: 
appattempt_1452750396633_0001_000001 with final state: 
2016-01-14 11:16:52,885 ERROR [Thread-2] resourcemanager.ResourceManager 
(ResourceManager.java:serviceStart(599)) - Failed to load/recover state
java.lang.NullPointerException
        at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:889)
{noformat}

I will reopen the YARN-4584 for quick fix i.e before removing attempts from 
state store, need to check of validity interval. We shall move the discussion 
there.

> RM might fail to restart when recovering apps whose attempts are missing
> ------------------------------------------------------------------------
>
>                 Key: YARN-4497
>                 URL: https://issues.apache.org/jira/browse/YARN-4497
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>            Priority: Critical
>         Attachments: YARN-4497.01.patch, YARN-4497.02.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

Reply via email to