[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-22 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15112244#comment-15112244
 ] 

Jun Gong commented on YARN-4497:


[~rohithsharma] thanks, I just attached a rebased patch.

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch, YARN-4497.02.patch, 
> YARN-4497.03.patch, YARN-4497.04.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-22 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15112134#comment-15112134
 ] 

Rohith Sharma K S commented on YARN-4497:
-

[~hex108] would mind rebase the patch? Since YARN-4584 has gone in first, there 
are few conflicts:-(

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch, YARN-4497.02.patch, 
> YARN-4497.03.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-22 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15112126#comment-15112126
 ] 

Rohith Sharma K S commented on YARN-4497:
-

+1, committing shortly

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch, YARN-4497.02.patch, 
> YARN-4497.03.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-22 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15112378#comment-15112378
 ] 

Hadoop QA commented on YARN-4497:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
20s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
16s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 37s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
11s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 21s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
31s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
15s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
20s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 18s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 41s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 68m 31s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_91. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 152m 31s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_66 Failed junit tests | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesAppsModification |
|   | hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.7.0_91 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12783797/YARN-4497.04.patch |
| JIRA Issue | YARN-4497 |
| Optional Tests |  

[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-22 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15112551#comment-15112551
 ] 

Jun Gong commented on YARN-4497:


[~rohithsharma]  Thanks for the review, comments and commit! Thanks [~jianhe], 
[~sunilg] for the review and comments!

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Fix For: 2.9.0
>
> Attachments: YARN-4497.01.patch, YARN-4497.02.patch, 
> YARN-4497.03.patch, YARN-4497.04.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-22 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15112511#comment-15112511
 ] 

Hudson commented on YARN-4497:
--

FAILURE: Integrated in Hadoop-trunk-Commit #9166 (See 
[https://builds.apache.org/job/Hadoop-trunk-Commit/9166/])
YARN-4497. RM might fail to restart when recovering apps whose attempts 
(rohithsharmaks: rev d6258b33a7428a0725ead96bc43f4dd444c7c8f1)
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/TestRMRestart.java
* 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/RMAppImpl.java
* hadoop-yarn-project/CHANGES.txt


> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Fix For: 2.9.0
>
> Attachments: YARN-4497.01.patch, YARN-4497.02.patch, 
> YARN-4497.03.patch, YARN-4497.04.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-21 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111507#comment-15111507
 ] 

Jian He commented on YARN-4497:
---

looks good to me, minor comments is I think setRecoveredFinalState and 
getRecoveredFinalState does not need to acquire the lock, as they happen 
sequentially.
this code can be formatted into single lines like below.
{code}
  if (preAttempt != null && preAttempt.getRecoveredFinalState() == null) {
preAttempt.setRecoveredFinalState(RMAppAttemptState.FAILED);
  }
{code}

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch, YARN-4497.02.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-21 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111823#comment-15111823
 ] 

Jun Gong commented on YARN-4497:


[~jianhe] Thanks for review. Attach a new patch to address above problems.

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch, YARN-4497.02.patch, 
> YARN-4497.03.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-21 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15112023#comment-15112023
 ] 

Jian He commented on YARN-4497:
---

+1, thanks

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch, YARN-4497.02.patch, 
> YARN-4497.03.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-21 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15111943#comment-15111943
 ] 

Hadoop QA commented on YARN-4497:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 8m 
31s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 32s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
17s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 38s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
16s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
13s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 30s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
33s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 27s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
26s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 66m 21s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 67m 21s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_91. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
18s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 152m 58s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_66 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.7.0_91 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12783740/YARN-4497.03.patch |
| JIRA Issue | YARN-4497 |
| Optional Tests |  asflicense  compile  javac  javadoc  mvninstall  mvnsite  
unit  findbugs  checkstyle  |
| 

[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-17 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15104225#comment-15104225
 ] 

Rohith Sharma K S commented on YARN-4497:
-

+1 LGTM, I will wait for couple of days before committing this in. 
[~sunilg]/[~jianhe] do you have any comments on the patch?

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch, YARN-4497.02.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-13 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095956#comment-15095956
 ] 

Jun Gong commented on YARN-4497:


[~rohithsharma] Thanks for the comments and suggestion.

{quote}
As a side note : since YARN-3840 removes the attempts from RMStateStore, it is 
very prone to get this issue (YARN-4584) nevertheless of without RM HA is 
configured and fail fast is false.
{quote}
As I commented in YARN-4584, "If attempt 1~28 are removed and attempt 29~31 has 
been saved to appstore successfully, there will be no NPE for RM recovery." I 
think we need analyze the RM log more. Removing attempts will cause NPE only 
when RM continues to run when failing to operate(e.g. store/remove) on 
RMStateStore. Is there any other case might cause NPE? Maybe we need fix it.

{quote}
About the solution, it is bit tricky to identify during recovery that 
whether-application-is-failed-to-store VS 
failed-attempts-were-removed-after-interval.
{quote}
I think we do not need to identify these two cases, because it makes no 
different for recovery.

{quote}
So I think you can club both your solution and Jian He's thought together, so 
that we can eliminate failed-attempts-were-removed-after-interval attempts. And 
assume that attempts recovered are of failed to store only. 
{quote}
In *RMAppImpl#createNewAttempt()*, the first new attempt id is *nextAttemptId* 
which is initialized to the minimum attempt ID in RMStateStore in 
*RMAppImpl#recover()*. So we have skipped recovering those 
*failed-attempts-were-removed-after-interval* attempts. 

{quote}
Regarding iterating appState.attempts, it can be sorted before iterating it. If 
attempts are sorted, then there should not be problem with nextAttemptId.
{quote}
Yes, we could sort it.I will update the patch if needed. 

{quote}
attempt.recoveredFinalStatus is being set to always to FAILED. These attempts 
might be KILLED/FINISHED also.
{quote}
These attempts might be KILLED actually, but we could not make sure about it. 
If it is not reasonable to set it to FAILED, how about adding another 
state(e.g. UNKOWN)? My concern that is it will make things complex.

{quote}
getNumFailedAppAttempts() is violated if attempt is failed to store since this 
attempt is removed from attempts. And also note that if attempts is failed to 
store, then many information such as getNumFailedAppAttempts also wont be exact 
number since attempt failure is taken from attempt.
{quote}
Yes, the number is not exact number. I have not figured out a good method to 
solve it now :(.  Since RM HA is not so often and removed attempts are kept in 
memory, it might be acceptable.

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-13 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097634#comment-15097634
 ] 

Bibin A Chundatt commented on YARN-4497:


# Also AM killed by RM cases too if possible

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch, YARN-4497.02.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-13 Thread Bibin A Chundatt (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097633#comment-15097633
 ] 

Bibin A Chundatt commented on YARN-4497:


[~hex108]
YARN-4584 logs i have shared to [~rohithsharma] offline.
# Could you also try adding testcases with 
{{conf.setInt(YarnConfiguration.RM_AM_MAX_ATTEMPTS, 2)}} related to recovery 
scenarios too as part of patch  and amattempts more than 2.

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch, YARN-4497.02.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-13 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097653#comment-15097653
 ] 

Rohith Sharma K S commented on YARN-4497:
-

bq. "If attempt 1~28 are removed and attempt 29~31 has been saved to appstore 
successfully, there will be no NPE for RM recovery." I think we need analyze 
the RM log more. Removing attempts will cause NPE only when RM continues to run 
when failing to operate(e.g. store/remove) on RMStateStore. Is there any other 
case might cause NPE? Maybe we need fix it.
The issue happens straightforwardly in the above case. Can you run test written 
for this JIRA without fix with slight  change like below which is similar to 
YARN-3480.
{code}
memStore.removeApplicationAttemptInternal(am0.getApplicationAttemptId());
memStore.removeApplicationAttemptInternal(am1.getApplicationAttemptId());
{code}
{color:red}Reason{color} : While recovering, nextAttemptId is set to 
firstAttemptIdInStateStore only if 
{{submissionContext.getAttemptFailuresValidityInterval() > 0}}.
{code}
if (submissionContext.getAttemptFailuresValidityInterval() > 0) {
  this.firstAttemptIdInStateStore = appState.getFirstAttemptId();
  this.nextAttemptId = firstAttemptIdInStateStore;
}
{code}
What if {{submissionContext.getAttemptFailuresValidityInterval()}} is not set? 
Attempt id will always start from 1 event thought attempt is removed.
*log*:
*Before recovery*
{noformat}
2016-01-14 11:16:52,775 INFO  [Thread-2] recovery.RMStateStore 
(MemoryRMStateStore.java:removeApplicationAttemptInternal(151)) - Removing 
state for attempt: appattempt_1452750396633_0001_01
2016-01-14 11:16:52,775 INFO  [Thread-2] recovery.RMStateStore 
(MemoryRMStateStore.java:removeApplicationAttemptInternal(151)) - Removing 
state for attempt: appattempt_1452750396633_0001_02
{noformat}
*After Recovery* I have removed *attemptState.getState()* in the log message to 
show attempt is created is attempt id 1
{noformat}
2016-01-14 11:16:52,885 INFO  [Thread-2] attempt.RMAppAttemptImpl 
(RMAppAttemptImpl.java:recover(886)) - Recovering attempt: 
appattempt_1452750396633_0001_01 with final state: 
2016-01-14 11:16:52,885 ERROR [Thread-2] resourcemanager.ResourceManager 
(ResourceManager.java:serviceStart(599)) - Failed to load/recover state
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.recover(RMAppAttemptImpl.java:889)
{noformat}

I will reopen the YARN-4584 for quick fix i.e before removing attempts from 
state store, need to check of validity interval. We shall move the discussion 
there.

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch, YARN-4497.02.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-13 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097687#comment-15097687
 ] 

Jun Gong commented on YARN-4497:


[~rohithsharma] Thanks for the analysis. Yes, it is a bug.

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch, YARN-4497.02.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-13 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097434#comment-15097434
 ] 

Jun Gong commented on YARN-4497:


[~sunilg] Thanks for confirm and comments.

I just updated a new patch. Thanks for review.

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch, YARN-4497.02.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097600#comment-15097600
 ] 

Hadoop QA commented on YARN-4497:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
27s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 31s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
16s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 38s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
11s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
32s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 25s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 16s 
{color} | {color:red} Patch generated 1 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 (total was 241, now 241). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 35s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
13s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
17s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 60m 26s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 61m 36s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_91. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
17s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 139m 34s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_66 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.7.0_91 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | 
hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesDelegationTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 

[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-13 Thread Sunil G (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096309#comment-15096309
 ] 

Sunil G commented on YARN-4497:
---

Hi [~hex108]
I second the idea of sorting {{appState.attempts.keySet()}} which looks more 
clean.

bq.attempt.recoveredFinalStatus is being set to always to FAILED. These 
attempts might be KILLED/FINISHED also.
Yes, there are no clear way to update this. We cannot rely much on diagnostics 
also. I feel keeping FAILED is fine till we have some clear information to 
updates as KILLED. I dont this having a final state UNKNOWN is a good idea. Too 
much of complexity to have a new final state.



> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-12 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095740#comment-15095740
 ] 

Rohith Sharma K S commented on YARN-4497:
-

As a side node : since YARN-3840 removes the attempts from RMStateStore, it is 
very prone to get this issue (YARN-4584) nevertheless of *without RM HA is 
configured and fail fast is false*.

About the solution, it is bit tricky to identify during recovery that 
*whether-application-is-failed-to-store* VS 
*failed-attempts-were-removed-after-interval*. So I think you can club both 
your solution and [~jianhe]'s thought together, so that we can eliminate 
*failed-attempts-were-removed-after-interval* attempts. And assume that 
attempts recovered are of failed to store only. Thoughts?
Regarding iterating appState.attempts, it can be sorted before iterating it. If 
attempts are sorted, then there should not be problem with nextAttemptId.

About the patch,
# attempt.recoveredFinalStatus is being set to always to FAILED. These attempts 
might be KILLED/FINISHED also.
# *getNumFailedAppAttempts()* is violated if attempt is failed to store since 
this attempt is removed from *attempts*. And also note that if attempts is 
failed to store, then many information such as getNumFailedAppAttempts also 
wont be exact number since attempt failure is taken from attempt. 





> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-12 Thread Jian He (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15094923#comment-15094923
 ] 

Jian He commented on YARN-4497:
---

[~hex108], thanks for working on this. 
 for the patch, I think making below change in RMAppImpl#recover may be enough 
? 
{code}
-for(int i=0; i RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2016-01-12 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095389#comment-15095389
 ] 

Jun Gong commented on YARN-4497:


[~jianhe] Thanks for review and comments.

{quote}
for the patch, I think making below change in RMAppImpl#recover may be enough ?
{quote}
There might be some problems:
1. *appState.attempts.keySet()* is not sorted by attempt ID, however we need 
recover them by order because we use *currentAttempt* to get AMBlacklist and we 
calle *getNumFailedAppAttempts()* in *createNewAttempt()* .
2. We need update *nextAttemptId* after recovering attempts.
3. We need to deal with the case 2 in previous comment: attempt's final state 
is missed(fail to store its final state), otherwise it will cause RM to 
relaunch this attempt: it will be in *LAUNCEHD* state after recover, and will 
time out(the attempt has already failed), then RM will relaunch it.

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>Priority: Critical
> Attachments: YARN-4497.01.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2015-12-30 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075118#comment-15075118
 ] 

Jun Gong commented on YARN-4497:


In the patch, it deals with two cases:

1. attempt is missed
When recovering a attempt, remove the attempt from app.attempts if we could not 
find corresponding ApplicationAttemptStateData from RMStateStore. There is no 
ApplicationAttemptStateData found, it means corresponding AM is never 
launched(launching AM is after receiving event 
*RMAppAttemptEventType.ATTEMPT_NEW_SAVED*, and we must not have received the 
event.).

2. attempt's final state is missed(fail to store its final state)
When recovering these attempts, we set their state to FAILED(or any other final 
state, or adding a state UNKOWN if needed), then the attempt could deal well 
with event *RMAppAttemptEventType.RECOVER*. 

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
> Attachments: YARN-4497.01.patch
>
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2015-12-30 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15075257#comment-15075257
 ] 

Hadoop QA commented on YARN-4497:
-

| (x) *{color:red}-1 overall{color}* |
\\
\\
|| Vote || Subsystem || Runtime || Comment ||
| {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 0s 
{color} | {color:blue} Docker mode activated. {color} |
| {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s 
{color} | {color:green} The patch does not contain any @author tags. {color} |
| {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 
0s {color} | {color:green} The patch appears to include 1 new or modified test 
files. {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 7m 
34s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 25s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 
16s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 36s 
{color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
15s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
10s {color} | {color:green} trunk passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 20s 
{color} | {color:green} trunk passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 26s 
{color} | {color:green} trunk passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 
31s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 28s 
{color} | {color:green} the patch passed {color} |
| {color:red}-1{color} | {color:red} checkstyle {color} | {color:red} 0m 15s 
{color} | {color:red} Patch generated 1 new checkstyle issues in 
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager
 (total was 241, now 241). {color} |
| {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 34s 
{color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} mvneclipse {color} | {color:green} 0m 
12s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 
0s {color} | {color:green} Patch has no whitespace issues. {color} |
| {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 1m 
16s {color} | {color:green} the patch passed {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 19s 
{color} | {color:green} the patch passed with JDK v1.8.0_66 {color} |
| {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 24s 
{color} | {color:green} the patch passed with JDK v1.7.0_91 {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 63m 42s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.8.0_66. {color} |
| {color:red}-1{color} | {color:red} unit {color} | {color:red} 65m 13s {color} 
| {color:red} hadoop-yarn-server-resourcemanager in the patch failed with JDK 
v1.7.0_91. {color} |
| {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 
19s {color} | {color:green} Patch does not generate ASF License warnings. 
{color} |
| {color:black}{color} | {color:black} {color} | {color:black} 146m 12s {color} 
| {color:black} {color} |
\\
\\
|| Reason || Tests ||
| JDK v1.8.0_66 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
| JDK v1.7.0_91 Failed junit tests | 
hadoop.yarn.server.resourcemanager.TestClientRMTokens |
|   | hadoop.yarn.server.resourcemanager.TestAMAuthorization |
\\
\\
|| Subsystem || Report/Notes ||
| Docker |  Image:yetus/hadoop:0ca8df7 |
| JIRA Patch URL | 
https://issues.apache.org/jira/secure/attachment/12779998/YARN-4497.01.patch |
| JIRA 

[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2015-12-23 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069443#comment-15069443
 ] 

Rohith Sharma K S commented on YARN-4497:
-

Thinking when it can happen attempt1 is stored , attempt2  is not stored and 
attempt3 is stored? One way is manually delete the attempt2 node from 
zookeeper. 

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2015-12-23 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069454#comment-15069454
 ] 

Jun Gong commented on YARN-4497:


In *RMStateStore#notifyStoreOperationFailedInternal*, RMStateStore might skip 
store errors, so RMStateStore might fail to store attempt2 for some 
reasons(e.g. network error), but the app could continue running, and starts a 
new attempt attempt3, then RMStateStore stores attempt3 successfully(suppose 
network is OK now).

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2015-12-23 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069470#comment-15069470
 ] 

Rohith Sharma K S commented on YARN-4497:
-

Currently, If any errors happened while storing into RMstateStore then 
RMStatestore is FENCED. So no more attempts are stored in state-store. And the 
RMStatState store state machine has transition is only from {{ACTIVE to 
FENCED}} but there is No {{FENCED to ACTIVE}}. 

If I am missing anything in flow, could you explain elaborately? 

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2015-12-23 Thread Rohith Sharma K S (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069477#comment-15069477
 ] 

Rohith Sharma K S commented on YARN-4497:
-

I got your point, if RM HA is not configured and fail fast is false, this would 
happen.

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (YARN-4497) RM might fail to restart when recovering apps whose attempts are missing

2015-12-23 Thread Jun Gong (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-4497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15069480#comment-15069480
 ] 

Jun Gong commented on YARN-4497:


Yes, it is the problem.

> RM might fail to restart when recovering apps whose attempts are missing
> 
>
> Key: YARN-4497
> URL: https://issues.apache.org/jira/browse/YARN-4497
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Jun Gong
>Assignee: Jun Gong
>
> Find following problem when discussing in YARN-3480.
> If RM fails to store some attempts in RMStateStore, there will be missing 
> attempts in RMStateStore, for the case storing attempt1, attempt2 and 
> attempt3, RM successfully stored attempt1 and attempt3, but failed to store 
> attempt2. When RM restarts, in *RMAppImpl#recover*, we recover attempts one 
> by one, for this case, we will recover attmept1, then attempt2. When 
> recovering attempt2, we call  
> *((RMAppAttemptImpl)this.currentAttempt).recover(state)*, it will first find 
> its ApplicationAttemptStateData, but it could not find it, an error will come 
> at *assert attemptState != null*(*RMAppAttemptImpl#recover*, line 880).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)