[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066251#comment-14066251 ] Hudson commented on YARN-1341: -- FAILURE: Integrated in Hadoop-Yarn-trunk #616 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/616/]) YARN-1341. Recover NMTokens upon nodemanager restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611512) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/BaseNMTokenSecretManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/security/NMTokenSecretManagerInNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security/TestNMTokenSecretManagerInNM.java Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 2.6.0 Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066342#comment-14066342 ] Hudson commented on YARN-1341: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1835 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1835/]) YARN-1341. Recover NMTokens upon nodemanager restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611512) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/BaseNMTokenSecretManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/security/NMTokenSecretManagerInNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security/TestNMTokenSecretManagerInNM.java Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 2.6.0 Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066365#comment-14066365 ] Hudson commented on YARN-1341: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1808 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1808/]) YARN-1341. Recover NMTokens upon nodemanager restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611512) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/BaseNMTokenSecretManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/security/NMTokenSecretManagerInNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security/TestNMTokenSecretManagerInNM.java Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Fix For: 2.6.0 Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064870#comment-14064870 ] Hadoop QA commented on YARN-1341: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656064/YARN-1341v7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesContainers org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesApps {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4344//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4344//console This message is automatically generated. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065716#comment-14065716 ] Junping Du commented on YARN-1341: -- +1. Patch looks good. Will commit it shortly. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065715#comment-14065715 ] Junping Du commented on YARN-1341: -- I can confirm test failure is not related to the patch as it also show up in YARN-2045. The similar issue happens in AM WebServices (MAPREDUCE-5973)and RM WebServices (YARN-2304) also. Already filed YARN-2316 to track these failures. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065795#comment-14065795 ] Hudson commented on YARN-1341: -- FAILURE: Integrated in Hadoop-trunk-Commit #5906 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/5906/]) YARN-1341. Recover NMTokens upon nodemanager restart. (Contributed by Jason Lowe) (junping_du: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1611512) * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common/src/main/java/org/apache/hadoop/yarn/server/security/BaseNMTokenSecretManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeManager.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMNullStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/security/NMTokenSecretManagerInNM.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/NMMemoryStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/recovery/TestNMLeveldbStateStoreService.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/security/TestNMTokenSecretManagerInNM.java Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063381#comment-14063381 ] Junping Du commented on YARN-1341: -- Thanks [~jlowe] for reply above and document work on umbrella JIRA. I think we can add error handling session later for all cases according to discussion above. Isn't it? I also agree retry (configurable or not) may not be necessary at this point. We may add it if future if we really need it. [~devaraj.k], what do you think? The latest patch looks good to me overall. Some comments on tiny issues: In NMTokenSecretManagerInNM.java {code} +// if there was no master key, try the previous key +if (super.currentMasterKey == null) { + super.currentMasterKey = previousMasterKey; +} {code} Is above code still necessary given currentMasterKey will updated soon from RM registration as what we discussed above? Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063704#comment-14063704 ] Hadoop QA commented on YARN-1341: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12656064/YARN-1341v7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesContainers org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServices org.apache.hadoop.yarn.server.nodemanager.webapp.TestNMWebServicesApps {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4323//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4323//console This message is automatically generated. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063981#comment-14063981 ] Jason Lowe commented on YARN-1341: -- I believe the test failures are unrelated. All of the nodemanager tests pass for me locally, and the build report with the test failures shows they're all of the java.net.BindException: Address already in use variety. Other web services tests have been failing in MAPREDUCE in a similar way -- I suspect a process from a previous test run got stuck on a popular web service port. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14064427#comment-14064427 ] Junping Du commented on YARN-1341: -- Agree. The test failure shouldn't be related to the latest patch. Kick off Jenkins test again. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch, YARN-1341v7.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14061891#comment-14061891 ] Junping Du commented on YARN-1341: -- Hey [~jlowe], I also agree it is better to discuss the inconsistent scenario for each cases on separated JIRAs. However, for now, our conclusion from these discussions can only be true in theoretically but it may have bugs/issues in practical. Thus, I also suggest we should have a central place to document these assumptions/conclusions from discussions and it would help us and others in community to identify potential issues if coming up with UT or other integration tests on negative cases later. What do you think? If you are also agree on this, we can separate this document effort to other JIRA (Umbrella or a dedicated one, whatever you like) and continue the discussion on this particular case. On this particular one, the assumptions here from discussion above seems like: if NM restart with stale keys, a. if currentMasterKey is stale, it can be updated and override soon with registering to RM later. Nothing is affected. b. if previousMasterKey is stale, then the real previous master key is lost, so the affection is: AMs with real master key cannot connect to NM to launch containers. c. if applicationMasterKeys are stale, then previous old keys get tracked in applicationMasterKeys get lost after restart. The affection is: AMs with old keys cannot connect to NM to launch containers. I would prefer option 1 too if we listed all affections here. Anything I am missing here? Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14062861#comment-14062861 ] Jason Lowe commented on YARN-1341: -- Thanks for commenting, Devaraj! My apologies for the late reply, as I was on vacation and am still catching up. bq. In addition to option 1), I'd think of making the NM down if NM fails to store RM keys for certain number of times(configurable) consecutively. As for retries, I mentioned earlier that if retries are likely to help then the state store implementation should do so rather than have the common code do so. For the leveldb implementation it is very unlikely that a retry is going to do anything other than just make the operation take longer to ultimately fail. The the firmware of the drive is already going to implement a large number of retries to attempt to recover from hardware errors, and non-hardware local filesystem errors are highly unlikely to be fixed by simply retrying immediately. If that were the case then I'd expect retries to be implemented in many other places where the local filesystem is used by Hadoop code. bq. And also we can make it(i.e. tear down NM or not) as configurable I'd like to avoid adding yet more config options unless we think we really need them, but if people agree this needs to be configurable then we can do so. Also I assume in that scenario you would want the NM to shutdown while also tearing down containers, cleaning up, etc. as if it didn't support recovery. Tearing down the NM on a state store error just to have it start up again and try to recover with stale state seems pointless -- might as well have just kept running which is a better outcome. Or am I missing a use case for that? And thanks, Junping, for the recent comments! bq. If you are also agree on this, we can separate this document effort to other JIRA (Umbrella or a dedicated one, whatever you like) and continue the discussion on this particular case. Sure, we can discuss general error handling or an overall document for it either on YARN-1336 or a new JIRA. bq. a. if currentMasterKey is stale, it can be updated and override soon with registering to RM later. Nothing is affected. Correct, the NM should receive the current master key upon re-registration with the RM after it restarts. bq. b. if previousMasterKey is stale, then the real previous master key is lost, so the affection is: AMs with real master key cannot connect to NM to launch containers. AMs that have the current master key will still be able to connect because the NM just got the current master key as described in a). AM's that have the previous master key will not be able to connect to the NM unless that particular master key also happened to be successfully associated with the attempt in the state store (related to case c). bq. c. if applicationMasterKeys are stale, then previous old keys get tracked in applicationMasterKeys get lost after restart. The affection is: AMs with old keys cannot connect to NM to launch containers. AMs that use an old key (i.e.: not the current or previous master key) would be unable to connect to the NM. bq. Anything I am missing here? I don't believe so. The bottom line is that an AM may not be able to successfully connect to an NM after a restart with stale NM token state. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14063031#comment-14063031 ] Hadoop QA commented on YARN-1341: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12651342/YARN-1341v6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4319//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4319//console This message is automatically generated. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14047910#comment-14047910 ] Devaraj K commented on YARN-1341: - Sorry for coming late here. +1 for limiting the implementation/discussion as per Jira title and handling other cases in the respected Jira’s. In addition to option 1), I'd think of making the NM down if NM fails to store RM keys for certain number of times(configurable) consecutively. And also we can make it(i.e. tear down NM or not) as configurable and let the users choose whether to enable or disable the config to make the NM down for RM keys state store failures. Similarly for Container/Application state store failures, NM can mark that Container/Application as failed and can be reported to RM. These can be discussed more detail in the corresponding Jira’s YARN-1337 and YARN-1354. However for all these NM state store operations, we could think of having retries before throwing the IOException. Thoughts? Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045959#comment-14045959 ] Jason Lowe commented on YARN-1341: -- Agree it's not ideal to discuss handling state store errors for all NM components in this JIRA. In general I'd prefer to discuss and address each case with the corresponding JIRA, e.g.: application state store errors discussed and addressed in YARN-1354, container state store errors in YARN-1337, etc. If we feel there's significant utility to committing a JIRA before all the issues are addressed then we can file one or more followup JIRAs to track those outstanding issues. That's the normal process we follow with other features/fixes as well. So if we follow that process then we're back to the discussion about RM master keys not being able to be stored in the state store. The choices we've discussed are: 1) Log an error, update the master key in memory, and continue 2) Log an error, _not_ update the master key in memory, and continue 3) Log an error and tear down the NM I'd prefer 1) since that is the option that preserves the most work in all scenarios I can think of, and I don't know of a scenario where 2) would handle it better. However I could be convinced given the right scenario. I'd really rather avoid 3) since that seems like a severe way to handle the error and guarantees work is lost. Oh there is one more handling scenario we briefly discussed where we flag the NM as undesirable. When that occurs we don't shoot the containers that are running, but we avoid adding new containers since the node is having issues (i.e.: a drain-decommission). I feel that would be a separate JIRA since it needs YARN-914, and we'd still need to decide how to handle the error until the decommission is complete (i.e.: choice 1 or 2 above). Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045522#comment-14045522 ] Junping Du commented on YARN-1341: -- [~jlowe], I agree we should treat each state inconsistent case specifically, and it is good that we already had many discussions to cover many cases which could beyond the work in this JIRA. I prefer to continue these discussions in a separated JIRA until we make sure all scenarios are covered properly (not block this JIRA's work). What do you think? Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14042559#comment-14042559 ] Jason Lowe commented on YARN-1341: -- bq. So far from I know, RM restart didn't track this because these metrics will be recover during events recovery in RM restart. In current NM restart, some metrics could be lost, i.e. allocatedContainers, etc. I think we should either count them back as part of events during recovery or persistent them. Thoughts? Not all of the RM metrics will be recovered, correct? RPC metrics will be zeroed since those aren't persisted (nor should they be, IMHO). Aggregate containers allocated/released in the queue metrics will be wrong since the RM restart work, by design, doesn't store per-container state. If the cluster stays up too long then apps submitted/completed/failed/killed will not be correct, as I believe it will only count the applications that haven't been reaped due to retention policies. Anyway this is outside the scope of this JIRA, and I'll file a separate JIRA underneath the YARN-1336 umbrella to discuss what we should do about NM metrics and restart. bq. If so, how about we don't apply these changes until these changes can be persistent? If so, we still keep consistent between state store and NM's current state. Even we choose to fail the NM, we still can load state and recover the working. Again I think this is a case-by-case thing. For the RM master key, I'd rather keep going with the current master key and hope the next key update is able to persist (e.g.: a full disk where the state is stored that is later cleared up) rather than ditch the new key update and risk bringing down the NM because it can no longer keep talking to the RM or AMs. As I mentioned earlier, the failure to persist the RM master key or the master key used by an AM is that _if_ the NM happens to restart then some AMs _might_ not be able to authenticate with the NM until they get updated to the new master key. If we take down the NM or keep going but fail to update the master key in memory then this seems purely worse. The opportunity for error has widened, but I don't see any advantage gained by doing so. bq. Do we expect some operations can be failed while other operation can be successful? If this means short-term unavailable for persistent effort, we can just handle it by adding retry. If not, we should expect other operations that fetal get failed soon enough, and in this case, log error and move on in non-fatal operations don't have many differences. No? I don't expect immediate retry to help, and if the state store implementation is such that immediate retry is likely to help then the state store implementation should do that directly before throwing the error rather than relying on the upper-layer code to do so. However I do expect there to be common failure modes where the error state is temporary but not in the immediate sense (e.g.: the full disk scenario). And although an NM can't launch containers without a working state store, there's still a lot of useful stuff an NM can do with a broken state store -- report status of active containers, serve up shuffle data, etc. So far I don't think any of the state store updates should result in a teardown of the NM if there is a failure, although please let me know if you have a scenario where we should. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14041077#comment-14041077 ] Junping Du commented on YARN-1341: -- bq. Yes, applications should be like containers. If we fail to store an application start in the state store then we should fail the container launch that triggered the application to be added. This already happens in the current patch for YARN-1354. If we fail to store the completion of an application then worst-case we will report an application to the RM on restart that isn't active, and the RM will correct the NM when it re-registers. That make sense. I guess we should do additional work to check if the behavior is as our expected. bq. I wasn't planning on persisting metrics during restart, as there are quite a few (e.g.: RPC metrics, etc.), and I'm not sure it's critical that they be preserved across a restart. Does RM restart do this or are there plans to do so? I think these metrics are important especially for user's monitoring tools and we should make these info consistent during restart. So far from I know, RM restart didn't track this because these metrics will be recover during events recovery in RM restart. In current NM restart, some metrics could be lost, i.e. allocatedContainers, etc. I think we should either count them back as part of events during recovery or persistent them. Thoughts? bq. Therefore I don't believe the effort to maintain a stale tag is going to be worth it. Also if we refuse to load a state store that's stale then we are going to leak containers because we won't try to recover anything from a stale state store. If so, how about we don't apply these changes until these changes can be persistent? If so, we still keep consistent between state store and NM's current state. Even we choose to fail the NM, we still can load state and recover the working. bq. Instead I think we should decide in the various store failure cases whether the error should be fatal to the operation (which may lead to it being fatal to the NM overall) or if we feel the recovery with stale information is a better outcome than taking the NM down. In the latter case we should just log the error and move on. Do we expect some operations can be failed while other operation can be successful? If this means short-term unavailable for persistent effort, we can just handle it by adding retry. If not, we should expect other operations that fetal get failed soon enough, and in this case, log error and move on in non-fatal operations don't have many differences. No? Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039138#comment-14039138 ] Jason Lowe commented on YARN-1341: -- bq. The worst case seems to me is: NM restart with partial state recovered, this inconsistent state is not aware by running containers which could bring some weird bugs. Yes, you're correct. The worst-case is likely where we come up, fail to realize a container is running, and therefore the container leaks. I think we should handle store errors on a case-by-case basis, based on the ramifications of how the system will recover without that information. For containers, a container should fail to launch if state store errors occur to mark the container request and/or mark the container launch. The YARN-1336 prototype patch already does this when containers are requested and in ContainerLaunch. That way the worst-case scenario for a container is that we throw an error for the container request or the container fails before launch due to a state store error. We failed to launch a container, but the whole NM doesn't go down. If we fail to mark the container completed in the store then worst-case scenario is that we try to recover a container that isn't there, which again will mark the container as failed and we'll report that to the RM. If the RM doesn't know about the failed container (because the container/app is long gone) then it will just ignore it. For deletion service, if we fail to update the store then we may fail to delete something when we recover if we happened to restart in-between. If we ignore the error then it's very likely the NM will _not_ restart before the deletion time expires and the file is deleted. However if we tear down the NM on a store error then we will also fail to delete it when the NM restarts later since we failed to record it, meaning we made things purely worse -- we lost work _and_ leaked the thing we were supposed to delete. Therefore for deletion tasks I think the current behavior is appropriate. For localized resources failing to update the store means we could end up leaking a resource or thinking a resource is there when it's really not. The latter isn't a huge problem because when we try to reference the resource again it checks if it's there, and if it isn't it re-localizes it again. Not knowing a resource is there is a bigger issue, and there's a couple of ways to tackle that one -- either fail the localization of the resource when the state store error occurs or have the NM scan the local resource directories for unknown resources when it recovers. For the RM master key, I see it very similar to the deletion task case. If we fail to store it then the NM will update it in memory, and can keep going. If we restart without recovering an older key (the current key will be obtained when the NM re-registers with the RM) then we may fail to let AMs connect that only have an older key. Containers that were still on the NM will still continue. If we take down the NM when the store hiccups then we lose work which seems worse than a possibility the AM could fail to connect to the NM (which can and does already happen today due to network cuts, etc.) bq. May be we add some stale tag on NMStateStore and mark this when store failure happens and never load a staled store. If we had an error storing then we're likely to have the same error trying to store a stale tag, or am I misunderstanding the proposal? Also as I mentioned above, there are many cases where a partial recovery isn't a bad thing as the system can recover via other means (e.g.: trying to recover a container that already completed should be benign, trying to delete a container directory that's already deleted is benign, etc.). Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039206#comment-14039206 ] Junping Du commented on YARN-1341: -- Thanks [~jlowe] for detailed explanation here! I totally agree that we should deal with this case by case and appreciate your analysis on above cases. I think there are still other cases we should double-check, some of them may suffer more from inconsistency. - Application state - If we failed to store the application update, i.e. from init to finish, then we get wrong state on application after recovery. - NodeManagerMetrics - The metrics of NM will get mess up if partial updated. (We haven't get JIRA to store/recover this. Isn't it?) On the side effect for bring NM down, like case in deletionServices. I think we can just do cleanup on these directories (like we want to do in node decommission cases). About stale tag on NMStateStore - I don't mean to put on NMStateStore, but haven't think clearly on where to do - may be we can persistent on local disk directly or send to RM and retrieval it in NM registration? Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14039235#comment-14039235 ] Jason Lowe commented on YARN-1341: -- bq. Application state - If we failed to store the application update, i.e. from init to finish, then we get wrong state on application after recovery. Yes, applications should be like containers. If we fail to store an application start in the state store then we should fail the container launch that triggered the application to be added. This already happens in the current patch for YARN-1354. If we fail to store the completion of an application then worst-case we will report an application to the RM on restart that isn't active, and the RM will correct the NM when it re-registers. bq. NodeManagerMetrics - The metrics of NM will get mess up if partial updated. I wasn't planning on persisting metrics during restart, as there are quite a few (e.g.: RPC metrics, etc.), and I'm not sure it's critical that they be preserved across a restart. Does RM restart do this or are there plans to do so? bq. About stale tag on NMStateStore - I don't mean to put on NMStateStore, but haven't think clearly on where to do - may be we can persistent on local disk directly or send to RM and retrieval it in NM registration? I think in most cases the attempt to update the stale tag, even if it's separate from the NMStateStore, will often fail in a similar way when the state store fails (e.g.: full local disk, read-only filesystem, etc.). Therefore I don't believe the effort to maintain a stale tag is going to be worth it. Also if we refuse to load a state store that's stale then we are going to leak containers because we won't try to recover anything from a stale state store. Instead I think we should decide in the various store failure cases whether the error should be fatal to the operation (which may lead to it being fatal to the NM overall) or if we feel the recovery with stale information is a better outcome than taking the NM down. In the latter case we should just log the error and move on. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037656#comment-14037656 ] Junping Du commented on YARN-1341: -- bq. I'm not sure I understand what you're requesting. Recovering the NM tokens is one line of code (3 if we count the if canRecover part), and recovering the container tokens in YARN-1342 will add one more line for that (inside the same if canRecover block). I went ahead and factored this into a separate method, however I'm not sure it matches what you were expecting as I don't see where we're saving duplicated code. If what's in the updated patch isn't what you expected, please provide some sample pseudo-code to demonstrate how we can avoid duplication of code. I think it is fine for now. However, I would like to refactor a bit on NodeManager#serviceInit() when we finish all these recover work to avoid some duplicate work, some code like: createNMContext(), we duplicated set some handler. Anyway, we can do this later. bq. The problem with throwing an exception is what to do with the exception – do we take down the NM? That seems like a drastic answer since the NM will likely chug along just fine without the key stored. It only becomes a problem when the NM restarts and restores an old key. However if we rollback the old key here then we take that only-breaks-if-we-happened-to-restart case and make it an always-breaks scenario. Eventually the old key will no longer be valid to the RM, and none of the AMs will be able to authenticate to the NM. Therefore I thought it would be better to log the error, press onward, and hope we don't restart before we store a valid key again (maybe store error was transient) rather than either take down the NM or have things start failing even without a restart We already have similar tradeoff in RM side, if any exception happens in RMStore then it will bring down RM. In NM case, if levelDB stop to work, I think we should bring NM down to get rid of any inconsistent after NM restart. Although I am not sure what weird things could happen in case of inconsistency here, but considering it is cheaper to bring down NM, we should play more safety in our case than RM. Actually, I bring up some thoughts on play more risky in RM side at YARN-2019 which target to reduce RM service down time. But here, I prefer to be safer. Jason, what do you think? Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14037868#comment-14037868 ] Junping Du commented on YARN-1341: -- bq. Restarts should be rare, and I'd rather not force a loss of work by taking the NM down instantly when the state store hiccups. Yes. But considering rolling upgrade case, it (restart) should be much often than failed in state store (Correct me here if I am wrong as I am not levelDB expert). In this case, we always look forward to some work loss as even if we don't bring NM down now, we will suffer after NM restart in upgrade. bq. If the state store is missing some things, we might not be able to recover a localized resource, a token, a container, or possibly anything at all. I am not worrying losing them all, but if we can only partially recover these, would it become a problem and break some assumptions we have? I don't know. But this seems to make things more complicated. bq. in the worst-case, the state store is so corrupted on startup that we don't even survive the NM restart and the NM crashes, which would have an end result just like if we took it down when the state store failed. I am not sure if this is the worst case. The worst case seems to me is: NM restart with partial state recovered, this inconsistent state is not aware by running containers which could bring some weird bugs. I am not sure how possible it could happen here, please bq. Therefore I'd rather not guarantee that we'll lose work by crashing the NM on any store error and instead try to preserve the work we have. The NM could theoretically recover (e.g.: if the error is transient then the next RM key store could succeed). If we take the NM down immediately then we're guaranteeing the work is lost. Is that really better? I think it is better to guarantee the work get lost as the expectation to user is consistent. We don't know when new Token from RM come to refresh to stale one to make persevering work succeed in lucky. User shouldn't expect work still get preserved after NM restart if state store get failed sometime. bq. May be a better approach is to have errors like this trigger an unhealthy state for the NM when we have the ability to do a graceful decommission. I agree. This could be a better approach. In overall, I agree that we can keep log error here without breaking NM down (or we will have change previous code on update localizedResources/deletionServices) for reason you specified above. However, to get rid of loading inconsistent state and manage user's expectation. I think we shouldn't allow the state get loaded again if get some failure before in store. May be we add some stale tag on NMStateStore and mark this when store failure happens and never load a staled store. [~jlowe], what do you think? Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14036592#comment-14036592 ] Junping Du commented on YARN-1341: -- Thanks for updating the patch, [~jlowe]! Some minor comments: The change in BaseContainerTokenSecretManager.java is not necessary and I believe that belongs to YARN-1342. Let’s remove it for this patch. In NodeManager.java, {code} NMTokenSecretManagerInNM nmTokenSecretManager = -new NMTokenSecretManagerInNM(); +new NMTokenSecretManagerInNM(nmStore); + +if (nmStore.canRecover()) { + nmTokenSecretManager.recover(nmStore.loadNMTokenState()); +} {code} Can we consolidate the code in a separated method together with NMContainerTokenSecretManager as we will do similar thing to recover ContainerToken staff which make code have duplicated things? In NMTokenSecretManagerInNM.java, {code} + private void updateCurrentMasterKey(MasterKeyData key) { +super.currentMasterKey = key; +try { + stateStore.storeNMTokenCurrentMasterKey(key.getMasterKey()); +} catch (IOException e) { + LOG.error(Unable to update current master key in state store, e); +} + } + + private void updatePreviousMasterKey(MasterKeyData key) { +previousMasterKey = key; +try { + stateStore.storeNMTokenPreviousMasterKey(key.getMasterKey()); +} catch (IOException e) { + LOG.error(Unable to update previous master key in state store, e); +} + } {code} Does log error here is just enough in case of failure in store? If Master key is updated but not persistent, then it could cause some inconsistency when recover it. I think we should throw some exception here if store get failed and rollback the key just set. Thoughts? Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14036853#comment-14036853 ] Hadoop QA commented on YARN-1341: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12651342/YARN-1341v6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4022//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4022//console This message is automatically generated. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch, YARN-1341v6.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034021#comment-14034021 ] Junping Du commented on YARN-1341: -- [~jlowe], Thanks for the patch here. I am currently reviewing it and looks like some code like: LeveldbIterator, NMStateStoreService already get committed in other patches. Would you resync the patch here against trunk? Thanks! Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034588#comment-14034588 ] Hadoop QA commented on YARN-1341: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12650914/YARN-1341v5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/4017//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/4017//console This message is automatically generated. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch, YARN-1341v5.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13985928#comment-13985928 ] Hadoop QA commented on YARN-1341: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12642696/YARN-1341v4-and-YARN-1987.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3667//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3667//console This message is automatically generated. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch, YARN-1341v4-and-YARN-1987.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13963581#comment-13963581 ] Hadoop QA commented on YARN-1341: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12639280/YARN-1341v3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3534//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3534//console This message is automatically generated. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch, YARN-1341v3.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923276#comment-13923276 ] Hadoop QA commented on YARN-1341: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633251/YARN-1341.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 11 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3283//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3283//console This message is automatically generated. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1341) Recover NMTokens upon nodemanager restart
[ https://issues.apache.org/jira/browse/YARN-1341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13923372#comment-13923372 ] Hadoop QA commented on YARN-1341: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12633265/YARN-1341v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/3285//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/3285//console This message is automatically generated. Recover NMTokens upon nodemanager restart - Key: YARN-1341 URL: https://issues.apache.org/jira/browse/YARN-1341 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.3.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-1341.patch, YARN-1341v2.patch -- This message was sent by Atlassian JIRA (v6.2#6252)