[jira] [Updated] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-1879: - Attachment: YARN-1879.24.patch Updated: * Renaming TestApplicationMasterServiceProtocolOnHA#initiate to initialize(). * About registerApplicatinMaster(), added test case for duplicated request. {quote} We should probably move the following check in finishApplicationMaster call upfront so that ApplicationDoesNotExistInCacheException is not thrown for already completed apps. {quote} In this case, token is expired for the application after finishing AM's container and I think we don't need to handle it. [~jianhe], what do you think? Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1879) Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over
[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168061#comment-14168061 ] Hadoop QA commented on YARN-1879: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12674346/YARN-1879.24.patch against trunk revision 554250c. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:red}-1 release audit{color}. The applied patch generated 1 release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager.TestAMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5370//testReport/ Release audit warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5370//artifact/patchprocess/patchReleaseAuditProblems.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5370//console This message is automatically generated. Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol for RM fail over Key: YARN-1879 URL: https://issues.apache.org/jira/browse/YARN-1879 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Tsuyoshi OZAWA Priority: Critical Attachments: YARN-1879.1.patch, YARN-1879.1.patch, YARN-1879.11.patch, YARN-1879.12.patch, YARN-1879.13.patch, YARN-1879.14.patch, YARN-1879.15.patch, YARN-1879.16.patch, YARN-1879.17.patch, YARN-1879.18.patch, YARN-1879.19.patch, YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.20.patch, YARN-1879.21.patch, YARN-1879.22.patch, YARN-1879.23.patch, YARN-1879.23.patch, YARN-1879.24.patch, YARN-1879.3.patch, YARN-1879.4.patch, YARN-1879.5.patch, YARN-1879.6.patch, YARN-1879.7.patch, YARN-1879.8.patch, YARN-1879.9.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2494) Node label manager API and storage implementations
[ https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168102#comment-14168102 ] Hudson commented on YARN-2494: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #708 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/708/]) YARN-2494. Added NodeLabels Manager internal API and implementation. Contributed by Wangda Tan. (vinodkv: rev db7f1653198b950e89567c06898d64f6b930a0ee) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/NodeLabelsStoreEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/RemoveClusterNodeLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/DummyRMNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/StoreNewClusterNodeLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/TestFileSystemNodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/NodeLabelsStoreEventType.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/NodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/NodeLabelTestBase.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/TestRMNodeLabelsManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/DummyCommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/TestCommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/UpdateNodeToLabelsMappingsEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/RMNodeLabelsManager.java Node label manager API and storage implementations -- Key: YARN-2494 URL: https://issues.apache.org/jira/browse/YARN-2494 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.6.0 Attachments: YARN-2494.20141009-1.patch, YARN-2494.20141009-2.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch This JIRA includes APIs and storage implementations of node label manager, NodeLabelManager is an abstract class used to manage labels of nodes in the cluster, it has APIs to query/modify - Nodes according to given label - Labels according to given hostname - Add/remove labels - Set labels of nodes in the cluster - Persist/recover changes of labels/labels-on-nodes to/from storage And it has two implementations to store modifications - Memory based storage: It will not persist changes, so all labels will be lost when RM restart - FileSystem based storage: It will persist/recover to/from FileSystem (like HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2501) Changes in AMRMClient to support labels
[ https://issues.apache.org/jira/browse/YARN-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168160#comment-14168160 ] Hudson commented on YARN-2501: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1898 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1898/]) YARN-2501. Enhanced AMRMClient library to support requests against node labels. Contributed by Wangda Tan. (vinodkv: rev a5ec3d080978a67837946a991843a081ea712539) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/AMRMClient.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/AMRMClientImpl.java Changes in AMRMClient to support labels --- Key: YARN-2501 URL: https://issues.apache.org/jira/browse/YARN-2501 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.6.0 Attachments: YARN-2501-20141009.1.patch, YARN-2501-20141009.2.patch, YARN-2501.patch Changes in AMRMClient to support labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2494) Node label manager API and storage implementations
[ https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168162#comment-14168162 ] Hudson commented on YARN-2494: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1898 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1898/]) YARN-2494. Added NodeLabels Manager internal API and implementation. Contributed by Wangda Tan. (vinodkv: rev db7f1653198b950e89567c06898d64f6b930a0ee) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/TestRMNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/StoreNewClusterNodeLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/DummyRMNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/RemoveClusterNodeLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/DummyCommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/TestCommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/NodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/RMNodeLabelsManager.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/NodeLabelsStoreEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/TestFileSystemNodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/UpdateNodeToLabelsMappingsEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/NodeLabelsStoreEventType.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/NodeLabelTestBase.java Node label manager API and storage implementations -- Key: YARN-2494 URL: https://issues.apache.org/jira/browse/YARN-2494 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.6.0 Attachments: YARN-2494.20141009-1.patch, YARN-2494.20141009-2.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch This JIRA includes APIs and storage implementations of node label manager, NodeLabelManager is an abstract class used to manage labels of nodes in the cluster, it has APIs to query/modify - Nodes according to given label - Labels according to given hostname - Add/remove labels - Set labels of nodes in the cluster - Persist/recover changes of labels/labels-on-nodes to/from storage And it has two implementations to store modifications - Memory based storage: It will not persist changes, so all labels will be lost when RM restart - FileSystem based storage: It will persist/recover to/from FileSystem (like HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2494) Node label manager API and storage implementations
[ https://issues.apache.org/jira/browse/YARN-2494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168194#comment-14168194 ] Hudson commented on YARN-2494: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1923 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1923/]) YARN-2494. Added NodeLabels Manager internal API and implementation. Contributed by Wangda Tan. (vinodkv: rev db7f1653198b950e89567c06898d64f6b930a0ee) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/TestFileSystemNodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/CommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/FileSystemNodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/DummyCommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/TestCommonNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/UpdateNodeToLabelsMappingsEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/DummyRMNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/NodeLabelsStoreEventType.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/RemoveClusterNodeLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/TestRMNodeLabelsManager.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/nodelabels/NodeLabelTestBase.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/YarnConfiguration.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/NodeLabelsStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/NodeLabelsStoreEvent.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/nodelabels/event/StoreNewClusterNodeLabels.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/nodelabels/RMNodeLabelsManager.java Node label manager API and storage implementations -- Key: YARN-2494 URL: https://issues.apache.org/jira/browse/YARN-2494 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.6.0 Attachments: YARN-2494.20141009-1.patch, YARN-2494.20141009-2.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch, YARN-2494.patch This JIRA includes APIs and storage implementations of node label manager, NodeLabelManager is an abstract class used to manage labels of nodes in the cluster, it has APIs to query/modify - Nodes according to given label - Labels according to given hostname - Add/remove labels - Set labels of nodes in the cluster - Persist/recover changes of labels/labels-on-nodes to/from storage And it has two implementations to store modifications - Memory based storage: It will not persist changes, so all labels will be lost when RM restart - FileSystem based storage: It will persist/recover to/from FileSystem (like HDFS), and all labels and labels-on-nodes will be recovered upon RM restart -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2501) Changes in AMRMClient to support labels
[ https://issues.apache.org/jira/browse/YARN-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168192#comment-14168192 ] Hudson commented on YARN-2501: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1923 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1923/]) YARN-2501. Enhanced AMRMClient library to support requests against node labels. Contributed by Wangda Tan. (vinodkv: rev a5ec3d080978a67837946a991843a081ea712539) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/impl/AMRMClientImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/main/java/org/apache/hadoop/yarn/client/api/AMRMClient.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client/src/test/java/org/apache/hadoop/yarn/client/api/impl/TestAMRMClient.java Changes in AMRMClient to support labels --- Key: YARN-2501 URL: https://issues.apache.org/jira/browse/YARN-2501 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Fix For: 2.6.0 Attachments: YARN-2501-20141009.1.patch, YARN-2501-20141009.2.patch, YARN-2501.patch Changes in AMRMClient to support labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2641) improve node decommission latency in RM.
[ https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168278#comment-14168278 ] Karthik Kambatla commented on YARN-2641: Thanks for clarifying it, Zhihai. Verified that if we decommission the node after the NM goes down, the tasks are rescheduled only when the liveliness monitor kicks in. improve node decommission latency in RM. Key: YARN-2641 URL: https://issues.apache.org/jira/browse/YARN-2641 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2641.000.patch, YARN-2641.001.patch improve node decommission latency in RM. Currently the node decommission only happened after RM received nodeHeartbeat from the Node Manager. The node heartbeat interval is configurable. The default value is 1 second. It will be better to do the decommission during RM Refresh(NodesListManager) instead of nodeHeartbeat(ResourceTrackerService). This will be a much more serious issue: After RM is refreshed (refreshNodes), If the NM to be decommissioned is killed before NM sent heartbeat to RM. The RMNode will never be decommissioned in RM. The RMNode will only expire in RM after yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2566: Attachment: YARN-2566.007.patch IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2566.000.patch, YARN-2566.001.patch, YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED
[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168282#comment-14168282 ] zhihai xu commented on YARN-2566: - Optimize the code in YARN-2566.007.patch to exclude the leading zero available-space-on-disk directory to address item 3. IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2566.000.patch, YARN-2566.001.patch, YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera OPERATION=Container Finished -
[jira] [Commented] (YARN-2641) improve node decommission latency in RM.
[ https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168283#comment-14168283 ] zhihai xu commented on YARN-2641: - [~kasha], thanks to spend a lot of effort to verify the issue. improve node decommission latency in RM. Key: YARN-2641 URL: https://issues.apache.org/jira/browse/YARN-2641 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2641.000.patch, YARN-2641.001.patch improve node decommission latency in RM. Currently the node decommission only happened after RM received nodeHeartbeat from the Node Manager. The node heartbeat interval is configurable. The default value is 1 second. It will be better to do the decommission during RM Refresh(NodesListManager) instead of nodeHeartbeat(ResourceTrackerService). This will be a much more serious issue: After RM is refreshed (refreshNodes), If the NM to be decommissioned is killed before NM sent heartbeat to RM. The RMNode will never be decommissioned in RM. The RMNode will only expire in RM after yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168327#comment-14168327 ] Karthik Kambatla commented on YARN-2566: The patch doesn't compile against latest trunk: WindowsSecureContainerExecutor extends DefaultContainerExecutor. The patch looks okay to me, otherwise. IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Priority: Critical Attachments: YARN-2566.000.patch, YARN-2566.001.patch, YARN-2566.002.patch, YARN-2566.003.patch, YARN-2566.004.patch, YARN-2566.005.patch, YARN-2566.006.patch, YARN-2566.007.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger:
[jira] [Commented] (YARN-2641) improve node decommission latency in RM.
[ https://issues.apache.org/jira/browse/YARN-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168333#comment-14168333 ] Karthik Kambatla commented on YARN-2641: Patch looks good to me, except for the following: - Are the changes to ResourceTrackerService#registerNodeManager required? NodeListManager#isValidNode is synchronized on hostsReader and that should be sufficient. No? improve node decommission latency in RM. Key: YARN-2641 URL: https://issues.apache.org/jira/browse/YARN-2641 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2641.000.patch, YARN-2641.001.patch improve node decommission latency in RM. Currently the node decommission only happened after RM received nodeHeartbeat from the Node Manager. The node heartbeat interval is configurable. The default value is 1 second. It will be better to do the decommission during RM Refresh(NodesListManager) instead of nodeHeartbeat(ResourceTrackerService). This will be a much more serious issue: After RM is refreshed (refreshNodes), If the NM to be decommissioned is killed before NM sent heartbeat to RM. The RMNode will never be decommissioned in RM. The RMNode will only expire in RM after yarn.nm.liveness-monitor.expiry-interval-ms(default value 10 minutes) time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2668) yarn-registry JAR won't link against ZK 3.4.5
[ https://issues.apache.org/jira/browse/YARN-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168453#comment-14168453 ] Steve Loughran commented on YARN-2668: -- release audit is RAT complaining the 0-byte file resources/.keep doesn't have an ASF header yarn-registry JAR won't link against ZK 3.4.5 - Key: YARN-2668 URL: https://issues.apache.org/jira/browse/YARN-2668 Project: Hadoop YARN Issue Type: Sub-task Components: client Affects Versions: 2.6.0 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-2668-001.patch, YARN-2668-002.patch Original Estimate: 0.5h Remaining Estimate: 0.5h It's been reported that the registry code doesn't link against ZK 3.4.5 as the enable/disable SASL client property isn't there, which went in with ZOOKEEPER-1657. pulling in the constant and {{isEnabled()}} check will ensure registry linkage, even though the ability for a client to disable SASL auth will be lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2616) Add CLI client to the registry to list/view entries
[ https://issues.apache.org/jira/browse/YARN-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168458#comment-14168458 ] Steve Loughran commented on YARN-2616: -- the core code is checked in YARN-2652, but without tests and documentations it doesn't exist, Please add. SLIDER-365 implements a resolve operation; the lessons from using that can help define what is needed here. Key early learning is SLIDER-306, the path {{~}} should refer to the user's home dir. That means that a path {{~/services/hbase}} is valid across all users, so ideal for testing, scripting c. Add CLI client to the registry to list/view entries --- Key: YARN-2616 URL: https://issues.apache.org/jira/browse/YARN-2616 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Steve Loughran Assignee: Akshay Radia Attachments: YARN-2616-003.patch, yarn-2616-v1.patch, yarn-2616-v2.patch registry needs a CLI interface -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2668) yarn-registry JAR won't link against ZK 3.4.5
[ https://issues.apache.org/jira/browse/YARN-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14168461#comment-14168461 ] Hudson commented on YARN-2668: -- FAILURE: Integrated in Hadoop-trunk-Commit #6244 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6244/]) YARN-2668 yarn-registry JAR won't link against ZK 3.4.5. (stevel) (stevel: rev ac64ff77cfd87b5dd6b017468acd6698dbff43ae) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/test/java/org/apache/hadoop/registry/secure/TestSecureRMRegistryOperations.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/impl/zk/RegistrySecurity.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry/src/main/java/org/apache/hadoop/registry/client/impl/zk/ZookeeperConfigOptions.java yarn-registry JAR won't link against ZK 3.4.5 - Key: YARN-2668 URL: https://issues.apache.org/jira/browse/YARN-2668 Project: Hadoop YARN Issue Type: Sub-task Components: client Affects Versions: 2.6.0 Reporter: Steve Loughran Assignee: Steve Loughran Fix For: 2.6.0 Attachments: YARN-2668-001.patch, YARN-2668-002.patch Original Estimate: 0.5h Remaining Estimate: 0.5h It's been reported that the registry code doesn't link against ZK 3.4.5 as the enable/disable SASL client property isn't there, which went in with ZOOKEEPER-1657. pulling in the constant and {{isEnabled()}} check will ensure registry linkage, even though the ability for a client to disable SASL auth will be lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332)