[jira] [Commented] (YARN-561) Nodemanager should set some key information into the environment of every container that it launches.
[ https://issues.apache.org/jira/browse/YARN-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629837#comment-13629837 ] Xuan Gong commented on YARN-561: org.apache.hadoop.yarn.api.records.container has containerId and NodeId(which can get address and port) which are enough for container talked to its local NM. And by YARN-486, we have already add org.apache.hadoop.yarn.api.records.container to ContainImpl. So, it will get those information now. Nodemanager should set some key information into the environment of every container that it launches. - Key: YARN-561 URL: https://issues.apache.org/jira/browse/YARN-561 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Xuan Gong Labels: usability Information such as containerId, nodemanager hostname, nodemanager port is not set in the environment when any container is launched. For an AM, the RM does all of this for it but for a container launched by an application, all of the above need to be set by the ApplicationMaster. At the minimum, container id would be a useful piece of information. If the container wishes to talk to its local NM, the nodemanager related information would also come in handy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-441) Clean up unused collection methods in various APIs
[ https://issues.apache.org/jira/browse/YARN-441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629839#comment-13629839 ] Hadoop QA commented on YARN-441: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12578368/YARN-441.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 3 warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/725//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/725//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-api.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/725//console This message is automatically generated. Clean up unused collection methods in various APIs -- Key: YARN-441 URL: https://issues.apache.org/jira/browse/YARN-441 Project: Hadoop YARN Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Xuan Gong Attachments: YARN-441.1.patch, YARN-441.2.patch, YARN-441.3.patch, YARN-441.4.patch There's a bunch of unused methods like getAskCount() and getAsk(index) in AllocateRequest, and other interfaces. These should be removed. In YARN, found them in. MR will have it's own set. AllocateRequest StartContaienrResponse -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-441) Clean up unused collection methods in various APIs
[ https://issues.apache.org/jira/browse/YARN-441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-441: --- Attachment: YARN-441.5.patch 1. fix -1 on javadoc 2. fix -1 on findbug by remove the keyword synchronized from StartContainerResponsePbImpl.java Clean up unused collection methods in various APIs -- Key: YARN-441 URL: https://issues.apache.org/jira/browse/YARN-441 Project: Hadoop YARN Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Xuan Gong Attachments: YARN-441.1.patch, YARN-441.2.patch, YARN-441.3.patch, YARN-441.4.patch, YARN-441.5.patch There's a bunch of unused methods like getAskCount() and getAsk(index) in AllocateRequest, and other interfaces. These should be removed. In YARN, found them in. MR will have it's own set. AllocateRequest StartContaienrResponse -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-570) Time strings are formated in different timezone
[ https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629880#comment-13629880 ] Hadoop QA commented on YARN-570: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12577997/MAPREDUCE-5141.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/726//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/726//console This message is automatically generated. Time strings are formated in different timezone --- Key: YARN-570 URL: https://issues.apache.org/jira/browse/YARN-570 Project: Hadoop YARN Issue Type: Bug Reporter: PengZhang Attachments: MAPREDUCE-5141.patch Time strings on different page are displayed in different timezone. If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as Wed, 10 Apr 2013 08:29:56 GMT If it is formatted by format() in yarn.util.Times, it appears as 10-Apr-2013 16:29:56 Same value, but different timezone. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-441) Clean up unused collection methods in various APIs
[ https://issues.apache.org/jira/browse/YARN-441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629887#comment-13629887 ] Hadoop QA commented on YARN-441: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12578389/YARN-441.5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/727//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/727//console This message is automatically generated. Clean up unused collection methods in various APIs -- Key: YARN-441 URL: https://issues.apache.org/jira/browse/YARN-441 Project: Hadoop YARN Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Xuan Gong Attachments: YARN-441.1.patch, YARN-441.2.patch, YARN-441.3.patch, YARN-441.4.patch, YARN-441.5.patch There's a bunch of unused methods like getAskCount() and getAsk(index) in AllocateRequest, and other interfaces. These should be removed. In YARN, found them in. MR will have it's own set. AllocateRequest StartContaienrResponse -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-570) Time strings are formated in different timezone
[ https://issues.apache.org/jira/browse/YARN-570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629907#comment-13629907 ] Harsh J commented on YARN-570: -- Thanks for the report and the patch! With this patch it now renders it this way: renderHadoopDate() - Wed, 10 Apr 2013 08:29:56 GMT+05:30 format() - 10-Apr-2013 08:29:56 Which I think is still inconsistent. Ideally, I think, we'd want the former everywhere for consistency. Can you update format() as well to print in the same style, if you agree? Time strings are formated in different timezone --- Key: YARN-570 URL: https://issues.apache.org/jira/browse/YARN-570 Project: Hadoop YARN Issue Type: Bug Reporter: PengZhang Attachments: MAPREDUCE-5141.patch Time strings on different page are displayed in different timezone. If it is rendered by renderHadoopDate() in yarn.dt.plugins.js, it appears as Wed, 10 Apr 2013 08:29:56 GMT If it is formatted by format() in yarn.util.Times, it appears as 10-Apr-2013 16:29:56 Same value, but different timezone. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-457) Setting updated nodes from null to null causes NPE in AllocateResponsePBImpl
[ https://issues.apache.org/jira/browse/YARN-457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kenji Kikushima updated YARN-457: - Attachment: YARN-457-4.patch Attached a patch for trunk. Thank you. Setting updated nodes from null to null causes NPE in AllocateResponsePBImpl Key: YARN-457 URL: https://issues.apache.org/jira/browse/YARN-457 Project: Hadoop YARN Issue Type: Sub-task Components: api Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Kenji Kikushima Priority: Minor Labels: Newbie Attachments: YARN-457-2.patch, YARN-457-3.patch, YARN-457-4.patch, YARN-457.patch {code} if (updatedNodes == null) { this.updatedNodes.clear(); return; } {code} If updatedNodes is already null, a NullPointerException is thrown. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-486) Change startContainer NM API to accept Container as a parameter and make ContainerLaunchContext user land
[ https://issues.apache.org/jira/browse/YARN-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629987#comment-13629987 ] Hudson commented on YARN-486: - Integrated in Hadoop-Yarn-trunk #181 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/181/]) YARN-486. Changed NM's startContainer API to accept Container record given by RM as a direct parameter instead of as part of the ContainerLaunchContext record. Contributed by Xuan Gong. MAPREDUCE-5139. Update MR AM to use the modified startContainer API after YARN-486. Contributed by Xuan Gong. (Revision 1467063) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1467063 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerRemoteLaunchEvent.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttemptContainerRequest.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncher.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/main/java/org/apache/hadoop/mapred/YARNRunner.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/StartContainerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/StartContainerRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ContainerLaunchContext.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ApplicationSubmissionContextPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerLaunchContextPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/Client.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLaunchRPC.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestRPC.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/Container.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java *
[jira] [Commented] (YARN-319) Submit a job to a queue that not allowed in fairScheduler, client will hold forever.
[ https://issues.apache.org/jira/browse/YARN-319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13629991#comment-13629991 ] Hudson commented on YARN-319: - Integrated in Hadoop-Yarn-trunk #181 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/181/]) Fixing CHANGES.txt entry for YARN-319. (Revision 1467133) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1467133 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt Submit a job to a queue that not allowed in fairScheduler, client will hold forever. Key: YARN-319 URL: https://issues.apache.org/jira/browse/YARN-319 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.0.2-alpha Reporter: shenhong Assignee: shenhong Fix For: 2.0.5-beta Attachments: YARN-319-1.patch, YARN-319-2.patch, YARN-319-3.patch, YARN-319.patch RM use fairScheduler, when client submit a job to a queue, but the queue do not allow the user to submit job it, in this case, client will hold forever. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-412) FifoScheduler incorrectly checking for node locality
[ https://issues.apache.org/jira/browse/YARN-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630008#comment-13630008 ] Hudson commented on YARN-412: - Integrated in Hadoop-trunk-Commit #3607 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/3607/]) YARN-412. Fixed FifoScheduler to check hostname of a NodeManager rather than its host:port during scheduling which caused incorrect locality for containers. Contributed by Roger Hoover. (Revision 1467244) Result = SUCCESS acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1467244 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNodes.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java FifoScheduler incorrectly checking for node locality Key: YARN-412 URL: https://issues.apache.org/jira/browse/YARN-412 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Roger Hoover Assignee: Roger Hoover Priority: Minor Labels: patch Attachments: YARN-412.patch In the FifoScheduler, the assignNodeLocalContainers method is checking if the data is local to a node by searching for the nodeAddress of the node in the set of outstanding requests for the app. This seems to be incorrect as it should be checking hostname instead. The offending line of code is 455: application.getResourceRequest(priority, node.getRMNode().getNodeAddress()); Requests are formated by hostname (e.g. host1.foo.com) whereas node addresses are a concatenation of hostname and command port (e.g. host1.foo.com:1234) In the CapacityScheduler, it's done using hostname. See LeafQueue.assignNodeLocalContainers, line 1129 application.getResourceRequest(priority, node.getHostName()); Note that this bug does not affect the actual scheduling decisions made by the FifoScheduler because even though it incorrect determines that a request is not local to the node, it will still schedule the request immediately because it's rack-local. However, this bug may be adversely affecting the reporting of job status by underreporting the number of tasks that were node local. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-486) Change startContainer NM API to accept Container as a parameter and make ContainerLaunchContext user land
[ https://issues.apache.org/jira/browse/YARN-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630040#comment-13630040 ] Hudson commented on YARN-486: - Integrated in Hadoop-Hdfs-trunk #1370 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1370/]) YARN-486. Changed NM's startContainer API to accept Container record given by RM as a direct parameter instead of as part of the ContainerLaunchContext record. Contributed by Xuan Gong. MAPREDUCE-5139. Update MR AM to use the modified startContainer API after YARN-486. Contributed by Xuan Gong. (Revision 1467063) Result = FAILURE vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1467063 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerRemoteLaunchEvent.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttemptContainerRequest.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncher.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/main/java/org/apache/hadoop/mapred/YARNRunner.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/StartContainerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/StartContainerRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ContainerLaunchContext.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ApplicationSubmissionContextPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerLaunchContextPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/Client.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLaunchRPC.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestRPC.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/Container.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java *
[jira] [Commented] (YARN-319) Submit a job to a queue that not allowed in fairScheduler, client will hold forever.
[ https://issues.apache.org/jira/browse/YARN-319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630044#comment-13630044 ] Hudson commented on YARN-319: - Integrated in Hadoop-Hdfs-trunk #1370 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1370/]) Fixing CHANGES.txt entry for YARN-319. (Revision 1467133) Result = FAILURE vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1467133 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt Submit a job to a queue that not allowed in fairScheduler, client will hold forever. Key: YARN-319 URL: https://issues.apache.org/jira/browse/YARN-319 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.0.2-alpha Reporter: shenhong Assignee: shenhong Fix For: 2.0.5-beta Attachments: YARN-319-1.patch, YARN-319-2.patch, YARN-319-3.patch, YARN-319.patch RM use fairScheduler, when client submit a job to a queue, but the queue do not allow the user to submit job it, in this case, client will hold forever. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-160) nodemanagers should obtain cpu/memory values from underlying OS
[ https://issues.apache.org/jira/browse/YARN-160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630059#comment-13630059 ] Timothy St. Clair commented on YARN-160: If it's possible to tag along development on this one, I would be interested in the approach. IMHO referencing existing solutions gauges baseline: Ref: http://www.open-mpi.org/projects/hwloc/ http://www.rce-cast.com/Podcast/rce-33-hwloc-portable-hardware-locality.html http://gridscheduler.sourceforge.net/projects/hwloc/GridEnginehwloc.html nodemanagers should obtain cpu/memory values from underlying OS --- Key: YARN-160 URL: https://issues.apache.org/jira/browse/YARN-160 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.0.3-alpha Reporter: Alejandro Abdelnur Assignee: Alejandro Abdelnur Fix For: 2.0.5-beta As mentioned in YARN-2 *NM memory and CPU configs* Currently these values are coming from the config of the NM, we should be able to obtain those values from the OS (ie, in the case of Linux from /proc/meminfo /proc/cpuinfo). As this is highly OS dependent we should have an interface that obtains this information. In addition implementations of this interface should be able to specify a mem/cpu offset (amount of mem/cpu not to be avail as YARN resource), this would allow to reserve mem/cpu for the OS and other services outside of YARN containers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-412) FifoScheduler incorrectly checking for node locality
[ https://issues.apache.org/jira/browse/YARN-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630075#comment-13630075 ] Hudson commented on YARN-412: - Integrated in Hadoop-Mapreduce-trunk #1397 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1397/]) YARN-412. Fixed FifoScheduler to check hostname of a NodeManager rather than its host:port during scheduling which caused incorrect locality for containers. Contributed by Roger Hoover. (Revision 1467244) Result = SUCCESS acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1467244 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/FifoScheduler.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/MockNodes.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fifo/TestFifoScheduler.java FifoScheduler incorrectly checking for node locality Key: YARN-412 URL: https://issues.apache.org/jira/browse/YARN-412 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Roger Hoover Assignee: Roger Hoover Priority: Minor Labels: patch Fix For: 2.0.4-alpha Attachments: YARN-412.patch In the FifoScheduler, the assignNodeLocalContainers method is checking if the data is local to a node by searching for the nodeAddress of the node in the set of outstanding requests for the app. This seems to be incorrect as it should be checking hostname instead. The offending line of code is 455: application.getResourceRequest(priority, node.getRMNode().getNodeAddress()); Requests are formated by hostname (e.g. host1.foo.com) whereas node addresses are a concatenation of hostname and command port (e.g. host1.foo.com:1234) In the CapacityScheduler, it's done using hostname. See LeafQueue.assignNodeLocalContainers, line 1129 application.getResourceRequest(priority, node.getHostName()); Note that this bug does not affect the actual scheduling decisions made by the FifoScheduler because even though it incorrect determines that a request is not local to the node, it will still schedule the request immediately because it's rack-local. However, this bug may be adversely affecting the reporting of job status by underreporting the number of tasks that were node local. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-486) Change startContainer NM API to accept Container as a parameter and make ContainerLaunchContext user land
[ https://issues.apache.org/jira/browse/YARN-486?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630079#comment-13630079 ] Hudson commented on YARN-486: - Integrated in Hadoop-Mapreduce-trunk #1397 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1397/]) YARN-486. Changed NM's startContainer API to accept Container record given by RM as a direct parameter instead of as part of the ContainerLaunchContext record. Contributed by Xuan Gong. MAPREDUCE-5139. Update MR AM to use the modified startContainer API after YARN-486. Contributed by Xuan Gong. (Revision 1467063) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1467063 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerLauncherImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/launcher/ContainerRemoteLaunchEvent.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TestTaskAttemptContainerRequest.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/launcher/TestContainerLauncher.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/main/java/org/apache/hadoop/mapred/YARNRunner.java * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/StartContainerRequest.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/protocolrecords/impl/pb/StartContainerRequestPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/ContainerLaunchContext.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ApplicationSubmissionContextPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/api/records/impl/pb/ContainerLaunchContextPBImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/proto/yarn_service_protos.proto * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/ApplicationMaster.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell/src/main/java/org/apache/hadoop/yarn/applications/distributedshell/Client.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/util/BuilderUtils.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestContainerLaunchRPC.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/TestRPC.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/LinuxContainerExecutor.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/ContainerManagerImpl.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/Container.java * /hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/container/ContainerImpl.java *
[jira] [Commented] (YARN-319) Submit a job to a queue that not allowed in fairScheduler, client will hold forever.
[ https://issues.apache.org/jira/browse/YARN-319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630083#comment-13630083 ] Hudson commented on YARN-319: - Integrated in Hadoop-Mapreduce-trunk #1397 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1397/]) Fixing CHANGES.txt entry for YARN-319. (Revision 1467133) Result = SUCCESS vinodkv : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1467133 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt Submit a job to a queue that not allowed in fairScheduler, client will hold forever. Key: YARN-319 URL: https://issues.apache.org/jira/browse/YARN-319 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, scheduler Affects Versions: 2.0.2-alpha Reporter: shenhong Assignee: shenhong Fix For: 2.0.5-beta Attachments: YARN-319-1.patch, YARN-319-2.patch, YARN-319-3.patch, YARN-319.patch RM use fairScheduler, when client submit a job to a queue, but the queue do not allow the user to submit job it, in this case, client will hold forever. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-514) Delayed store operations should not result in RM unavailability for app submission
[ https://issues.apache.org/jira/browse/YARN-514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630320#comment-13630320 ] Bikas Saha commented on YARN-514: - you only need to add the new field in the enum. I dont think we should change the values of all existing enums. Delayed store operations should not result in RM unavailability for app submission -- Key: YARN-514 URL: https://issues.apache.org/jira/browse/YARN-514 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Zhijie Shen Attachments: YARN-514.1.patch, YARN-514.2.patch, YARN-514.3.patch, YARN-514.4.patch Currently, app submission is the only store operation performed synchronously because the app must be stored before the request returns with success. This makes the RM susceptible to blocking all client threads on slow store operations, resulting in RM being perceived as unavailable by clients. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-571) User should not be part of ContainerLaunchContext
[ https://issues.apache.org/jira/browse/YARN-571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah updated YARN-571: - Issue Type: Sub-task (was: Bug) Parent: YARN-386 User should not be part of ContainerLaunchContext - Key: YARN-571 URL: https://issues.apache.org/jira/browse/YARN-571 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Today, a user is expected to set the user name in the CLC when either submitting an application or launching a container from the AM. This does not make sense as the user can/has been identified by the RM as part of the RPC layer. Solution would be to move the user information into either the Container object or directly into the ContainerToken which can then be used by the NM to launch the container. This user information would set into the container by the RM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-572) Remove duplication of data in Container
Hitesh Shah created YARN-572: Summary: Remove duplication of data in Container Key: YARN-572 URL: https://issues.apache.org/jira/browse/YARN-572 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Most of the information needed to launch a container is duplicated in both the Container class as well as in the ContainerToken object that the Container object already contains. It would be good to remove this level of duplication. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-457) Setting updated nodes from null to null causes NPE in AllocateResponsePBImpl
[ https://issues.apache.org/jira/browse/YARN-457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630425#comment-13630425 ] Xuan Gong commented on YARN-457: +1, Looks good Setting updated nodes from null to null causes NPE in AllocateResponsePBImpl Key: YARN-457 URL: https://issues.apache.org/jira/browse/YARN-457 Project: Hadoop YARN Issue Type: Sub-task Components: api Affects Versions: 2.0.3-alpha Reporter: Sandy Ryza Assignee: Kenji Kikushima Priority: Minor Labels: Newbie Attachments: YARN-457-2.patch, YARN-457-3.patch, YARN-457-4.patch, YARN-457.patch {code} if (updatedNodes == null) { this.updatedNodes.clear(); return; } {code} If updatedNodes is already null, a NullPointerException is thrown. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-513) Verify all clients will wait for RM to restart
[ https://issues.apache.org/jira/browse/YARN-513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong reassigned YARN-513: -- Assignee: Xuan Gong (was: Jian He) Verify all clients will wait for RM to restart -- Key: YARN-513 URL: https://issues.apache.org/jira/browse/YARN-513 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Xuan Gong When the RM is restarting, the NM, AM and Clients should wait for some time for the RM to come back up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-392) Make it possible to schedule to specific nodes without dropping locality
[ https://issues.apache.org/jira/browse/YARN-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630440#comment-13630440 ] Sandy Ryza commented on YARN-392: - Any further thoughts on this? Make it possible to schedule to specific nodes without dropping locality Key: YARN-392 URL: https://issues.apache.org/jira/browse/YARN-392 Project: Hadoop YARN Issue Type: Sub-task Reporter: Bikas Saha Assignee: Sandy Ryza Attachments: YARN-392-1.patch, YARN-392.patch Currently its not possible to specify scheduling requests for specific nodes and nowhere else. The RM automatically relaxes locality to rack and * and assigns non-specified machines to the app. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-530) Define Service model strictly, implement AbstractService for robust subclassing, migrate yarn-common services
[ https://issues.apache.org/jira/browse/YARN-530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-530: Attachment: YARN-530.4.patch no tangible change than previously; publishing to keep in sync with the updated YARN-117 everything patch Define Service model strictly, implement AbstractService for robust subclassing, migrate yarn-common services - Key: YARN-530 URL: https://issues.apache.org/jira/browse/YARN-530 Project: Hadoop YARN Issue Type: Sub-task Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-117changes.pdf, YARN-530-2.patch, YARN-530-3.patch, YARN-530.4.patch, YARN-530.patch # Extend the YARN {{Service}} interface as discussed in YARN-117 # Implement the changes in {{AbstractService}} and {{FilterService}}. # Migrate all services in yarn-common to the more robust service model, test. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-117) Enhance YARN service model
[ https://issues.apache.org/jira/browse/YARN-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-117: Attachment: YARN-117.4.patch The changes here since the last patch related to the test {{TestNodeStatusUpdater}} which was failing on Jenkins but not locally. #adding timeouts in the {{syncBarrier.await()}} clause handle better the situation where the rollback of a failing {[start()}} doesn't block -as the barrier in the test case isn't reached as it would be on the same thread. #lots of extra assertions and debugging to see why {{testNMConnectionToRM()}} fails most of the time on a Linux test VM. It looks like the time-based assertions are brittle there (not fixed) Enhance YARN service model -- Key: YARN-117 URL: https://issues.apache.org/jira/browse/YARN-117 Project: Hadoop YARN Issue Type: Improvement Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-117-2.patch, YARN-117-3.patch, YARN-117.4.patch, YARN-117.patch Having played the YARN service model, there are some issues that I've identified based on past work and initial use. This JIRA issue is an overall one to cover the issues, with solutions pushed out to separate JIRAs. h2. state model prevents stopped state being entered if you could not successfully start the service. In the current lifecycle you cannot stop a service unless it was successfully started, but * {{init()}} may acquire resources that need to be explicitly released * if the {{start()}} operation fails partway through, the {{stop()}} operation may be needed to release resources. *Fix:* make {{stop()}} a valid state transition from all states and require the implementations to be able to stop safely without requiring all fields to be non null. Before anyone points out that the {{stop()}} operations assume that all fields are valid; and if called before a {{start()}} they will NPE; MAPREDUCE-3431 shows that this problem arises today, MAPREDUCE-3502 is a fix for this. It is independent of the rest of the issues in this doc but it will aid making {{stop()}} execute from all states other than stopped. MAPREDUCE-3502 is too big a patch and needs to be broken down for easier review and take up; this can be done with issues linked to this one. h2. AbstractService doesn't prevent duplicate state change requests. The {{ensureState()}} checks to verify whether or not a state transition is allowed from the current state are performed in the base {{AbstractService}} class -yet subclasses tend to call this *after* their own {{init()}}, {{start()}} {{stop()}} operations. This means that these operations can be performed out of order, and even if the outcome of the call is an exception, all actions performed by the subclasses will have taken place. MAPREDUCE-3877 demonstrates this. This is a tricky one to address. In HADOOP-3128 I used a base class instead of an interface and made the {{init()}}, {{start()}} {{stop()}} methods {{final}}. These methods would do the checks, and then invoke protected inner methods, {{innerStart()}}, {{innerStop()}}, etc. It should be possible to retrofit the same behaviour to everything that extends {{AbstractService}} -something that must be done before the class is considered stable (because once the lifecycle methods are declared final, all subclasses that are out of the source tree will need fixing by the respective developers. h2. AbstractService state change doesn't defend against race conditions. There's no concurrency locks on the state transitions. Whatever fix for wrong state calls is added should correct this to prevent re-entrancy, such as {{stop()}} being called from two threads. h2. Static methods to choreograph of lifecycle operations Helper methods to move things through lifecycles. init-start is common, stop-if-service!=null another. Some static methods can execute these, and even call {{stop()}} if {{init()}} raises an exception. These could go into a class {{ServiceOps}} in the same package. These can be used by those services that wrap other services, and help manage more robust shutdowns. h2. state transition failures are something that registered service listeners may wish to be informed of. When a state transition fails a {{RuntimeException}} can be thrown -and the service listeners are not informed as the notification point isn't reached. They may wish to know this, especially for management and diagnostics. *Fix:* extend {{ServiceStateChangeListener}} with a callback such as {{stateChangeFailed(Service service,Service.State targeted-state, RuntimeException e)}} that is invoked from the (final) state change methods in the {{AbstractService}} class (once they delegate to their inner
[jira] [Commented] (YARN-117) Enhance YARN service model
[ https://issues.apache.org/jira/browse/YARN-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630478#comment-13630478 ] Hadoop QA commented on YARN-117: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12578484/YARN-117.4.patch against trunk revision . {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/729//console This message is automatically generated. Enhance YARN service model -- Key: YARN-117 URL: https://issues.apache.org/jira/browse/YARN-117 Project: Hadoop YARN Issue Type: Improvement Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-117-2.patch, YARN-117-3.patch, YARN-117.4.patch, YARN-117.patch Having played the YARN service model, there are some issues that I've identified based on past work and initial use. This JIRA issue is an overall one to cover the issues, with solutions pushed out to separate JIRAs. h2. state model prevents stopped state being entered if you could not successfully start the service. In the current lifecycle you cannot stop a service unless it was successfully started, but * {{init()}} may acquire resources that need to be explicitly released * if the {{start()}} operation fails partway through, the {{stop()}} operation may be needed to release resources. *Fix:* make {{stop()}} a valid state transition from all states and require the implementations to be able to stop safely without requiring all fields to be non null. Before anyone points out that the {{stop()}} operations assume that all fields are valid; and if called before a {{start()}} they will NPE; MAPREDUCE-3431 shows that this problem arises today, MAPREDUCE-3502 is a fix for this. It is independent of the rest of the issues in this doc but it will aid making {{stop()}} execute from all states other than stopped. MAPREDUCE-3502 is too big a patch and needs to be broken down for easier review and take up; this can be done with issues linked to this one. h2. AbstractService doesn't prevent duplicate state change requests. The {{ensureState()}} checks to verify whether or not a state transition is allowed from the current state are performed in the base {{AbstractService}} class -yet subclasses tend to call this *after* their own {{init()}}, {{start()}} {{stop()}} operations. This means that these operations can be performed out of order, and even if the outcome of the call is an exception, all actions performed by the subclasses will have taken place. MAPREDUCE-3877 demonstrates this. This is a tricky one to address. In HADOOP-3128 I used a base class instead of an interface and made the {{init()}}, {{start()}} {{stop()}} methods {{final}}. These methods would do the checks, and then invoke protected inner methods, {{innerStart()}}, {{innerStop()}}, etc. It should be possible to retrofit the same behaviour to everything that extends {{AbstractService}} -something that must be done before the class is considered stable (because once the lifecycle methods are declared final, all subclasses that are out of the source tree will need fixing by the respective developers. h2. AbstractService state change doesn't defend against race conditions. There's no concurrency locks on the state transitions. Whatever fix for wrong state calls is added should correct this to prevent re-entrancy, such as {{stop()}} being called from two threads. h2. Static methods to choreograph of lifecycle operations Helper methods to move things through lifecycles. init-start is common, stop-if-service!=null another. Some static methods can execute these, and even call {{stop()}} if {{init()}} raises an exception. These could go into a class {{ServiceOps}} in the same package. These can be used by those services that wrap other services, and help manage more robust shutdowns. h2. state transition failures are something that registered service listeners may wish to be informed of. When a state transition fails a {{RuntimeException}} can be thrown -and the service listeners are not informed as the notification point isn't reached. They may wish to know this, especially for management and diagnostics. *Fix:* extend {{ServiceStateChangeListener}} with a callback such as {{stateChangeFailed(Service service,Service.State targeted-state, RuntimeException e)}} that is invoked from the (final) state change methods in the {{AbstractService}} class (once they delegate to their inner {{innerStart()}}, {{innerStop()}} methods; make a no-op on the existing implementations of the interface. h2. Service listener failures not handled Is this
[jira] [Commented] (YARN-530) Define Service model strictly, implement AbstractService for robust subclassing, migrate yarn-common services
[ https://issues.apache.org/jira/browse/YARN-530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630489#comment-13630489 ] Hadoop QA commented on YARN-530: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12578482/YARN-530.4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 5 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/728//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/728//artifact/trunk/patchprocess/newPatchFindbugsWarningshadoop-yarn-common.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/728//console This message is automatically generated. Define Service model strictly, implement AbstractService for robust subclassing, migrate yarn-common services - Key: YARN-530 URL: https://issues.apache.org/jira/browse/YARN-530 Project: Hadoop YARN Issue Type: Sub-task Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-117changes.pdf, YARN-530-2.patch, YARN-530-3.patch, YARN-530.4.patch, YARN-530.patch # Extend the YARN {{Service}} interface as discussed in YARN-117 # Implement the changes in {{AbstractService}} and {{FilterService}}. # Migrate all services in yarn-common to the more robust service model, test. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-561) Nodemanager should set some key information into the environment of every container that it launches.
[ https://issues.apache.org/jira/browse/YARN-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630493#comment-13630493 ] Vinod Kumar Vavilapalli commented on YARN-561: -- Xuan, what Hitesh is saying is that when a container starts as a process, it doesn't know its containerId. We should make NM export it as part of the env. Nodemanager should set some key information into the environment of every container that it launches. - Key: YARN-561 URL: https://issues.apache.org/jira/browse/YARN-561 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Xuan Gong Labels: usability Information such as containerId, nodemanager hostname, nodemanager port is not set in the environment when any container is launched. For an AM, the RM does all of this for it but for a container launched by an application, all of the above need to be set by the ApplicationMaster. At the minimum, container id would be a useful piece of information. If the container wishes to talk to its local NM, the nodemanager related information would also come in handy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-571) User should not be part of ContainerLaunchContext
[ https://issues.apache.org/jira/browse/YARN-571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli reassigned YARN-571: Assignee: Vinod Kumar Vavilapalli Taking a shot at this.. User should not be part of ContainerLaunchContext - Key: YARN-571 URL: https://issues.apache.org/jira/browse/YARN-571 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Vinod Kumar Vavilapalli Today, a user is expected to set the user name in the CLC when either submitting an application or launching a container from the AM. This does not make sense as the user can/has been identified by the RM as part of the RPC layer. Solution would be to move the user information into either the Container object or directly into the ContainerToken which can then be used by the NM to launch the container. This user information would set into the container by the RM. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-572) Remove duplication of data in Container
[ https://issues.apache.org/jira/browse/YARN-572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah reassigned YARN-572: Assignee: Hitesh Shah Remove duplication of data in Container Key: YARN-572 URL: https://issues.apache.org/jira/browse/YARN-572 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Hitesh Shah Most of the information needed to launch a container is duplicated in both the Container class as well as in the ContainerToken object that the Container object already contains. It would be good to remove this level of duplication. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (YARN-435) Make it easier to access cluster topology information in an AM
[ https://issues.apache.org/jira/browse/YARN-435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hitesh Shah reassigned YARN-435: Assignee: Hitesh Shah Make it easier to access cluster topology information in an AM -- Key: YARN-435 URL: https://issues.apache.org/jira/browse/YARN-435 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Hitesh Shah ClientRMProtocol exposes a getClusterNodes api that provides a report on all nodes in the cluster including their rack information. However, this requires the AM to open and establish a separate connection to the RM in addition to one for the AMRMProtocol. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-503) DelegationTokens will be renewed forever if multiple jobs share tokens and the first one sets JOB_CANCEL_DELEGATION_TOKEN to false
[ https://issues.apache.org/jira/browse/YARN-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630619#comment-13630619 ] Daryn Sharp commented on YARN-503: -- bq. There's likely a race between the RenewalTask and AbortTask. [...] I think it's possible for a schedulesTask to be executing - in which case the abortScheduleTask() may have no affect and can result in the wrong task being cancelled / scheduled. It's ok if another task is executing, because it's just trying to abort any pending task. Since there's only one possible pending task per token at any given time, the wrong task can't be cancelled. Did I miss an edge case? bq. ManagedApp.add - instead of adding to the app and the token here, this composite add can be kept outside of ManagedApp/ManagedToken Not sure I follow. Are you suggesting to move {{managedToken.add(appId)}} into the loop in {{addApplication}}? I was trying to encapsulate the implementation details of adding/removing the appId within ManagedApp. Is it ok to leave it as-is? bq. ManagedApp.expunge() - is synchronization on 'appTokens' required ? Strictly speaking, probably not. It's a throwback to earlier implementation that was doing trickier stuff. It was to avoid concurrent modification exceptions while iterating, but appTokens isn't mutating in multiple threads. And the {{remove}} is essentially guarding it too. For that matter, I don't think {{appTokens}} needs to be a synch-ed set. I'll change it. bq. MangedToken.expunge() - tokenApps.clear() required ? Probably not. Seemed like good housekeeping, but I'll remove it. bq. In the unit test - the 1 second sleep seems rather low. Instead of the sleep, this can be changed to a timed wait on one of the fields being verified. I don't like sleeps either. 1s is an eternity in this case because the initial renew and cancel timer tasks fire immediately on mocked objects, so it should run in a few ms. I assume you are suggesting using notify in a mock'ed answer method? Multiple timers are expected to fire in some cases, so it would probably require something like a CountdownLatch, which will get tricky to keep swapping in a new one by re-adding mocked responses with the new latch. Let me know if you feel it's worth it to change it. DelegationTokens will be renewed forever if multiple jobs share tokens and the first one sets JOB_CANCEL_DELEGATION_TOKEN to false -- Key: YARN-503 URL: https://issues.apache.org/jira/browse/YARN-503 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 0.23.3, 3.0.0, 2.0.0-alpha Reporter: Siddharth Seth Assignee: Daryn Sharp Attachments: YARN-503.patch, YARN-503.patch The first Job/App to register a token is the one which DelegationTokenRenewer associates with a a specific Token. An attempt to remove/cancel these shared tokens by subsequent jobs doesn't work - since the JobId will not match. As a result, Even if subsequent jobs have MRJobConfig.JOB_CANCEL_DELEGATION_TOKEN set to true - tokens will not be cancelled when those jobs complete. Tokens will eventually be removed from the RM / JT when the service that issued them considers them to have expired or via an explicit cancelDelegationTokens call (not implemented yet in 23). A side affect of this is that the same delegation token will end up being renewed multiple times (a separate TimerTask for each job which uses the token). DelegationTokenRenewer could maintain a reference count/list of jobIds for shared tokens. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-561) Nodemanager should set some key information into the environment of every container that it launches.
[ https://issues.apache.org/jira/browse/YARN-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630626#comment-13630626 ] Xuan Gong commented on YARN-561: when container starts as a process, it does not know its containerId. Does it mean when we execute the script to launch the container, the script does not include this containerId ? If I understand it correctly, we can solve this issue like this: 1. We need to add some content into enum Environment, such as ContainerId(String)(Which can be converted back by using ConverterUtils.toContainerId(String containerId)), NM hostName(String), NMPort(int). 2. The container Launch script is write out at ContainerLaunch::call(), and the environment is also set here. At ContainerLaunch, we already have org.apache.hadoop.yarn.server.nodemanager.containermanager.container, so containerId can be simply get. The NM hostName and NMPort can be get from NM_NodeId which is in NMContext. And ContainerLaunch is initialized from ContainerLauncher which already has NMContext. So, we can make changes here, when we initialize the ContainerLaunch, we either input NMContext as parameter, or simply give NM_NodeId, or just give NM_hostName and NMPort, then we can get all the information we need. Any other suggestions ?? Nodemanager should set some key information into the environment of every container that it launches. - Key: YARN-561 URL: https://issues.apache.org/jira/browse/YARN-561 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Xuan Gong Labels: usability Information such as containerId, nodemanager hostname, nodemanager port is not set in the environment when any container is launched. For an AM, the RM does all of this for it but for a container launched by an application, all of the above need to be set by the ApplicationMaster. At the minimum, container id would be a useful piece of information. If the container wishes to talk to its local NM, the nodemanager related information would also come in handy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-561) Nodemanager should set some key information into the environment of every container that it launches.
[ https://issues.apache.org/jira/browse/YARN-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630635#comment-13630635 ] Hitesh Shah commented on YARN-561: -- @Xuan, one thing to be careful of this certain env settings should only be set by the NodeManager when it launches the container and not by an application. So you would need a notion of certain whitelist environment variables that should be set only by the NM and not overridden by the env in CLC provided by the application. Nodemanager should set some key information into the environment of every container that it launches. - Key: YARN-561 URL: https://issues.apache.org/jira/browse/YARN-561 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Xuan Gong Labels: usability Information such as containerId, nodemanager hostname, nodemanager port is not set in the environment when any container is launched. For an AM, the RM does all of this for it but for a container launched by an application, all of the above need to be set by the ApplicationMaster. At the minimum, container id would be a useful piece of information. If the container wishes to talk to its local NM, the nodemanager related information would also come in handy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-412) FifoScheduler incorrectly checking for node locality
[ https://issues.apache.org/jira/browse/YARN-412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated YARN-412: --- Fix Version/s: (was: 2.0.4-alpha) 2.0.5-beta FifoScheduler incorrectly checking for node locality Key: YARN-412 URL: https://issues.apache.org/jira/browse/YARN-412 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Roger Hoover Assignee: Roger Hoover Priority: Minor Labels: patch Fix For: 2.0.5-beta Attachments: YARN-412.patch In the FifoScheduler, the assignNodeLocalContainers method is checking if the data is local to a node by searching for the nodeAddress of the node in the set of outstanding requests for the app. This seems to be incorrect as it should be checking hostname instead. The offending line of code is 455: application.getResourceRequest(priority, node.getRMNode().getNodeAddress()); Requests are formated by hostname (e.g. host1.foo.com) whereas node addresses are a concatenation of hostname and command port (e.g. host1.foo.com:1234) In the CapacityScheduler, it's done using hostname. See LeafQueue.assignNodeLocalContainers, line 1129 application.getResourceRequest(priority, node.getHostName()); Note that this bug does not affect the actual scheduling decisions made by the FifoScheduler because even though it incorrect determines that a request is not local to the node, it will still schedule the request immediately because it's rack-local. However, this bug may be adversely affecting the reporting of job status by underreporting the number of tasks that were node local. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-412) FifoScheduler incorrectly checking for node locality
[ https://issues.apache.org/jira/browse/YARN-412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630649#comment-13630649 ] Hudson commented on YARN-412: - Integrated in Hadoop-trunk-Commit #3610 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/3610/]) YARN-412. Pushing to 2.0.5-beta only. (Revision 1467470) Result = SUCCESS acmurthy : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1467470 Files : * /hadoop/common/trunk/hadoop-yarn-project/CHANGES.txt FifoScheduler incorrectly checking for node locality Key: YARN-412 URL: https://issues.apache.org/jira/browse/YARN-412 Project: Hadoop YARN Issue Type: Bug Components: scheduler Reporter: Roger Hoover Assignee: Roger Hoover Priority: Minor Labels: patch Fix For: 2.0.5-beta Attachments: YARN-412.patch In the FifoScheduler, the assignNodeLocalContainers method is checking if the data is local to a node by searching for the nodeAddress of the node in the set of outstanding requests for the app. This seems to be incorrect as it should be checking hostname instead. The offending line of code is 455: application.getResourceRequest(priority, node.getRMNode().getNodeAddress()); Requests are formated by hostname (e.g. host1.foo.com) whereas node addresses are a concatenation of hostname and command port (e.g. host1.foo.com:1234) In the CapacityScheduler, it's done using hostname. See LeafQueue.assignNodeLocalContainers, line 1129 application.getResourceRequest(priority, node.getHostName()); Note that this bug does not affect the actual scheduling decisions made by the FifoScheduler because even though it incorrect determines that a request is not local to the node, it will still schedule the request immediately because it's rack-local. However, this bug may be adversely affecting the reporting of job status by underreporting the number of tasks that were node local. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-561) Nodemanager should set some key information into the environment of every container that it launches.
[ https://issues.apache.org/jira/browse/YARN-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630652#comment-13630652 ] Xuan Gong commented on YARN-561: Just like ApplicationConstants which include some variables can only set in the AppMaster environment ? At the beginning (From the code ContainerLaunch::call()), the env is original from CLC.getEnvironment(), then we can set ContainerId, and Node_hostName, Node_portNumber after that. Nodemanager should set some key information into the environment of every container that it launches. - Key: YARN-561 URL: https://issues.apache.org/jira/browse/YARN-561 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Xuan Gong Labels: usability Information such as containerId, nodemanager hostname, nodemanager port is not set in the environment when any container is launched. For an AM, the RM does all of this for it but for a container launched by an application, all of the above need to be set by the ApplicationMaster. At the minimum, container id would be a useful piece of information. If the container wishes to talk to its local NM, the nodemanager related information would also come in handy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-562) NM should reject containers allocated by previous RM
[ https://issues.apache.org/jira/browse/YARN-562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-562: - Attachment: YARN-562.1.patch This patch does: 1. add RM's cluster timestamp in NM to reject old containers. 2. Block container requests while NM is resyncing with RM. 3. Add test cases for both cases NM should reject containers allocated by previous RM Key: YARN-562 URL: https://issues.apache.org/jira/browse/YARN-562 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-562.1.patch Its possible that after RM shutdown, before AM goes down,AM still call startContainer on NM with containers allocated by previous RM. When RM comes back, NM doesn't know whether this container launch request comes from previous RM or the current RM. we should reject containers allocated by previous RM -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-513) Verify all clients will wait for RM to restart
[ https://issues.apache.org/jira/browse/YARN-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630672#comment-13630672 ] Xuan Gong commented on YARN-513: From ApplicationMaster perspective: 1. The very first communication it will have with the RM is for Register itself with RM which is from AMRMClientImpl::registerApplicationMaster(), so we can add waitting logic here, to try several times until it is accepted or throw out the exceptions From Client Perspective: 1. The very first communication it will have with the RM is getNewApplication(), which is in YarnClientImpl::getNewApplication(request), we can add waitting logic here. In order to do that, we need add several const and variables to YarnConfiguration, such as AM_RM_CONNECTION_RETRY_INTERVAL_SECS, AM_RM_CONNECT_WAIT_SECS, CLIENT_RM_CONNECTION_RETRY_INTERVAL_SECS and CLIENT_RM_CONNECTION_WAIT_SECS. Verify all clients will wait for RM to restart -- Key: YARN-513 URL: https://issues.apache.org/jira/browse/YARN-513 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Xuan Gong When the RM is restarting, the NM, AM and Clients should wait for some time for the RM to come back up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-513) Verify all clients will wait for RM to restart
[ https://issues.apache.org/jira/browse/YARN-513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630677#comment-13630677 ] Bikas Saha commented on YARN-513: - What about other interactions with the RM such as allocate() or finishApplicationMaster() Verify all clients will wait for RM to restart -- Key: YARN-513 URL: https://issues.apache.org/jira/browse/YARN-513 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Xuan Gong When the RM is restarting, the NM, AM and Clients should wait for some time for the RM to come back up. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-562) NM should reject containers allocated by previous RM
[ https://issues.apache.org/jira/browse/YARN-562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630678#comment-13630678 ] Hadoop QA commented on YARN-562: {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12578512/YARN-562.1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/730//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/730//console This message is automatically generated. NM should reject containers allocated by previous RM Key: YARN-562 URL: https://issues.apache.org/jira/browse/YARN-562 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He Attachments: YARN-562.1.patch Its possible that after RM shutdown, before AM goes down,AM still call startContainer on NM with containers allocated by previous RM. When RM comes back, NM doesn't know whether this container launch request comes from previous RM or the current RM. we should reject containers allocated by previous RM -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-561) Nodemanager should set some key information into the environment of every container that it launches.
[ https://issues.apache.org/jira/browse/YARN-561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630681#comment-13630681 ] Hitesh Shah commented on YARN-561: -- Take a look at ContainerLaunch#sanitizeEnv() and how it handles non-modifiable environment variables. The above mentioned env variables should also fall into the non-modifiable category. Nodemanager should set some key information into the environment of every container that it launches. - Key: YARN-561 URL: https://issues.apache.org/jira/browse/YARN-561 Project: Hadoop YARN Issue Type: Sub-task Reporter: Hitesh Shah Assignee: Xuan Gong Labels: usability Information such as containerId, nodemanager hostname, nodemanager port is not set in the environment when any container is launched. For an AM, the RM does all of this for it but for a container launched by an application, all of the above need to be set by the ApplicationMaster. At the minimum, container id would be a useful piece of information. If the container wishes to talk to its local NM, the nodemanager related information would also come in handy. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-45) Scheduler feedback to AM to release containers
[ https://issues.apache.org/jira/browse/YARN-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630734#comment-13630734 ] Sandy Ryza commented on YARN-45: Carlo, I'm glad that this is being proposed. Have you considered including how long the grace period is in the response? Scheduler feedback to AM to release containers -- Key: YARN-45 URL: https://issues.apache.org/jira/browse/YARN-45 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Chris Douglas Assignee: Carlo Curino Attachments: YARN-45.patch, YARN-45.patch The ResourceManager strikes a balance between cluster utilization and strict enforcement of resource invariants in the cluster. Individual allocations of containers must be reclaimed- or reserved- to restore the global invariants when cluster load shifts. In some cases, the ApplicationMaster can respond to fluctuations in resource availability without losing the work already completed by that task (MAPREDUCE-4584). Supplying it with this information would be helpful for overall cluster utilization [1]. To this end, we want to establish a protocol for the RM to ask the AM to release containers. [1] http://research.yahoo.com/files/yl-2012-003.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-547) New resource localization is tried even when Localized Resource is in DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630741#comment-13630741 ] Omkar Vinit Joshi commented on YARN-547: Canceling the patch as it was fixing the existing problems but was removing parallelization (based on number of containers not resource). Making sure this parallelization still exists. * Removing Invalid transitions for INIT and LOCALIZED however not modifying DOWNLOADING state transition. * Making sure that now in PublicLocalizer as well we acquire lock before downloading. This will fix broken signaling. Now multiple containers will still try to download but download will start/enqueued only if ** we can acquire lock on LocalizedResource. ** LocalizedResource is still in DOWNLOADING state. New resource localization is tried even when Localized Resource is in DOWNLOADING state --- Key: YARN-547 URL: https://issues.apache.org/jira/browse/YARN-547 Project: Hadoop YARN Issue Type: Sub-task Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: yarn-547-20130411.1.patch, yarn-547-20130411.patch At present when multiple containers try to request a localized resource 1) If the resource is not present then first it is created and Resource Localization starts ( LocalizedResource is in DOWNLOADING state) 2) Now if in this state multiple ResourceRequestEvents come in then ResourceLocalizationEvents are fired for all of them. Most of the times it is not resulting into a duplicate resource download but there is a race condition present there. Location : ResourceLocalizationService.addResource .. addition of the request into attempts in case of an event already exists. The root cause for this is the presence of FetchResourceTransition on receiving ResourceRequestEvent in DOWNLOADING state. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-503) DelegationTokens will be renewed forever if multiple jobs share tokens and the first one sets JOB_CANCEL_DELEGATION_TOKEN to false
[ https://issues.apache.org/jira/browse/YARN-503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630749#comment-13630749 ] Siddharth Seth commented on YARN-503: - bq. I don't like sleeps either. 1s is an eternity in this case because the initial renew and cancel timer tasks fire immediately on mocked objects, so it should run in a few ms. I assume you are suggesting using notify in a mock'ed answer method? Multiple timers are expected to fire in some cases, so it would probably require something like a CountdownLatch, which will get tricky to keep swapping in a new one by re-adding mocked responses with the new latch. Let me know if you feel it's worth it to change it. Was actually suggesting doing the post-sleep verify in a check-sleep loop, instead of just sleeping. Passing this step indicates the required execution has completed. Would prefer keeping a sleep out of the tests if we can. Otherwise a longer sleep for sure. bq. It's ok if another task is executing, because it's just trying to abort any pending task. Since there's only one possible pending task per token at any given time, the wrong task can't be cancelled. Did I miss an edge case? I think there's an edge case. Sequence 1. [t1] timerTask is a RenewalTask 2. [t1] timer kicks in and starts executing 3. [t2] scheduleCancelled gets called in a parallel thread [via AppRemovalTask] 4. [t2] scheduleCancelled.abortScheduled called - synchronized but does nothing useful since the current task is already running. 5. [t2] scheduleCancelled runs to completion and creates a cancelTask 6. [t1] completes execution - and calls scheduleTask(new TokenRenewTask(), renewIn) - which effectively destorys the scheduled cancelTask bq. Are you suggesting to move managedToken.add(appId) into the loop in addApplication? I was trying to encapsulate the implementation details of adding/removing the appId within ManagedApp. Is it ok to leave it as-is? I thought it'd be cleaner leaving this outside of ManagedApp - ManagedApp should not be managing ManagedTokens. IAC, don't feel strongly about this; whatever you decide. DelegationTokens will be renewed forever if multiple jobs share tokens and the first one sets JOB_CANCEL_DELEGATION_TOKEN to false -- Key: YARN-503 URL: https://issues.apache.org/jira/browse/YARN-503 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 0.23.3, 3.0.0, 2.0.0-alpha Reporter: Siddharth Seth Assignee: Daryn Sharp Attachments: YARN-503.patch, YARN-503.patch The first Job/App to register a token is the one which DelegationTokenRenewer associates with a a specific Token. An attempt to remove/cancel these shared tokens by subsequent jobs doesn't work - since the JobId will not match. As a result, Even if subsequent jobs have MRJobConfig.JOB_CANCEL_DELEGATION_TOKEN set to true - tokens will not be cancelled when those jobs complete. Tokens will eventually be removed from the RM / JT when the service that issued them considers them to have expired or via an explicit cancelDelegationTokens call (not implemented yet in 23). A side affect of this is that the same delegation token will end up being renewed multiple times (a separate TimerTask for each job which uses the token). DelegationTokenRenewer could maintain a reference count/list of jobIds for shared tokens. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-547) New resource localization is tried even when Localized Resource is in DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Vinit Joshi updated YARN-547: --- Attachment: yarn-547-20130412.patch New resource localization is tried even when Localized Resource is in DOWNLOADING state --- Key: YARN-547 URL: https://issues.apache.org/jira/browse/YARN-547 Project: Hadoop YARN Issue Type: Sub-task Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: yarn-547-20130411.1.patch, yarn-547-20130411.patch, yarn-547-20130412.patch At present when multiple containers try to request a localized resource 1) If the resource is not present then first it is created and Resource Localization starts ( LocalizedResource is in DOWNLOADING state) 2) Now if in this state multiple ResourceRequestEvents come in then ResourceLocalizationEvents are fired for all of them. Most of the times it is not resulting into a duplicate resource download but there is a race condition present there. Location : ResourceLocalizationService.addResource .. addition of the request into attempts in case of an event already exists. The root cause for this is the presence of FetchResourceTransition on receiving ResourceRequestEvent in DOWNLOADING state. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-476) ProcfsBasedProcessTree info message confuses users
[ https://issues.apache.org/jira/browse/YARN-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-476: Attachment: YARN-476.patch ProcfsBasedProcessTree info message confuses users -- Key: YARN-476 URL: https://issues.apache.org/jira/browse/YARN-476 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.6 Reporter: Jason Lowe Attachments: YARN-476.patch ProcfsBasedProcessTree has a habit of emitting not-so-helpful messages such as the following: {noformat} 2013-03-13 12:41:51,957 INFO [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: The process 28747 may have finished in the interim. 2013-03-13 12:41:51,958 INFO [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: The process 28978 may have finished in the interim. 2013-03-13 12:41:51,958 INFO [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: The process 28979 may have finished in the interim. {noformat} As described in MAPREDUCE-4570, this is something that naturally occurs in the process of monitoring processes via procfs. It's uninteresting at best and can confuse users who think it's a reason their job isn't running as expected when it appears in their logs. We should either make this DEBUG or remove it entirely. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-476) ProcfsBasedProcessTree info message confuses users
[ https://issues.apache.org/jira/browse/YARN-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630759#comment-13630759 ] Sandy Ryza commented on YARN-476: - Attached patch that removes the log statement entirely. ProcfsBasedProcessTree info message confuses users -- Key: YARN-476 URL: https://issues.apache.org/jira/browse/YARN-476 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.6 Reporter: Jason Lowe Attachments: YARN-476.patch ProcfsBasedProcessTree has a habit of emitting not-so-helpful messages such as the following: {noformat} 2013-03-13 12:41:51,957 INFO [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: The process 28747 may have finished in the interim. 2013-03-13 12:41:51,958 INFO [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: The process 28978 may have finished in the interim. 2013-03-13 12:41:51,958 INFO [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: The process 28979 may have finished in the interim. {noformat} As described in MAPREDUCE-4570, this is something that naturally occurs in the process of monitoring processes via procfs. It's uninteresting at best and can confuse users who think it's a reason their job isn't running as expected when it appears in their logs. We should either make this DEBUG or remove it entirely. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-476) ProcfsBasedProcessTree info message confuses users
[ https://issues.apache.org/jira/browse/YARN-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated YARN-476: Priority: Minor (was: Major) ProcfsBasedProcessTree info message confuses users -- Key: YARN-476 URL: https://issues.apache.org/jira/browse/YARN-476 Project: Hadoop YARN Issue Type: Bug Affects Versions: 0.23.6 Reporter: Jason Lowe Priority: Minor Attachments: YARN-476.patch ProcfsBasedProcessTree has a habit of emitting not-so-helpful messages such as the following: {noformat} 2013-03-13 12:41:51,957 INFO [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: The process 28747 may have finished in the interim. 2013-03-13 12:41:51,958 INFO [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: The process 28978 may have finished in the interim. 2013-03-13 12:41:51,958 INFO [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: The process 28979 may have finished in the interim. {noformat} As described in MAPREDUCE-4570, this is something that naturally occurs in the process of monitoring processes via procfs. It's uninteresting at best and can confuse users who think it's a reason their job isn't running as expected when it appears in their logs. We should either make this DEBUG or remove it entirely. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-547) New resource localization is tried even when Localized Resource is in DOWNLOADING state
[ https://issues.apache.org/jira/browse/YARN-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630766#comment-13630766 ] Hadoop QA commented on YARN-547: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12578544/yarn-547-20130412.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestLocalResourcesTrackerImpl org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestLocalizedResource {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/731//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/731//console This message is automatically generated. New resource localization is tried even when Localized Resource is in DOWNLOADING state --- Key: YARN-547 URL: https://issues.apache.org/jira/browse/YARN-547 Project: Hadoop YARN Issue Type: Sub-task Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi Attachments: yarn-547-20130411.1.patch, yarn-547-20130411.patch, yarn-547-20130412.patch At present when multiple containers try to request a localized resource 1) If the resource is not present then first it is created and Resource Localization starts ( LocalizedResource is in DOWNLOADING state) 2) Now if in this state multiple ResourceRequestEvents come in then ResourceLocalizationEvents are fired for all of them. Most of the times it is not resulting into a duplicate resource download but there is a race condition present there. Location : ResourceLocalizationService.addResource .. addition of the request into attempts in case of an event already exists. The root cause for this is the presence of FetchResourceTransition on receiving ResourceRequestEvent in DOWNLOADING state. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-45) Scheduler feedback to AM to release containers
[ https://issues.apache.org/jira/browse/YARN-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630769#comment-13630769 ] Bikas Saha commented on YARN-45: I like the idea of the RM giving information to the AM about actions that it might take which will affect the AM. However, I am wary of having the action taken in different places. eg. the KILL to the containers should come from the RM or the AM exclusively but not from both. Otherwise we open ourselves up to race conditions, unnecessary kills and complex logic in the RM. Preemption is something that, IMO the RM needs to do at the very last moment when there is no other alternative of resource being freed up. If we decide to preempt at time T1 and then actually preempt at time T2 then the cluster conditions may have changed between T1 and T2 which may invalidate the decisions taken at T1. New resources may have freed up that reduce the number of containers to be killed. This sub-optimality is directly proportional to length of time between T1 and T2. So ideally we want to keep T1=T2. One can argue that things can change after the preemption which may have made the preemption unnecessary. So the above argument of T1=T2 is fallacious. However, preemption policies are usually based on deadlines such as the allocation of queue1 must be met within X seconds. So RM does not have the luxury of waiting for X+1 seconds. The best it can do is to wait upto X seconds in the hope that things will work out and at X redistribute resources to meet the deficit. At the same time, I can see that there is an argument that the AM knows best how to free up its resources. It will be good to remember that the AM has already informed the RM about the importance of all its containers when it made the requests at different priorities. So the RM knows the order of importance of the containers and the RM also knows the amount of time each container has been allocated. Assuming container runtime as a proxy for container work done, this data can be used by the RM to preempt in a work preserving manner without having to talk to the AM. Notifying the AM has the usefulness of allowing the AM to take actions that preserve work such as checkpointing. However, IMO, the AM should only do checkpointing operations but not kill the containers. That should still happen at the RM as the very last option at the last moment. If the situation changes in the grace period and the containers do not need to be killed then there is no point in the AM killing them right now. This also lets us increase the grace period to a longer time because checkpointing and preserving work usually means persisting data in a stable store and may be slow in practical scenarios. To summarize, I would propose an API in which the RM tells the AM about exactly which containers it might imminently preempt with the contract being that the AM could take actions to preserve the work done in those containers. The AM can continue to run those containers until the RM actually preempts them if needed. If we really think that the choice of containers needs to be made at the AM then the AM needs to checkpoint those containers and inform the RM about the containers it has chosen. But the final decision to send the kill must be sent by the RM. Scheduler feedback to AM to release containers -- Key: YARN-45 URL: https://issues.apache.org/jira/browse/YARN-45 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Chris Douglas Assignee: Carlo Curino Attachments: YARN-45.patch, YARN-45.patch The ResourceManager strikes a balance between cluster utilization and strict enforcement of resource invariants in the cluster. Individual allocations of containers must be reclaimed- or reserved- to restore the global invariants when cluster load shifts. In some cases, the ApplicationMaster can respond to fluctuations in resource availability without losing the work already completed by that task (MAPREDUCE-4584). Supplying it with this information would be helpful for overall cluster utilization [1]. To this end, we want to establish a protocol for the RM to ask the AM to release containers. [1] http://research.yahoo.com/files/yl-2012-003.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-476) ProcfsBasedProcessTree info message confuses users
[ https://issues.apache.org/jira/browse/YARN-476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630771#comment-13630771 ] Hadoop QA commented on YARN-476: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12578547/YARN-476.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:red}-1 tests included{color}. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/732//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/732//console This message is automatically generated. ProcfsBasedProcessTree info message confuses users -- Key: YARN-476 URL: https://issues.apache.org/jira/browse/YARN-476 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.0.3-alpha, 0.23.6 Reporter: Jason Lowe Assignee: Sandy Ryza Priority: Minor Attachments: YARN-476.patch ProcfsBasedProcessTree has a habit of emitting not-so-helpful messages such as the following: {noformat} 2013-03-13 12:41:51,957 INFO [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: The process 28747 may have finished in the interim. 2013-03-13 12:41:51,958 INFO [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: The process 28978 may have finished in the interim. 2013-03-13 12:41:51,958 INFO [communication thread] org.apache.hadoop.yarn.util.ProcfsBasedProcessTree: The process 28979 may have finished in the interim. {noformat} As described in MAPREDUCE-4570, this is something that naturally occurs in the process of monitoring processes via procfs. It's uninteresting at best and can confuse users who think it's a reason their job isn't running as expected when it appears in their logs. We should either make this DEBUG or remove it entirely. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-573) Shared data structures in Public Localizer and Private Localizer are not Thread safe.
Omkar Vinit Joshi created YARN-573: -- Summary: Shared data structures in Public Localizer and Private Localizer are not Thread safe. Key: YARN-573 URL: https://issues.apache.org/jira/browse/YARN-573 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi PublicLocalizer 1) pending accessed by addResource (part of event handling) and run method (as a part of PublicLocalizer.run() ). PrivateLocalizer 1) pending accessed by addResource (part of event handling) and findNextResource (i.remove()). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-574) PrivateLocalizer does not support parallel resource download via ContainerLocalizer
Omkar Vinit Joshi created YARN-574: -- Summary: PrivateLocalizer does not support parallel resource download via ContainerLocalizer Key: YARN-574 URL: https://issues.apache.org/jira/browse/YARN-574 Project: Hadoop YARN Issue Type: Bug Reporter: Omkar Vinit Joshi Assignee: Omkar Vinit Joshi At present private resources will be downloaded in parallel only if multiple containers request the same resource. However otherwise it will be serial. The protocol between PrivateLocalizer and ContainerLocalizer supports multiple downloads however it is not used and only one resource is sent for downloading at a time. I think we can increase / assure parallelism (even for single container requesting resource) for private/application resources by making multiple downloads per ContainerLocalizer. Total Parallelism before = number of threads allotted for PublicLocalizer [public resource] + number of containers[private and application resource] Total Parallelism after = number of threads allotted for PublicLocalizer [public resource] + number of containers * max downloads per container [private and application resource] -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-45) Scheduler feedback to AM to release containers
[ https://issues.apache.org/jira/browse/YARN-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630893#comment-13630893 ] Chris Douglas commented on YARN-45: --- [~sandyr]: Yes, but the correct format/semantics for time are a complex discussion in themselves. To keep this easy to review and the discussion focused, we were going to file that separately. But I totally agree: for the AM to respond intelligently, the time before it's forced to give up the container is valuable input. [~bikash]: Agree almost completely. In YARN-569, the hysteresis you cite motivated several design points, including multiple dampers on actions taken by the preemption policy, out-of-band observation/enforcement, and no effort to fine-tune particular allocations. The role of preemption (to summarize what [~curino] discussed in detail in the prenominate JIRA) is to make coarse corrections around the core scheduler invariants (e.g., capacity, fairness). Rather than introducing new races or complexity, one could argue that preemption is a dual of allocation in an inconsistent environment. Your proposal matches case (1) in the above [comment|https://issues.apache.org/jira/browse/YARN-45?focusedCommentId=13628950page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13628950], where the RM specifies the set of containers in jeopardy and a contract (as {{ResourceRequest}}) for avoiding the kills, should the AM have cause to pick different containers. Further, your observation that the RM has enough information in priorities, etc. to make an educated guess at those containers is spot-on. IIRC, the policy uses allocation order when selecting containers, but that should be a secondary key after priority. The disputed point, and I'm not sure we actually disagree, is the claim that the AM should never kill things in response to this message. To be fair, that can be implemented by just ignoring the requests, so it's orthogonal to this particular protocol, but it's certainly an important best practice to discuss to ensure we're capturing the right thing. Certainly there are many cases where ignoring the message is correct; most CDFs of map task execution time show that over 80% finish in less than a minute, so the AM has few reasons to pessimistically kill them. There are a few scenarios where this isn't optimal. Take the case of YARN-415, where the AM is billed cumulatively for cluster time. Assume an AM knows (a) the container will not finish (reinforcing [~sandyr]'s point about including time in the preemption message) and (b) the work done is not worth checkpointing. It can conclude that killing the container is in its best interest, because squatting on the resource could affect its ability to get containers in the future (or simply cost more). Moreover, for long-lived services and speculative container allocation/retention, the AM may actually be holding the container only as an optimization or for a future execution, so it could release it at low cost to itself. Finally, the time allowed before the RM starts killing containers can be extended if AMs typically return resources before the deadline. It's also a mechanism for the RM to advise the AM about constraints that prevent it from granting its pending requests. The AM currently kills reducers if it can't get containers to regenerate lost map output. If the scheduler values some containers more than others, the AM's response to starvation can be improved from random killing. This is a case where the current implementation acknowledges the fact that it already runs in an inconsistent environment. Scheduler feedback to AM to release containers -- Key: YARN-45 URL: https://issues.apache.org/jira/browse/YARN-45 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Chris Douglas Assignee: Carlo Curino Attachments: YARN-45.patch, YARN-45.patch The ResourceManager strikes a balance between cluster utilization and strict enforcement of resource invariants in the cluster. Individual allocations of containers must be reclaimed- or reserved- to restore the global invariants when cluster load shifts. In some cases, the ApplicationMaster can respond to fluctuations in resource availability without losing the work already completed by that task (MAPREDUCE-4584). Supplying it with this information would be helpful for overall cluster utilization [1]. To this end, we want to establish a protocol for the RM to ask the AM to release containers. [1] http://research.yahoo.com/files/yl-2012-003.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on
[jira] [Commented] (YARN-45) Scheduler feedback to AM to release containers
[ https://issues.apache.org/jira/browse/YARN-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630898#comment-13630898 ] Carlo Curino commented on YARN-45: -- As you pointed out, any decision made in the RM needs to deal with an inconsistent and evolving view of the world, and the preemption actions suffer from an inherent and significant lag. In designing policies around this, one must embrace such chaos and operate conservatively and try to affect only macroscopic properties (hence the many built-in dampers Chris mentioned). As for what to do with the preemption requests, I think we are quite aligned with your comments in our current implementation for the mapreduce AM/Task. Here's what we do: 1) Maps are typically short-lived, so it is often worth ignoring the preemption request and try to make a run for it, as checkpointing and completion times risk to be comparable, and re-execution costs are low. 2) For reducer, since the state is valuable and runtimes often longer, the AM asks the task to checkpoint. In our current implementation, once the state of the reducer has been saved to a checkpoint we exit, as continuing execution is non-trivial (in particular managing partial output of reducers). I can envision a future version that tries to continue running after having taken a checkpoint. Note that this (the task exiting) does not introduce any new race-condition/complexity in either RM or AM, as both already handle failing/killed tasks, and the AM even have logic to kill its own reducers to free up space for maps. More importantly, this setup (in which containers exit as soon as they are done checkpointing) allows us to set rather generous wait-before-kill parameters, since the containers will be reclaimed as soon as the task is done checkpointing anyway. The alternative would have the RM pick a static policy for waiting, which risks to be either too long (hence delaying by too much the rebalancing), or too short (which risks to interrupt containers while finishing the checkpointing thus wasting work). I expect that no static solution would fair well for a broad range of AMs and job sizes. 3) When the preemption takes the form of a ResourceRequest we pick reducers over maps (as having reducers running when the map are killed would simply lead to wasted slot time). Looking forward in Yarn's future this is a key feature as other applications might have evolving priorities for containers which are not exposed to the RM, hence we can't rely on the RM to guess which container is best to preempt, and delegating the choice to the AM could be invaluable. Scheduler feedback to AM to release containers -- Key: YARN-45 URL: https://issues.apache.org/jira/browse/YARN-45 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Chris Douglas Assignee: Carlo Curino Attachments: YARN-45.patch, YARN-45.patch The ResourceManager strikes a balance between cluster utilization and strict enforcement of resource invariants in the cluster. Individual allocations of containers must be reclaimed- or reserved- to restore the global invariants when cluster load shifts. In some cases, the ApplicationMaster can respond to fluctuations in resource availability without losing the work already completed by that task (MAPREDUCE-4584). Supplying it with this information would be helpful for overall cluster utilization [1]. To this end, we want to establish a protocol for the RM to ask the AM to release containers. [1] http://research.yahoo.com/files/yl-2012-003.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-575) ContainerManager APIs should be user accessible
Siddharth Seth created YARN-575: --- Summary: ContainerManager APIs should be user accessible Key: YARN-575 URL: https://issues.apache.org/jira/browse/YARN-575 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.0.4-alpha Reporter: Siddharth Seth Priority: Critical Auth for ContainerManager is based on the containerId being accessed - since this is what is used to launch containers (There's likely another jira somewhere to change this to not be containerId based). What this also means is the API is effectively not usable with kerberos credentials. Also, it should be possible to use this API with some generic tokens (RMDelegation?), instead of with Container specific tokens. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-45) Scheduler feedback to AM to release containers
[ https://issues.apache.org/jira/browse/YARN-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630925#comment-13630925 ] Alejandro Abdelnur commented on YARN-45: Comments on the patch. * Reusing ResourceRequest means we have a bunch of properties that are not applicable to the preempt message. Wouldn't be enough just to return the ContainerIds and a flag indicating that the set is strict or not? The AM can reconstruct all the resources information if it needs to. *Do we need the get*Count() methods? You can get the size from the set itself, or am I missing something? Scheduler feedback to AM to release containers -- Key: YARN-45 URL: https://issues.apache.org/jira/browse/YARN-45 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Chris Douglas Assignee: Carlo Curino Attachments: YARN-45.patch, YARN-45.patch The ResourceManager strikes a balance between cluster utilization and strict enforcement of resource invariants in the cluster. Individual allocations of containers must be reclaimed- or reserved- to restore the global invariants when cluster load shifts. In some cases, the ApplicationMaster can respond to fluctuations in resource availability without losing the work already completed by that task (MAPREDUCE-4584). Supplying it with this information would be helpful for overall cluster utilization [1]. To this end, we want to establish a protocol for the RM to ask the AM to release containers. [1] http://research.yahoo.com/files/yl-2012-003.pdf -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently
[ https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13630939#comment-13630939 ] nemon lou commented on YARN-276: [~tgraves] Here is the initial thoughts on checking cluster level AM resource percent in each leafqueue: Leaf queue'capacity is computed based on absoluteMaxCapacity. Considering we have 10 leaf queues,each with a value of 100% absoluteMaxCapacity and 10% maxAMResourcePerQueuePercent configured, there is still a chance that all leaf queue's resources taken up by AM before reaching the 10% maxAMResourcePerQueuePercent limit. Note that a cluster basis' am resource percent only works in leaf queue if no am resource percent configured for this leaf queue. As Thomas Graves mentioned,cluster level checking will causing one queue restrict another.I will remove cluster level checking. Capacity Scheduler can hang when submit many jobs concurrently -- Key: YARN-276 URL: https://issues.apache.org/jira/browse/YARN-276 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 3.0.0, 2.0.1-alpha Reporter: nemon lou Assignee: nemon lou Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch Original Estimate: 24h Remaining Estimate: 24h In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity scheduler can hang with most resources taken up by AM and don't have enough resources for tasks.And then all applications hang there. The cause is that yarn.scheduler.capacity.maximum-am-resource-percent not check directly.Instead ,this property only used for maxActiveApplications. And maxActiveApplications is computed by minimumAllocation (not by Am actually used). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (YARN-276) Capacity Scheduler can hang when submit many jobs concurrently
[ https://issues.apache.org/jira/browse/YARN-276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nemon lou updated YARN-276: --- Attachment: YARN-276.patch uploading a interim patch. Capacity Scheduler can hang when submit many jobs concurrently -- Key: YARN-276 URL: https://issues.apache.org/jira/browse/YARN-276 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Affects Versions: 3.0.0, 2.0.1-alpha Reporter: nemon lou Assignee: nemon lou Attachments: YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch, YARN-276.patch Original Estimate: 24h Remaining Estimate: 24h In hadoop2.0.1,When i submit many jobs concurrently at the same time,Capacity scheduler can hang with most resources taken up by AM and don't have enough resources for tasks.And then all applications hang there. The cause is that yarn.scheduler.capacity.maximum-am-resource-percent not check directly.Instead ,this property only used for maxActiveApplications. And maxActiveApplications is computed by minimumAllocation (not by Am actually used). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira