[jira] [Commented] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138622#comment-14138622 ] Tsuyoshi OZAWA commented on YARN-2312: -- Basically we can replace {{getId()}} with {{getContainerId()}} straightforwardly. [~jianhe], [~vinodkv], how can we deal with following exceptions? 1. {{WrappedJvmID}}. In TaskAttemptImpl, the id is created by using {{container.getId}}. If we change a constructor of {{WrapperedJvmID}}, we also need to change a constructor of {{ID}} class itself. One idea is to add {{id}} for upper 32 bits of container Id to {{ID}} class. {code} taskAttempt.jvmID = new WrappedJvmID(taskAttempt.remoteTask.getTaskID().getJobID(), taskAttempt.remoteTask.isMapTask(), taskAttempt.container.getId() .getId()); {code} 2. {{Priority}}. Can we change the definition of Proto? It's used widely and one concern is backward compatibility. Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2474) document the wsce-site.xml keys in hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm
[ https://issues.apache.org/jira/browse/YARN-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu resolved YARN-2474. Resolution: Implemented Included in .6.patch of YARN-2198 document the wsce-site.xml keys in hadoop-yarn-site/src/site/apt/SecureContainer.apt.vm --- Key: YARN-2474 URL: https://issues.apache.org/jira/browse/YARN-2474 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Priority: Critical Labels: security, windows Attachments: YARN-2474.1.patch document the keys used to configure WSCE -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2198) Remove the need to run NodeManager as privileged account for Windows Secure Container Executor
[ https://issues.apache.org/jira/browse/YARN-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138633#comment-14138633 ] Hadoop QA commented on YARN-2198: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669663/YARN-2198.trunk.6.patch against trunk revision ee21b13. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5023//console This message is automatically generated. Remove the need to run NodeManager as privileged account for Windows Secure Container Executor -- Key: YARN-2198 URL: https://issues.apache.org/jira/browse/YARN-2198 Project: Hadoop YARN Issue Type: Improvement Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows Attachments: YARN-2198.1.patch, YARN-2198.2.patch, YARN-2198.3.patch, YARN-2198.delta.4.patch, YARN-2198.delta.5.patch, YARN-2198.delta.6.patch, YARN-2198.separation.patch, YARN-2198.trunk.4.patch, YARN-2198.trunk.5.patch, YARN-2198.trunk.6.patch YARN-1972 introduces a Secure Windows Container Executor. However this executor requires a the process launching the container to be LocalSystem or a member of the a local Administrators group. Since the process in question is the NodeManager, the requirement translates to the entire NM to run as a privileged account, a very large surface area to review and protect. This proposal is to move the privileged operations into a dedicated NT service. The NM can run as a low privilege account and communicate with the privileged NT service when it needs to launch a container. This would reduce the surface exposed to the high privileges. There has to exist a secure, authenticated and authorized channel of communication between the NM and the privileged NT service. Possible alternatives are a new TCP endpoint, Java RPC etc. My proposal though would be to use Windows LPC (Local Procedure Calls), which is a Windows platform specific inter-process communication channel that satisfies all requirements and is easy to deploy. The privileged NT service would register and listen on an LPC port (NtCreatePort, NtListenPort). The NM would use JNI to interop with libwinutils which would host the LPC client code. The client would connect to the LPC port (NtConnectPort) and send a message requesting a container launch (NtRequestWaitReplyPort). LPC provides authentication and the privileged NT service can use authorization API (AuthZ) to validate the caller. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2312) Marking ContainerId#getId as deprecated
[ https://issues.apache.org/jira/browse/YARN-2312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsuyoshi OZAWA updated YARN-2312: - Attachment: YARN-2312-wip.patch Attached a WIP patch as a reference of changes. Marking ContainerId#getId as deprecated --- Key: YARN-2312 URL: https://issues.apache.org/jira/browse/YARN-2312 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: YARN-2312-wip.patch {{ContainerId#getId}} will only return partial value of containerId, only sequence number of container id without epoch, after YARN-2229. We should mark {{ContainerId#getId}} as deprecated and use {{ContainerId#getContainerId}} instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: YARN-796.node-label.consolidate.6.patch Attached ver.6 consolidated patch against latest trunk -- YARN-796.node-label.consolidate.6.patch Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.consolidate.4.patch, YARN-796.node-label.consolidate.5.patch, YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2498) [YARN-796] Respect labels in preemption policy of capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2498: - Attachment: YARN-2498.patch [YARN-796] Respect labels in preemption policy of capacity scheduler Key: YARN-2498 URL: https://issues.apache.org/jira/browse/YARN-2498 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2498.patch, YARN-2498.patch, YARN-2498.patch, yarn-2498-implementation-notes.pdf There're 3 stages in ProportionalCapacityPreemptionPolicy, # Recursively calculate {{ideal_assigned}} for queue. This is depends on available resource, resource used/pending in each queue and guaranteed capacity of each queue. # Mark to-be preempted containers: For each over-satisfied queue, it will mark some containers will be preempted. # Notify scheduler about to-be preempted container. We need respect labels in the cluster for both #1 and #2: For #1, when there're some resource available in the cluster, we shouldn't assign it to a queue (by increasing {{ideal_assigned}}) if the queue cannot access such labels For #2, when we make decision about whether we need preempt a container, we need make sure, resource this container is *possibly* usable by a queue which is under-satisfied and has pending resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2498) [YARN-796] Respect labels in preemption policy of capacity scheduler
[ https://issues.apache.org/jira/browse/YARN-2498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138712#comment-14138712 ] Hadoop QA commented on YARN-2498: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669680/YARN-2498.patch against trunk revision ee21b13. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5024//console This message is automatically generated. [YARN-796] Respect labels in preemption policy of capacity scheduler Key: YARN-2498 URL: https://issues.apache.org/jira/browse/YARN-2498 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2498.patch, YARN-2498.patch, YARN-2498.patch, yarn-2498-implementation-notes.pdf There're 3 stages in ProportionalCapacityPreemptionPolicy, # Recursively calculate {{ideal_assigned}} for queue. This is depends on available resource, resource used/pending in each queue and guaranteed capacity of each queue. # Mark to-be preempted containers: For each over-satisfied queue, it will mark some containers will be preempted. # Notify scheduler about to-be preempted container. We need respect labels in the cluster for both #1 and #2: For #1, when there're some resource available in the cluster, we shouldn't assign it to a queue (by increasing {{ideal_assigned}}) if the queue cannot access such labels For #2, when we make decision about whether we need preempt a container, we need make sure, resource this container is *possibly* usable by a queue which is under-satisfied and has pending resource. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2500) [YARN-796] Miscellaneous changes in ResourceManager to support labels
[ https://issues.apache.org/jira/browse/YARN-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2500: - Description: This patches contains changes in ResourceManager to support labels [YARN-796] Miscellaneous changes in ResourceManager to support labels - Key: YARN-2500 URL: https://issues.apache.org/jira/browse/YARN-2500 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2500.patch, YARN-2500.patch, YARN-2500.patch This patches contains changes in ResourceManager to support labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2500) [YARN-796] Miscellaneous changes in ResourceManager to support labels
[ https://issues.apache.org/jira/browse/YARN-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138720#comment-14138720 ] Wangda Tan commented on YARN-2500: -- Highlight major changes for easier review: * Properly setup and check label expression according to ResourceRequest, ApplicationSubmissionContext and queue default label expression ** Located in {{ApplicationMasterService}}, {{RMServerUtils#validateResourceRequest}} * In the past, ResourceRequest of AM container created in two places (one in {{RMAppManager}}, another one in {{RMAppAttemptImpl}}), when trying to allocate resource from YarnScheduler, the 2nd ResourceRequest doesn't normalized. Now reduced this to only one: ResourceRequest will be created in {{RMAppManager}} only, and it will pass ResourceRequest to RMAppImpl to allocate resource * Check if a queue has permission to access all labels in label-expression of a ResourceRequest, located in {{SchedulerUtils.validateResourceRequest}} * Added getLabels() and getDefaultLabelExpression to Queue * Added get/set NodeLabelManager in RMContext [YARN-796] Miscellaneous changes in ResourceManager to support labels - Key: YARN-2500 URL: https://issues.apache.org/jira/browse/YARN-2500 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2500.patch, YARN-2500.patch, YARN-2500.patch This patches contains changes in ResourceManager to support labels -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
[ https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138809#comment-14138809 ] Hudson commented on YARN-2558: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #684 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/684/]) YARN-2558. Updated ContainerTokenIdentifier#read/write to use ContainerId#getContainerId. Contributed by Tsuyoshi OZAWA. (jianhe: rev f4886111aa573ec928de69e8ca9328d480bf673e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/ContainerTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java * hadoop-yarn-project/CHANGES.txt Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId -- Key: YARN-2558 URL: https://issues.apache.org/jira/browse/YARN-2558 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2558.1.patch, YARN-2558.2.patch, YARN-2558.3.patch We should update ContainerTokenIdentifier#read/write to use {{getContainerId}} instead of {{getId}} to pass all container information correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2559) ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher
[ https://issues.apache.org/jira/browse/YARN-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138814#comment-14138814 ] Hudson commented on YARN-2559: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #684 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/684/]) YARN-2559. Fixed NPE in SystemMetricsPublisher when retrieving FinalApplicationStatus. Contributed by Zhijie Shen (jianhe: rev ee21b13cbd4654d7181306404174329f12193613) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsPublisher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TestSystemMetricsPublisher.java ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher -- Key: YARN-2559 URL: https://issues.apache.org/jira/browse/YARN-2559 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.6.0 Environment: Generice History Service is enabled in Timelineserver with yarn.resourcemanager.system-metrics-publisher.enabled=true So that ResourceManager should Timeline Store for recording application history information Reporter: Karam Singh Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2559.1.patch, YARN-2559.2.patch, YARN-2559.3.patch, YARN-2559.4.patch ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2552) Windows Secure Container Executor: the privileged file operations of hadoopwinutilsvc should be constrained to localdirs only
[ https://issues.apache.org/jira/browse/YARN-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu updated YARN-2552: --- Attachment: YARN-2552.1.patch This patch adds ValidateLocalPath function, checks all file operations for valid paths, and adds a yarn.nodemanager.windows-secure-container-executor.local-dirs key in wsce-site.xml. The SecureContainer.apt.vm is updated to describe the new config key. Windows Secure Container Executor: the privileged file operations of hadoopwinutilsvc should be constrained to localdirs only - Key: YARN-2552 URL: https://issues.apache.org/jira/browse/YARN-2552 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows, wsce Attachments: YARN-2552.1.patch YARN-2458 added file manipulation operations executed in an elevated context by hadoopwinutilsvc. W/o any constraint, the NM (or a hijacker that takes over the NM) can manipulate arbitrary OS files under highest possible privileges, an easy elevation attack vector. The service should only allow operations on files/directories that are under the configured NM localdirs. It should read this value from wsce-site.xml, as the yarn-site.xml cannot be trusted, being writable by Hadoop admins (YARN-2551 ensures wsce-site.xml is only writable by system Administrators, not Hadoop admins). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-2552) Windows Secure Container Executor: the privileged file operations of hadoopwinutilsvc should be constrained to localdirs only
[ https://issues.apache.org/jira/browse/YARN-2552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Remus Rusanu resolved YARN-2552. Resolution: Implemented Fix is included in YARN-2198.delta.7.patch Windows Secure Container Executor: the privileged file operations of hadoopwinutilsvc should be constrained to localdirs only - Key: YARN-2552 URL: https://issues.apache.org/jira/browse/YARN-2552 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Reporter: Remus Rusanu Assignee: Remus Rusanu Labels: security, windows, wsce Attachments: YARN-2552.1.patch YARN-2458 added file manipulation operations executed in an elevated context by hadoopwinutilsvc. W/o any constraint, the NM (or a hijacker that takes over the NM) can manipulate arbitrary OS files under highest possible privileges, an easy elevation attack vector. The service should only allow operations on files/directories that are under the configured NM localdirs. It should read this value from wsce-site.xml, as the yarn-site.xml cannot be trusted, being writable by Hadoop admins (YARN-2551 ensures wsce-site.xml is only writable by system Administrators, not Hadoop admins). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2559) ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher
[ https://issues.apache.org/jira/browse/YARN-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138942#comment-14138942 ] Hudson commented on YARN-2559: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1900 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1900/]) YARN-2559. Fixed NPE in SystemMetricsPublisher when retrieving FinalApplicationStatus. Contributed by Zhijie Shen (jianhe: rev ee21b13cbd4654d7181306404174329f12193613) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TestSystemMetricsPublisher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsPublisher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * hadoop-yarn-project/CHANGES.txt ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher -- Key: YARN-2559 URL: https://issues.apache.org/jira/browse/YARN-2559 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.6.0 Environment: Generice History Service is enabled in Timelineserver with yarn.resourcemanager.system-metrics-publisher.enabled=true So that ResourceManager should Timeline Store for recording application history information Reporter: Karam Singh Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2559.1.patch, YARN-2559.2.patch, YARN-2559.3.patch, YARN-2559.4.patch ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
[ https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138937#comment-14138937 ] Hudson commented on YARN-2558: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #1900 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/1900/]) YARN-2558. Updated ContainerTokenIdentifier#read/write to use ContainerId#getContainerId. Contributed by Tsuyoshi OZAWA. (jianhe: rev f4886111aa573ec928de69e8ca9328d480bf673e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/ContainerTokenIdentifier.java * hadoop-yarn-project/CHANGES.txt Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId -- Key: YARN-2558 URL: https://issues.apache.org/jira/browse/YARN-2558 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2558.1.patch, YARN-2558.2.patch, YARN-2558.3.patch We should update ContainerTokenIdentifier#read/write to use {{getContainerId}} instead of {{getId}} to pass all container information correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
[ https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138950#comment-14138950 ] Hudson commented on YARN-2558: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1875 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1875/]) YARN-2558. Updated ContainerTokenIdentifier#read/write to use ContainerId#getContainerId. Contributed by Tsuyoshi OZAWA. (jianhe: rev f4886111aa573ec928de69e8ca9328d480bf673e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/ContainerTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests/src/test/java/org/apache/hadoop/yarn/server/TestContainerManagerSecurity.java * hadoop-yarn-project/CHANGES.txt Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId -- Key: YARN-2558 URL: https://issues.apache.org/jira/browse/YARN-2558 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2558.1.patch, YARN-2558.2.patch, YARN-2558.3.patch We should update ContainerTokenIdentifier#read/write to use {{getContainerId}} instead of {{getId}} to pass all container information correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2559) ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher
[ https://issues.apache.org/jira/browse/YARN-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14138955#comment-14138955 ] Hudson commented on YARN-2559: -- FAILURE: Integrated in Hadoop-Hdfs-trunk #1875 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1875/]) YARN-2559. Fixed NPE in SystemMetricsPublisher when retrieving FinalApplicationStatus. Contributed by Zhijie Shen (jianhe: rev ee21b13cbd4654d7181306404174329f12193613) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/SystemMetricsPublisher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/metrics/TestSystemMetricsPublisher.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/RMAppAttemptImpl.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/rmapp/attempt/TestRMAppAttemptTransitions.java ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher -- Key: YARN-2559 URL: https://issues.apache.org/jira/browse/YARN-2559 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.6.0 Environment: Generice History Service is enabled in Timelineserver with yarn.resourcemanager.system-metrics-publisher.enabled=true So that ResourceManager should Timeline Store for recording application history information Reporter: Karam Singh Assignee: Zhijie Shen Fix For: 2.6.0 Attachments: YARN-2559.1.patch, YARN-2559.2.patch, YARN-2559.3.patch, YARN-2559.4.patch ResourceManager sometime become un-responsive due to NPE in SystemMetricsPublisher -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.
[ https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139066#comment-14139066 ] Junping Du commented on YARN-2561: -- Hi [~jlowe], can you help to review the patch again given this is a blocker bug? Thanks! MR job client cannot reconnect to AM after NM restart. -- Key: YARN-2561 URL: https://issues.apache.org/jira/browse/YARN-2561 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Tassapol Athiapinya Assignee: Junping Du Priority: Blocker Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, YARN-2561-v4.patch, YARN-2561.patch Work-preserving NM restart is disabled. Submit a job. Restart the only NM and found that Job will hang with connect retries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1779) Handle AMRMTokens across RM failover
[ https://issues.apache.org/jira/browse/YARN-1779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139186#comment-14139186 ] Vinod Kumar Vavilapalli commented on YARN-1779: --- Looks good, +1. Checking this in.. Handle AMRMTokens across RM failover Key: YARN-1779 URL: https://issues.apache.org/jira/browse/YARN-1779 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.3.0 Reporter: Karthik Kambatla Assignee: Jian He Priority: Blocker Labels: ha Attachments: YARN-1779.1.patch, YARN-1779.2.patch, YARN-1779.3.patch, YARN-1779.6.patch Verify if AMRMTokens continue to work against RM failover. If not, we will have to do something along the lines of YARN-986. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2558) Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId
[ https://issues.apache.org/jira/browse/YARN-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-2558: -- Hadoop Flags: Reviewed (was: Incompatible change,Reviewed) Dropping the incompatibility flag given we aren't really supporting rolling upgrades yet (till YARN-666 is in). Updating ContainerTokenIdentifier#read/write to use ContainerId#getContainerId -- Key: YARN-2558 URL: https://issues.apache.org/jira/browse/YARN-2558 Project: Hadoop YARN Issue Type: Sub-task Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Priority: Blocker Fix For: 2.6.0 Attachments: YARN-2558.1.patch, YARN-2558.2.patch, YARN-2558.3.patch We should update ContainerTokenIdentifier#read/write to use {{getContainerId}} instead of {{getId}} to pass all container information correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.
[ https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139256#comment-14139256 ] Jason Lowe commented on YARN-2561: -- We should not assume that the lack of any container status means the NM doesn't support restart. There may have been no containers running on it at the time, but it could still have outstanding applications active (e.g.: still shuffling for the mapreduce_shuffle service), and we definitely don't want AMs to be notified of the node removal in that case. It would be better to check the list of applications on the node. If there are no applications then that should necessarily mean there are also no containers. If the node registers with no applications then it's probably safe to treat the node as a removal and re-add whether that node supports NM restart or not. When a node reconnects with no containers but has the same port, we aren't updating it's potentially new totalCapability as we did before. MR job client cannot reconnect to AM after NM restart. -- Key: YARN-2561 URL: https://issues.apache.org/jira/browse/YARN-2561 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Tassapol Athiapinya Assignee: Junping Du Priority: Blocker Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, YARN-2561-v4.patch, YARN-2561.patch Work-preserving NM restart is disabled. Submit a job. Restart the only NM and found that Job will hang with connect retries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2363) Submitted applications occasionally lack a tracking URL
[ https://issues.apache.org/jira/browse/YARN-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139269#comment-14139269 ] Jonathan Eagles commented on YARN-2363: --- +1. This will be a big help to users, [~jlowe]. Submitted applications occasionally lack a tracking URL --- Key: YARN-2363 URL: https://issues.apache.org/jira/browse/YARN-2363 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-2363.patch Sometimes when an application is submitted the client receives no tracking URL. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-415) Capture aggregate memory allocation at the app-level for chargeback
[ https://issues.apache.org/jira/browse/YARN-415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139336#comment-14139336 ] Kendall Thrapp commented on YARN-415: - Thanks [~eepayne], [~aklochkov], [~jianhe], [~wangda], [~kasha], [~sandyr] and [~jlowe] for all your effort on this! Looking forward to being able to use this feature. Capture aggregate memory allocation at the app-level for chargeback --- Key: YARN-415 URL: https://issues.apache.org/jira/browse/YARN-415 Project: Hadoop YARN Issue Type: New Feature Components: resourcemanager Affects Versions: 2.5.0 Reporter: Kendall Thrapp Assignee: Eric Payne Fix For: 2.6.0 Attachments: YARN-415--n10.patch, YARN-415--n2.patch, YARN-415--n3.patch, YARN-415--n4.patch, YARN-415--n5.patch, YARN-415--n6.patch, YARN-415--n7.patch, YARN-415--n8.patch, YARN-415--n9.patch, YARN-415.201405311749.txt, YARN-415.201406031616.txt, YARN-415.201406262136.txt, YARN-415.201407042037.txt, YARN-415.201407071542.txt, YARN-415.201407171553.txt, YARN-415.201407172144.txt, YARN-415.201407232237.txt, YARN-415.201407242148.txt, YARN-415.201407281816.txt, YARN-415.201408062232.txt, YARN-415.201408080204.txt, YARN-415.201408092006.txt, YARN-415.201408132109.txt, YARN-415.201408150030.txt, YARN-415.201408181938.txt, YARN-415.201408181938.txt, YARN-415.201408212033.txt, YARN-415.201409040036.txt, YARN-415.201409092204.txt, YARN-415.201409102216.txt, YARN-415.patch For the purpose of chargeback, I'd like to be able to compute the cost of an application in terms of cluster resource usage. To start out, I'd like to get the memory utilization of an application. The unit should be MB-seconds or something similar and, from a chargeback perspective, the memory amount should be the memory reserved for the application, as even if the app didn't use all that memory, no one else was able to use it. (reserved ram for container 1 * lifetime of container 1) + (reserved ram for container 2 * lifetime of container 2) + ... + (reserved ram for container n * lifetime of container n) It'd be nice to have this at the app level instead of the job level because: 1. We'd still be able to get memory usage for jobs that crashed (and wouldn't appear on the job history server). 2. We'd be able to get memory usage for future non-MR jobs (e.g. Storm). This new metric should be available both through the RM UI and RM Web Services REST API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1959) Fix headroom calculation in Fair Scheduler
[ https://issues.apache.org/jira/browse/YARN-1959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139368#comment-14139368 ] Karthik Kambatla commented on YARN-1959: Approach looks good. Couple of comments: # Can we add a single method to SchedulingPolicy that takes available cluster resources, queue's fairshare and current usage? May be, we can call that method computeHeadroom? # The amount of resources available in the cluster can be looked up in FairScheduler.rootMetrics. Fix headroom calculation in Fair Scheduler -- Key: YARN-1959 URL: https://issues.apache.org/jira/browse/YARN-1959 Project: Hadoop YARN Issue Type: Bug Reporter: Sandy Ryza Assignee: Anubhav Dhoot Attachments: YARN-1959.prelim.patch The Fair Scheduler currently always sets the headroom to 0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2080) Admission Control: Integrate Reservation subsystem with ResourceManager
[ https://issues.apache.org/jira/browse/YARN-2080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139372#comment-14139372 ] Karthik Kambatla commented on YARN-2080: Latest patch looks good to me. PS: Versioned patches are easier to refer to. Admission Control: Integrate Reservation subsystem with ResourceManager --- Key: YARN-2080 URL: https://issues.apache.org/jira/browse/YARN-2080 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Subru Krishnan Assignee: Subru Krishnan Attachments: YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch, YARN-2080.patch This JIRA tracks the integration of Reservation subsystem data structures introduced in YARN-1709 with the YARN RM. This is essentially end2end wiring of YARN-1051. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2563) On secure clusters call to timeline server fails with authentication errors when running a job via oozie
[ https://issues.apache.org/jira/browse/YARN-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139376#comment-14139376 ] Jian He commented on YARN-2563: --- not related to this bug, I think for the following we can just check the kind name for simplicity. {code} TokenIdentifier tokenIdentifier = token.decodeIdentifier(); if (tokenIdentifier instanceof TimelineDelegationTokenIdentifier) { return; } {code} On secure clusters call to timeline server fails with authentication errors when running a job via oozie Key: YARN-2563 URL: https://issues.apache.org/jira/browse/YARN-2563 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Arpit Gupta Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2563.1.patch During our nightlies on a secure cluster we have seen oozie jobs fail with authentication error to the time line server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2563) On secure clusters call to timeline server fails with authentication errors when running a job via oozie
[ https://issues.apache.org/jira/browse/YARN-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2563: -- Attachment: YARN-2563.2.patch Upload a new patch to simplify the way to check the existence of timeline DT. On secure clusters call to timeline server fails with authentication errors when running a job via oozie Key: YARN-2563 URL: https://issues.apache.org/jira/browse/YARN-2563 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Arpit Gupta Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2563.1.patch, YARN-2563.2.patch During our nightlies on a secure cluster we have seen oozie jobs fail with authentication error to the time line server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated YARN-2056: - Attachment: YARN-2056.201409181916.txt [~leftnoteasy] and [~jlowe], Thank you both for your reviews of this jira. The previous patch did not handle all cases in hierarchical queues where the queue is over-capacity and some or all of it's children are non-preemptable. This patch has been rewritten and should address all use cases. Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, YARN-2056.201409181916.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2363) Submitted applications occasionally lack a tracking URL
[ https://issues.apache.org/jira/browse/YARN-2363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139395#comment-14139395 ] Jason Lowe commented on YARN-2363: -- Thanks for the reviews, [~mitdesai], [~sjlee0], and [~jeagles]! Committing this. Submitted applications occasionally lack a tracking URL --- Key: YARN-2363 URL: https://issues.apache.org/jira/browse/YARN-2363 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Jason Lowe Assignee: Jason Lowe Attachments: YARN-2363.patch Sometimes when an application is submitted the client receives no tracking URL. More details in the first comment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2561) MR job client cannot reconnect to AM after NM restart.
[ https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-2561: - Attachment: YARN-2561-v5.patch MR job client cannot reconnect to AM after NM restart. -- Key: YARN-2561 URL: https://issues.apache.org/jira/browse/YARN-2561 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Tassapol Athiapinya Assignee: Junping Du Priority: Blocker Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, YARN-2561-v4.patch, YARN-2561-v5.patch, YARN-2561.patch Work-preserving NM restart is disabled. Submit a job. Restart the only NM and found that Job will hang with connect retries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2563) On secure clusters call to timeline server fails with authentication errors when running a job via oozie
[ https://issues.apache.org/jira/browse/YARN-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139440#comment-14139440 ] Hadoop QA commented on YARN-2563: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669779/YARN-2563.2.patch against trunk revision 570b8b4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5025//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5025//console This message is automatically generated. On secure clusters call to timeline server fails with authentication errors when running a job via oozie Key: YARN-2563 URL: https://issues.apache.org/jira/browse/YARN-2563 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Arpit Gupta Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2563.1.patch, YARN-2563.2.patch During our nightlies on a secure cluster we have seen oozie jobs fail with authentication error to the time line server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.
[ https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139450#comment-14139450 ] Jason Lowe commented on YARN-2561: -- bq. Yes. Update with adding with newNode instead of old rmNode. Ah, I see now. Before I thought the node update event was just going to the scheduler but now I see it's sending an event to itself which will eventually set totalCapability properly. +1 on the latest patch pending Jenkins. MR job client cannot reconnect to AM after NM restart. -- Key: YARN-2561 URL: https://issues.apache.org/jira/browse/YARN-2561 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Tassapol Athiapinya Assignee: Junping Du Priority: Blocker Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, YARN-2561-v4.patch, YARN-2561-v5.patch, YARN-2561.patch Work-preserving NM restart is disabled. Submit a job. Restart the only NM and found that Job will hang with connect retries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-913: Attachment: YARN-913-003.patch Updated patch with ACL build-up for system permissions on secure clusters, testing in-VM with miniKDC, ZK and showing doAs registry creation creates an authed registry. Also: patched the build process so that precommit reports can now be found Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2056) Disable preemption at Queue level
[ https://issues.apache.org/jira/browse/YARN-2056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139493#comment-14139493 ] Hadoop QA commented on YARN-2056: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669780/YARN-2056.201409181916.txt against trunk revision 570b8b4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5026//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5026//console This message is automatically generated. Disable preemption at Queue level - Key: YARN-2056 URL: https://issues.apache.org/jira/browse/YARN-2056 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Affects Versions: 2.4.0 Reporter: Mayank Bansal Assignee: Eric Payne Attachments: YARN-2056.201408202039.txt, YARN-2056.201408260128.txt, YARN-2056.201408310117.txt, YARN-2056.201409022208.txt, YARN-2056.201409181916.txt We need to be able to disable preemption at individual queue level -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139498#comment-14139498 ] Hadoop QA commented on YARN-913: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669798/YARN-913-003.patch against trunk revision 474f116. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 34 new or modified test files. {color:red}-1 javac{color:red}. The patch appears to cause the build to fail. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5028//console This message is automatically generated. Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2468: Attachment: YARN-2468.6.patch Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139515#comment-14139515 ] Xuan Gong commented on YARN-2468: - bq. If we want to use day as the unit, shall we allow double interval, such that I can say 0.1 day. I think it would be useful for testing purpose (not unit test, but verifying the feature is working on purpose, and the app is using the feature correctly). Thoughts? Add extra YarnConfiguration.DEBUG_LOG_AGGREGATION_SPEED_UP_RATIO for test/debug purpose. New patch also address other comments Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.
[ https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139523#comment-14139523 ] Hadoop QA commented on YARN-2561: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669786/YARN-2561-v5.patch against trunk revision 570b8b4. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5027//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5027//console This message is automatically generated. MR job client cannot reconnect to AM after NM restart. -- Key: YARN-2561 URL: https://issues.apache.org/jira/browse/YARN-2561 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Tassapol Athiapinya Assignee: Junping Du Priority: Blocker Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, YARN-2561-v4.patch, YARN-2561-v5.patch, YARN-2561.patch Work-preserving NM restart is disabled. Submit a job. Restart the only NM and found that Job will hang with connect retries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2468: Attachment: YARN-2468.6.1.patch Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-2568) TestAMRMClientOnRMRestart test fails
Jian He created YARN-2568: - Summary: TestAMRMClientOnRMRestart test fails Key: YARN-2568 URL: https://issues.apache.org/jira/browse/YARN-2568 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2563) On secure clusters call to timeline server fails with authentication errors when running a job via oozie
[ https://issues.apache.org/jira/browse/YARN-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139542#comment-14139542 ] Jian He commented on YARN-2563: --- looks good, +1 On secure clusters call to timeline server fails with authentication errors when running a job via oozie Key: YARN-2563 URL: https://issues.apache.org/jira/browse/YARN-2563 Project: Hadoop YARN Issue Type: Bug Components: timelineserver Affects Versions: 2.6.0 Reporter: Arpit Gupta Assignee: Zhijie Shen Priority: Blocker Attachments: YARN-2563.1.patch, YARN-2563.2.patch During our nightlies on a secure cluster we have seen oozie jobs fail with authentication error to the time line server. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2568) TestAMRMClientOnRMRestart test fails
[ https://issues.apache.org/jira/browse/YARN-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2568: -- Attachment: YARN-2568.patch TestAMRMClientOnRMRestart test fails Key: YARN-2568 URL: https://issues.apache.org/jira/browse/YARN-2568 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2568.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2561) MR job client cannot reconnect to AM after NM restart.
[ https://issues.apache.org/jira/browse/YARN-2561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139544#comment-14139544 ] Jason Lowe commented on YARN-2561: -- Committing this. MR job client cannot reconnect to AM after NM restart. -- Key: YARN-2561 URL: https://issues.apache.org/jira/browse/YARN-2561 Project: Hadoop YARN Issue Type: Bug Affects Versions: 2.6.0 Reporter: Tassapol Athiapinya Assignee: Junping Du Priority: Blocker Attachments: YARN-2561-v2.patch, YARN-2561-v3.patch, YARN-2561-v4.patch, YARN-2561-v5.patch, YARN-2561.patch Work-preserving NM restart is disabled. Submit a job. Restart the only NM and found that Job will hang with connect retries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2568) TestAMRMClientOnRMRestart test fails
[ https://issues.apache.org/jira/browse/YARN-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2568: -- Description: testAMRMClientResendsRequestsOnRMRestart(org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart) Time elapsed: 10.807 sec FAILURE! java.lang.AssertionError: Number of container should be 3 expected:3 but was:0 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:290) TestAMRMClientOnRMRestart test fails Key: YARN-2568 URL: https://issues.apache.org/jira/browse/YARN-2568 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2568.patch testAMRMClientResendsRequestsOnRMRestart(org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart) Time elapsed: 10.807 sec FAILURE! java.lang.AssertionError: Number of container should be 3 expected:3 but was:0 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1857) CapacityScheduler headroom doesn't account for other AM's running
[ https://issues.apache.org/jira/browse/YARN-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139605#comment-14139605 ] Jian He commented on YARN-1857: --- thanks [~airbots] and [~cwelch], patch looks good overall, few comments and questions: - Indentation of the last line seems incorrect. {code} Resource headroom = Resources.min(resourceCalculator, clusterResource, Resources.subtract( Resources.min(resourceCalculator, clusterResource, userLimit, queueMaxCap), userConsumed), Resources.subtract(queueMaxCap, usedResources)); {code} - Test case2: could you check app2 headRoom as well - Test case3: could you check app_1 headRoom as well. - Could you explain why in test case 4 {{assertEquals(5*GB, app_4.getHeadroom().getMemory());}}, app4 still has 5GB headRoom? CapacityScheduler headroom doesn't account for other AM's running - Key: YARN-1857 URL: https://issues.apache.org/jira/browse/YARN-1857 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Affects Versions: 2.3.0 Reporter: Thomas Graves Assignee: Chen He Priority: Critical Attachments: YARN-1857.1.patch, YARN-1857.2.patch, YARN-1857.3.patch, YARN-1857.patch, YARN-1857.patch, YARN-1857.patch Its possible to get an application to hang forever (or a long time) in a cluster with multiple users. The reason why is that the headroom sent to the application is based on the user limit but it doesn't account for other Application masters using space in that queue. So the headroom (user limit - user consumed) can be 0 even though the cluster is 100% full because the other space is being used by application masters from other users. For instance if you have a cluster with 1 queue, user limit is 100%, you have multiple users submitting applications. One very large application by user 1 starts up, runs most of its maps and starts running reducers. other users try to start applications and get their application masters started but not tasks. The very large application then gets to the point where it has consumed the rest of the cluster resources with all reduces. But at this point it needs to still finish a few maps. The headroom being sent to this application is only based on the user limit (which is 100% of the cluster capacity) its using lets say 95% of the cluster for reduces and then other 5% is being used by other users running application masters. The MRAppMaster thinks it still has 5% so it doesn't know that it should kill a reduce in order to run a map. This can happen in other scenarios also. Generally in a large cluster with multiple queues this shouldn't cause a hang forever but it could cause the application to take much longer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2468) Log handling for LRS
[ https://issues.apache.org/jira/browse/YARN-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139619#comment-14139619 ] Hadoop QA commented on YARN-2468: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669809/YARN-2468.6.1.patch against trunk revision a337f0e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager: org.apache.hadoop.yarn.client.cli.TestLogsCLI org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5030//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5030//console This message is automatically generated. Log handling for LRS Key: YARN-2468 URL: https://issues.apache.org/jira/browse/YARN-2468 Project: Hadoop YARN Issue Type: Sub-task Components: log-aggregation, nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2468.1.patch, YARN-2468.2.patch, YARN-2468.3.patch, YARN-2468.3.rebase.2.patch, YARN-2468.3.rebase.patch, YARN-2468.4.1.patch, YARN-2468.4.patch, YARN-2468.5.1.patch, YARN-2468.5.1.patch, YARN-2468.5.2.patch, YARN-2468.5.3.patch, YARN-2468.5.4.patch, YARN-2468.5.patch, YARN-2468.6.1.patch, YARN-2468.6.patch Currently, when application is finished, NM will start to do the log aggregation. But for Long running service applications, this is not ideal. The problems we have are: 1) LRS applications are expected to run for a long time (weeks, months). 2) Currently, all the container logs (from one NM) will be written into a single file. The files could become larger and larger. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-913) Add a way to register long-lived services in a YARN cluster
[ https://issues.apache.org/jira/browse/YARN-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139675#comment-14139675 ] Hadoop QA commented on YARN-913: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669814/YARN-913-004.patch against trunk revision fe2f54d. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 34 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1267 javac compiler warnings (more than the trunk's current 1266 warnings). {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 10 warning messages. See https://builds.apache.org/job/PreCommit-YARN-Build/5031//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavadocWarnings.txt for details. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 11 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-common-project/hadoop-minikdc hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-registry hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-tests: org.apache.hadoop.yarn.applications.distributedshell.TestDistributedShell org.apache.hadoop.yarn.registry.secure.TestSecureLogins org.apache.hadoop.yarn.server.TestMiniYARNClusterRegistry {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5031//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5031//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5031//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-registry.html Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5031//artifact/PreCommit-HADOOP-Build-patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5031//console This message is automatically generated. Add a way to register long-lived services in a YARN cluster --- Key: YARN-913 URL: https://issues.apache.org/jira/browse/YARN-913 Project: Hadoop YARN Issue Type: New Feature Components: api, resourcemanager Affects Versions: 2.5.0, 2.4.1 Reporter: Steve Loughran Assignee: Steve Loughran Attachments: 2014-09-03_Proposed_YARN_Service_Registry.pdf, 2014-09-08_YARN_Service_Registry.pdf, RegistrationServiceDetails.txt, YARN-913-001.patch, YARN-913-002.patch, YARN-913-003.patch, YARN-913-003.patch, YARN-913-004.patch, yarnregistry.pdf, yarnregistry.tla In a YARN cluster you can't predict where services will come up -or on what ports. The services need to work those things out as they come up and then publish them somewhere. Applications need to be able to find the service instance they are to bond to -and not any others in the cluster. Some kind of service registry -in the RM, in ZK, could do this. If the RM held the write access to the ZK nodes, it would be more secure than having apps register with ZK themselves. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2568) TestAMRMClientOnRMRestart test fails
[ https://issues.apache.org/jira/browse/YARN-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139680#comment-14139680 ] Hadoop QA commented on YARN-2568: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669810/YARN-2568.patch against trunk revision 2c3da25. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5032//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5032//console This message is automatically generated. TestAMRMClientOnRMRestart test fails Key: YARN-2568 URL: https://issues.apache.org/jira/browse/YARN-2568 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2568.patch testAMRMClientResendsRequestsOnRMRestart(org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart) Time elapsed: 10.807 sec FAILURE! java.lang.AssertionError: Number of container should be 3 expected:3 but was:0 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139731#comment-14139731 ] Vinod Kumar Vavilapalli commented on YARN-668: -- One more thing. Let's version each of the token separately and validate the versions on the NodeManager during startContainer() to catch incompatible changes. The Token is redesigned anyways, so we can start with version 0. TokenIdentifier serialization should consider Unknown fields Key: YARN-668 URL: https://issues.apache.org/jira/browse/YARN-668 Project: Hadoop YARN Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Vinod Kumar Vavilapalli Priority: Blocker Attachments: YARN-668-demo.patch This would allow changing of the TokenIdentifier between versions. The current serialization is Writable. A simple way to achieve this would be to have a Proto object as the payload for TokenIdentifiers, instead of individual fields. TokenIdentifier continues to implement Writable to work with the RPC layer - but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2565) RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting FileSystemApplicationHistoryStore
[ https://issues.apache.org/jira/browse/YARN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139746#comment-14139746 ] Jian He commented on YARN-2565: --- - exception msg seems still using FileSystemApplicationHistoryStore {code} Could not instantiate ApplicationHistoryWriter: + conf.get(YarnConfiguration.APPLICATION_HISTORY_STORE, FileSystemApplicationHistoryStore.class.getName()); {code} - testDefaultStoreSetup: missing @test annotation RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting FileSystemApplicationHistoryStore --- Key: YARN-2565 URL: https://issues.apache.org/jira/browse/YARN-2565 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.6.0 Environment: Secure cluster with ATS (timeline server enabled) and yarn.resourcemanager.system-metrics-publisher.enabled=true so that RM can send Application history to Timeline Store Reporter: Karam Singh Assignee: Zhijie Shen Attachments: YARN-2565.1.patch, YARN-2565.2.patch Observed that RM fails to start in Secure mode when GenericeHistoryService is enabled and ResourceManager is set to use Timeline Store -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139748#comment-14139748 ] Junping Du commented on YARN-668: - Thanks [~vinodkv] for comments, and I also tested it offline that this way should work. Will work on this JIRA. TokenIdentifier serialization should consider Unknown fields Key: YARN-668 URL: https://issues.apache.org/jira/browse/YARN-668 Project: Hadoop YARN Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Vinod Kumar Vavilapalli Priority: Blocker Attachments: YARN-668-demo.patch This would allow changing of the TokenIdentifier between versions. The current serialization is Writable. A simple way to achieve this would be to have a Proto object as the payload for TokenIdentifiers, instead of individual fields. TokenIdentifier continues to implement Writable to work with the RPC layer - but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-668) TokenIdentifier serialization should consider Unknown fields
[ https://issues.apache.org/jira/browse/YARN-668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-668: Assignee: Junping Du (was: Vinod Kumar Vavilapalli) TokenIdentifier serialization should consider Unknown fields Key: YARN-668 URL: https://issues.apache.org/jira/browse/YARN-668 Project: Hadoop YARN Issue Type: Sub-task Reporter: Siddharth Seth Assignee: Junping Du Priority: Blocker Attachments: YARN-668-demo.patch This would allow changing of the TokenIdentifier between versions. The current serialization is Writable. A simple way to achieve this would be to have a Proto object as the payload for TokenIdentifiers, instead of individual fields. TokenIdentifier continues to implement Writable to work with the RPC layer - but the payload itself is serialized using PB. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2568) TestAMRMClientOnRMRestart test fails
[ https://issues.apache.org/jira/browse/YARN-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2568: -- Attachment: YARN-2568.patch TestAMRMClientOnRMRestart test fails Key: YARN-2568 URL: https://issues.apache.org/jira/browse/YARN-2568 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2568.patch, YARN-2568.patch testAMRMClientResendsRequestsOnRMRestart(org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart) Time elapsed: 10.807 sec FAILURE! java.lang.AssertionError: Number of container should be 3 expected:3 but was:0 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2565) RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting FileSystemApplicationHistoryStore
[ https://issues.apache.org/jira/browse/YARN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhijie Shen updated YARN-2565: -- Attachment: YARN-2565.3.patch Fix the patch according to Jian's comments RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting FileSystemApplicationHistoryStore --- Key: YARN-2565 URL: https://issues.apache.org/jira/browse/YARN-2565 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.6.0 Environment: Secure cluster with ATS (timeline server enabled) and yarn.resourcemanager.system-metrics-publisher.enabled=true so that RM can send Application history to Timeline Store Reporter: Karam Singh Assignee: Zhijie Shen Attachments: YARN-2565.1.patch, YARN-2565.2.patch, YARN-2565.3.patch Observed that RM fails to start in Secure mode when GenericeHistoryService is enabled and ResourceManager is set to use Timeline Store -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2568) TestAMRMClientOnRMRestart test fails
[ https://issues.apache.org/jira/browse/YARN-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139811#comment-14139811 ] Zhijie Shen commented on YARN-2568: --- +1, pending on jenkins TestAMRMClientOnRMRestart test fails Key: YARN-2568 URL: https://issues.apache.org/jira/browse/YARN-2568 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2568.patch, YARN-2568.patch testAMRMClientResendsRequestsOnRMRestart(org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart) Time elapsed: 10.807 sec FAILURE! java.lang.AssertionError: Number of container should be 3 expected:3 but was:0 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-1372: Attachment: YARN-1372.006.patch Addressed feedback except for 2 things (transferring all finishedContainers from previous attempt irrespective of wp-restart am enabled and adding am containers to finished containers). Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1372.001.patch, YARN-1372.001.patch, YARN-1372.002_NMHandlesCompletedApp.patch, YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, YARN-1372.004.patch, YARN-1372.005.patch, YARN-1372.005.patch, YARN-1372.006.patch, YARN-1372.prelim.patch, YARN-1372.prelim2.patch Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2568) TestAMRMClientOnRMRestart test fails
[ https://issues.apache.org/jira/browse/YARN-2568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139847#comment-14139847 ] Hadoop QA commented on YARN-2568: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669871/YARN-2568.patch against trunk revision fe38d2e. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 3 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5033//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5033//console This message is automatically generated. TestAMRMClientOnRMRestart test fails Key: YARN-2568 URL: https://issues.apache.org/jira/browse/YARN-2568 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He Attachments: YARN-2568.patch, YARN-2568.patch testAMRMClientResendsRequestsOnRMRestart(org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart) Time elapsed: 10.807 sec FAILURE! java.lang.AssertionError: Number of container should be 3 expected:3 but was:0 at org.junit.Assert.fail(Assert.java:88) at org.junit.Assert.failNotEquals(Assert.java:743) at org.junit.Assert.assertEquals(Assert.java:118) at org.junit.Assert.assertEquals(Assert.java:555) at org.apache.hadoop.yarn.client.api.impl.TestAMRMClientOnRMRestart.testAMRMClientResendsRequestsOnRMRestart(TestAMRMClientOnRMRestart.java:290) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2565) RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting FileSystemApplicationHistoryStore
[ https://issues.apache.org/jira/browse/YARN-2565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139873#comment-14139873 ] Hadoop QA commented on YARN-2565: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669873/YARN-2565.3.patch against trunk revision 6434572. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5035//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5035//console This message is automatically generated. RM shouldn't use the old RMApplicationHistoryWriter unless explicitly setting FileSystemApplicationHistoryStore --- Key: YARN-2565 URL: https://issues.apache.org/jira/browse/YARN-2565 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager, timelineserver Affects Versions: 2.6.0 Environment: Secure cluster with ATS (timeline server enabled) and yarn.resourcemanager.system-metrics-publisher.enabled=true so that RM can send Application history to Timeline Store Reporter: Karam Singh Assignee: Zhijie Shen Attachments: YARN-2565.1.patch, YARN-2565.2.patch, YARN-2565.3.patch Observed that RM fails to start in Secure mode when GenericeHistoryService is enabled and ResourceManager is set to use Timeline Store -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1372) Ensure all completed containers are reported to the AMs across RM restart
[ https://issues.apache.org/jira/browse/YARN-1372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139887#comment-14139887 ] Hadoop QA commented on YARN-1372: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669882/YARN-1372.006.patch against trunk revision 6434572. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 6 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5036//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5036//console This message is automatically generated. Ensure all completed containers are reported to the AMs across RM restart - Key: YARN-1372 URL: https://issues.apache.org/jira/browse/YARN-1372 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Bikas Saha Assignee: Anubhav Dhoot Attachments: YARN-1372.001.patch, YARN-1372.001.patch, YARN-1372.002_NMHandlesCompletedApp.patch, YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.002_RMHandlesCompletedApp.patch, YARN-1372.003.patch, YARN-1372.004.patch, YARN-1372.005.patch, YARN-1372.005.patch, YARN-1372.006.patch, YARN-1372.prelim.patch, YARN-1372.prelim2.patch Currently the NM informs the RM about completed containers and then removes those containers from the RM notification list. The RM passes on that completed container information to the AM and the AM pulls this data. If the RM dies before the AM pulls this data then the AM may not be able to get this information again. To fix this, NM should maintain a separate list of such completed container notifications sent to the RM. After the AM has pulled the containers from the RM then the RM will inform the NM about it and the NM can remove the completed container from the new list. Upon re-register with the RM (after RM restart) the NM should send the entire list of completed containers to the RM along with any other containers that completed while the RM was dead. This ensures that the RM can inform the AM's about all completed containers. Some container completions may be reported more than once since the AM may have pulled the container but the RM may die before notifying the NM about the pull. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-2566: Attachment: YARN-2566.000.patch IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2566.000.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED APPID=application_1410663092546_0004 CONTAINERID=container_1410663092546_0004_01_01 2014-09-13 23:33:25,187 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container:
[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139894#comment-14139894 ] zhihai xu commented on YARN-2566: - I attached a patch YARN-2566.000.patch for review. I have a test case in the patch which need Mock the FileContext class, so I need remove final in FileContext class. IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2566.000.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344) at org.apache.hadoop.fs.ChecksumFs.createInternal(ChecksumFs.java:390) at org.apache.hadoop.fs.AbstractFileSystem.create(AbstractFileSystem.java:577) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:677) at org.apache.hadoop.fs.FileContext$3.next(FileContext.java:673) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.create(FileContext.java:673) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:2021) at org.apache.hadoop.fs.FileContext$Util.copy(FileContext.java:1963) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:102) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,186 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.Container: Container container_1410663092546_0004_01_01 transitioned from LOCALIZING to LOCALIZATION_FAILED 2014-09-13 23:33:25,187 WARN org.apache.hadoop.yarn.server.nodemanager.NMAuditLogger: USER=cloudera OPERATION=Container Finished - Failed TARGET=ContainerImpl RESULT=FAILURE DESCRIPTION=Container failed with state: LOCALIZATION_FAILED
[jira] [Commented] (YARN-2566) IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir.
[ https://issues.apache.org/jira/browse/YARN-2566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139927#comment-14139927 ] Hadoop QA commented on YARN-2566: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669893/YARN-2566.000.patch against trunk revision 6434572. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-common-project/hadoop-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5037//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5037//artifact/PreCommit-HADOOP-Build-patchprocess/newPatchFindbugsWarningshadoop-yarn-server-nodemanager.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5037//console This message is automatically generated. IOException happen in startLocalizer of DefaultContainerExecutor due to not enough disk space for the first localDir. - Key: YARN-2566 URL: https://issues.apache.org/jira/browse/YARN-2566 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.5.0 Reporter: zhihai xu Assignee: zhihai xu Attachments: YARN-2566.000.patch startLocalizer in DefaultContainerExecutor will only use the first localDir to copy the token file, if the copy is failed for first localDir due to not enough disk space in the first localDir, the localization will be failed even there are plenty of disk space in other localDirs. We see the following error for this case: {code} 2014-09-13 23:33:25,171 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Unable to create app directory /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 java.io.IOException: mkdir of /hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 failed at org.apache.hadoop.fs.FileSystem.primitiveMkdir(FileSystem.java:1062) at org.apache.hadoop.fs.DelegateToFileSystem.mkdir(DelegateToFileSystem.java:157) at org.apache.hadoop.fs.FilterFs.mkdir(FilterFs.java:189) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:721) at org.apache.hadoop.fs.FileContext$4.next(FileContext.java:717) at org.apache.hadoop.fs.FSLinkResolver.resolve(FSLinkResolver.java:90) at org.apache.hadoop.fs.FileContext.mkdir(FileContext.java:717) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createDir(DefaultContainerExecutor.java:426) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.createAppDirs(DefaultContainerExecutor.java:522) at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.startLocalizer(DefaultContainerExecutor.java:94) at org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerRunner.run(ResourceLocalizationService.java:987) 2014-09-13 23:33:25,185 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer failed java.io.FileNotFoundException: File file:/hadoop/d1/usercache/cloudera/appcache/application_1410663092546_0004 does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:511) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:724) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:501) at org.apache.hadoop.fs.DelegateToFileSystem.getFileStatus(DelegateToFileSystem.java:111) at org.apache.hadoop.fs.DelegateToFileSystem.createInternal(DelegateToFileSystem.java:76) at org.apache.hadoop.fs.ChecksumFs$ChecksumFSOutputSummer.init(ChecksumFs.java:344)
[jira] [Created] (YARN-2569) Log Handling for LRS API Changes
Xuan Gong created YARN-2569: --- Summary: Log Handling for LRS API Changes Key: YARN-2569 URL: https://issues.apache.org/jira/browse/YARN-2569 Project: Hadoop YARN Issue Type: Sub-task Reporter: Xuan Gong Assignee: Xuan Gong -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2569) Log Handling for LRS API Changes
[ https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuan Gong updated YARN-2569: Attachment: YARN-2569.1.patch Log Handling for LRS API Changes Key: YARN-2569 URL: https://issues.apache.org/jira/browse/YARN-2569 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2569.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2569) Log Handling for LRS API Changes
[ https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139973#comment-14139973 ] Xuan Gong commented on YARN-2569: - API changes required by log handling. 1. add LogContext into ASC 2. add logContext into ContainerTokenIdentifier. When RM schedules the containers, the logContext will be set which can guarantee all the containers for the same application has the same logContext. 3. When NM launches the containers, it will read logcontext from ContainerTokenIdentifier Log Handling for LRS API Changes Key: YARN-2569 URL: https://issues.apache.org/jira/browse/YARN-2569 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2569.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-796) Allow for (admin) labels on nodes and resource-requests
[ https://issues.apache.org/jira/browse/YARN-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-796: Attachment: YARN-796.node-label.consolidate.7.patch Updated latest consolidated patch against trunk. Set patch available to Kick jenkins Allow for (admin) labels on nodes and resource-requests --- Key: YARN-796 URL: https://issues.apache.org/jira/browse/YARN-796 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.4.1 Reporter: Arun C Murthy Assignee: Wangda Tan Attachments: LabelBasedScheduling.pdf, Node-labels-Requirements-Design-doc-V1.pdf, Node-labels-Requirements-Design-doc-V2.pdf, YARN-796-Diagram.pdf, YARN-796.node-label.consolidate.1.patch, YARN-796.node-label.consolidate.2.patch, YARN-796.node-label.consolidate.3.patch, YARN-796.node-label.consolidate.4.patch, YARN-796.node-label.consolidate.5.patch, YARN-796.node-label.consolidate.6.patch, YARN-796.node-label.consolidate.7.patch, YARN-796.node-label.demo.patch.1, YARN-796.patch, YARN-796.patch4 It will be useful for admins to specify labels for nodes. Examples of labels are OS, processor architecture etc. We should expose these labels and allow applications to specify labels on resource-requests. Obviously we need to support admin operations on adding/removing node labels. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2569) Log Handling for LRS API Changes
[ https://issues.apache.org/jira/browse/YARN-2569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14140036#comment-14140036 ] Hadoop QA commented on YARN-2569: - {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12669912/YARN-2569.1.patch against trunk revision 6434572. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 5 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5038//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5038//console This message is automatically generated. Log Handling for LRS API Changes Key: YARN-2569 URL: https://issues.apache.org/jira/browse/YARN-2569 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Xuan Gong Assignee: Xuan Gong Attachments: YARN-2569.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)