[jira] [Commented] (YARN-2997) NM keeps sending already-sent completed containers to RM until containers are removed from context
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270347#comment-14270347 ] Chengbing Liu commented on YARN-2997: - Thanks [~jianhe] ! NM keeps sending already-sent completed containers to RM until containers are removed from context -- Key: YARN-2997 URL: https://issues.apache.org/jira/browse/YARN-2997 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Fix For: 2.7.0 Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, YARN-2997.5.patch, YARN-2997.patch We have seen in RM log a lot of {quote} INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {quote} It is caused by NM sending completed containers repeatedly until the app is finished. On the RM side, the container is already released, hence {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3016) (Refactoring) Merge internalAdd/Remove/ReplaceLabels to one method in CommonNodeLabelsManager
[ https://issues.apache.org/jira/browse/YARN-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270518#comment-14270518 ] Rohith commented on YARN-3016: -- It make sense to me. (Refactoring) Merge internalAdd/Remove/ReplaceLabels to one method in CommonNodeLabelsManager - Key: YARN-3016 URL: https://issues.apache.org/jira/browse/YARN-3016 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Now we have separated but similar implementations for add/remove/replace labels on node in CommonNodeLabelsManager, we should merge it to a single one for easier modify them and better readability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2807) Option --forceactive not works as described in usage of yarn rmadmin -transitionToActive
[ https://issues.apache.org/jira/browse/YARN-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270522#comment-14270522 ] Akira AJISAKA commented on YARN-2807: - +1, thank you [~iwasakims]. Option --forceactive not works as described in usage of yarn rmadmin -transitionToActive Key: YARN-2807 URL: https://issues.apache.org/jira/browse/YARN-2807 Project: Hadoop YARN Issue Type: Sub-task Components: documentation, resourcemanager Reporter: Wangda Tan Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-2807.1.patch, YARN-2807.2.patch, YARN-2807.3.patch Currently the help message of yarn rmadmin -transitionToActive is: {code} transitionToActive: incorrect number of arguments Usage: HAAdmin [-transitionToActive serviceId [--forceactive]] {code} But the --forceactive not works as expected. When transition RM state with --forceactive: {code} yarn rmadmin -transitionToActive rm2 --forceactive Automatic failover is enabled for org.apache.hadoop.yarn.client.RMHAServiceTarget@64c9f31e Refusing to manually manage HA state, since it may cause a split-brain scenario or other incorrect state. If you are very sure you know what you are doing, please specify the forcemanual flag. {code} As shown above, we still cannot transitionToActive when automatic failover is enabled with --forceactive. The option can work is: {{--forcemanual}}, there's no place in usage describes this option. I think we should fix this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2956) Some yarn-site index linked pages are difficult to discover because are not in the side bar
[ https://issues.apache.org/jira/browse/YARN-2956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270255#comment-14270255 ] Jian He commented on YARN-2956: --- [~iwasakims], thanks for working on this. maybe add a link in the side bar to link to the hadoop-yarn-site/index.html too ? The link can be located at the YARN section on the side bar and may call it overview. I think it's fine to keep a full list of document indexes in the main index too ? Some yarn-site index linked pages are difficult to discover because are not in the side bar --- Key: YARN-2956 URL: https://issues.apache.org/jira/browse/YARN-2956 Project: Hadoop YARN Issue Type: Bug Components: documentation Affects Versions: 2.6.0 Reporter: Remus Rusanu Assignee: Masatake Iwasaki Priority: Minor Labels: documentation Attachments: YARN-2956.1.patch The yarn-site index.apt.vm page is difficult to 'stumble upon' because the hadoop.apache.org/ sidebar navigation does not link to it. One needs to know the URL http://hadoop.apache.org/docs/r2.6.0/hadoop-yarn/hadoop-yarn-site/ to land on it. The links from the index page do not match the links from the side bar, so some pages are quickly accessible (from sidebar). I propose the links from the index.apt.vm to match the links from the YARN side bar subsection (Ideally through one single definition file, but I don't understand the APT generation process well enough to call out how this can be achieved. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2996) Refine fs operations in FileSystemRMStateStore and few fixes
[ https://issues.apache.org/jira/browse/YARN-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270274#comment-14270274 ] Yi Liu commented on YARN-2996: -- Thanks [~zjshen] for review and commit. Refine fs operations in FileSystemRMStateStore and few fixes Key: YARN-2996 URL: https://issues.apache.org/jira/browse/YARN-2996 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Yi Liu Assignee: Yi Liu Fix For: 2.7.0 Attachments: YARN-2996.001.patch, YARN-2996.002.patch, YARN-2996.003.patch, YARN-2996.004.patch In {{FileSystemRMStateStore}}, we can refine some fs operations to improve performance: *1.* There are several places invoke {{fs.exists}}, then {{fs.getFileStatus}}, we can merge them to save one RPC call {code} if (fs.exists(versionNodePath)) { FileStatus status = fs.getFileStatus(versionNodePath); {code} *2.* {code} protected void updateFile(Path outputPath, byte[] data) throws Exception { Path newPath = new Path(outputPath.getParent(), outputPath.getName() + .new); // use writeFile to make sure .new file is created atomically writeFile(newPath, data); replaceFile(newPath, outputPath); } {code} The {{updateFile}} is not good too, it write file to _output\_file_.tmp, then rename to _output\_file_.new, then rename it to _output\_file_, we can reduce one rename operation. Also there is one unnecessary import, we can remove it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3022) Expose Container resource information from NodeManager for monitoring
Anubhav Dhoot created YARN-3022: --- Summary: Expose Container resource information from NodeManager for monitoring Key: YARN-3022 URL: https://issues.apache.org/jira/browse/YARN-3022 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Along with exposing resource consumption of each container such as (YARN-2141) its worth exposing the actual resource limit associated with them to get better insight into YARN allocation and consumption -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wei Yan updated YARN-810: - Attachment: YARN-810-6.patch Update a patch to fix the test failures. Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810-3.patch, YARN-810-4.patch, YARN-810-5.patch, YARN-810-6.patch, YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us 10 [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us -1 {noformat} Oddly, it appears that the cfs_period_us is set to .1s, not 1s. We can place processes in hard limits. I have process 4370 running YARN container container_1371141151815_0003_01_03 on a host. By default, it's running at ~300% cpu usage. {noformat} CPU 4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ... {noformat} When I set the CFS quote: {noformat} echo 1000 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us CPU 4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39 ... {noformat} It drops to 1% usage, and you can see the box has room to spare: {noformat} Cpu(s): 2.4%us, 1.0%sy, 0.0%ni, 92.2%id, 4.2%wa, 0.0%hi, 0.1%si, 0.0%st {noformat} Turning the quota back to -1: {noformat} echo -1 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us {noformat} Burns the cores again: {noformat} Cpu(s): 11.1%us, 1.7%sy, 0.0%ni, 83.9%id, 3.1%wa, 0.0%hi, 0.2%si, 0.0%st CPU 4370 criccomi 20 0 1157m 563m 14m S 253.9 0.8 89:32.31 ... {noformat} On my dev box, I was testing CGroups by running a python process eight times, to burn through all the cores, since it was doing as described above (giving extra CPU to the process, even with a cpu.shares limit). Toggling the cfs_quota_us seems to enforce a hard limit.
[jira] [Updated] (YARN-313) Add Admin API for supporting node resource configuration in command line
[ https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-313: Attachment: YARN-313-v3.patch Resync the patch to latest trunk. Add Admin API for supporting node resource configuration in command line Key: YARN-313 URL: https://issues.apache.org/jira/browse/YARN-313 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Junping Du Assignee: Junping Du Priority: Critical Attachments: YARN-313-sample.patch, YARN-313-v1.patch, YARN-313-v2.patch, YARN-313-v3.patch We should provide some admin interface, e.g. yarn rmadmin -refreshResources to support changes of node's resource specified in a config file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3014) Replaces labels on a host should update all NM's labels on that host
[ https://issues.apache.org/jira/browse/YARN-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3014: - Attachment: YARN-3014-2.patch Thanks [~jianhe]'s review, addressed comments and also updated test cases to cover them. Replaces labels on a host should update all NM's labels on that host Key: YARN-3014 URL: https://issues.apache.org/jira/browse/YARN-3014 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3014-1.patch, YARN-3014-2.patch Admin can either specify labels on a host (by running {{yarn rmadmin -replaceLabelsOnNode host1,label1}}) OR on a single NM (by running {{yarn rmadmin -replaceLabelsOnNode host1:port,label1}}). If user has specified label=x on a NM (instead of host), and later set the label=y on host of the NM. NM's label should update to y as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash
zhihai xu created YARN-3023: --- Summary: Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash Key: YARN-3023 URL: https://issues.apache.org/jira/browse/YARN-3023 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash. The sequence for the Race condition is the following: 1, RM Store attempt state to ZK by calling createWithRetries {code} 2015-01-06 12:37:35,343 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1418914202950_42363 AttemptId: appattempt_1418914202950_42363_01 MasterContainer: Container: [ContainerId: container_1418914202950_42363_01_01, {code} 2. unluckily ConnectionLoss for the ZK session happened at the same time as RM Stored attempt state to ZK. The ZooKeeper server created the node and store the data successfully, But due to ConnectionLoss, RM didn't know the operation (createWithRetries) is succeeded. {code} 2015-01-06 12:37:36,102 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss {code} 3.RM did retry to store attempt state to ZK after one second {code} 2015-01-06 12:37:36,104 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 1 {code} 4. during the one second interval, the ZK session is reconnected. {code} 2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket connection established initiating session 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated timeout = 1 {code} 5. Because the node was created successfully at ZooKeeper in the first try(runWithCheck), For the second try, it will fail with NodeExists KeeperException {code} 2015-01-06 12:37:37,116 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists 2015-01-06 12:37:37,118 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! {code} 6.This NodeExists KeeperException will cause Storing AppAttempt failure in RMStateStore {code} 2015-01-06 12:37:37,118 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing appAttempt: appattempt_1418914202950_42363_01 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists {code} 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to ResourceManager {code} protected void notifyStoreOperationFailed(Exception failureCause) { RMFatalEventType type; if (failureCause instanceof StoreFencedException) { type = RMFatalEventType.STATE_STORE_FENCED; } else { type = RMFatalEventType.STATE_STORE_OP_FAILED; } rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause)); } {code} 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED RMFatalEvent. {code} 2015-01-06 12:37:37,128 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3014) Replaces labels on a host should update all NM's labels on that host
[ https://issues.apache.org/jira/browse/YARN-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270475#comment-14270475 ] Hadoop QA commented on YARN-3014: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12691003/YARN-3014-2.patch against trunk revision ae91b13. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6287//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6287//console This message is automatically generated. Replaces labels on a host should update all NM's labels on that host Key: YARN-3014 URL: https://issues.apache.org/jira/browse/YARN-3014 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3014-1.patch, YARN-3014-2.patch Admin can either specify labels on a host (by running {{yarn rmadmin -replaceLabelsOnNode host1,label1}}) OR on a single NM (by running {{yarn rmadmin -replaceLabelsOnNode host1:port,label1}}). If user has specified label=x on a NM (instead of host), and later set the label=y on host of the NM. NM's label should update to y as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-810) Support CGroup ceiling enforcement on CPU
[ https://issues.apache.org/jira/browse/YARN-810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270492#comment-14270492 ] Hadoop QA commented on YARN-810: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12691002/YARN-810-6.patch against trunk revision ae91b13. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 8 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-applications-distributedshell hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6286//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6286//console This message is automatically generated. Support CGroup ceiling enforcement on CPU - Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Assignee: Sandy Ryza Attachments: YARN-810-3.patch, YARN-810-4.patch, YARN-810-5.patch, YARN-810-6.patch, YARN-810.patch, YARN-810.patch Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat
[jira] [Commented] (YARN-313) Add Admin API for supporting node resource configuration in command line
[ https://issues.apache.org/jira/browse/YARN-313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270510#comment-14270510 ] Hadoop QA commented on YARN-313: {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12691018/YARN-313-v3.patch against trunk revision ae91b13. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 4 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6288//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6288//console This message is automatically generated. Add Admin API for supporting node resource configuration in command line Key: YARN-313 URL: https://issues.apache.org/jira/browse/YARN-313 Project: Hadoop YARN Issue Type: Sub-task Components: client Reporter: Junping Du Assignee: Junping Du Priority: Critical Attachments: YARN-313-sample.patch, YARN-313-v1.patch, YARN-313-v2.patch, YARN-313-v3.patch We should provide some admin interface, e.g. yarn rmadmin -refreshResources to support changes of node's resource specified in a config file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3022) Expose Container resource information from NodeManager for monitoring
[ https://issues.apache.org/jira/browse/YARN-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anubhav Dhoot updated YARN-3022: Attachment: YARN-3022.001.patch Initial patch based on YARN-2984 which adds metrics for Containers Expose Container resource information from NodeManager for monitoring - Key: YARN-3022 URL: https://issues.apache.org/jira/browse/YARN-3022 Project: Hadoop YARN Issue Type: Bug Reporter: Anubhav Dhoot Assignee: Anubhav Dhoot Attachments: YARN-3022.001.patch Along with exposing resource consumption of each container such as (YARN-2141) its worth exposing the actual resource limit associated with them to get better insight into YARN allocation and consumption -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications.
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270502#comment-14270502 ] Hadoop QA commented on YARN-2637: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12691012/YARN-2637.36.patch against trunk revision ae91b13. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 9 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServicesCapacitySched Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6289//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6289//console This message is automatically generated. maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications. Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, YARN-2637.25.patch, YARN-2637.26.patch, YARN-2637.27.patch, YARN-2637.28.patch, YARN-2637.29.patch, YARN-2637.30.patch, YARN-2637.31.patch, YARN-2637.32.patch, YARN-2637.36.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash
[ https://issues.apache.org/jira/browse/YARN-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270526#comment-14270526 ] Rohith commented on YARN-3023: -- Which version of Hadoop are you using? In trunk this is handled, If node already exists then ZKRMStateStore wont throw NodeExists {code} catch (KeeperException ke) { if (ke.code() == Code.NODEEXISTS) { LOG.info(znode already exists!); return null; } // other code } {code} Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash - Key: YARN-3023 URL: https://issues.apache.org/jira/browse/YARN-3023 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash. The sequence for the Race condition is the following: 1, RM Store attempt state to ZK by calling createWithRetries {code} 2015-01-06 12:37:35,343 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1418914202950_42363 AttemptId: appattempt_1418914202950_42363_01 MasterContainer: Container: [ContainerId: container_1418914202950_42363_01_01, {code} 2. unluckily ConnectionLoss for the ZK session happened at the same time as RM Stored attempt state to ZK. The ZooKeeper server created the node and store the data successfully, But due to ConnectionLoss, RM didn't know the operation (createWithRetries) is succeeded. {code} 2015-01-06 12:37:36,102 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss {code} 3.RM did retry to store attempt state to ZK after one second {code} 2015-01-06 12:37:36,104 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 1 {code} 4. during the one second interval, the ZK session is reconnected. {code} 2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket connection established initiating session 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated timeout = 1 {code} 5. Because the node was created successfully at ZooKeeper in the first try(runWithCheck), For the second try, it will fail with NodeExists KeeperException {code} 2015-01-06 12:37:37,116 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists 2015-01-06 12:37:37,118 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! {code} 6.This NodeExists KeeperException will cause Storing AppAttempt failure in RMStateStore {code} 2015-01-06 12:37:37,118 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing appAttempt: appattempt_1418914202950_42363_01 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists {code} 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to ResourceManager {code} protected void notifyStoreOperationFailed(Exception failureCause) { RMFatalEventType type; if (failureCause instanceof StoreFencedException) { type = RMFatalEventType.STATE_STORE_FENCED; } else { type = RMFatalEventType.STATE_STORE_OP_FAILED; } rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause)); } {code} 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED RMFatalEvent. {code} 2015-01-06 12:37:37,128 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2807) Option --forceactive not works as described in usage of yarn rmadmin -transitionToActive
[ https://issues.apache.org/jira/browse/YARN-2807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira AJISAKA updated YARN-2807: Hadoop Flags: Reviewed Option --forceactive not works as described in usage of yarn rmadmin -transitionToActive Key: YARN-2807 URL: https://issues.apache.org/jira/browse/YARN-2807 Project: Hadoop YARN Issue Type: Sub-task Components: documentation, resourcemanager Reporter: Wangda Tan Assignee: Masatake Iwasaki Priority: Minor Attachments: YARN-2807.1.patch, YARN-2807.2.patch, YARN-2807.3.patch Currently the help message of yarn rmadmin -transitionToActive is: {code} transitionToActive: incorrect number of arguments Usage: HAAdmin [-transitionToActive serviceId [--forceactive]] {code} But the --forceactive not works as expected. When transition RM state with --forceactive: {code} yarn rmadmin -transitionToActive rm2 --forceactive Automatic failover is enabled for org.apache.hadoop.yarn.client.RMHAServiceTarget@64c9f31e Refusing to manually manage HA state, since it may cause a split-brain scenario or other incorrect state. If you are very sure you know what you are doing, please specify the forcemanual flag. {code} As shown above, we still cannot transitionToActive when automatic failover is enabled with --forceactive. The option can work is: {{--forcemanual}}, there's no place in usage describes this option. I think we should fix this -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash
[ https://issues.apache.org/jira/browse/YARN-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270581#comment-14270581 ] zhihai xu commented on YARN-3023: - Yes, you are right. The issue is the same as YARN-2721. Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash - Key: YARN-3023 URL: https://issues.apache.org/jira/browse/YARN-3023 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash. The sequence for the Race condition is the following: 1, RM Store attempt state to ZK by calling createWithRetries {code} 2015-01-06 12:37:35,343 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1418914202950_42363 AttemptId: appattempt_1418914202950_42363_01 MasterContainer: Container: [ContainerId: container_1418914202950_42363_01_01, {code} 2. unluckily ConnectionLoss for the ZK session happened at the same time as RM Stored attempt state to ZK. The ZooKeeper server created the node and store the data successfully, But due to ConnectionLoss, RM didn't know the operation (createWithRetries) is succeeded. {code} 2015-01-06 12:37:36,102 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss {code} 3.RM did retry to store attempt state to ZK after one second {code} 2015-01-06 12:37:36,104 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 1 {code} 4. during the one second interval, the ZK session is reconnected. {code} 2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket connection established initiating session 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated timeout = 1 {code} 5. Because the node was created successfully at ZooKeeper in the first try(runWithCheck), For the second try, it will fail with NodeExists KeeperException {code} 2015-01-06 12:37:37,116 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists 2015-01-06 12:37:37,118 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! {code} 6.This NodeExists KeeperException will cause Storing AppAttempt failure in RMStateStore {code} 2015-01-06 12:37:37,118 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing appAttempt: appattempt_1418914202950_42363_01 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists {code} 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to ResourceManager {code} protected void notifyStoreOperationFailed(Exception failureCause) { RMFatalEventType type; if (failureCause instanceof StoreFencedException) { type = RMFatalEventType.STATE_STORE_FENCED; } else { type = RMFatalEventType.STATE_STORE_OP_FAILED; } rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause)); } {code} 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED RMFatalEvent. {code} 2015-01-06 12:37:37,128 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-3023) Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash
[ https://issues.apache.org/jira/browse/YARN-3023?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu resolved YARN-3023. - Resolution: Duplicate Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash - Key: YARN-3023 URL: https://issues.apache.org/jira/browse/YARN-3023 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: zhihai xu Assignee: zhihai xu Race condition in ZKRMStateStore#createWithRetries from ZooKeeper cause RM crash. The sequence for the Race condition is the following: 1, RM Store attempt state to ZK by calling createWithRetries {code} 2015-01-06 12:37:35,343 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Storing attempt: AppId: application_1418914202950_42363 AttemptId: appattempt_1418914202950_42363_01 MasterContainer: Container: [ContainerId: container_1418914202950_42363_01_01, {code} 2. unluckily ConnectionLoss for the ZK session happened at the same time as RM Stored attempt state to ZK. The ZooKeeper server created the node and store the data successfully, But due to ConnectionLoss, RM didn't know the operation (createWithRetries) is succeeded. {code} 2015-01-06 12:37:36,102 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss {code} 3.RM did retry to store attempt state to ZK after one second {code} 2015-01-06 12:37:36,104 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Retrying operation on ZK. Retry no. 1 {code} 4. during the one second interval, the ZK session is reconnected. {code} 2015-01-06 12:37:36,210 INFO org.apache.zookeeper.ClientCnxn: Socket connection established initiating session 2015-01-06 12:37:36,213 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server, sessionid = 0x44a9166eb2d12cb, negotiated timeout = 1 {code} 5. Because the node was created successfully at ZooKeeper in the first try(runWithCheck), For the second try, it will fail with NodeExists KeeperException {code} 2015-01-06 12:37:37,116 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Exception while executing a ZK operation. org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists 2015-01-06 12:37:37,118 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore: Maxed out ZK retries. Giving up! {code} 6.This NodeExists KeeperException will cause Storing AppAttempt failure in RMStateStore {code} 2015-01-06 12:37:37,118 ERROR org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Error storing appAttempt: appattempt_1418914202950_42363_01 org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists {code} 7.RMStateStore will send RMFatalEventType.STATE_STORE_OP_FAILED event to ResourceManager {code} protected void notifyStoreOperationFailed(Exception failureCause) { RMFatalEventType type; if (failureCause instanceof StoreFencedException) { type = RMFatalEventType.STATE_STORE_FENCED; } else { type = RMFatalEventType.STATE_STORE_OP_FAILED; } rmDispatcher.getEventHandler().handle(new RMFatalEvent(type, failureCause)); } {code} 8.ResoureManager will kill itself after received STATE_STORE_OP_FAILED RMFatalEvent. {code} 2015-01-06 12:37:37,128 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED. Cause: org.apache.zookeeper.KeeperException$NodeExistsException: KeeperErrorCode = NodeExists 2015-01-06 12:37:37,138 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 1 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3024) LocalizerRunner should give DIE action when all resources are localized
[ https://issues.apache.org/jira/browse/YARN-3024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chengbing Liu updated YARN-3024: Attachment: YARN-3024.02.patch Fixed tests accordingly. LocalizerRunner should give DIE action when all resources are localized --- Key: YARN-3024 URL: https://issues.apache.org/jira/browse/YARN-3024 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Attachments: YARN-3024.01.patch, YARN-3024.02.patch We have observed that {{LocalizerRunner}} always gives a LIVE action at the end of localization process. The problem is {{findNextResource()}} can return null even when {{pending}} was not empty prior to the call. This method removes localized resources from {{pending}}, therefore we should check the return value, and gives DIE action when it returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2141) [Umbrella] Capture container and node resource consumption
[ https://issues.apache.org/jira/browse/YARN-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270535#comment-14270535 ] Vinod Kumar Vavilapalli commented on YARN-2141: --- One other related effort is YARN-2928 which is also planning to obtain and send information about container resource-usage to a per-application aggregator. We should try to unify these.. [Umbrella] Capture container and node resource consumption -- Key: YARN-2141 URL: https://issues.apache.org/jira/browse/YARN-2141 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Carlo Curino Priority: Minor Collecting per-container and per-node resource consumption statistics in a fairly granular manner, and making them available to both infrastructure code (e.g., schedulers) and users (e.g., AMs or directly users via webapps), can facilitate several performance work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2984) Metrics for container's actual memory usage
[ https://issues.apache.org/jira/browse/YARN-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270536#comment-14270536 ] Vinod Kumar Vavilapalli commented on YARN-2984: --- Linking related efforts. One other related effort is YARN-2928 which is also planning to obtain and send information about container resource-usage to a per-application aggregator. We should try to unify these.. Metrics for container's actual memory usage --- Key: YARN-2984 URL: https://issues.apache.org/jira/browse/YARN-2984 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2984-prelim.patch It would be nice to capture resource usage per container, for a variety of reasons. This JIRA is to track memory usage. YARN-2965 tracks the resource usage on the node, and the two implementations should reuse code as much as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2637) maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications.
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Craig Welch updated YARN-2637: -- Attachment: YARN-2637.36.patch Should be down to one failing test, let's see maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications. Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, YARN-2637.25.patch, YARN-2637.26.patch, YARN-2637.27.patch, YARN-2637.28.patch, YARN-2637.29.patch, YARN-2637.30.patch, YARN-2637.31.patch, YARN-2637.32.patch, YARN-2637.36.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3019) Enable RM work-preserving restart by default
[ https://issues.apache.org/jira/browse/YARN-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270087#comment-14270087 ] Allen Wittenauer commented on YARN-3019: I'm not in favor of this going into branch-2. It's a fundamental change to operating expectations that may have a significant impact on capacity. Enable RM work-preserving restart by default - Key: YARN-3019 URL: https://issues.apache.org/jira/browse/YARN-3019 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He The proposal is to set yarn.resourcemanager.work-preserving-recovery.enabled to true by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3019) Enable RM work-preserving restart by default
[ https://issues.apache.org/jira/browse/YARN-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270100#comment-14270100 ] Jian He commented on YARN-3019: --- To clarify: this jira is to flip recovery mode to work-preserving recovery from non-work-preserving recovery. The feature itself remains disabled. i.e. yarn.resourcemanager.recovery.enabled remains false. updating the description. Further, I'm also thinking to enable the feature itself by default and use the local FS as the default file system. I'm OK to do this only on trunk. That will uncover bugs if any. I can open a separate jira for this. Enable RM work-preserving restart by default - Key: YARN-3019 URL: https://issues.apache.org/jira/browse/YARN-3019 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He The proposal is to set yarn.resourcemanager.work-preserving-recovery.enabled to true by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1680) availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory.
[ https://issues.apache.org/jira/browse/YARN-1680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270110#comment-14270110 ] Chen He commented on YARN-1680: --- Thank you for the comments, [~jlowe]. [~cwelch] created YARN-2848 that discuss about blacklist node and label scheduling. I will work on a patch that fixes the blacklisted node case. availableResources sent to applicationMaster in heartbeat should exclude blacklistedNodes free memory. -- Key: YARN-1680 URL: https://issues.apache.org/jira/browse/YARN-1680 Project: Hadoop YARN Issue Type: Sub-task Affects Versions: 2.2.0, 2.3.0 Environment: SuSE 11 SP2 + Hadoop-2.3 Reporter: Rohith Assignee: Chen He Attachments: YARN-1680-WIP.patch, YARN-1680-v2.patch, YARN-1680-v2.patch, YARN-1680.patch There are 4 NodeManagers with 8GB each.Total cluster capacity is 32GB.Cluster slow start is set to 1. Job is running reducer task occupied 29GB of cluster.One NodeManager(NM-4) is become unstable(3 Map got killed), MRAppMaster blacklisted unstable NodeManager(NM-4). All reducer task are running in cluster now. MRAppMaster does not preempt the reducers because for Reducer preemption calculation, headRoom is considering blacklisted nodes memory. This makes jobs to hang forever(ResourceManager does not assing any new containers on blacklisted nodes but returns availableResouce considers cluster free memory). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3021) YARN's delegation-token handling disallows certain trust setups to operate properly
Harsh J created YARN-3021: - Summary: YARN's delegation-token handling disallows certain trust setups to operate properly Key: YARN-3021 URL: https://issues.apache.org/jira/browse/YARN-3021 Project: Hadoop YARN Issue Type: Bug Components: security Affects Versions: 2.3.0 Reporter: Harsh J Consider this scenario of 3 realms: A, B and COMMON, where A trusts COMMON, and B trusts COMMON (one way trusts both), and both A and B run HDFS + YARN clusters. Now if one logs in with a COMMON credential, and runs a job on A's YARN that needs to access B's HDFS (such as a DistCp), the operation fails in the RM, as it attempts a renewDelegationToken(…) synchronously during application submission (to validate the managed token before it adds it to a scheduler for automatic renewal). The call obviously fails cause B realm will not trust A's credentials (here, the RM's principal is the renewer). In the 1.x JobTracker the same call is present, but it is done asynchronously and once the renewal attempt failed we simply ceased to schedule any further attempts of renewals, rather than fail the job immediately. We should change the logic such that we attempt the renewal but go easy on the failure and skip the scheduling alone, rather than bubble back an error to the client, failing the app submission. This way the old behaviour is retained. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2637) maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications.
[ https://issues.apache.org/jira/browse/YARN-2637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270116#comment-14270116 ] Jian He commented on YARN-2637: --- Quick thing, YARN-3010 fixed the find bug warning. the findbugs exclusion in the patch maybe not needed. maximum-am-resource-percent could be respected for both LeafQueue/User when trying to activate applications. Key: YARN-2637 URL: https://issues.apache.org/jira/browse/YARN-2637 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.6.0 Reporter: Wangda Tan Assignee: Craig Welch Priority: Critical Attachments: YARN-2637.0.patch, YARN-2637.1.patch, YARN-2637.12.patch, YARN-2637.13.patch, YARN-2637.15.patch, YARN-2637.16.patch, YARN-2637.17.patch, YARN-2637.18.patch, YARN-2637.19.patch, YARN-2637.2.patch, YARN-2637.20.patch, YARN-2637.21.patch, YARN-2637.22.patch, YARN-2637.23.patch, YARN-2637.25.patch, YARN-2637.26.patch, YARN-2637.27.patch, YARN-2637.28.patch, YARN-2637.29.patch, YARN-2637.30.patch, YARN-2637.31.patch, YARN-2637.32.patch, YARN-2637.6.patch, YARN-2637.7.patch, YARN-2637.9.patch Currently, number of AM in leaf queue will be calculated in following way: {code} max_am_resource = queue_max_capacity * maximum_am_resource_percent #max_am_number = max_am_resource / minimum_allocation #max_am_number_for_each_user = #max_am_number * userlimit * userlimit_factor {code} And when submit new application to RM, it will check if an app can be activated in following way: {code} for (IteratorFiCaSchedulerApp i=pendingApplications.iterator(); i.hasNext(); ) { FiCaSchedulerApp application = i.next(); // Check queue limit if (getNumActiveApplications() = getMaximumActiveApplications()) { break; } // Check user limit User user = getUser(application.getUser()); if (user.getActiveApplications() getMaximumActiveApplicationsPerUser()) { user.activateApplication(); activeApplications.add(application); i.remove(); LOG.info(Application + application.getApplicationId() + from user: + application.getUser() + activated in queue: + getQueueName()); } } {code} An example is, If a queue has capacity = 1G, max_am_resource_percent = 0.2, the maximum resource that AM can use is 200M, assuming minimum_allocation=1M, #am can be launched is 200, and if user uses 5M for each AM ( minimum_allocation). All apps can still be activated, and it will occupy all resource of a queue instead of only a max_am_resource_percent of a queue. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3019) Enable RM work-preserving restart by default
[ https://issues.apache.org/jira/browse/YARN-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3019: -- Description: The proposal is to set yarn.resourcemanager.work-preserving-recovery.enabled to true by default to flip recovery mode to work-preserving recovery from non-work-preserving recovery. (was: The proposal is to set yarn.resourcemanager.work-preserving-recovery.enabled to true by default. ) Enable RM work-preserving restart by default - Key: YARN-3019 URL: https://issues.apache.org/jira/browse/YARN-3019 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He The proposal is to set yarn.resourcemanager.work-preserving-recovery.enabled to true by default to flip recovery mode to work-preserving recovery from non-work-preserving recovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3014) Replaces labels on a host should update all NM's labels on that host
[ https://issues.apache.org/jira/browse/YARN-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270179#comment-14270179 ] Jian He commented on YARN-3014: --- addLabel and removeLabel on the Host should not do replaceLabel on the NM ? Replaces labels on a host should update all NM's labels on that host Key: YARN-3014 URL: https://issues.apache.org/jira/browse/YARN-3014 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3014-1.patch Admin can either specify labels on a host (by running {{yarn rmadmin -replaceLabelsOnNode host1,label1}}) OR on a single NM (by running {{yarn rmadmin -replaceLabelsOnNode host1:port,label1}}). If user has specified label=x on a NM (instead of host), and later set the label=y on host of the NM. NM's label should update to y as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3010) Fix recent findbug issue in AbstractYarnScheduler
[ https://issues.apache.org/jira/browse/YARN-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269217#comment-14269217 ] Hudson commented on YARN-3010: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #67 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/67/]) YARN-3010. Fixed findbugs warning in AbstractYarnScheduler. Contributed by Yi Liu (jianhe: rev e13a484a2be64fb781c5eca5ae7056cbe194ac5e) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java Fix recent findbug issue in AbstractYarnScheduler - Key: YARN-3010 URL: https://issues.apache.org/jira/browse/YARN-3010 Project: Hadoop YARN Issue Type: Bug Reporter: Yi Liu Assignee: Yi Liu Priority: Minor Fix For: 2.7.0 Attachments: YARN-3010.001.patch, YARN-3010.002.patch A new findbug issues reported recently in latest trunk: {quote} ISInconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.rmContext; locked 91% of time {quote} https://issues.apache.org/jira/browse/YARN-2996?focusedCommentId=14265760page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14265760 https://builds.apache.org/job/PreCommit-YARN-Build/6249//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2230) Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code)
[ https://issues.apache.org/jira/browse/YARN-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269223#comment-14269223 ] Hudson commented on YARN-2230: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #67 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/67/]) YARN-2230. Fixed few configs description in yarn-default.xml. Contributed by Vijay Bhat (jianhe: rev fe8d2bd74175e7ad521bc310c41a367c0946d6ec) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/CHANGES.txt Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code) - Key: YARN-2230 URL: https://issues.apache.org/jira/browse/YARN-2230 Project: Hadoop YARN Issue Type: Bug Components: client, documentation, scheduler Affects Versions: 2.4.0 Reporter: Adam Kawa Assignee: Vijay Bhat Priority: Minor Fix For: 2.7.0 Attachments: YARN-2230.001.patch, YARN-2230.002.patch When a user requests more vcores than the allocation limit (e.g. mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores), then InvalidResourceRequestException is thrown - https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java {code} if (resReq.getCapability().getVirtualCores() 0 || resReq.getCapability().getVirtualCores() maximumResource.getVirtualCores()) { throw new InvalidResourceRequestException(Invalid resource request + , requested virtual cores 0 + , or requested virtual cores max configured + , requestedVirtualCores= + resReq.getCapability().getVirtualCores() + , maxVirtualCores= + maximumResource.getVirtualCores()); } {code} According to documentation - yarn-default.xml http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml, the request should be capped to the allocation limit. {code} property descriptionThe maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value./description nameyarn.scheduler.maximum-allocation-vcores/name value32/value /property {code} This means that: * Either documentation or code should be corrected (unless this exception is handled elsewhere accordingly, but it looks that it is not). This behavior is confusing, because when such a job (with mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores) is submitted, it does not make any progress. The warnings/exceptions are thrown at the scheduler (RM) side e.g. {code} 2014-06-29 00:34:51,469 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1403993411503_0002_01 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores 0, or requested virtual cores max configured, requestedVirtualCores=32, maxVirtualCores=3 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:237) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:80) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:420) . at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980) {code} * IMHO, such an exception should be forwarded to client. Otherwise, it is non obvious to discover why a job does not make any progress. The same looks to be related to memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2880) Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled
[ https://issues.apache.org/jira/browse/YARN-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269221#comment-14269221 ] Hudson commented on YARN-2880: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #67 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/67/]) Moved YARN-2880 to improvement section in CHANGES.txt (jianhe: rev ef237bd52fc570292a7e608b373b51dd6d1590b8) * hadoop-yarn-project/CHANGES.txt Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled --- Key: YARN-2880 URL: https://issues.apache.org/jira/browse/YARN-2880 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Rohith Fix For: 2.7.0 Attachments: 0001-YARN-2880.patch, YARN-2880.1.patch, YARN-2880.1.patch, YARN-2880.2.patch As suggested by [~ozawa], [link|https://issues.apache.org/jira/browse/YARN-2800?focusedCommentId=14217569page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14217569]. We should have a such test to make sure there will be no regression -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2936) YARNDelegationTokenIdentifier doesn't set proto.builder now
[ https://issues.apache.org/jira/browse/YARN-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269220#comment-14269220 ] Hudson commented on YARN-2936: -- FAILURE: Integrated in Hadoop-Yarn-trunk-Java8 #67 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/67/]) YARN-2936. Changed YARNDelegationTokenIdentifier to set proto fields on getProto method. Contributed by Varun Saxena (jianhe: rev 2638f4d0f0da375b0dd08f3188057637ed0f32d5) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/YARNDelegationTokenIdentifier.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/security/TestYARNTokenIdentifier.java YARNDelegationTokenIdentifier doesn't set proto.builder now --- Key: YARN-2936 URL: https://issues.apache.org/jira/browse/YARN-2936 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2936.001.patch, YARN-2936.002.patch, YARN-2936.003.patch, YARN-2936.004.patch, YARN-2936.005.patch, YARN-2936.006.patch After YARN-2743, the setters are removed from YARNDelegationTokenIdentifier, such that when constructing a object which extends YARNDelegationTokenIdentifier, proto.builder is not set at all. Later on, when we call getProto() of it, we will just get an empty proto object. It seems to do no harm to the production code path, as we will always call getBytes() before using proto to persist the DT in the state store, when we generating the password. I think the setter is removed to avoid duplicating setting the fields why getBytes() is called. However, YARNDelegationTokenIdentifier doesn't work properly alone. YARNDelegationTokenIdentifier is tightly coupled with the logic in secretManager. It's vulnerable if something is changed at secretManager. For example, in the test case of YARN-2837, I spent time to figure out we need to execute getBytes() first to make sure the testing DTs can be properly put into the state store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2880) Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled
[ https://issues.apache.org/jira/browse/YARN-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269248#comment-14269248 ] Hudson commented on YARN-2880: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #801 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/801/]) Moved YARN-2880 to improvement section in CHANGES.txt (jianhe: rev ef237bd52fc570292a7e608b373b51dd6d1590b8) * hadoop-yarn-project/CHANGES.txt Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled --- Key: YARN-2880 URL: https://issues.apache.org/jira/browse/YARN-2880 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Rohith Fix For: 2.7.0 Attachments: 0001-YARN-2880.patch, YARN-2880.1.patch, YARN-2880.1.patch, YARN-2880.2.patch As suggested by [~ozawa], [link|https://issues.apache.org/jira/browse/YARN-2800?focusedCommentId=14217569page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14217569]. We should have a such test to make sure there will be no regression -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2230) Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code)
[ https://issues.apache.org/jira/browse/YARN-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269250#comment-14269250 ] Hudson commented on YARN-2230: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #801 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/801/]) YARN-2230. Fixed few configs description in yarn-default.xml. Contributed by Vijay Bhat (jianhe: rev fe8d2bd74175e7ad521bc310c41a367c0946d6ec) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code) - Key: YARN-2230 URL: https://issues.apache.org/jira/browse/YARN-2230 Project: Hadoop YARN Issue Type: Bug Components: client, documentation, scheduler Affects Versions: 2.4.0 Reporter: Adam Kawa Assignee: Vijay Bhat Priority: Minor Fix For: 2.7.0 Attachments: YARN-2230.001.patch, YARN-2230.002.patch When a user requests more vcores than the allocation limit (e.g. mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores), then InvalidResourceRequestException is thrown - https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java {code} if (resReq.getCapability().getVirtualCores() 0 || resReq.getCapability().getVirtualCores() maximumResource.getVirtualCores()) { throw new InvalidResourceRequestException(Invalid resource request + , requested virtual cores 0 + , or requested virtual cores max configured + , requestedVirtualCores= + resReq.getCapability().getVirtualCores() + , maxVirtualCores= + maximumResource.getVirtualCores()); } {code} According to documentation - yarn-default.xml http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml, the request should be capped to the allocation limit. {code} property descriptionThe maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value./description nameyarn.scheduler.maximum-allocation-vcores/name value32/value /property {code} This means that: * Either documentation or code should be corrected (unless this exception is handled elsewhere accordingly, but it looks that it is not). This behavior is confusing, because when such a job (with mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores) is submitted, it does not make any progress. The warnings/exceptions are thrown at the scheduler (RM) side e.g. {code} 2014-06-29 00:34:51,469 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1403993411503_0002_01 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores 0, or requested virtual cores max configured, requestedVirtualCores=32, maxVirtualCores=3 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:237) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:80) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:420) . at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980) {code} * IMHO, such an exception should be forwarded to client. Otherwise, it is non obvious to discover why a job does not make any progress. The same looks to be related to memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2936) YARNDelegationTokenIdentifier doesn't set proto.builder now
[ https://issues.apache.org/jira/browse/YARN-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269247#comment-14269247 ] Hudson commented on YARN-2936: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #801 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/801/]) YARN-2936. Changed YARNDelegationTokenIdentifier to set proto fields on getProto method. Contributed by Varun Saxena (jianhe: rev 2638f4d0f0da375b0dd08f3188057637ed0f32d5) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/security/TestYARNTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/YARNDelegationTokenIdentifier.java * hadoop-yarn-project/CHANGES.txt YARNDelegationTokenIdentifier doesn't set proto.builder now --- Key: YARN-2936 URL: https://issues.apache.org/jira/browse/YARN-2936 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2936.001.patch, YARN-2936.002.patch, YARN-2936.003.patch, YARN-2936.004.patch, YARN-2936.005.patch, YARN-2936.006.patch After YARN-2743, the setters are removed from YARNDelegationTokenIdentifier, such that when constructing a object which extends YARNDelegationTokenIdentifier, proto.builder is not set at all. Later on, when we call getProto() of it, we will just get an empty proto object. It seems to do no harm to the production code path, as we will always call getBytes() before using proto to persist the DT in the state store, when we generating the password. I think the setter is removed to avoid duplicating setting the fields why getBytes() is called. However, YARNDelegationTokenIdentifier doesn't work properly alone. YARNDelegationTokenIdentifier is tightly coupled with the logic in secretManager. It's vulnerable if something is changed at secretManager. For example, in the test case of YARN-2837, I spent time to figure out we need to execute getBytes() first to make sure the testing DTs can be properly put into the state store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3010) Fix recent findbug issue in AbstractYarnScheduler
[ https://issues.apache.org/jira/browse/YARN-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269244#comment-14269244 ] Hudson commented on YARN-3010: -- SUCCESS: Integrated in Hadoop-Yarn-trunk #801 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/801/]) YARN-3010. Fixed findbugs warning in AbstractYarnScheduler. Contributed by Yi Liu (jianhe: rev e13a484a2be64fb781c5eca5ae7056cbe194ac5e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java Fix recent findbug issue in AbstractYarnScheduler - Key: YARN-3010 URL: https://issues.apache.org/jira/browse/YARN-3010 Project: Hadoop YARN Issue Type: Bug Reporter: Yi Liu Assignee: Yi Liu Priority: Minor Fix For: 2.7.0 Attachments: YARN-3010.001.patch, YARN-3010.002.patch A new findbug issues reported recently in latest trunk: {quote} ISInconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.rmContext; locked 91% of time {quote} https://issues.apache.org/jira/browse/YARN-2996?focusedCommentId=14265760page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14265760 https://builds.apache.org/job/PreCommit-YARN-Build/6249//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file
[ https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] nijel reassigned YARN-3018: --- Assignee: nijel Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file Key: YARN-3018 URL: https://issues.apache.org/jira/browse/YARN-3018 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: nijel Assignee: nijel Priority: Trivial For the configuration item yarn.scheduler.capacity.node-locality-delay the default value given in code is -1 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1; In the default capacity-scheduler.xml file in the resource manager config directory it is 40. Can it be unified to avoid confusion when the user creates the file without this configuration. IF he expects the values in the file to be default values, then it will be wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file
[ https://issues.apache.org/jira/browse/YARN-3018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269058#comment-14269058 ] nijel commented on YARN-3018: - Please give your opinion. I prefer to have the value as -1 in file also If it sounds good, i can upload a patch Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file Key: YARN-3018 URL: https://issues.apache.org/jira/browse/YARN-3018 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: nijel Priority: Trivial For the configuration item yarn.scheduler.capacity.node-locality-delay the default value given in code is -1 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1; In the default capacity-scheduler.xml file in the resource manager config directory it is 40. Can it be unified to avoid confusion when the user creates the file without this configuration. IF he expects the values in the file to be default values, then it will be wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3018) Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file
nijel created YARN-3018: --- Summary: Unify the default value for yarn.scheduler.capacity.node-locality-delay in code and default xml file Key: YARN-3018 URL: https://issues.apache.org/jira/browse/YARN-3018 Project: Hadoop YARN Issue Type: Bug Components: capacityscheduler Reporter: nijel Priority: Trivial For the configuration item yarn.scheduler.capacity.node-locality-delay the default value given in code is -1 public static final int DEFAULT_NODE_LOCALITY_DELAY = -1; In the default capacity-scheduler.xml file in the resource manager config directory it is 40. Can it be unified to avoid confusion when the user creates the file without this configuration. IF he expects the values in the file to be default values, then it will be wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2996) Refine fs operations in FileSystemRMStateStore and few fixes
[ https://issues.apache.org/jira/browse/YARN-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269728#comment-14269728 ] Zhijie Shen commented on YARN-2996: --- +1. The test failure seems not to be related. Will commit the patch. Refine fs operations in FileSystemRMStateStore and few fixes Key: YARN-2996 URL: https://issues.apache.org/jira/browse/YARN-2996 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Yi Liu Assignee: Yi Liu Attachments: YARN-2996.001.patch, YARN-2996.002.patch, YARN-2996.003.patch, YARN-2996.004.patch In {{FileSystemRMStateStore}}, we can refine some fs operations to improve performance: *1.* There are several places invoke {{fs.exists}}, then {{fs.getFileStatus}}, we can merge them to save one RPC call {code} if (fs.exists(versionNodePath)) { FileStatus status = fs.getFileStatus(versionNodePath); {code} *2.* {code} protected void updateFile(Path outputPath, byte[] data) throws Exception { Path newPath = new Path(outputPath.getParent(), outputPath.getName() + .new); // use writeFile to make sure .new file is created atomically writeFile(newPath, data); replaceFile(newPath, outputPath); } {code} The {{updateFile}} is not good too, it write file to _output\_file_.tmp, then rename to _output\_file_.new, then rename it to _output\_file_, we can reduce one rename operation. Also there is one unnecessary import, we can remove it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2880) Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled
[ https://issues.apache.org/jira/browse/YARN-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269720#comment-14269720 ] Zhijie Shen commented on YARN-2880: --- Is https://builds.apache.org/job/PreCommit-YARN-Build/6274/testReport/org.apache.hadoop.yarn.server.resourcemanager.applicationsmanager/TestAMRestart/testShouldNotCountFailureToMaxAttemptRetry/ related to the change here? Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled --- Key: YARN-2880 URL: https://issues.apache.org/jira/browse/YARN-2880 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Rohith Fix For: 2.7.0 Attachments: 0001-YARN-2880.patch, YARN-2880.1.patch, YARN-2880.1.patch, YARN-2880.2.patch As suggested by [~ozawa], [link|https://issues.apache.org/jira/browse/YARN-2800?focusedCommentId=14217569page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14217569]. We should have a such test to make sure there will be no regression -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2996) Refine fs operations in FileSystemRMStateStore and few fixes
[ https://issues.apache.org/jira/browse/YARN-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269754#comment-14269754 ] Hudson commented on YARN-2996: -- FAILURE: Integrated in Hadoop-trunk-Commit #6830 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6830/]) YARN-2996. Improved synchronization and I/O operations of FS- and Mem- RMStateStore. Contributed by Yi Liu. (zjshen: rev dc2eaa26b20cfbbcdd5784bb8761d08a42f29605) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/FileSystemRMStateStore.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/recovery/MemoryRMStateStore.java Refine fs operations in FileSystemRMStateStore and few fixes Key: YARN-2996 URL: https://issues.apache.org/jira/browse/YARN-2996 Project: Hadoop YARN Issue Type: Improvement Components: resourcemanager Reporter: Yi Liu Assignee: Yi Liu Fix For: 2.7.0 Attachments: YARN-2996.001.patch, YARN-2996.002.patch, YARN-2996.003.patch, YARN-2996.004.patch In {{FileSystemRMStateStore}}, we can refine some fs operations to improve performance: *1.* There are several places invoke {{fs.exists}}, then {{fs.getFileStatus}}, we can merge them to save one RPC call {code} if (fs.exists(versionNodePath)) { FileStatus status = fs.getFileStatus(versionNodePath); {code} *2.* {code} protected void updateFile(Path outputPath, byte[] data) throws Exception { Path newPath = new Path(outputPath.getParent(), outputPath.getName() + .new); // use writeFile to make sure .new file is created atomically writeFile(newPath, data); replaceFile(newPath, outputPath); } {code} The {{updateFile}} is not good too, it write file to _output\_file_.tmp, then rename to _output\_file_.new, then rename it to _output\_file_, we can reduce one rename operation. Also there is one unnecessary import, we can remove it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269784#comment-14269784 ] Mayank Bansal commented on YARN-2933: - Thanks [~wangda] and Sunil for review. bq. In addition to previously comment, I think we put incorrect #container for each application when setLabelContainer=true. The usedResource or current in TestProportionalPreemptionPolicy actually means used resource of nodes without label. So if we want to have labeled container in an application, we should make it stay outside of usedResource. I don't think thats needed as the basic functionality for the test is to demonstrate we can skip labeled container, So I think it does not mater. bq. And testSkipLabeledContainer is fully covered by testIdealAllocationForLabels. Since we have already checked #container preempted in each application in testIdealAllocationForLabels, which implies labeled containers are ignored. Agreed bq. A minor suggest is rename setLabelContainer to setLabeledContainer Agreed bq. An application's(if not specified any labels during submission time) containers, may fall in to nodes where it can be labelled or not labelled. Am I correct? No , As of now containers with no labels can not go to labeled nodes. Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mayank Bansal updated YARN-2933: Attachment: YARN-2933-6.patch Attaching patch Thanks, Mayank Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch, YARN-2933-6.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2421) CapacityScheduler still allocates containers to an app in the FINISHING state
[ https://issues.apache.org/jira/browse/YARN-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chen He updated YARN-2421: -- Target Version/s: 2.7.0 (was: 2.6.0) CapacityScheduler still allocates containers to an app in the FINISHING state - Key: YARN-2421 URL: https://issues.apache.org/jira/browse/YARN-2421 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.1 Reporter: Thomas Graves Assignee: chang li Attachments: yarn2421.patch, yarn2421.patch, yarn2421.patch I saw an instance of a bad application master where it unregistered with the RM but then continued to call into allocate. The RMAppAttempt went to the FINISHING state, but the capacity scheduler kept allocating it containers. We should probably have the capacity scheduler check that the application isn't in one of the terminal states before giving it containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2786) Create yarn cluster CLI to enable list node labels collection
[ https://issues.apache.org/jira/browse/YARN-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2786: - Attachment: (was: YARN-2800-20141118-1.patch) Create yarn cluster CLI to enable list node labels collection - Key: YARN-2786 URL: https://issues.apache.org/jira/browse/YARN-2786 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2786-20141031-1.patch, YARN-2786-20141031-2.patch, YARN-2786-20141102-2.patch, YARN-2786-20141102-3.patch, YARN-2786-20141103-1-full.patch, YARN-2786-20141103-1-without-yarn.cmd.patch, YARN-2786-20141104-1-full.patch, YARN-2786-20141104-1-without-yarn.cmd.patch, YARN-2786-20141104-2-full.patch, YARN-2786-20141104-2-without-yarn.cmd.patch, YARN-2786-20150107-1-full.patch, YARN-2786-20150107-1-without-yarn.cmd.patch With YARN-2778, we can list node labels on existing RM nodes. But it is not enough, we should be able to: 1) list node labels collection The command should start with yarn cluster ..., in the future, we can add more functionality to the yarnClusterCLI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2786) Create yarn cluster CLI to enable list node labels collection
[ https://issues.apache.org/jira/browse/YARN-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2786: - Attachment: YARN-2786-20150108-1-full.patch Create yarn cluster CLI to enable list node labels collection - Key: YARN-2786 URL: https://issues.apache.org/jira/browse/YARN-2786 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2786-20141031-1.patch, YARN-2786-20141031-2.patch, YARN-2786-20141102-2.patch, YARN-2786-20141102-3.patch, YARN-2786-20141103-1-full.patch, YARN-2786-20141103-1-without-yarn.cmd.patch, YARN-2786-20141104-1-full.patch, YARN-2786-20141104-1-without-yarn.cmd.patch, YARN-2786-20141104-2-full.patch, YARN-2786-20141104-2-without-yarn.cmd.patch, YARN-2786-20150107-1-full.patch, YARN-2786-20150107-1-without-yarn.cmd.patch, YARN-2786-20150108-1-full.patch With YARN-2778, we can list node labels on existing RM nodes. But it is not enough, we should be able to: 1) list node labels collection The command should start with yarn cluster ..., in the future, we can add more functionality to the yarnClusterCLI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2786) Create yarn cluster CLI to enable list node labels collection
[ https://issues.apache.org/jira/browse/YARN-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2786: - Attachment: YARN-2800-20141118-1.patch Create yarn cluster CLI to enable list node labels collection - Key: YARN-2786 URL: https://issues.apache.org/jira/browse/YARN-2786 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2786-20141031-1.patch, YARN-2786-20141031-2.patch, YARN-2786-20141102-2.patch, YARN-2786-20141102-3.patch, YARN-2786-20141103-1-full.patch, YARN-2786-20141103-1-without-yarn.cmd.patch, YARN-2786-20141104-1-full.patch, YARN-2786-20141104-1-without-yarn.cmd.patch, YARN-2786-20141104-2-full.patch, YARN-2786-20141104-2-without-yarn.cmd.patch, YARN-2786-20150107-1-full.patch, YARN-2786-20150107-1-without-yarn.cmd.patch With YARN-2778, we can list node labels on existing RM nodes. But it is not enough, we should be able to: 1) list node labels collection The command should start with yarn cluster ..., in the future, we can add more functionality to the yarnClusterCLI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2786) Create yarn cluster CLI to enable list node labels collection
[ https://issues.apache.org/jira/browse/YARN-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-2786: - Attachment: YARN-2786-20150108-1-without-yarn.cmd.patch Updated patch addressed comments from [~aw], and fixed findbugs warning. Please kindly review, thanks. Create yarn cluster CLI to enable list node labels collection - Key: YARN-2786 URL: https://issues.apache.org/jira/browse/YARN-2786 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2786-20141031-1.patch, YARN-2786-20141031-2.patch, YARN-2786-20141102-2.patch, YARN-2786-20141102-3.patch, YARN-2786-20141103-1-full.patch, YARN-2786-20141103-1-without-yarn.cmd.patch, YARN-2786-20141104-1-full.patch, YARN-2786-20141104-1-without-yarn.cmd.patch, YARN-2786-20141104-2-full.patch, YARN-2786-20141104-2-without-yarn.cmd.patch, YARN-2786-20150107-1-full.patch, YARN-2786-20150107-1-without-yarn.cmd.patch, YARN-2786-20150108-1-full.patch, YARN-2786-20150108-1-without-yarn.cmd.patch With YARN-2778, we can list node labels on existing RM nodes. But it is not enough, we should be able to: 1) list node labels collection The command should start with yarn cluster ..., in the future, we can add more functionality to the yarnClusterCLI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3016) (Refactoring) Merge internalAdd/Remove/ReplaceLabels to one method in CommonNodeLabelsManager
[ https://issues.apache.org/jira/browse/YARN-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269824#comment-14269824 ] Wangda Tan commented on YARN-3016: -- I meant CommonNodeLabelsManager, methods start with internal, we should have a way to make them easier. (Refactoring) Merge internalAdd/Remove/ReplaceLabels to one method in CommonNodeLabelsManager - Key: YARN-3016 URL: https://issues.apache.org/jira/browse/YARN-3016 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Now we have separated but similar implementations for add/remove/replace labels on node in CommonNodeLabelsManager, we should merge it to a single one for easier modify them and better readability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2230) Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code)
[ https://issues.apache.org/jira/browse/YARN-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269359#comment-14269359 ] Hudson commented on YARN-2230: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1999 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1999/]) YARN-2230. Fixed few configs description in yarn-default.xml. Contributed by Vijay Bhat (jianhe: rev fe8d2bd74175e7ad521bc310c41a367c0946d6ec) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code) - Key: YARN-2230 URL: https://issues.apache.org/jira/browse/YARN-2230 Project: Hadoop YARN Issue Type: Bug Components: client, documentation, scheduler Affects Versions: 2.4.0 Reporter: Adam Kawa Assignee: Vijay Bhat Priority: Minor Fix For: 2.7.0 Attachments: YARN-2230.001.patch, YARN-2230.002.patch When a user requests more vcores than the allocation limit (e.g. mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores), then InvalidResourceRequestException is thrown - https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java {code} if (resReq.getCapability().getVirtualCores() 0 || resReq.getCapability().getVirtualCores() maximumResource.getVirtualCores()) { throw new InvalidResourceRequestException(Invalid resource request + , requested virtual cores 0 + , or requested virtual cores max configured + , requestedVirtualCores= + resReq.getCapability().getVirtualCores() + , maxVirtualCores= + maximumResource.getVirtualCores()); } {code} According to documentation - yarn-default.xml http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml, the request should be capped to the allocation limit. {code} property descriptionThe maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value./description nameyarn.scheduler.maximum-allocation-vcores/name value32/value /property {code} This means that: * Either documentation or code should be corrected (unless this exception is handled elsewhere accordingly, but it looks that it is not). This behavior is confusing, because when such a job (with mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores) is submitted, it does not make any progress. The warnings/exceptions are thrown at the scheduler (RM) side e.g. {code} 2014-06-29 00:34:51,469 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1403993411503_0002_01 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores 0, or requested virtual cores max configured, requestedVirtualCores=32, maxVirtualCores=3 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:237) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:80) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:420) . at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980) {code} * IMHO, such an exception should be forwarded to client. Otherwise, it is non obvious to discover why a job does not make any progress. The same looks to be related to memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3010) Fix recent findbug issue in AbstractYarnScheduler
[ https://issues.apache.org/jira/browse/YARN-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269353#comment-14269353 ] Hudson commented on YARN-3010: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1999 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1999/]) YARN-3010. Fixed findbugs warning in AbstractYarnScheduler. Contributed by Yi Liu (jianhe: rev e13a484a2be64fb781c5eca5ae7056cbe194ac5e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java Fix recent findbug issue in AbstractYarnScheduler - Key: YARN-3010 URL: https://issues.apache.org/jira/browse/YARN-3010 Project: Hadoop YARN Issue Type: Bug Reporter: Yi Liu Assignee: Yi Liu Priority: Minor Fix For: 2.7.0 Attachments: YARN-3010.001.patch, YARN-3010.002.patch A new findbug issues reported recently in latest trunk: {quote} ISInconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.rmContext; locked 91% of time {quote} https://issues.apache.org/jira/browse/YARN-2996?focusedCommentId=14265760page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14265760 https://builds.apache.org/job/PreCommit-YARN-Build/6249//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2936) YARNDelegationTokenIdentifier doesn't set proto.builder now
[ https://issues.apache.org/jira/browse/YARN-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269356#comment-14269356 ] Hudson commented on YARN-2936: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1999 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1999/]) YARN-2936. Changed YARNDelegationTokenIdentifier to set proto fields on getProto method. Contributed by Varun Saxena (jianhe: rev 2638f4d0f0da375b0dd08f3188057637ed0f32d5) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/YARNDelegationTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/security/TestYARNTokenIdentifier.java YARNDelegationTokenIdentifier doesn't set proto.builder now --- Key: YARN-2936 URL: https://issues.apache.org/jira/browse/YARN-2936 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2936.001.patch, YARN-2936.002.patch, YARN-2936.003.patch, YARN-2936.004.patch, YARN-2936.005.patch, YARN-2936.006.patch After YARN-2743, the setters are removed from YARNDelegationTokenIdentifier, such that when constructing a object which extends YARNDelegationTokenIdentifier, proto.builder is not set at all. Later on, when we call getProto() of it, we will just get an empty proto object. It seems to do no harm to the production code path, as we will always call getBytes() before using proto to persist the DT in the state store, when we generating the password. I think the setter is removed to avoid duplicating setting the fields why getBytes() is called. However, YARNDelegationTokenIdentifier doesn't work properly alone. YARNDelegationTokenIdentifier is tightly coupled with the logic in secretManager. It's vulnerable if something is changed at secretManager. For example, in the test case of YARN-2837, I spent time to figure out we need to execute getBytes() first to make sure the testing DTs can be properly put into the state store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2880) Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled
[ https://issues.apache.org/jira/browse/YARN-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269357#comment-14269357 ] Hudson commented on YARN-2880: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk #1999 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1999/]) Moved YARN-2880 to improvement section in CHANGES.txt (jianhe: rev ef237bd52fc570292a7e608b373b51dd6d1590b8) * hadoop-yarn-project/CHANGES.txt Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled --- Key: YARN-2880 URL: https://issues.apache.org/jira/browse/YARN-2880 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Rohith Fix For: 2.7.0 Attachments: 0001-YARN-2880.patch, YARN-2880.1.patch, YARN-2880.1.patch, YARN-2880.2.patch As suggested by [~ozawa], [link|https://issues.apache.org/jira/browse/YARN-2800?focusedCommentId=14217569page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14217569]. We should have a such test to make sure there will be no regression -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2893) AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream
[ https://issues.apache.org/jira/browse/YARN-2893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269373#comment-14269373 ] Andrew Johnson commented on YARN-2893: -- Yeah, that definitely seems like its worth a look. Is there anything specific I should look out for? AMLaucher: sporadic job failures due to EOFException in readTokenStorageStream -- Key: YARN-2893 URL: https://issues.apache.org/jira/browse/YARN-2893 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.4.0 Reporter: Gera Shegalov MapReduce jobs on our clusters experience sporadic failures due to corrupt tokens in the AM launch context. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2880) Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled
[ https://issues.apache.org/jira/browse/YARN-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269380#comment-14269380 ] Hudson commented on YARN-2880: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #64 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/64/]) Moved YARN-2880 to improvement section in CHANGES.txt (jianhe: rev ef237bd52fc570292a7e608b373b51dd6d1590b8) * hadoop-yarn-project/CHANGES.txt Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled --- Key: YARN-2880 URL: https://issues.apache.org/jira/browse/YARN-2880 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Rohith Fix For: 2.7.0 Attachments: 0001-YARN-2880.patch, YARN-2880.1.patch, YARN-2880.1.patch, YARN-2880.2.patch As suggested by [~ozawa], [link|https://issues.apache.org/jira/browse/YARN-2800?focusedCommentId=14217569page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14217569]. We should have a such test to make sure there will be no regression -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3010) Fix recent findbug issue in AbstractYarnScheduler
[ https://issues.apache.org/jira/browse/YARN-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269376#comment-14269376 ] Hudson commented on YARN-3010: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #64 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/64/]) YARN-3010. Fixed findbugs warning in AbstractYarnScheduler. Contributed by Yi Liu (jianhe: rev e13a484a2be64fb781c5eca5ae7056cbe194ac5e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/CHANGES.txt Fix recent findbug issue in AbstractYarnScheduler - Key: YARN-3010 URL: https://issues.apache.org/jira/browse/YARN-3010 Project: Hadoop YARN Issue Type: Bug Reporter: Yi Liu Assignee: Yi Liu Priority: Minor Fix For: 2.7.0 Attachments: YARN-3010.001.patch, YARN-3010.002.patch A new findbug issues reported recently in latest trunk: {quote} ISInconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.rmContext; locked 91% of time {quote} https://issues.apache.org/jira/browse/YARN-2996?focusedCommentId=14265760page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14265760 https://builds.apache.org/job/PreCommit-YARN-Build/6249//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2230) Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code)
[ https://issues.apache.org/jira/browse/YARN-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269382#comment-14269382 ] Hudson commented on YARN-2230: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #64 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/64/]) YARN-2230. Fixed few configs description in yarn-default.xml. Contributed by Vijay Bhat (jianhe: rev fe8d2bd74175e7ad521bc310c41a367c0946d6ec) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code) - Key: YARN-2230 URL: https://issues.apache.org/jira/browse/YARN-2230 Project: Hadoop YARN Issue Type: Bug Components: client, documentation, scheduler Affects Versions: 2.4.0 Reporter: Adam Kawa Assignee: Vijay Bhat Priority: Minor Fix For: 2.7.0 Attachments: YARN-2230.001.patch, YARN-2230.002.patch When a user requests more vcores than the allocation limit (e.g. mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores), then InvalidResourceRequestException is thrown - https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java {code} if (resReq.getCapability().getVirtualCores() 0 || resReq.getCapability().getVirtualCores() maximumResource.getVirtualCores()) { throw new InvalidResourceRequestException(Invalid resource request + , requested virtual cores 0 + , or requested virtual cores max configured + , requestedVirtualCores= + resReq.getCapability().getVirtualCores() + , maxVirtualCores= + maximumResource.getVirtualCores()); } {code} According to documentation - yarn-default.xml http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml, the request should be capped to the allocation limit. {code} property descriptionThe maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value./description nameyarn.scheduler.maximum-allocation-vcores/name value32/value /property {code} This means that: * Either documentation or code should be corrected (unless this exception is handled elsewhere accordingly, but it looks that it is not). This behavior is confusing, because when such a job (with mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores) is submitted, it does not make any progress. The warnings/exceptions are thrown at the scheduler (RM) side e.g. {code} 2014-06-29 00:34:51,469 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1403993411503_0002_01 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores 0, or requested virtual cores max configured, requestedVirtualCores=32, maxVirtualCores=3 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:237) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:80) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:420) . at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980) {code} * IMHO, such an exception should be forwarded to client. Otherwise, it is non obvious to discover why a job does not make any progress. The same looks to be related to memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3010) Fix recent findbug issue in AbstractYarnScheduler
[ https://issues.apache.org/jira/browse/YARN-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269411#comment-14269411 ] Hudson commented on YARN-3010: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #68 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/68/]) YARN-3010. Fixed findbugs warning in AbstractYarnScheduler. Contributed by Yi Liu (jianhe: rev e13a484a2be64fb781c5eca5ae7056cbe194ac5e) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java Fix recent findbug issue in AbstractYarnScheduler - Key: YARN-3010 URL: https://issues.apache.org/jira/browse/YARN-3010 Project: Hadoop YARN Issue Type: Bug Reporter: Yi Liu Assignee: Yi Liu Priority: Minor Fix For: 2.7.0 Attachments: YARN-3010.001.patch, YARN-3010.002.patch A new findbug issues reported recently in latest trunk: {quote} ISInconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.rmContext; locked 91% of time {quote} https://issues.apache.org/jira/browse/YARN-2996?focusedCommentId=14265760page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14265760 https://builds.apache.org/job/PreCommit-YARN-Build/6249//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2880) Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled
[ https://issues.apache.org/jira/browse/YARN-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269415#comment-14269415 ] Hudson commented on YARN-2880: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #68 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/68/]) Moved YARN-2880 to improvement section in CHANGES.txt (jianhe: rev ef237bd52fc570292a7e608b373b51dd6d1590b8) * hadoop-yarn-project/CHANGES.txt Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled --- Key: YARN-2880 URL: https://issues.apache.org/jira/browse/YARN-2880 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Rohith Fix For: 2.7.0 Attachments: 0001-YARN-2880.patch, YARN-2880.1.patch, YARN-2880.1.patch, YARN-2880.2.patch As suggested by [~ozawa], [link|https://issues.apache.org/jira/browse/YARN-2800?focusedCommentId=14217569page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14217569]. We should have a such test to make sure there will be no regression -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2230) Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code)
[ https://issues.apache.org/jira/browse/YARN-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269417#comment-14269417 ] Hudson commented on YARN-2230: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #68 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/68/]) YARN-2230. Fixed few configs description in yarn-default.xml. Contributed by Vijay Bhat (jianhe: rev fe8d2bd74175e7ad521bc310c41a367c0946d6ec) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code) - Key: YARN-2230 URL: https://issues.apache.org/jira/browse/YARN-2230 Project: Hadoop YARN Issue Type: Bug Components: client, documentation, scheduler Affects Versions: 2.4.0 Reporter: Adam Kawa Assignee: Vijay Bhat Priority: Minor Fix For: 2.7.0 Attachments: YARN-2230.001.patch, YARN-2230.002.patch When a user requests more vcores than the allocation limit (e.g. mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores), then InvalidResourceRequestException is thrown - https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java {code} if (resReq.getCapability().getVirtualCores() 0 || resReq.getCapability().getVirtualCores() maximumResource.getVirtualCores()) { throw new InvalidResourceRequestException(Invalid resource request + , requested virtual cores 0 + , or requested virtual cores max configured + , requestedVirtualCores= + resReq.getCapability().getVirtualCores() + , maxVirtualCores= + maximumResource.getVirtualCores()); } {code} According to documentation - yarn-default.xml http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml, the request should be capped to the allocation limit. {code} property descriptionThe maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value./description nameyarn.scheduler.maximum-allocation-vcores/name value32/value /property {code} This means that: * Either documentation or code should be corrected (unless this exception is handled elsewhere accordingly, but it looks that it is not). This behavior is confusing, because when such a job (with mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores) is submitted, it does not make any progress. The warnings/exceptions are thrown at the scheduler (RM) side e.g. {code} 2014-06-29 00:34:51,469 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1403993411503_0002_01 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores 0, or requested virtual cores max configured, requestedVirtualCores=32, maxVirtualCores=3 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:237) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:80) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:420) . at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980) {code} * IMHO, such an exception should be forwarded to client. Otherwise, it is non obvious to discover why a job does not make any progress. The same looks to be related to memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2230) Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code)
[ https://issues.apache.org/jira/browse/YARN-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269432#comment-14269432 ] Hudson commented on YARN-2230: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2018 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2018/]) YARN-2230. Fixed few configs description in yarn-default.xml. Contributed by Vijay Bhat (jianhe: rev fe8d2bd74175e7ad521bc310c41a367c0946d6ec) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/resources/yarn-default.xml * hadoop-yarn-project/CHANGES.txt Fix description of yarn.scheduler.maximum-allocation-vcores in yarn-default.xml (or code) - Key: YARN-2230 URL: https://issues.apache.org/jira/browse/YARN-2230 Project: Hadoop YARN Issue Type: Bug Components: client, documentation, scheduler Affects Versions: 2.4.0 Reporter: Adam Kawa Assignee: Vijay Bhat Priority: Minor Fix For: 2.7.0 Attachments: YARN-2230.001.patch, YARN-2230.002.patch When a user requests more vcores than the allocation limit (e.g. mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores), then InvalidResourceRequestException is thrown - https://svn.apache.org/repos/asf/hadoop/common/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/SchedulerUtils.java {code} if (resReq.getCapability().getVirtualCores() 0 || resReq.getCapability().getVirtualCores() maximumResource.getVirtualCores()) { throw new InvalidResourceRequestException(Invalid resource request + , requested virtual cores 0 + , or requested virtual cores max configured + , requestedVirtualCores= + resReq.getCapability().getVirtualCores() + , maxVirtualCores= + maximumResource.getVirtualCores()); } {code} According to documentation - yarn-default.xml http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-common/yarn-default.xml, the request should be capped to the allocation limit. {code} property descriptionThe maximum allocation for every container request at the RM, in terms of virtual CPU cores. Requests higher than this won't take effect, and will get capped to this value./description nameyarn.scheduler.maximum-allocation-vcores/name value32/value /property {code} This means that: * Either documentation or code should be corrected (unless this exception is handled elsewhere accordingly, but it looks that it is not). This behavior is confusing, because when such a job (with mapreduce.map.cpu.vcores is larger than yarn.scheduler.maximum-allocation-vcores) is submitted, it does not make any progress. The warnings/exceptions are thrown at the scheduler (RM) side e.g. {code} 2014-06-29 00:34:51,469 WARN org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService: Invalid resource ask by application appattempt_1403993411503_0002_01 org.apache.hadoop.yarn.exceptions.InvalidResourceRequestException: Invalid resource request, requested virtual cores 0, or requested virtual cores max configured, requestedVirtualCores=32, maxVirtualCores=3 at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.validateResourceRequest(SchedulerUtils.java:237) at org.apache.hadoop.yarn.server.resourcemanager.RMServerUtils.validateResourceRequests(RMServerUtils.java:80) at org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService.allocate(ApplicationMasterService.java:420) . at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1026) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1986) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1982) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1980) {code} * IMHO, such an exception should be forwarded to client. Otherwise, it is non obvious to discover why a job does not make any progress. The same looks to be related to memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3010) Fix recent findbug issue in AbstractYarnScheduler
[ https://issues.apache.org/jira/browse/YARN-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269426#comment-14269426 ] Hudson commented on YARN-3010: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2018 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2018/]) YARN-3010. Fixed findbugs warning in AbstractYarnScheduler. Contributed by Yi Liu (jianhe: rev e13a484a2be64fb781c5eca5ae7056cbe194ac5e) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/fair/FairScheduler.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/AbstractYarnScheduler.java Fix recent findbug issue in AbstractYarnScheduler - Key: YARN-3010 URL: https://issues.apache.org/jira/browse/YARN-3010 Project: Hadoop YARN Issue Type: Bug Reporter: Yi Liu Assignee: Yi Liu Priority: Minor Fix For: 2.7.0 Attachments: YARN-3010.001.patch, YARN-3010.002.patch A new findbug issues reported recently in latest trunk: {quote} ISInconsistent synchronization of org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.rmContext; locked 91% of time {quote} https://issues.apache.org/jira/browse/YARN-2996?focusedCommentId=14265760page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14265760 https://builds.apache.org/job/PreCommit-YARN-Build/6249//artifact/patchprocess/newPatchFindbugsWarningshadoop-yarn-server-resourcemanager.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2936) YARNDelegationTokenIdentifier doesn't set proto.builder now
[ https://issues.apache.org/jira/browse/YARN-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269379#comment-14269379 ] Hudson commented on YARN-2936: -- SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #64 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/64/]) YARN-2936. Changed YARNDelegationTokenIdentifier to set proto fields on getProto method. Contributed by Varun Saxena (jianhe: rev 2638f4d0f0da375b0dd08f3188057637ed0f32d5) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/security/TestYARNTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/YARNDelegationTokenIdentifier.java YARNDelegationTokenIdentifier doesn't set proto.builder now --- Key: YARN-2936 URL: https://issues.apache.org/jira/browse/YARN-2936 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2936.001.patch, YARN-2936.002.patch, YARN-2936.003.patch, YARN-2936.004.patch, YARN-2936.005.patch, YARN-2936.006.patch After YARN-2743, the setters are removed from YARNDelegationTokenIdentifier, such that when constructing a object which extends YARNDelegationTokenIdentifier, proto.builder is not set at all. Later on, when we call getProto() of it, we will just get an empty proto object. It seems to do no harm to the production code path, as we will always call getBytes() before using proto to persist the DT in the state store, when we generating the password. I think the setter is removed to avoid duplicating setting the fields why getBytes() is called. However, YARNDelegationTokenIdentifier doesn't work properly alone. YARNDelegationTokenIdentifier is tightly coupled with the logic in secretManager. It's vulnerable if something is changed at secretManager. For example, in the test case of YARN-2837, I spent time to figure out we need to execute getBytes() first to make sure the testing DTs can be properly put into the state store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2571) RM to support YARN registry
[ https://issues.apache.org/jira/browse/YARN-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269548#comment-14269548 ] Steve Loughran commented on YARN-2571: -- Vinod, having the RM create the user paths allows the registry to be set up with the correct permissions as YARN jobs are created. Without that, if there is no path for that user set up, the application is likely to fail post-launch with some error. For cleanup, automatic purging of records keeps the registry data somewhat under control, without applications having to go to the effort of writing these not-yet-implemented cleanup containers. It's not a particularly complex piece of code; there's tests for the distributed shell that verify it works in YARN-2646 RM to support YARN registry Key: YARN-2571 URL: https://issues.apache.org/jira/browse/YARN-2571 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-2571-001.patch, YARN-2571-002.patch, YARN-2571-003.patch, YARN-2571-005.patch, YARN-2571-007.patch, YARN-2571-008.patch, YARN-2571-009.patch The RM needs to (optionally) integrate with the YARN registry: # startup: create the /services and /users paths with system ACLs (yarn, hdfs principals) # app-launch: create the user directory /users/$username with the relevant permissions (CRD) for them to create subnodes. # attempt, container, app completion: remove service records with the matching persistence and ID -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2936) YARNDelegationTokenIdentifier doesn't set proto.builder now
[ https://issues.apache.org/jira/browse/YARN-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269414#comment-14269414 ] Hudson commented on YARN-2936: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #68 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/68/]) YARN-2936. Changed YARNDelegationTokenIdentifier to set proto fields on getProto method. Contributed by Varun Saxena (jianhe: rev 2638f4d0f0da375b0dd08f3188057637ed0f32d5) * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/security/TestYARNTokenIdentifier.java * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/YARNDelegationTokenIdentifier.java YARNDelegationTokenIdentifier doesn't set proto.builder now --- Key: YARN-2936 URL: https://issues.apache.org/jira/browse/YARN-2936 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2936.001.patch, YARN-2936.002.patch, YARN-2936.003.patch, YARN-2936.004.patch, YARN-2936.005.patch, YARN-2936.006.patch After YARN-2743, the setters are removed from YARNDelegationTokenIdentifier, such that when constructing a object which extends YARNDelegationTokenIdentifier, proto.builder is not set at all. Later on, when we call getProto() of it, we will just get an empty proto object. It seems to do no harm to the production code path, as we will always call getBytes() before using proto to persist the DT in the state store, when we generating the password. I think the setter is removed to avoid duplicating setting the fields why getBytes() is called. However, YARNDelegationTokenIdentifier doesn't work properly alone. YARNDelegationTokenIdentifier is tightly coupled with the logic in secretManager. It's vulnerable if something is changed at secretManager. For example, in the test case of YARN-2837, I spent time to figure out we need to execute getBytes() first to make sure the testing DTs can be properly put into the state store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2936) YARNDelegationTokenIdentifier doesn't set proto.builder now
[ https://issues.apache.org/jira/browse/YARN-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269429#comment-14269429 ] Hudson commented on YARN-2936: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2018 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2018/]) YARN-2936. Changed YARNDelegationTokenIdentifier to set proto fields on getProto method. Contributed by Varun Saxena (jianhe: rev 2638f4d0f0da375b0dd08f3188057637ed0f32d5) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/test/java/org/apache/hadoop/yarn/security/TestYARNTokenIdentifier.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/security/client/YARNDelegationTokenIdentifier.java YARNDelegationTokenIdentifier doesn't set proto.builder now --- Key: YARN-2936 URL: https://issues.apache.org/jira/browse/YARN-2936 Project: Hadoop YARN Issue Type: Bug Reporter: Zhijie Shen Assignee: Varun Saxena Fix For: 2.7.0 Attachments: YARN-2936.001.patch, YARN-2936.002.patch, YARN-2936.003.patch, YARN-2936.004.patch, YARN-2936.005.patch, YARN-2936.006.patch After YARN-2743, the setters are removed from YARNDelegationTokenIdentifier, such that when constructing a object which extends YARNDelegationTokenIdentifier, proto.builder is not set at all. Later on, when we call getProto() of it, we will just get an empty proto object. It seems to do no harm to the production code path, as we will always call getBytes() before using proto to persist the DT in the state store, when we generating the password. I think the setter is removed to avoid duplicating setting the fields why getBytes() is called. However, YARNDelegationTokenIdentifier doesn't work properly alone. YARNDelegationTokenIdentifier is tightly coupled with the logic in secretManager. It's vulnerable if something is changed at secretManager. For example, in the test case of YARN-2837, I spent time to figure out we need to execute getBytes() first to make sure the testing DTs can be properly put into the state store. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2880) Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled
[ https://issues.apache.org/jira/browse/YARN-2880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269430#comment-14269430 ] Hudson commented on YARN-2880: -- FAILURE: Integrated in Hadoop-Mapreduce-trunk #2018 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2018/]) Moved YARN-2880 to improvement section in CHANGES.txt (jianhe: rev ef237bd52fc570292a7e608b373b51dd6d1590b8) * hadoop-yarn-project/CHANGES.txt Add a test in TestRMRestart to make sure node labels will be recovered if it is enabled --- Key: YARN-2880 URL: https://issues.apache.org/jira/browse/YARN-2880 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Wangda Tan Assignee: Rohith Fix For: 2.7.0 Attachments: 0001-YARN-2880.patch, YARN-2880.1.patch, YARN-2880.1.patch, YARN-2880.2.patch As suggested by [~ozawa], [link|https://issues.apache.org/jira/browse/YARN-2800?focusedCommentId=14217569page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14217569]. We should have a such test to make sure there will be no regression -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2571) RM to support YARN registry
[ https://issues.apache.org/jira/browse/YARN-2571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269556#comment-14269556 ] Steve Loughran commented on YARN-2571: -- Xuan 1. If you can show me an example of an active service to start with, I'll gladly make it active only 2. We're relying on operations to be idempotent: whoever creates last wins, whoever deletes last wins. There's some race conditions on cleanup if there's a change between a read and a delete, but that's what you get in a world without transactions. RM to support YARN registry Key: YARN-2571 URL: https://issues.apache.org/jira/browse/YARN-2571 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Steve Loughran Assignee: Steve Loughran Attachments: YARN-2571-001.patch, YARN-2571-002.patch, YARN-2571-003.patch, YARN-2571-005.patch, YARN-2571-007.patch, YARN-2571-008.patch, YARN-2571-009.patch The RM needs to (optionally) integrate with the YARN registry: # startup: create the /services and /users paths with system ACLs (yarn, hdfs principals) # app-launch: create the user directory /users/$username with the relevant permissions (CRD) for them to create subnodes. # attempt, container, app completion: remove service records with the matching persistence and ID -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269836#comment-14269836 ] Wangda Tan commented on YARN-2933: -- [~mayank_bansal], Thanks for updating the patch, bq. I don't think thats needed as the basic functionality for the test is to demonstrate we can skip labeled container, So I think it does not mater. I think it matters, and this test is not only verify we can skip labeled containers, but also demonstrate we can make ideal_allocation correct. I still suggest to add the one-line change as I mentioned in https://issues.apache.org/jira/browse/YARN-2933?focusedCommentId=14268631page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14268631 to verify ideal_allocation is correctly computed. And for [~sunilg]'s comment, I think Mayank has already answered you. Thanks, Wangda Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch, YARN-2933-6.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269858#comment-14269858 ] Jian He commented on YARN-2933: --- one other comment: getNodeLabels is not used anywhere, we can remove. Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch, YARN-2933-6.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269868#comment-14269868 ] Sunil G commented on YARN-2933: --- Hi [~mayank_bansal] Thank you for the clarification. I have one more small nit in a test case {code} if(setAMContainer i == 0){ cLive.add(mockContainer(appAttId, cAlloc, unit, 0)); }else if(setLabeledContainer i ==1){ cLive.add(mockContainer(appAttId, cAlloc, unit, 2)); } else{ cLive.add(mockContainer(appAttId, cAlloc, unit, 1)); } {code} For *mockContainer*, last parameter is integer. And it represents 0 for AM container, 1 for normal container, and 2 for labelled container. Could we make it with a macro and more generic. So in future, it will be easy to add a type of container and for readability. We can have an array of different container types, and it can added as needed later. This array index can be used with Enum to create a mock container. Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch, YARN-2933-6.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2933) Capacity Scheduler preemption policy should only consider capacity without labels temporarily
[ https://issues.apache.org/jira/browse/YARN-2933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269875#comment-14269875 ] Hadoop QA commented on YARN-2933: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690901/YARN-2933-6.patch against trunk revision 20625c8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:red}-1 findbugs{color}. The patch appears to introduce 29 new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-tools/hadoop-gridmix. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6279//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6279//artifact/patchprocess/newPatchFindbugsWarningshadoop-gridmix.html Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6279//console This message is automatically generated. Capacity Scheduler preemption policy should only consider capacity without labels temporarily - Key: YARN-2933 URL: https://issues.apache.org/jira/browse/YARN-2933 Project: Hadoop YARN Issue Type: Sub-task Components: capacityscheduler Reporter: Wangda Tan Assignee: Mayank Bansal Attachments: YARN-2933-1.patch, YARN-2933-2.patch, YARN-2933-3.patch, YARN-2933-4.patch, YARN-2933-5.patch, YARN-2933-6.patch Currently, we have capacity enforcement on each queue for each label in CapacityScheduler, but we don't have preemption policy to support that. YARN-2498 is targeting to support preemption respect node labels, but we have some gaps in code base, like queues/FiCaScheduler should be able to get usedResource/pendingResource, etc. by label. These items potentially need to refactor CS which we need spend some time carefully think about. For now, what immediately we can do is allow calculate ideal_allocation and preempt containers only for resources on nodes without labels, to avoid regression like: A cluster has some nodes with labels and some not, assume queueA isn't satisfied for resource without label, but for now, preemption policy may preempt resource from nodes with labels for queueA, that is not correct. Again, it is just a short-term enhancement, YARN-2498 will consider preemption respecting node-labels for Capacity Scheduler which is our final target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2786) Create yarn cluster CLI to enable list node labels collection
[ https://issues.apache.org/jira/browse/YARN-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269878#comment-14269878 ] Hadoop QA commented on YARN-2786: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690906/YARN-2786-20150108-1-without-yarn.cmd.patch against trunk revision 20625c8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6281//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6281//console This message is automatically generated. Create yarn cluster CLI to enable list node labels collection - Key: YARN-2786 URL: https://issues.apache.org/jira/browse/YARN-2786 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2786-20141031-1.patch, YARN-2786-20141031-2.patch, YARN-2786-20141102-2.patch, YARN-2786-20141102-3.patch, YARN-2786-20141103-1-full.patch, YARN-2786-20141103-1-without-yarn.cmd.patch, YARN-2786-20141104-1-full.patch, YARN-2786-20141104-1-without-yarn.cmd.patch, YARN-2786-20141104-2-full.patch, YARN-2786-20141104-2-without-yarn.cmd.patch, YARN-2786-20150107-1-full.patch, YARN-2786-20150107-1-without-yarn.cmd.patch, YARN-2786-20150108-1-full.patch, YARN-2786-20150108-1-without-yarn.cmd.patch With YARN-2778, we can list node labels on existing RM nodes. But it is not enough, we should be able to: 1) list node labels collection The command should start with yarn cluster ..., in the future, we can add more functionality to the yarnClusterCLI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2997) NM keeps sending already-sent completed containers to RM until containers are acked
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2997: -- Summary: NM keeps sending already-sent completed containers to RM until containers are acked (was: NM keeps sending finished containers to RM until app is finished) NM keeps sending already-sent completed containers to RM until containers are acked --- Key: YARN-2997 URL: https://issues.apache.org/jira/browse/YARN-2997 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, YARN-2997.5.patch, YARN-2997.patch We have seen in RM log a lot of {quote} INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {quote} It is caused by NM sending completed containers repeatedly until the app is finished. On the RM side, the container is already released, hence {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-2997) NM keeps sending already-sent completed containers to RM until containers are removed from context
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-2997: -- Summary: NM keeps sending already-sent completed containers to RM until containers are removed from context (was: NM keeps sending already-sent completed containers to RM until containers are acked) NM keeps sending already-sent completed containers to RM until containers are removed from context -- Key: YARN-2997 URL: https://issues.apache.org/jira/browse/YARN-2997 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, YARN-2997.5.patch, YARN-2997.patch We have seen in RM log a lot of {quote} INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {quote} It is caused by NM sending completed containers repeatedly until the app is finished. On the RM side, the container is already released, hence {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending already-sent completed containers to RM until containers are removed from context
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269885#comment-14269885 ] Jian He commented on YARN-2997: --- patch looks good to me. NM keeps sending already-sent completed containers to RM until containers are removed from context -- Key: YARN-2997 URL: https://issues.apache.org/jira/browse/YARN-2997 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, YARN-2997.5.patch, YARN-2997.patch We have seen in RM log a lot of {quote} INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {quote} It is caused by NM sending completed containers repeatedly until the app is finished. On the RM side, the container is already released, hence {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2997) NM keeps sending already-sent completed containers to RM until containers are removed from context
[ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269918#comment-14269918 ] Hudson commented on YARN-2997: -- FAILURE: Integrated in Hadoop-trunk-Commit #6833 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6833/]) YARN-2997. Fixed NodeStatusUpdater to not send alreay-sent completed container statuses on heartbeat. Contributed by Chengbing Liu (jianhe: rev cc2a745f7e82c9fa6de03242952347c54c52dccc) * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/test/java/org/apache/hadoop/yarn/server/nodemanager/TestNodeStatusUpdater.java * hadoop-yarn-project/CHANGES.txt * hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/NodeStatusUpdaterImpl.java NM keeps sending already-sent completed containers to RM until containers are removed from context -- Key: YARN-2997 URL: https://issues.apache.org/jira/browse/YARN-2997 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.6.0 Reporter: Chengbing Liu Assignee: Chengbing Liu Fix For: 2.7.0 Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.4.patch, YARN-2997.5.patch, YARN-2997.patch We have seen in RM log a lot of {quote} INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null container completed... {quote} It is caused by NM sending completed containers repeatedly until the app is finished. On the RM side, the container is already released, hence {{getRMContainer}} returns null. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3014) Replaces labels on a host should update all NM's labels on that host
[ https://issues.apache.org/jira/browse/YARN-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3014: - Summary: Replaces labels on a host should update all NM's labels on that host (was: Changing labels on a host should update all NM's labels on that host) Replaces labels on a host should update all NM's labels on that host Key: YARN-3014 URL: https://issues.apache.org/jira/browse/YARN-3014 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3014-1.patch Admin can either specify labels on a host (by running {{yarn rmadmin -replaceLabelsOnNode host1,label1}}) OR on a single NM (by running {{yarn rmadmin -replaceLabelsOnNode host1:port,label1}}). If user has specified label=x on a NM (instead of host), and later set the label=y on host of the NM. NM's label should update to y as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3014) Changing labels on a host should update all NM's labels on that host
[ https://issues.apache.org/jira/browse/YARN-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wangda Tan updated YARN-3014: - Attachment: YARN-3014-1.patch Attached patch, and kick jenkins. This patch contains refactoring described in YARN-3016, which merged internalAdd/Remove/ReplaceLabels to one method. Changing labels on a host should update all NM's labels on that host Key: YARN-3014 URL: https://issues.apache.org/jira/browse/YARN-3014 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3014-1.patch Admin can either specify labels on a host (by running {{yarn rmadmin -replaceLabelsOnNode host1,label1}}) OR on a single NM (by running {{yarn rmadmin -replaceLabelsOnNode host1:port,label1}}). If user has specified label=x on a NM (instead of host), and later set the label=y on host of the NM. NM's label should update to y as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3015) yarn classpath command should support same options as hadoop classpath.
[ https://issues.apache.org/jira/browse/YARN-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269973#comment-14269973 ] Hadoop QA commented on YARN-3015: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690927/YARN-3015.001.patch against trunk revision cc2a745. {color:red}-1 patch{color}. The patch command could not apply the patch. Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6285//console This message is automatically generated. yarn classpath command should support same options as hadoop classpath. --- Key: YARN-3015 URL: https://issues.apache.org/jira/browse/YARN-3015 Project: Hadoop YARN Issue Type: Bug Components: scripts Reporter: Chris Nauroth Assignee: Varun Saxena Priority: Minor Attachments: YARN-3015.001.patch HADOOP-10903 enhanced the {{hadoop classpath}} command to support optional expansion of the wildcards and bundling the classpath into a jar file containing a manifest with the Class-Path attribute. The other classpath commands should do the same for consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (YARN-712) RMDelegationTokenSecretManager shouldn't start in non-secure mode
[ https://issues.apache.org/jira/browse/YARN-712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He resolved YARN-712. -- Resolution: Won't Fix RMDelegationTokenSecretManager shouldn't start in non-secure mode -- Key: YARN-712 URL: https://issues.apache.org/jira/browse/YARN-712 Project: Hadoop YARN Issue Type: Bug Reporter: Jian He Assignee: Jian He RM will just be doing useless work as no tokens are issued. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3019) Enable work-preserving restart by default
Jian He created YARN-3019: - Summary: Enable work-preserving restart by default Key: YARN-3019 URL: https://issues.apache.org/jira/browse/YARN-3019 Project: Hadoop YARN Issue Type: Sub-task Reporter: Jian He Assignee: Jian He The proposal is to set yarn.resourcemanager.work-preserving-recovery.enabled to true by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3019) Enable RM work-preserving restart by default
[ https://issues.apache.org/jira/browse/YARN-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jian He updated YARN-3019: -- Summary: Enable RM work-preserving restart by default (was: Enable work-preserving restart by default ) Enable RM work-preserving restart by default - Key: YARN-3019 URL: https://issues.apache.org/jira/browse/YARN-3019 Project: Hadoop YARN Issue Type: Sub-task Components: resourcemanager Reporter: Jian He Assignee: Jian He The proposal is to set yarn.resourcemanager.work-preserving-recovery.enabled to true by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (YARN-3020) n similar addContainerRequest()s produce n*(n+1)/2 containers
Peter D Kirchner created YARN-3020: -- Summary: n similar addContainerRequest()s produce n*(n+1)/2 containers Key: YARN-3020 URL: https://issues.apache.org/jira/browse/YARN-3020 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 2.5.2, 2.5.1, 2.6.0, 2.5.0 Reporter: Peter D Kirchner BUG: If the application master calls addContainerRequest() n times, but with the same priority, I get 1+2+3+...+n containers = n*(n+1)/2 . If the application master calls addContainerRequest() n times, but with a unique priority each time, I get n containers (as I intended). Analysis: There is a logic problem in AMRMClientImpl.java. Although AMRMClientImpl.java, allocate() does an ask.clear() , on subsequent calls to addContainerRequest(), addResourceRequest() finds the previous matching remoteRequest and increments the container count rather than starting anew, and does an addResourceRequestToAsk() which defeats the ask.clear(). From documentation and code comments, it was hard for me to discern the intended behavior of the API, but the inconsistency reported in this issue suggests one case or the other is implemented incorrectly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-3014) Replaces labels on a host should update all NM's labels on that host
[ https://issues.apache.org/jira/browse/YARN-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270069#comment-14270069 ] Hadoop QA commented on YARN-3014: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690922/YARN-3014-1.patch against trunk revision cc2a745. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.security.TestRMDelegationTokens Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6284//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6284//console This message is automatically generated. Replaces labels on a host should update all NM's labels on that host Key: YARN-3014 URL: https://issues.apache.org/jira/browse/YARN-3014 Project: Hadoop YARN Issue Type: Sub-task Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-3014-1.patch Admin can either specify labels on a host (by running {{yarn rmadmin -replaceLabelsOnNode host1,label1}}) OR on a single NM (by running {{yarn rmadmin -replaceLabelsOnNode host1:port,label1}}). If user has specified label=x on a NM (instead of host), and later set the label=y on host of the NM. NM's label should update to y as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2141) [Umbrella] Capture container and node resource consumption
[ https://issues.apache.org/jira/browse/YARN-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269919#comment-14269919 ] Vinod Kumar Vavilapalli commented on YARN-2141: --- Related to and very likely a dup of YARN-1012 which is part of a larger effort. [Umbrella] Capture container and node resource consumption -- Key: YARN-2141 URL: https://issues.apache.org/jira/browse/YARN-2141 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager Reporter: Carlo Curino Priority: Minor Collecting per-container and per-node resource consumption statistics in a fairly granular manner, and making them available to both infrastructure code (e.g., schedulers) and users (e.g., AMs or directly users via webapps), can facilitate several performance work. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2984) Metrics for container's actual memory usage
[ https://issues.apache.org/jira/browse/YARN-2984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269944#comment-14269944 ] Anubhav Dhoot commented on YARN-2984: - Seems good. I tested it and shows up on the jmx page with its information I would rename the ContainerUsageMetrics to just ContainerMetrics as I am thinking its a good place for all container related metrics instead of having multiple ContainerMetrics. Metrics for container's actual memory usage --- Key: YARN-2984 URL: https://issues.apache.org/jira/browse/YARN-2984 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager Affects Versions: 2.6.0 Reporter: Karthik Kambatla Assignee: Karthik Kambatla Attachments: yarn-2984-prelim.patch It would be nice to capture resource usage per container, for a variety of reasons. This JIRA is to track memory usage. YARN-2965 tracks the resource usage on the node, and the two implementations should reuse code as much as possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1798) TestContainerLaunch, TestContainersMonitor, TestNodeManagerShutdown, TestNodeStatusUpdater fails on Linux
[ https://issues.apache.org/jira/browse/YARN-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14270004#comment-14270004 ] Hadoop QA commented on YARN-1798: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12640845/YARN-1798.1.patch against trunk revision cc2a745. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager. Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6283//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6283//console This message is automatically generated. TestContainerLaunch, TestContainersMonitor, TestNodeManagerShutdown, TestNodeStatusUpdater fails on Linux - Key: YARN-1798 URL: https://issues.apache.org/jira/browse/YARN-1798 Project: Hadoop YARN Issue Type: Test Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: TestContainerLaunch-output.txt, TestContainerLaunch.txt, YARN-1798.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2786) Create yarn cluster CLI to enable list node labels collection
[ https://issues.apache.org/jira/browse/YARN-2786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269905#comment-14269905 ] Hadoop QA commented on YARN-2786: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12690906/YARN-2786-20150108-1-without-yarn.cmd.patch against trunk revision 708b1aa. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-client: org.apache.hadoop.yarn.client.api.impl.TestAMRMClient Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6282//testReport/ Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6282//console This message is automatically generated. Create yarn cluster CLI to enable list node labels collection - Key: YARN-2786 URL: https://issues.apache.org/jira/browse/YARN-2786 Project: Hadoop YARN Issue Type: Sub-task Components: api, client, resourcemanager Reporter: Wangda Tan Assignee: Wangda Tan Attachments: YARN-2786-20141031-1.patch, YARN-2786-20141031-2.patch, YARN-2786-20141102-2.patch, YARN-2786-20141102-3.patch, YARN-2786-20141103-1-full.patch, YARN-2786-20141103-1-without-yarn.cmd.patch, YARN-2786-20141104-1-full.patch, YARN-2786-20141104-1-without-yarn.cmd.patch, YARN-2786-20141104-2-full.patch, YARN-2786-20141104-2-without-yarn.cmd.patch, YARN-2786-20150107-1-full.patch, YARN-2786-20150107-1-without-yarn.cmd.patch, YARN-2786-20150108-1-full.patch, YARN-2786-20150108-1-without-yarn.cmd.patch With YARN-2778, we can list node labels on existing RM nodes. But it is not enough, we should be able to: 1) list node labels collection The command should start with yarn cluster ..., in the future, we can add more functionality to the yarnClusterCLI -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2421) CapacityScheduler still allocates containers to an app in the FINISHING state
[ https://issues.apache.org/jira/browse/YARN-2421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269938#comment-14269938 ] Hadoop QA commented on YARN-2421: - {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12663584/yarn2421.patch against trunk revision 20625c8. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 1 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1215 javac compiler warnings (more than the trunk's current 1214 warnings). {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager: org.apache.hadoop.yarn.server.resourcemanager.TestMoveApplication org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.TestRMContainerImpl org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestFairScheduler Test results: https://builds.apache.org/job/PreCommit-YARN-Build/6280//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/6280//artifact/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-YARN-Build/6280//console This message is automatically generated. CapacityScheduler still allocates containers to an app in the FINISHING state - Key: YARN-2421 URL: https://issues.apache.org/jira/browse/YARN-2421 Project: Hadoop YARN Issue Type: Bug Components: scheduler Affects Versions: 2.4.1 Reporter: Thomas Graves Assignee: chang li Attachments: yarn2421.patch, yarn2421.patch, yarn2421.patch I saw an instance of a bad application master where it unregistered with the RM but then continued to call into allocate. The RMAppAttempt went to the FINISHING state, but the capacity scheduler kept allocating it containers. We should probably have the capacity scheduler check that the application isn't in one of the terminal states before giving it containers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1798) TestContainerLaunch, TestContainersMonitor, TestNodeManagerShutdown, TestNodeStatusUpdater fails on Linux
[ https://issues.apache.org/jira/browse/YARN-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269940#comment-14269940 ] Jian He commented on YARN-1798: --- I think the issue is outdated now, may be close? TestContainerLaunch, TestContainersMonitor, TestNodeManagerShutdown, TestNodeStatusUpdater fails on Linux - Key: YARN-1798 URL: https://issues.apache.org/jira/browse/YARN-1798 Project: Hadoop YARN Issue Type: Test Reporter: Tsuyoshi OZAWA Assignee: Tsuyoshi OZAWA Attachments: TestContainerLaunch-output.txt, TestContainerLaunch.txt, YARN-1798.1.patch -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-3015) yarn classpath command should support same options as hadoop classpath.
[ https://issues.apache.org/jira/browse/YARN-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Varun Saxena updated YARN-3015: --- Attachment: YARN-3015.001.patch yarn classpath command should support same options as hadoop classpath. --- Key: YARN-3015 URL: https://issues.apache.org/jira/browse/YARN-3015 Project: Hadoop YARN Issue Type: Bug Components: scripts Reporter: Chris Nauroth Assignee: Varun Saxena Priority: Minor Attachments: YARN-3015.001.patch HADOOP-10903 enhanced the {{hadoop classpath}} command to support optional expansion of the wildcards and bundling the classpath into a jar file containing a manifest with the Class-Path attribute. The other classpath commands should do the same for consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2965) Enhance Node Managers to monitor and report the resource usage on machines
[ https://issues.apache.org/jira/browse/YARN-2965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269921#comment-14269921 ] Vinod Kumar Vavilapalli commented on YARN-2965: --- Linking all related efforts. Related to and very likely a dup of YARN-1012 which is part of a larger effort. Enhance Node Managers to monitor and report the resource usage on machines -- Key: YARN-2965 URL: https://issues.apache.org/jira/browse/YARN-2965 Project: Hadoop YARN Issue Type: Sub-task Components: nodemanager, resourcemanager Reporter: Robert Grandl Assignee: Robert Grandl Attachments: ddoc_RT.docx This JIRA is about augmenting Node Managers to monitor the resource usage on the machine, aggregates these reports and exposes them to the RM. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-2745) Extend YARN to support multi-resource packing of tasks
[ https://issues.apache.org/jira/browse/YARN-2745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269926#comment-14269926 ] Vinod Kumar Vavilapalli commented on YARN-2745: --- Haven't read the design doc yet. Linking all related efforts so there are no duplicates. Related to and very likely a dup of YARN-1011. Extend YARN to support multi-resource packing of tasks -- Key: YARN-2745 URL: https://issues.apache.org/jira/browse/YARN-2745 Project: Hadoop YARN Issue Type: Improvement Components: nodemanager, resourcemanager, scheduler Reporter: Robert Grandl Assignee: Robert Grandl Attachments: sigcomm_14_tetris_talk.pptx, tetris_design_doc.docx, tetris_paper.pdf In this umbrella JIRA we propose an extension to existing scheduling techniques, which accounts for all resources used by a task (CPU, memory, disk, network) and it is able to achieve three competing objectives: fairness, improve cluster utilization and reduces average job completion time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)