Re: Distributed shell with host affinity and relaxLocality
Any ideas? On 3/16/15 1:31 PM, Chris Riccomini criccom...@linkedin.com wrote: + Navina Hey Karthik, YARN 2.6.0 FairShare. Cheers, Chris On 3/16/15 1:28 PM, Karthik Kambatla ka...@cloudera.com wrote: Hey Chris What scheduler/version is this? On Mon, Mar 16, 2015 at 12:01 PM, Chris Riccomini criccom...@linkedin.com.invalid wrote: Hey all, We have been testing YARN with host-specific ContainerRequests. For our tests, we've been using the DistributedShell example. We've applied YARN-1974, which allows us to specify node lists, relax locality, etc. Everything seems to work as expected when we have relaxLocality set to false, and we request a specific host. When we set relaxLocality to true, things get weird. We run three nodes: node1, node2, and node3. When we start DistributedShell with, we configure it (via CLI params) to use two containers, and have a host-level request for node3. What we observe is that the AM and one container both end up on node2, and a third container ends up on node3. There are enough resources for node3 to handle both containers, but the second one doesn't end up there. We also notice that the DistributedShell app wedges because the container on node3 never completes. What is the expected behavior here? This seems to be broken. Cheers, Chris -- Karthik Kambatla Software Engineer, Cloudera Inc. http://five.sentenc.es
Re: Distributed shell with host affinity and relaxLocality
+ Navina Hey Karthik, YARN 2.6.0 FairShare. Cheers, Chris On 3/16/15 1:28 PM, Karthik Kambatla ka...@cloudera.com wrote: Hey Chris What scheduler/version is this? On Mon, Mar 16, 2015 at 12:01 PM, Chris Riccomini criccom...@linkedin.com.invalid wrote: Hey all, We have been testing YARN with host-specific ContainerRequests. For our tests, we've been using the DistributedShell example. We've applied YARN-1974, which allows us to specify node lists, relax locality, etc. Everything seems to work as expected when we have relaxLocality set to false, and we request a specific host. When we set relaxLocality to true, things get weird. We run three nodes: node1, node2, and node3. When we start DistributedShell with, we configure it (via CLI params) to use two containers, and have a host-level request for node3. What we observe is that the AM and one container both end up on node2, and a third container ends up on node3. There are enough resources for node3 to handle both containers, but the second one doesn't end up there. We also notice that the DistributedShell app wedges because the container on node3 never completes. What is the expected behavior here? This seems to be broken. Cheers, Chris -- Karthik Kambatla Software Engineer, Cloudera Inc. http://five.sentenc.es
Distributed shell with host affinity and relaxLocality
Hey all, We have been testing YARN with host-specific ContainerRequests. For our tests, we've been using the DistributedShell example. We've applied YARN-1974, which allows us to specify node lists, relax locality, etc. Everything seems to work as expected when we have relaxLocality set to false, and we request a specific host. When we set relaxLocality to true, things get weird. We run three nodes: node1, node2, and node3. When we start DistributedShell with, we configure it (via CLI params) to use two containers, and have a host-level request for node3. What we observe is that the AM and one container both end up on node2, and a third container ends up on node3. There are enough resources for node3 to handle both containers, but the second one doesn't end up there. We also notice that the DistributedShell app wedges because the container on node3 never completes. What is the expected behavior here? This seems to be broken. Cheers, Chris
Host-specific ContainerRequest ignored
Hey Guys, I am creating a container request: protected def requestContainers(memMb: Int, cpuCores: Int, containers: Int) { info(Requesting %d container(s) with %dmb of memory format (containers, memMb)) val capability = Records.newRecord(classOf[Resource]) val priority = Records.newRecord(classOf[Priority]) priority.setPriority(0) capability.setMemory(memMb) capability.setVirtualCores(cpuCores) (0 until containers).foreach(idx = amClient.addContainerRequest(new ContainerRequest(capability, null, null, priority))) } This pretty closely mirrors the distributed shell example. If I put an array with a host string in the ContainerRequest, YARN seems to completely ignore this request, and continues to put all containers on one or two nodes in the grid, which aren't the ones I requested, even though the grid is completely empty, and there are 15 nodes available. This also holds true if I put false for relax locality. I'm running the CapacityScheduler with a node-locality-delay set to 40. Previously, I tried the FifoScheduler, and it exhibited the same behavior. All NMs are just using the /default-rack for their rack. The strings that I'm putting in the hosts String[] parameter in ContainerRequest are hard coded to exactly match the NodeIds being listed in the NMs. What am I doing wrong? I feel like I'm missing some configuration on the capacity scheduler or NMs or something. Cheers, Chris
Re: Capacity scheduler puts all containers on one box
Hey Guys, @Vinod: We aren't overriding the default, so we must be using -1 as the setting. @Sandy: We aren't specifying any racks/hosts when sending the resource requests. +1 regarding introducing a similar limit in capacity scheduler. Any recommended work-arounds in the mean time? Our utilization of the grid is very low because we're having to force high memory requests for the containers in order to guarantee a maximum number of containers on a single node (e.g. Set container memory MB set to 17GB to disallow more than 2 containers from being assigned to any one 48GB node). Cheers, Chris On 3/21/14 11:30 PM, Sandy Ryza sandy.r...@cloudera.com wrote: yarn.scheduler.capacity.node-locality-delay will help if the app is requesting containers at particular locations, but won't help spread things out evenly otherwise. The Fair Scheduler attempts an even spread. By default, it only schedules a single container each time it considers a node. Decoupling scheduling from node heartbeats (YARN-1010) makes it so that a high node heartbeat interval doesn't result in this being slow. Now that the Capacity Scheduler has similar capabilities (YARN-1512), it might make sense to introduce a similar limit? -Sandy On Fri, Mar 21, 2014 at 4:42 PM, Vinod Kumar Vavilapalli vino...@apache.org wrote: What's the value for yarn.scheduler.capacity.node-locality-delay? It is -1 by default in 2.2. We fixed the default to be a reasonable 40 (nodes in a rack) in 2.3.0 that should spread containers a bit. Thanks, +Vinod On Mar 21, 2014, at 12:48 PM, Chris Riccomini criccom...@linkedin.com wrote: Hey Guys, We're running YARN 2.2 with the capacity scheduler. Each NM is running with 40G of memory capacity. When we request a series containers with 2G of memory from a single AM, we see the RM assigning them entirely to one NM until that NM is full, and then moving on to the next, and so on. Essentially, we have a grid with 20 nodes, and two are completely full, and the rest are completely empty. This is problematic because our containers use disk heavily, and are completely saturating the disks on the two nodes, which slows all of the containers down on these NMs. 1. Is this expected behavior of the capacity scheduler? What about the fifo scheduler? 2. Is the recommended work around just to increase memory allocation per-container as a proxy for the disk capacity that's required? Given that there's no disk-level isolation, and no disk-level resource, I don't see another way around this. Cheers, Chris -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Capacity scheduler puts all containers on one box
Hey Guys, We're running YARN 2.2 with the capacity scheduler. Each NM is running with 40G of memory capacity. When we request a series containers with 2G of memory from a single AM, we see the RM assigning them entirely to one NM until that NM is full, and then moving on to the next, and so on. Essentially, we have a grid with 20 nodes, and two are completely full, and the rest are completely empty. This is problematic because our containers use disk heavily, and are completely saturating the disks on the two nodes, which slows all of the containers down on these NMs. 1. Is this expected behavior of the capacity scheduler? What about the fifo scheduler? 2. Is the recommended work around just to increase memory allocation per-container as a proxy for the disk capacity that's required? Given that there's no disk-level isolation, and no disk-level resource, I don't see another way around this. Cheers, Chris
[jira] [Created] (YARN-1079) Fix progress bar for long-lived services in YARN
Chris Riccomini created YARN-1079: - Summary: Fix progress bar for long-lived services in YARN Key: YARN-1079 URL: https://issues.apache.org/jira/browse/YARN-1079 Project: Hadoop YARN Issue Type: Bug Components: resourcemanager Affects Versions: 2.1.0-beta Reporter: Chris Riccomini YARN currently shows a progress bar for jobs in its web UI. This is non-sensical for long-lived services, which have no concept of progress. For example, with Samza, we have stream processors which run for an indefinite amount of time (sometimes forever). YARN should support jobs without a concept of progress. Some discussion about this is on YARN-896. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-810) Support CGroup ceiling enforcement on CPU
Chris Riccomini created YARN-810: Summary: Support CGroup ceiling enforcement on CPU Key: YARN-810 URL: https://issues.apache.org/jira/browse/YARN-810 Project: Hadoop YARN Issue Type: Bug Components: nodemanager Affects Versions: 2.1.0-beta, 2.0.5-alpha Reporter: Chris Riccomini Problem statement: YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. Containers are then allowed to request vcores between the minimum and maximum defined in the yarn-site.xml. In the case where a single-threaded container requests 1 vcore, with a pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of the core it's using, provided that no other container is also using it. This happens, even though the only guarantee that YARN/CGroups is making is that the container will get at least 1/4th of the core. If a second container then comes along, the second container can take resources from the first, provided that the first container is still getting at least its fair share (1/4th). There are certain cases where this is desirable. There are also certain cases where it might be desirable to have a hard limit on CPU usage, and not allow the process to go above the specified resource requirement, even if it's available. Here's an RFC that describes the problem in more detail: http://lwn.net/Articles/336127/ Solution: As it happens, when CFS is used in combination with CGroups, you can enforce a ceiling using two files in cgroups: {noformat} cpu.cfs_quota_us cpu.cfs_period_us {noformat} The usage of these two files is documented in more detail here: https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html Testing: I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it behaves as described above (it is a soft cap, and allows containers to use more than they asked for). I then tested CFS CPU quotas manually with YARN. First, you can see that CFS is in use in the CGroup, based on the file names: {noformat} [criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/ total 0 -r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02 -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us -rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares -r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat -rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release -rw-r--r-- 1 app app 0 Jun 13 16:46 tasks [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_period_us 10 [criccomi@eat1-qa464 ~]$ sudo -u app cat /cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us -1 {noformat} Oddly, it appears that the cfs_period_us is set to .1s, not 1s. We can place processes in hard limits. I have process 4370 running YARN container container_1371141151815_0003_01_03 on a host. By default, it's running at ~300% cpu usage. {noformat} CPU 4370 criccomi 20 0 1157m 551m 14m S 240.3 0.8 87:10.91 ... {noformat} When I set the CFS quote: {noformat} echo 1000 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us CPU 4370 criccomi 20 0 1157m 563m 14m S 1.0 0.8 90:08.39 ... {noformat} It drops to 1% usage, and you can see the box has room to spare: {noformat} Cpu(s): 2.4%us, 1.0%sy, 0.0%ni, 92.2%id, 4.2%wa, 0.0%hi, 0.1%si, 0.0%st {noformat} Turning the quota back to -1: {noformat} echo -1 /cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us {noformat} Burns the cores again: {noformat} Cpu(s): 11.1%us, 1.7%sy, 0.0%ni, 83.9%id, 3.1%wa, 0.0%hi, 0.2%si, 0.0%st CPU 4370 criccomi 20 0 1157m 563m 14m S 253.9 0.8 89:32.31 ... {noformat} On my dev box, I was testing CGroups by running a python process eight times, to burn through all the cores, since it was doing as described above (giving extra CPU to the process, even with a cpu.shares limit). Toggling the cfs_quota_us seems to enforce a hard limit. Implementation: What do you guys think about introducing a variable to YarnConfiguration: bq. yarn.nodemanager.linux-container.executor.cgroups.cpu-ceiling-enforcement The default would be false. Setting to true, would cause YARN's LCE to set: {noformat} cpu.cfs_quota_us=(container-request-vcores/nm-vcore-to-pcore-ratio) * 100 cpu.cfs_period_us=100 {noformat} For example, if a container asks for 2 vcores, and the vcore:pcore ratio
[jira] [Created] (YARN-799) CgroupsLCEResourcesHandler tries to write to cgroup.procs
Chris Riccomini created YARN-799: Summary: CgroupsLCEResourcesHandler tries to write to cgroup.procs Key: YARN-799 URL: https://issues.apache.org/jira/browse/YARN-799 Project: Hadoop YARN Issue Type: Bug Reporter: Chris Riccomini The implementation of bq. ./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java Tells the container-executor to write PIDs to cgroup.procs: public String getResourcesOption(ContainerId containerId) { String containerName = containerId.toString(); StringBuilder sb = new StringBuilder(cgroups=); if (isCpuWeightEnabled()) { sb.append(pathForCgroup(CONTROLLER_CPU, containerName) + /cgroup.procs); sb.append(,); } if (sb.charAt(sb.length() - 1) == ',') { sb.deleteCharAt(sb.length() - 1); } return sb.toString(); } Apparently, this file has not always been writeable: https://patchwork.kernel.org/patch/116146/ http://lkml.indiana.edu/hypermail/linux/kernel/1004.1/00536.html https://lists.linux-foundation.org/pipermail/containers/2009-July/019679.html The RHEL version of the Linux kernel that I'm using has a CGroup module that has a non-writeable cgroup.procs file. bq. $ uname -a Linux criccomi-ld 2.6.32-131.4.1.el6.x86_64 #1 SMP Fri Jun 10 10:54:26 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux As a result, when the container-executor tries to run, it fails with this error message: bq.fprintf(LOGFILE, Failed to write pid %s (%d) to file %s - %s\n, This is because the executor is given a resource by the CgroupsLCEResourcesHandler that includes cgroup.procs, which is non-writeable: bq. $ pwd /cgroup/cpu/hadoop-yarn/container_1370986842149_0001_01_01 $ ls -l total 0 -r--r--r-- 1 criccomi eng 0 Jun 11 14:43 cgroup.procs -rw-r--r-- 1 criccomi eng 0 Jun 11 14:43 cpu.rt_period_us -rw-r--r-- 1 criccomi eng 0 Jun 11 14:43 cpu.rt_runtime_us -rw-r--r-- 1 criccomi eng 0 Jun 11 14:43 cpu.shares -rw-r--r-- 1 criccomi eng 0 Jun 11 14:43 notify_on_release -rw-r--r-- 1 criccomi eng 0 Jun 11 14:43 tasks I patched CgroupsLCEResourcesHandler to use /tasks instead of /cgroup.procs, and this appears to have fixed the problem. I can think of several potential resolutions to this ticket: 1. Ignore the problem, and make people patch YARN when they hit this issue. 2. Write to /tasks instead of /cgroup.procs for everyone 3. Check permissioning on /cgroup.procs prior to writing to it, and fall back to /tasks. 4. Add a config to yarn-site that lets admins specify which file to write to. Thoughts? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (YARN-481) Add AM Host and RPC Port to ApplicationCLI Status Output
Chris Riccomini created YARN-481: Summary: Add AM Host and RPC Port to ApplicationCLI Status Output Key: YARN-481 URL: https://issues.apache.org/jira/browse/YARN-481 Project: Hadoop YARN Issue Type: Bug Components: client Affects Versions: 0.23.6, 2.0.3-alpha Reporter: Chris Riccomini Hey Guys, I noticed that the ApplicationCLI is just randomly not printing some of the values in the ApplicationReport. I've added the getHost and getRpcPort. These are useful for me, since I want to make an RPC call to the AM (not the tracker call). Thanks! Chris -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira