Re: Distributed shell with host affinity and relaxLocality

2015-03-17 Thread Chris Riccomini
Any ideas?

On 3/16/15 1:31 PM, Chris Riccomini criccom...@linkedin.com wrote:

+ Navina

Hey Karthik,

YARN 2.6.0 FairShare.

Cheers,
Chris

On 3/16/15 1:28 PM, Karthik Kambatla ka...@cloudera.com wrote:

Hey Chris

What scheduler/version is this?

On Mon, Mar 16, 2015 at 12:01 PM, Chris Riccomini 
criccom...@linkedin.com.invalid wrote:

 Hey all,

 We have been testing YARN with host-specific ContainerRequests. For our
 tests, we've been using the DistributedShell example. We've applied
 YARN-1974, which allows us to specify node lists, relax locality, etc.
 Everything seems to work as expected when we have relaxLocality set to
 false, and we request a specific host.

 When we set relaxLocality to true, things get weird. We run three
nodes:
 node1, node2, and node3. When we start DistributedShell with, we
configure
 it (via CLI params) to use two containers, and have a host-level
request
 for node3. What we observe is that the AM and one container both end up
on
 node2, and a third container ends up on node3. There are enough
resources
 for node3 to handle both containers, but the second one doesn't end up
 there. We also notice that the DistributedShell app wedges because the
 container on node3 never completes.

 What is the expected behavior here? This seems to be broken.

 Cheers,
 Chris




-- 
Karthik Kambatla
Software Engineer, Cloudera Inc.

http://five.sentenc.es




Re: Distributed shell with host affinity and relaxLocality

2015-03-16 Thread Chris Riccomini
+ Navina

Hey Karthik,

YARN 2.6.0 FairShare.

Cheers,
Chris

On 3/16/15 1:28 PM, Karthik Kambatla ka...@cloudera.com wrote:

Hey Chris

What scheduler/version is this?

On Mon, Mar 16, 2015 at 12:01 PM, Chris Riccomini 
criccom...@linkedin.com.invalid wrote:

 Hey all,

 We have been testing YARN with host-specific ContainerRequests. For our
 tests, we've been using the DistributedShell example. We've applied
 YARN-1974, which allows us to specify node lists, relax locality, etc.
 Everything seems to work as expected when we have relaxLocality set to
 false, and we request a specific host.

 When we set relaxLocality to true, things get weird. We run three nodes:
 node1, node2, and node3. When we start DistributedShell with, we
configure
 it (via CLI params) to use two containers, and have a host-level request
 for node3. What we observe is that the AM and one container both end up
on
 node2, and a third container ends up on node3. There are enough
resources
 for node3 to handle both containers, but the second one doesn't end up
 there. We also notice that the DistributedShell app wedges because the
 container on node3 never completes.

 What is the expected behavior here? This seems to be broken.

 Cheers,
 Chris




-- 
Karthik Kambatla
Software Engineer, Cloudera Inc.

http://five.sentenc.es



Distributed shell with host affinity and relaxLocality

2015-03-16 Thread Chris Riccomini
Hey all,

We have been testing YARN with host-specific ContainerRequests. For our tests, 
we've been using the DistributedShell example. We've applied YARN-1974, which 
allows us to specify node lists, relax locality, etc. Everything seems to work 
as expected when we have relaxLocality set to false, and we request a specific 
host.

When we set relaxLocality to true, things get weird. We run three nodes: node1, 
node2, and node3. When we start DistributedShell with, we configure it (via CLI 
params) to use two containers, and have a host-level request for node3. What we 
observe is that the AM and one container both end up on node2, and a third 
container ends up on node3. There are enough resources for node3 to handle both 
containers, but the second one doesn't end up there. We also notice that the 
DistributedShell app wedges because the container on node3 never completes.

What is the expected behavior here? This seems to be broken.

Cheers,
Chris


Host-specific ContainerRequest ignored

2014-05-15 Thread Chris Riccomini
Hey Guys,

I am creating a container request:


  protected def requestContainers(memMb: Int, cpuCores: Int, containers: Int) {

info(Requesting %d container(s) with %dmb of memory format (containers, 
memMb))

val capability = Records.newRecord(classOf[Resource])

val priority = Records.newRecord(classOf[Priority])

priority.setPriority(0)

capability.setMemory(memMb)

capability.setVirtualCores(cpuCores)

(0 until containers).foreach(idx = amClient.addContainerRequest(new 
ContainerRequest(capability, null, null, priority)))

  }

This pretty closely mirrors the distributed shell example.

If I put an array with a host string in the ContainerRequest, YARN seems to 
completely ignore this request, and continues to put all containers on one or 
two nodes in the grid, which aren't the ones I requested, even though the grid 
is completely empty, and there are 15 nodes available. This also holds true if 
I put false for relax locality. I'm running the CapacityScheduler with a 
node-locality-delay set to 40. Previously, I tried the FifoScheduler, and it 
exhibited the same behavior.

All NMs are just using the /default-rack for their rack. The strings that I'm 
putting in the hosts String[] parameter in ContainerRequest are hard coded to 
exactly match the NodeIds being listed in the NMs.

What am I doing wrong? I feel like I'm missing some configuration on the 
capacity scheduler or NMs or something.

Cheers,
Chris


Re: Capacity scheduler puts all containers on one box

2014-03-22 Thread Chris Riccomini
Hey Guys,

@Vinod: We aren't overriding the default, so we must be using -1 as the
setting.

@Sandy: We aren't specifying any racks/hosts when sending the resource
requests. +1 regarding introducing a similar limit in capacity scheduler.

Any recommended work-arounds in the mean time? Our utilization of the grid
is very low because we're having to force high memory requests for the
containers in order to guarantee a maximum number of containers on a
single node (e.g. Set container memory MB set to 17GB to disallow more
than 2 containers from being assigned to any one 48GB node).

Cheers,
Chris

On 3/21/14 11:30 PM, Sandy Ryza sandy.r...@cloudera.com wrote:

yarn.scheduler.capacity.node-locality-delay will help if the app is
requesting containers at particular locations, but won't help spread
things
out evenly otherwise.

The Fair Scheduler attempts an even spread.  By default, it only schedules
a single container each time it considers a node.  Decoupling scheduling
from node heartbeats (YARN-1010) makes it so that a high node heartbeat
interval doesn't result in this being slow.  Now that the Capacity
Scheduler has similar capabilities (YARN-1512), it might make sense to
introduce a similar limit?

-Sandy


On Fri, Mar 21, 2014 at 4:42 PM, Vinod Kumar Vavilapalli
vino...@apache.org
 wrote:

 What's the value for yarn.scheduler.capacity.node-locality-delay? It is
-1
 by default in 2.2.

 We fixed the default to be a reasonable 40 (nodes in a rack) in 2.3.0
that
 should spread containers a bit.

 Thanks,
 +Vinod

 On Mar 21, 2014, at 12:48 PM, Chris Riccomini criccom...@linkedin.com
 wrote:

  Hey Guys,
 
  We're running YARN 2.2 with the capacity scheduler. Each NM is running
 with 40G of memory capacity. When we request a series containers with
2G of
 memory from a single AM, we see the RM assigning them entirely to one NM
 until that NM is full, and then moving on to the next, and so on.
 Essentially, we have a grid with 20 nodes, and two are completely full,
and
 the rest are completely empty. This is problematic because our
containers
 use disk heavily, and are completely saturating the disks on the two
nodes,
 which slows all of the containers down on these NMs.
 
   1.  Is this expected behavior of the capacity scheduler? What about
the
 fifo scheduler?
   2.  Is the recommended work around just to increase memory allocation
 per-container as a proxy for the disk capacity that's required? Given
that
 there's no disk-level isolation, and no disk-level resource, I don't see
 another way around this.
 
  Cheers,
  Chris


 --
 CONFIDENTIALITY NOTICE
 NOTICE: This message is intended for the use of the individual or
entity to
 which it is addressed and may contain information that is confidential,
 privileged and exempt from disclosure under applicable law. If the
reader
 of this message is not the intended recipient, you are hereby notified
that
 any printing, copying, dissemination, distribution, disclosure or
 forwarding of this communication is strictly prohibited. If you have
 received this communication in error, please contact the sender
immediately
 and delete it from your system. Thank You.




Capacity scheduler puts all containers on one box

2014-03-21 Thread Chris Riccomini
Hey Guys,

We're running YARN 2.2 with the capacity scheduler. Each NM is running with 40G 
of memory capacity. When we request a series containers with 2G of memory from 
a single AM, we see the RM assigning them entirely to one NM until that NM is 
full, and then moving on to the next, and so on. Essentially, we have a grid 
with 20 nodes, and two are completely full, and the rest are completely empty. 
This is problematic because our containers use disk heavily, and are completely 
saturating the disks on the two nodes, which slows all of the containers down 
on these NMs.

  1.  Is this expected behavior of the capacity scheduler? What about the fifo 
scheduler?
  2.  Is the recommended work around just to increase memory allocation 
per-container as a proxy for the disk capacity that's required? Given that 
there's no disk-level isolation, and no disk-level resource, I don't see 
another way around this.

Cheers,
Chris


[jira] [Created] (YARN-1079) Fix progress bar for long-lived services in YARN

2013-08-19 Thread Chris Riccomini (JIRA)
Chris Riccomini created YARN-1079:
-

 Summary: Fix progress bar for long-lived services in YARN
 Key: YARN-1079
 URL: https://issues.apache.org/jira/browse/YARN-1079
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.1.0-beta
Reporter: Chris Riccomini


YARN currently shows a progress bar for jobs in its web UI. This is 
non-sensical for long-lived services, which have no concept of progress. For 
example, with Samza, we have stream processors which run for an indefinite 
amount of time (sometimes forever).

YARN should support jobs without a concept of progress. Some discussion about 
this is on YARN-896.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-810) Support CGroup ceiling enforcement on CPU

2013-06-13 Thread Chris Riccomini (JIRA)
Chris Riccomini created YARN-810:


 Summary: Support CGroup ceiling enforcement on CPU
 Key: YARN-810
 URL: https://issues.apache.org/jira/browse/YARN-810
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.1.0-beta, 2.0.5-alpha
Reporter: Chris Riccomini


Problem statement:

YARN currently lets you define an NM's pcore count, and a pcore:vcore ratio. 
Containers are then allowed to request vcores between the minimum and maximum 
defined in the yarn-site.xml.

In the case where a single-threaded container requests 1 vcore, with a 
pcore:vcore ratio of 1:4, the container is still allowed to use up to 100% of 
the core it's using, provided that no other container is also using it. This 
happens, even though the only guarantee that YARN/CGroups is making is that the 
container will get at least 1/4th of the core.

If a second container then comes along, the second container can take resources 
from the first, provided that the first container is still getting at least its 
fair share (1/4th).

There are certain cases where this is desirable. There are also certain cases 
where it might be desirable to have a hard limit on CPU usage, and not allow 
the process to go above the specified resource requirement, even if it's 
available.

Here's an RFC that describes the problem in more detail:

http://lwn.net/Articles/336127/

Solution:

As it happens, when CFS is used in combination with CGroups, you can enforce a 
ceiling using two files in cgroups:

{noformat}
cpu.cfs_quota_us
cpu.cfs_period_us
{noformat}

The usage of these two files is documented in more detail here:

https://access.redhat.com/site/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Resource_Management_Guide/sec-cpu.html

Testing:

I have tested YARN CGroups using the 2.0.5-alpha implementation. By default, it 
behaves as described above (it is a soft cap, and allows containers to use more 
than they asked for). I then tested CFS CPU quotas manually with YARN.

First, you can see that CFS is in use in the CGroup, based on the file names:

{noformat}
[criccomi@eat1-qa464 ~]$ sudo -u app ls -l /cgroup/cpu/hadoop-yarn/
total 0
-r--r--r-- 1 app app 0 Jun 13 16:46 cgroup.procs
drwxr-xr-x 2 app app 0 Jun 13 17:08 container_1371141151815_0004_01_02
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_period_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.cfs_quota_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_period_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.rt_runtime_us
-rw-r--r-- 1 app app 0 Jun 13 16:46 cpu.shares
-r--r--r-- 1 app app 0 Jun 13 16:46 cpu.stat
-rw-r--r-- 1 app app 0 Jun 13 16:46 notify_on_release
-rw-r--r-- 1 app app 0 Jun 13 16:46 tasks
[criccomi@eat1-qa464 ~]$ sudo -u app cat
/cgroup/cpu/hadoop-yarn/cpu.cfs_period_us
10
[criccomi@eat1-qa464 ~]$ sudo -u app cat
/cgroup/cpu/hadoop-yarn/cpu.cfs_quota_us
-1
{noformat}

Oddly, it appears that the cfs_period_us is set to .1s, not 1s.

We can place processes in hard limits. I have process 4370
running YARN container container_1371141151815_0003_01_03 on a host. By
default, it's running at ~300% cpu usage.

{noformat}
CPU
4370 criccomi  20   0 1157m 551m  14m S 240.3  0.8  87:10.91 ...
{noformat}

When I set the CFS quote:

{noformat}
echo 1000  
/cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
 CPU
4370 criccomi  20   0 1157m 563m  14m S  1.0  0.8  90:08.39 ...
{noformat}

It drops to 1% usage, and you can see the box has room to spare:

{noformat}
Cpu(s):  2.4%us,  1.0%sy,  0.0%ni, 92.2%id,  4.2%wa,  0.0%hi,  0.1%si,
0.0%st
{noformat}

Turning the quota back to -1:

{noformat}
echo -1 
/cgroup/cpu/hadoop-yarn/container_1371141151815_0003_01_03/cpu.cfs_quota_us
{noformat}

Burns the cores again:

{noformat}
Cpu(s): 11.1%us,  1.7%sy,  0.0%ni, 83.9%id,  3.1%wa,  0.0%hi,  0.2%si, 
0.0%st
CPU
4370 criccomi  20   0 1157m 563m  14m S 253.9  0.8  89:32.31 ...
{noformat}

On my dev box, I was testing CGroups by running a python process eight
times, to burn through all the cores, since it was doing as described above
(giving extra CPU to the process, even with a cpu.shares limit). Toggling
the cfs_quota_us seems to enforce a hard limit.

Implementation:

What do you guys think about introducing a variable to YarnConfiguration:

bq. yarn.nodemanager.linux-container.executor.cgroups.cpu-ceiling-enforcement

The default would be false. Setting to true, would cause YARN's LCE to set:

{noformat}
cpu.cfs_quota_us=(container-request-vcores/nm-vcore-to-pcore-ratio) * 100
cpu.cfs_period_us=100
{noformat}

For example, if a container asks for 2 vcores, and the vcore:pcore ratio

[jira] [Created] (YARN-799) CgroupsLCEResourcesHandler tries to write to cgroup.procs

2013-06-12 Thread Chris Riccomini (JIRA)
Chris Riccomini created YARN-799:


 Summary: CgroupsLCEResourcesHandler tries to write to cgroup.procs
 Key: YARN-799
 URL: https://issues.apache.org/jira/browse/YARN-799
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Chris Riccomini


The implementation of

bq. 
./hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/util/CgroupsLCEResourcesHandler.java

Tells the container-executor to write PIDs to cgroup.procs:

  public String getResourcesOption(ContainerId containerId) {
String containerName = containerId.toString();

StringBuilder sb = new StringBuilder(cgroups=);

if (isCpuWeightEnabled()) {
  sb.append(pathForCgroup(CONTROLLER_CPU, containerName) + /cgroup.procs);
  sb.append(,);
}

if (sb.charAt(sb.length() - 1) == ',') {
  sb.deleteCharAt(sb.length() - 1);
}

return sb.toString();
  }

Apparently, this file has not always been writeable:

https://patchwork.kernel.org/patch/116146/
http://lkml.indiana.edu/hypermail/linux/kernel/1004.1/00536.html
https://lists.linux-foundation.org/pipermail/containers/2009-July/019679.html

The RHEL version of the Linux kernel that I'm using has a CGroup module that 
has a non-writeable cgroup.procs file.

bq. $ uname -a
Linux criccomi-ld 2.6.32-131.4.1.el6.x86_64 #1 SMP Fri Jun 10 10:54:26 EDT 2011 
x86_64 x86_64 x86_64 GNU/Linux

As a result, when the container-executor tries to run, it fails with this error 
message:

bq.fprintf(LOGFILE, Failed to write pid %s (%d) to file %s - %s\n,

This is because the executor is given a resource by the 
CgroupsLCEResourcesHandler that includes cgroup.procs, which is non-writeable:

bq. $ pwd 
/cgroup/cpu/hadoop-yarn/container_1370986842149_0001_01_01
$ ls -l
total 0
-r--r--r-- 1 criccomi eng 0 Jun 11 14:43 cgroup.procs
-rw-r--r-- 1 criccomi eng 0 Jun 11 14:43 cpu.rt_period_us
-rw-r--r-- 1 criccomi eng 0 Jun 11 14:43 cpu.rt_runtime_us
-rw-r--r-- 1 criccomi eng 0 Jun 11 14:43 cpu.shares
-rw-r--r-- 1 criccomi eng 0 Jun 11 14:43 notify_on_release
-rw-r--r-- 1 criccomi eng 0 Jun 11 14:43 tasks

I patched CgroupsLCEResourcesHandler to use /tasks instead of /cgroup.procs, 
and this appears to have fixed the problem.

I can think of several potential resolutions to this ticket:

1. Ignore the problem, and make people patch YARN when they hit this issue.
2. Write to /tasks instead of /cgroup.procs for everyone
3. Check permissioning on /cgroup.procs prior to writing to it, and fall back 
to /tasks.
4. Add a config to yarn-site that lets admins specify which file to write to.

Thoughts?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (YARN-481) Add AM Host and RPC Port to ApplicationCLI Status Output

2013-03-15 Thread Chris Riccomini (JIRA)
Chris Riccomini created YARN-481:


 Summary: Add AM Host and RPC Port to ApplicationCLI Status Output
 Key: YARN-481
 URL: https://issues.apache.org/jira/browse/YARN-481
 Project: Hadoop YARN
  Issue Type: Bug
  Components: client
Affects Versions: 0.23.6, 2.0.3-alpha
Reporter: Chris Riccomini


Hey Guys,

I noticed that the ApplicationCLI is just randomly not printing some of the 
values in the ApplicationReport. I've added the getHost and getRpcPort. These 
are useful for me, since I want to make an RPC call to the AM (not the tracker 
call).

Thanks!
Chris

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira