[jira] [Commented] (YARN-8320) [Umbrella] Support CPU isolation for latency-sensitive (LS) service

Weiwei Yang (JIRA) Tue, 05 Jun 2018 08:10:29 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-8320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16501927#comment-16501927
 ]


Weiwei Yang commented on YARN-8320:
-----------------------------------

Hi [~miklos.szeg...@cloudera.com]/[~bibinchundatt]

Thanks for the comments, here are my thoughts
{quote}I think is that cpu and cpuset are not the same resource in cgroups.
{quote}
Actually we don't want to say *cpuset* is a resource, it is an isolation 
technique against a certain amount of cpu resource. It's not straightforward to 
set it as a resource in request, just like the example you pointed out. The 
purpose of this Jira is to give user a way to isolate or partially share cpu 
resources between containers. For the 1st step when we support exclusive type, 
once the container is started, we will set cpu quota/shares for them to limit 
resource usage, and bind the container to same amount of processors for 
isolation. So we still prefer the "simplify" the path, the "resource" path 
doesn't seem to be clear to me.
{quote}If we don't have slots to bind container will be rejecting the container 
start request. Which will be considered as failed. Scheduler could again 
allocate container to same nodemanager rt ??
{quote}
Well when a container failed to start like this, in most cases it will not be 
able to start on any node. Because this means the #vcore in the requested is 
too small. We are trying to do some pre-check on such conditions to give a 
fail-fast approach, that should help.
{quote}When nm processors/ nm vcores < 1 and share mode have you considered 
strictness per containers ?? ie using the periods and quota also along with 
Cpuset assignment ?? If no other process is using cpu then process will be 
consuming more than what its supposed to rt ??
{quote}
Yes, cpu quota and shares will also be set for containers, that's the thing we 
depend on to limit the actual cpu usage for each containers. We just need to 
make sure the values set for them are reasonable when they have bind to certain 
processors. Thanks for pointing this out, a very good concern.
{quote}Using fixed set of folders for assignment in Allocator (Reduce overload 
of creation and deletion on containers.
{quote}
That makes sense.
{quote}Resource calculation could go wrong incase of preemption of containers 
rt . kill reject could get processed after container start.
{quote}
Resource calculation won't be wrong since we don't count cpuset as a resource. 
But when container is killed, we will need to make sure this is handled and 
cgroups gets updated correctly. Thanks for pointing this out, I will add some 
more details into next version of design doc.

Thanks

> [Umbrella] Support CPU isolation for latency-sensitive (LS) service
> -------------------------------------------------------------------
>
>                 Key: YARN-8320
>                 URL: https://issues.apache.org/jira/browse/YARN-8320
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager
>            Reporter: Jiandan Yang 
>            Priority: Major
>         Attachments: CPU-isolation-for-latency-sensitive-services-v1.pdf, 
> CPU-isolation-for-latency-sensitive-services-v2.pdf, YARN-8320.001.patch
>
>
> Currently NodeManager uses “cpu.cfs_period_us”, “cpu.cfs_quota_us” and 
> “cpu.shares” to isolate cpu resource. However,
>  * Linux Completely Fair Scheduling (CFS) is a throughput-oriented scheduler; 
> no support for differentiated latency
>  * Request latency of services running on container may be frequent shake 
> when all containers share cpus, and latency-sensitive services can not afford 
> in our production environment.
> So we need more fine-grained cpu isolation.
> Here we propose a solution using cgroup cpuset to binds containers to 
> different processors, this is inspired by the isolation technique in [Borg 
> system|http://schd.ws/hosted_files/lcccna2016/a7/CAT%20@%20Scale.pdf].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-8320) [Umbrella] Support CPU isolation for latency-sensitive (LS) service

Reply via email to