[ 
https://issues.apache.org/jira/browse/YARN-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13726380#comment-13726380
 ] 

Steve Loughran commented on YARN-972:
-------------------------------------


bq. Nodes in probably the majority of clusters are configured with more slots 
than cores. This is sensible because many types of task do a lot of IO and do 
not even saturate half of a single core.

The IO intensive workloads means the cost of context switching oversubscribed 
CPU-intensive threads is rarely visible. What is visible is the cost of 
swapping RAM, which is why RAM is used as the primary slot metric, even though 
CPUs are sometimes idling. (that idle time triggers reduced power use, so 
actually has some unintended benefit)

bq. In what way are we optimising for 4-8 cores?

By assuming the CPU architecture will continue consist of a small homogenous 
large cores in a small number  of sockets. If you look at the future systems 
[Intel Xeon 
Phi|http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html]
 & HP Moonshot, the trend appears to be going for less powerful parts such as 
the classic P5 core or ARM parts. We should be planning for more cores/box, and 
even less uniform memory access, rather than fractional allocation of today's 
"achieve speed through watt-hungry clock speeds"

The other trend is towards heterogeneous CPU parts (GPU as well as Phi co-pros) 
so Allen's goal "ask for specific ISAs" may become more relevant -especially as 
you start upgrading the cluster.

bq. "I recently added machines with more or beefier CPUs to my cluster. I would 
like to run more concurrent tasks on these machines than on other machines."

seen this: on MR workloads the faster boxes finish work faster so ask for more 
jobs as they report in. It can actually lead to unbalanced HDFS data as more 
blocks get generated on the faster machines. 

If you are using RAM as your slot allocation, and also getting more RAM per 
server (which the cost curve of RAM makes a no-brainer), RAM-based container 
allocation will give the new boxes more work anyway.

bq. "I recently added machines with more or beefier CPUs to my cluster. I would 
like my jobs to run at predictable speeds."

Determinism isn't something you can get with other workloads on the same 
system, not if they are IO or Net intensive, even with a 1:1 mapping of thread 
to physical core. You can verify this by running the same query at different 
times of day on the same cluster.

bq.  "CPUs vary widely in the world, but I would like to be able to take my job 
to another cluster and have it run at a similar speed."

see above.

Regarding EC2 machine size requests, if you spend time there you end up 
noticing that perf varies both on CPU part you actually get and what else the 
box is up to. Detecting and releasing nodes where some other workload is 
hurting yours is a common technique, and a paper last year [discussed doing the 
same for CPU 
versions|https://www.usenix.org/system/files/conference/hotcloud12/hotcloud12-final40.pdf]

For that reason, even an abstract core-cost for use across clusters or a 
heterogenous single cluster is unlikely to work, it's just a diversion of R&D 
time and more long-term maintenance costs. It's the latter that I fear the 
most, as we all share that. 

What could work is for a YARN app to be able to say "IO intensive, | CPU 
intensive | Net intensive" when requesting a node and have that used as a hint 
in the schedulers. So AW can deploy Giraph nodes that are CPU & Net hungry, and 
the scheduler will know that some IO heavy work can also go there, but not 
other Net-heavy code. 
                
> Allow requests and scheduling for fractional virtual cores
> ----------------------------------------------------------
>
>                 Key: YARN-972
>                 URL: https://issues.apache.org/jira/browse/YARN-972
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: api, scheduler
>    Affects Versions: 2.0.5-alpha
>            Reporter: Sandy Ryza
>            Assignee: Sandy Ryza
>
> As this idea sparked a fair amount of discussion on YARN-2, I'd like to go 
> deeper into the reasoning.
> Currently the virtual core abstraction hides two orthogonal goals.  The first 
> is that a cluster might have heterogeneous hardware and that the processing 
> power of different makes of cores can vary wildly.  The second is that a 
> different (combinations of) workloads can require different levels of 
> granularity.  E.g. one admin might want every task on their cluster to use at 
> least a core, while another might want applications to be able to request 
> quarters of cores.  The former would configure a single vcore per core.  The 
> latter would configure four vcores per core.
> I don't think that the abstraction is a good way of handling the second goal. 
>  Having a virtual cores refer to different magnitudes of processing power on 
> different clusters will make the difficult problem of deciding how many cores 
> to request for a job even more confusing.
> Can we not handle this with dynamic oversubscription?
> Dynamic oversubscription, i.e. adjusting the number of cores offered by a 
> machine based on measured CPU-consumption, should work as a complement to 
> fine-granularity scheduling.  Dynamic oversubscription is never going to be 
> perfect, as the amount of CPU a process consumes can vary widely over its 
> lifetime.  A task that first loads a bunch of data over the network and then 
> performs complex computations on it will suffer if additional CPU-heavy tasks 
> are scheduled on the same node because its initial CPU-utilization was low.  
> To guard against this, we will need to be conservative with how we 
> dynamically oversubscribe.  If a user wants to explicitly hint to the 
> scheduler that their task will not use much CPU, the scheduler should be able 
> to take this into account.
> On YARN-2, there are concerns that including floating point arithmetic in the 
> scheduler will slow it down.  I question this assumption, and it is perhaps 
> worth debating, but I think we can sidestep the issue by multiplying 
> CPU-quantities inside the scheduler by a decently sized number like 1000 and 
> keep doing the computations on integers.
> The relevant APIs are marked as evolving, so there's no need for the change 
> to delay 2.1.0-beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to