[ 
https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037335#comment-15037335
 ] 

Konstantinos Karanasos commented on YARN-2877:
----------------------------------------------

Thank you for the detailed comments, [~leftnoteasy].

Regarding #1:
- Indeed the AM-LocalRM communication should be much more frequent than the 
LocalRM-RM (and subsequently AM-RM) communication, in order to achieve 
mili-second latency allocations.
We are planning to address this by having smaller heartbeat intervals in the 
AM-LocalRM communication when compared to the LocalRM-RM. For instance, the 
AM-LocalRM heartbeat interval can be set to 50ms, while the LocalRM-RM interval 
to 200ms (in other words, we will only propagate to the RM only one in every 
four heartbeats).
We will soon create a sub-JIRA for this.
- Each NM will periodically estimate its expected queue wait time (YARN-2886). 
This can simply be based on the number of tasks currently in its queue, or 
(even better) based on the estimated execution times of those tasks (in case 
they are available). Then, this expected queue wait time is pushed through the 
NM-RM heartbeats to the ClusterMonitor (YARN-4412) that is running as a service 
in the RM. The ClusterMonitor gathers this information from all nodes, 
periodically computes the least loaded nodes (i.e., with the smallest queue 
wait times), and adds that list to the heartbeat response, so that all nodes 
(and in turn LocalRMs) get the list. This list is then used during scheduling 
in the LocalRM.
Note that simpler solutions (such as the power of two choices used in Sparrow) 
could be employed, but our experiments have shown that the above "top-k node 
list" leads to considerably better placement (and thus load balancing), 
especially when task durations are heterogeneous.

Regarding #2:
This is a valid concern. The best way to minimize preemption is through the 
"top-k node list" technique described above. As the LocalRM will be placing the 
QUEUEABLE containers to the least loaded nodes, preemption will be minimized.
More techniques can be used to further mitigate the problem. For instance, we 
can "promote" a QUEUEABLE container to a GUARANTEED one in case it has been 
preempted more than k times.
Moreover, we can dynamically set limits to the number of QUEUEABLE containers 
accepted by a node in case of excessive load due to GUARANTEED containers.
That said, as you also mention, QUEUEABLE containers are more suitable for 
short-running tasks, where the probability of a container being preempted is 
smaller.

> Extend YARN to support distributed scheduling
> ---------------------------------------------
>
>                 Key: YARN-2877
>                 URL: https://issues.apache.org/jira/browse/YARN-2877
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager, resourcemanager
>            Reporter: Sriram Rao
>            Assignee: Konstantinos Karanasos
>         Attachments: distributed-scheduling-design-doc_v1.pdf
>
>
> This is an umbrella JIRA that proposes to extend YARN to support distributed 
> scheduling.  Briefly, some of the motivations for distributed scheduling are 
> the following:
> 1. Improve cluster utilization by opportunistically executing tasks otherwise 
> idle resources on individual machines.
> 2. Reduce allocation latency.  Tasks where the scheduling time dominates 
> (i.e., task execution time is much less compared to the time required for 
> obtaining a container from the RM).
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to