[ https://issues.apache.org/jira/browse/YARN-2877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15037335#comment-15037335 ]
Konstantinos Karanasos commented on YARN-2877: ---------------------------------------------- Thank you for the detailed comments, [~leftnoteasy]. Regarding #1: - Indeed the AM-LocalRM communication should be much more frequent than the LocalRM-RM (and subsequently AM-RM) communication, in order to achieve mili-second latency allocations. We are planning to address this by having smaller heartbeat intervals in the AM-LocalRM communication when compared to the LocalRM-RM. For instance, the AM-LocalRM heartbeat interval can be set to 50ms, while the LocalRM-RM interval to 200ms (in other words, we will only propagate to the RM only one in every four heartbeats). We will soon create a sub-JIRA for this. - Each NM will periodically estimate its expected queue wait time (YARN-2886). This can simply be based on the number of tasks currently in its queue, or (even better) based on the estimated execution times of those tasks (in case they are available). Then, this expected queue wait time is pushed through the NM-RM heartbeats to the ClusterMonitor (YARN-4412) that is running as a service in the RM. The ClusterMonitor gathers this information from all nodes, periodically computes the least loaded nodes (i.e., with the smallest queue wait times), and adds that list to the heartbeat response, so that all nodes (and in turn LocalRMs) get the list. This list is then used during scheduling in the LocalRM. Note that simpler solutions (such as the power of two choices used in Sparrow) could be employed, but our experiments have shown that the above "top-k node list" leads to considerably better placement (and thus load balancing), especially when task durations are heterogeneous. Regarding #2: This is a valid concern. The best way to minimize preemption is through the "top-k node list" technique described above. As the LocalRM will be placing the QUEUEABLE containers to the least loaded nodes, preemption will be minimized. More techniques can be used to further mitigate the problem. For instance, we can "promote" a QUEUEABLE container to a GUARANTEED one in case it has been preempted more than k times. Moreover, we can dynamically set limits to the number of QUEUEABLE containers accepted by a node in case of excessive load due to GUARANTEED containers. That said, as you also mention, QUEUEABLE containers are more suitable for short-running tasks, where the probability of a container being preempted is smaller. > Extend YARN to support distributed scheduling > --------------------------------------------- > > Key: YARN-2877 > URL: https://issues.apache.org/jira/browse/YARN-2877 > Project: Hadoop YARN > Issue Type: New Feature > Components: nodemanager, resourcemanager > Reporter: Sriram Rao > Assignee: Konstantinos Karanasos > Attachments: distributed-scheduling-design-doc_v1.pdf > > > This is an umbrella JIRA that proposes to extend YARN to support distributed > scheduling. Briefly, some of the motivations for distributed scheduling are > the following: > 1. Improve cluster utilization by opportunistically executing tasks otherwise > idle resources on individual machines. > 2. Reduce allocation latency. Tasks where the scheduling time dominates > (i.e., task execution time is much less compared to the time required for > obtaining a container from the RM). > -- This message was sent by Atlassian JIRA (v6.3.4#6332)