[ https://issues.apache.org/jira/browse/YARN-569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13682571#comment-13682571 ]
Chris Douglas commented on YARN-569: ------------------------------------ Thanks for the feedback; we revised the patch. We comment below on questions that required explanation, while all the small ones are addressed directly in the code following your suggestions. bq. This doesnt seem to affect the fair scheduler or does it? If not, then it can be misleading for users. bq. How do we envisage multiple policies working together without stepping on each other? Better off limiting to 1? The intent was for orthogonal policies to interact with the scheduler, or- if conflicting- be coordinated by a composite policy. Though you're right, the naming toward preemption is confusing; the patch renames the properties to refer to monitors, only. Because the only example is the {{ProportionalCapacityPreemptionPolicy}}, {{null}} seemed like the correct default. As for limiting to 1 monitor or not, we are experiencing with other policies that focus on different aspect of the schedule (e.g., deadlines and automatic tuning of queue capacity) and it seems possible to play nice with other policies (e.g., ProportionalCapacityPreemptionPolicy), so we would prefer to have the mechanism to remain capable of loading multiple monitors. bq. Not joining the thread to make sure its cleaned up? The contract for shutting down a monitor is not baked into the API, yet. While the proportional policy runs quickly, it's not obvious whether other policies would be both long running and respond to interrupts. By way of illustration, other monitors we've experimented with call into third party code for CPU-intensive calculation. Since YARN-117 went in a few hours ago, that might be a chance to define that more crisply. Thoughts? bq. Why no lock here when the other new methods have a lock? Do we not care that the app remains in applications during the duration of the operations? The semantics of the {{\@Lock}} annotation were not entirely clear from the examples in the code, so it's possible the inconsistency is our application of it. We're probably making the situation worse, so we omitted the annotations in the updated patch. To answer your question: we don't care, because the selected container already exited (part of the natural termination factor in the policy). bq. There is one critical difference between old and new behavior. The new code will not send the finish event to the container if its not part of the liveContainers. This probably is wrong. bq. FicaSchedulerNode.unreserveResource(). Checks have been added for the reserved container but will the code reach that point if there was no reservation actually left on that node? In the same vein, can it happen that the node has a new reservation that was made out of band of the preemption logic cycle. Hence, the reserved container on the node would exist but could be from a different application. Good catch, these are related. The change to boolean was necessary because we're calling the {{unreserve}} logic from a new context. Since only one application can have a single reservation on a node, and because we're freeing it through that application, we won't accidentally free another application's reservation. However, calling {{unreserve}} on a reservation that converted to a container will fail, so we need to know whether the state changed before updating the metric. bq. Couldnt quite grok this. What is delta? What is 0.5? A percentage? Whats the math behind the calculation? Should it be "even absent preemption" instead of "even absent natural termination"? Is this applied before or after TOTAL_PREEMPTION_PER_ROUND? The delta is the difference between the computed ideal capacity and the actual. A value of 0.5 would preempt only 50% of the containers the policy thinks should be preempted, as the rest are expected to exit "naturally". The comment is saying that- even without any containers exiting on their own- the policy will geometrically push capacity into the deadzone. In this case, 50% per round, in 5 rounds the policy will be within a 5% deadzone of the ideal capacity. It's applied before the total preemption per round; the latter proportionally affects all preemption targets. Because some containers will complete while the policy runs, it may make sense to tune it aggressively (or affect it with observed completion rates), but we'll want to get some experience running with this. > CapacityScheduler: support for preemption (using a capacity monitor) > -------------------------------------------------------------------- > > Key: YARN-569 > URL: https://issues.apache.org/jira/browse/YARN-569 > Project: Hadoop YARN > Issue Type: Sub-task > Components: capacityscheduler > Reporter: Carlo Curino > Assignee: Carlo Curino > Attachments: 3queues.pdf, CapScheduler_with_preemption.pdf, > preemption.2.patch, YARN-569.1.patch, YARN-569.2.patch, YARN-569.3.patch, > YARN-569.4.patch, YARN-569.5.patch, YARN-569.6.patch, YARN-569.patch, > YARN-569.patch > > > There is a tension between the fast-pace reactive role of the > CapacityScheduler, which needs to respond quickly to > applications resource requests, and node updates, and the more introspective, > time-based considerations > needed to observe and correct for capacity balance. To this purpose we opted > instead of hacking the delicate > mechanisms of the CapacityScheduler directly to add support for preemption by > means of a "Capacity Monitor", > which can be run optionally as a separate service (much like the > NMLivelinessMonitor). > The capacity monitor (similarly to equivalent functionalities in the fairness > scheduler) operates running on intervals > (e.g., every 3 seconds), observe the state of the assignment of resources to > queues from the capacity scheduler, > performs off-line computation to determine if preemption is needed, and how > best to "edit" the current schedule to > improve capacity, and generates events that produce four possible actions: > # Container de-reservations > # Resource-based preemptions > # Container-based preemptions > # Container killing > The actions listed above are progressively more costly, and it is up to the > policy to use them as desired to achieve the rebalancing goals. > Note that due to the "lag" in the effect of these actions the policy should > operate at the macroscopic level (e.g., preempt tens of containers > from a queue) and not trying to tightly and consistently micromanage > container allocations. > ------------- Preemption policy (ProportionalCapacityPreemptionPolicy): > ------------- > Preemption policies are by design pluggable, in the following we present an > initial policy (ProportionalCapacityPreemptionPolicy) we have been > experimenting with. The ProportionalCapacityPreemptionPolicy behaves as > follows: > # it gathers from the scheduler the state of the queues, in particular, their > current capacity, guaranteed capacity and pending requests (*) > # if there are pending requests from queues that are under capacity it > computes a new ideal balanced state (**) > # it computes the set of preemptions needed to repair the current schedule > and achieve capacity balance (accounting for natural completion rates, and > respecting bounds on the amount of preemption we allow for each round) > # it selects which applications to preempt from each over-capacity queue (the > last one in the FIFO order) > # it remove reservations from the most recently assigned app until the amount > of resource to reclaim is obtained, or until no more reservations exits > # (if not enough) it issues preemptions for containers from the same > applications (reverse chronological order, last assigned container first) > again until necessary or until no containers except the AM container are left, > # (if not enough) it moves onto unreserve and preempt from the next > application. > # containers that have been asked to preempt are tracked across executions. > If a containers is among the one to be preempted for more than a certain > time, the container is moved in a the list of containers to be forcibly > killed. > Notes: > (*) at the moment, in order to avoid double-counting of the requests, we only > look at the "ANY" part of pending resource requests, which means we might not > preempt on behalf of AMs that ask only for specific locations but not any. > (**) The ideal balance state is one in which each queue has at least its > guaranteed capacity, and the spare capacity is distributed among queues (that > wants some) as a weighted fair share. Where the weighting is based on the > guaranteed capacity of a queue, and the function runs to a fix point. > Tunables of the ProportionalCapacityPreemptionPolicy: > # observe-only mode (i.e., log the actions it would take, but behave as > read-only) > # how frequently to run the policy > # how long to wait between preemption and kill of a container > # which fraction of the containers I would like to obtain should I preempt > (has to do with the natural rate at which containers are returned) > # deadzone size, i.e., what % of over-capacity should I ignore (if we are off > perfect balance by some small % we ignore it) > # overall amount of preemption we can afford for each run of the policy (in > terms of total cluster capacity) > In our current experiments this set of tunables seem to be a good start to > shape the preemption action properly. More sophisticated preemption policies > could take into account different type of applications running, job > priorities, cost of preemption, integral of capacity imbalance. This is very > much a control-theory kind of problem, and some of the lessons on designing > and tuning controllers are likely to apply. > Generality: > The monitor-based scheduler edit, and the preemption mechanisms we introduced > here are designed to be more general than enforcing capacity/fairness, in > fact, we are considering other monitors that leverage the same idea of > "schedule edits" to target different global properties (e.g., allocate enough > resources to guarantee deadlines for important jobs, or data-locality > optimizations, IO-balancing among nodes, etc...). > Note that by default the preemption policy we describe is disabled in the > patch. > Depends on YARN-45 and YARN-567, is related to YARN-568 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira