[ https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15818835#comment-15818835 ]
Sunil G commented on YARN-5864: ------------------------------- Thanks [~leftnoteasy] for detailed proposal and the patch. I think this will really help to cut many corner cases whats present in scheduler today. Overall approach looks fine. *Few doubts in document as well as code:* +PriorityUtilizationQueueOrderingPolicy+ 1. bq.service queue has 66.7% configured resource (200G), each container needs 90G memory; Batch queue has 33.3% configured resource (100G), each container needs 20G memory. One doubt here. If *service* queue has used+reserved more than 66.7%, I think we ll not be considering higher priority queue here rt. 2. For normal *utilization* policy also, we use {{PriorityUtilizationQueueOrderingPolicy}} with {{respectPriority=false}} mode. May be we can pull a better name as we handle priority and utilization order in same policy impl. Or we could pull a {{AbstractUtilizationQueueOrderingPolicy}} which can support normal resource utilization and an extended Priority policy can do priority handling. 3. {{PriorityUtilizationQueueOrderingPolicy#getAssignmentIterator}} needs a readLock for *queues* ? +QueuePriorityContainerCandidateSelector+ 4. Could we use Guava libs in hadoop (ref: HashBasedTable) ? 5. {{intializePriorityDigraph}}, since queue priority set either at the time of initialize or reinitialize, i think we are recalculating and creating {{PriorityDigraph}} everytime. I think its not very specifically a preemption entity, still a scheduler entity as well. Could we create and cache it in CS so that such recomputation can be avoided. 6. {{intializePriorityDigraph}}, In {{preemptionContext.getLeafQueueNames()}} we are getting queue names in random. For better performance, i think we need these names in BFS search model which start from one side to another. Will that help ? 7. {{selectCandidates}} exit condition can be added in beginning, where queue priorities are not configured or digraph does not any queues in which some containers are reserved. 8. bq.Collections.sort(reservedContainers, CONTAINER_CREATION_TIME_COMPARATOR); Why are we sorting with container create time? Do we first need that reserved container from the most high priority queue? 9. In {{selectCandidates}} {noformat} 431 if (currentTime - reservedContainer.getCreationTime() < minTimeout) { 432 break; 433 } {noformat} I think we need to continue rt ? 10. {{selectCandidates}} all TempQueuePerPartition is still taken from context. I think in IntraQueue preemption selector make some changes in TempQueue. I will confirm soon. If so we might need a relook there. 11. In {{selectCandidates}}, while looping for {{newlySelectedToBePreemptContainers}}, it possible that container is already present in {{selectedCandidates}}. Currently we still deduct from {{totalPreemptedResourceAllowed}} in such cases as well. not looking correct. 12. {{tryToMakeBetterReservationPlacement}} looks a very big loop over all {{allSchedulerNodes}}. Looks not very optimal. I think i ll give one more pass once some of these are clarified. > YARN Capacity Scheduler - Queue Priorities > ------------------------------------------ > > Key: YARN-5864 > URL: https://issues.apache.org/jira/browse/YARN-5864 > Project: Hadoop YARN > Issue Type: New Feature > Reporter: Wangda Tan > Assignee: Wangda Tan > Attachments: YARN-5864.001.patch, YARN-5864.002.patch, > YARN-5864.003.patch, YARN-5864.poc-0.patch, > YARN-CapacityScheduler-Queue-Priorities-design-v1.pdf > > > Currently, Capacity Scheduler at every parent-queue level uses relative > used-capacities of the chil-queues to decide which queue can get next > available resource first. > For example, > - Q1 & Q2 are child queues under queueA > - Q1 has 20% of configured capacity, 5% of used-capacity and > - Q2 has 80% of configured capacity, 8% of used-capacity. > In the situation, the relative used-capacities are calculated as below > - Relative used-capacity of Q1 is 5/20 = 0.25 > - Relative used-capacity of Q2 is 8/80 = 0.10 > In the above example, per today’s Capacity Scheduler’s algorithm, Q2 is > selected by the scheduler first to receive next available resource. > Simply ordering queues according to relative used-capacities sometimes causes > a few troubles because scarce resources could be assigned to less-important > apps first. > # Latency sensitivity: This can be a problem with latency sensitive > applications where waiting till the ‘other’ queue gets full is not going to > cut it. The delay in scheduling directly reflects in the response times of > these applications. > # Resource fragmentation for large-container apps: Today’s algorithm also > causes issues with applications that need very large containers. It is > possible that existing queues are all within their resource guarantees but > their current allocation distribution on each node may be such that an > application which needs large container simply cannot fit on those nodes. > Services: > # The above problem (2) gets worse with long running applications. With short > running apps, previous containers may eventually finish and make enough space > for the apps with large containers. But with long running services in the > cluster, the large containers’ application may never get resources on any > nodes even if its demands are not yet met. > # Long running services are sometimes more picky w.r.t placement than normal > batch apps. For example, for a long running service in a separate queue (say > queue=service), during peak hours it may want to launch instances on 50% of > the cluster nodes. On each node, it may want to launch a large container, say > 200G memory per container. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org