[ 
https://issues.apache.org/jira/browse/YARN-5864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15818835#comment-15818835
 ] 

Sunil G commented on YARN-5864:
-------------------------------

Thanks [~leftnoteasy] for detailed proposal and the patch.

I think this will really help to cut many corner cases whats present in 
scheduler today.  Overall approach looks fine.

*Few doubts in document as well as code:*

+PriorityUtilizationQueueOrderingPolicy+
1.
bq.service queue has 66.7% configured resource (200G), each container needs 90G 
memory; Batch queue has 33.3% configured resource (100G), each container needs 
20G memory.
One doubt here. If *service* queue has used+reserved more than 66.7%, I think 
we ll not be considering higher priority queue here rt.

2. For normal *utilization* policy also, we use 
{{PriorityUtilizationQueueOrderingPolicy}} with {{respectPriority=false}} mode. 
May be we can pull a better name as we handle priority and utilization order in 
same policy impl. Or we could pull a {{AbstractUtilizationQueueOrderingPolicy}} 
which can support normal resource utilization and an extended Priority policy 
can do priority handling.

3. {{PriorityUtilizationQueueOrderingPolicy#getAssignmentIterator}} needs a 
readLock for *queues* ?

+QueuePriorityContainerCandidateSelector+
4. Could we use Guava libs in hadoop (ref: HashBasedTable) ?
5. {{intializePriorityDigraph}}, since queue priority set either at the time of 
initialize or reinitialize, i think we are recalculating and creating 
{{PriorityDigraph}} everytime. I think its not very specifically a preemption 
entity, still a scheduler entity as well. Could we create and cache it in CS so 
that such recomputation can be avoided.
6. {{intializePriorityDigraph}}, In {{preemptionContext.getLeafQueueNames()}} 
we are getting queue names in random. For better performance, i think we need 
these names in BFS search model which start from one side to another. Will that 
help ?
7. {{selectCandidates}} exit condition can be added in beginning,  where queue 
priorities are not configured or digraph does not any queues in which some 
containers are reserved.
8. 
bq.Collections.sort(reservedContainers, CONTAINER_CREATION_TIME_COMPARATOR);
Why are we sorting with container create time? Do we first need that reserved 
container from the most high priority queue?
9. In {{selectCandidates}} 
{noformat}
431           if (currentTime - reservedContainer.getCreationTime() < 
minTimeout) {
432             break;
433           }
{noformat}
I think we need to continue rt ?

10. {{selectCandidates}} all TempQueuePerPartition is still taken from context. 
I think in IntraQueue preemption selector make some changes in TempQueue. I 
will confirm soon. If so we might need a relook there.

11. In {{selectCandidates}}, while looping for 
{{newlySelectedToBePreemptContainers}}, it possible that container is already 
present in {{selectedCandidates}}. Currently we still deduct from 
{{totalPreemptedResourceAllowed}} in such cases as well. not looking correct.

12. {{tryToMakeBetterReservationPlacement}} looks a very big loop over all 
{{allSchedulerNodes}}. Looks not very optimal.

I think i ll give one more pass once some of these are clarified.

> YARN Capacity Scheduler - Queue Priorities
> ------------------------------------------
>
>                 Key: YARN-5864
>                 URL: https://issues.apache.org/jira/browse/YARN-5864
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>         Attachments: YARN-5864.001.patch, YARN-5864.002.patch, 
> YARN-5864.003.patch, YARN-5864.poc-0.patch, 
> YARN-CapacityScheduler-Queue-Priorities-design-v1.pdf
>
>
> Currently, Capacity Scheduler at every parent-queue level uses relative 
> used-capacities of the chil-queues to decide which queue can get next 
> available resource first.
> For example,
> - Q1 & Q2 are child queues under queueA
> - Q1 has 20% of configured capacity, 5% of used-capacity and
> - Q2 has 80% of configured capacity, 8% of used-capacity.
> In the situation, the relative used-capacities are calculated as below
> - Relative used-capacity of Q1 is 5/20 = 0.25
> - Relative used-capacity of Q2 is 8/80 = 0.10
> In the above example, per today’s Capacity Scheduler’s algorithm, Q2 is 
> selected by the scheduler first to receive next available resource.
> Simply ordering queues according to relative used-capacities sometimes causes 
> a few troubles because scarce resources could be assigned to less-important 
> apps first.
> # Latency sensitivity: This can be a problem with latency sensitive 
> applications where waiting till the ‘other’ queue gets full is not going to 
> cut it. The delay in scheduling directly reflects in the response times of 
> these applications.
> # Resource fragmentation for large-container apps: Today’s algorithm also 
> causes issues with applications that need very large containers. It is 
> possible that existing queues are all within their resource guarantees but 
> their current allocation distribution on each node may be such that an 
> application which needs large container simply cannot fit on those nodes.
> Services:
> # The above problem (2) gets worse with long running applications. With short 
> running apps, previous containers may eventually finish and make enough space 
> for the apps with large containers. But with long running services in the 
> cluster, the large containers’ application may never get resources on any 
> nodes even if its demands are not yet met.
> # Long running services are sometimes more picky w.r.t placement than normal 
> batch apps. For example, for a long running service in a separate queue (say 
> queue=service), during peak hours it may want to launch instances on 50% of 
> the cluster nodes. On each node, it may want to launch a large container, say 
> 200G memory per container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to