[ 
https://issues.apache.org/jira/browse/YARN-4390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15252803#comment-15252803
 ] 

Eric Payne commented on YARN-4390:
----------------------------------

Thanks, [~leftnoteasy], for the detailed work on this issue. Overall, I think 
the approach looks good. One thing I still wonder about is even if the 
preemption policy kills the perfectly sized container, will the scheduler know 
that it needs to assign those freed resources to the same app that requested 
them? I think this JIRA gets us closer to that goal, but there may be a 
possibility for the killed container to go someplace else. Is that right? Even 
if that's the case, I still like this approach.

Here are a few comments about the patch.
* I would rename "select_candidates_for_reserved_containers" to 
"select_based_on_reserved_containers"
* 
{{CapacitySchedulerPreemptionUtils#deductPreemptableResourcesBasedSelectedCandidates}}
** {{res}} is never used. Was it intended to be passed to 
{{tq.deductActuallyToBePreempted}}, or is it needed only to test if 
{{c.getReservedResource()}} and {{c.getAllocatedResource}} are not null?
* Here's just a very minor thing: {{FifoCandidatesSelector#selectCandidates}}:
   * {{... could already select containers ...}} could be changed to {{... 
could have already selected containers ...}}, for clarity.
* {{ReservedContainerCandidateSelector#getPreemptionCandidatesOnNode}}:
{code}
    Map<ContainerId, RMContainer> killableContainers =
        node.getKillableContainers();
{code}
**  Even though killableContainers is an {{unmodifiableMap}}, I think it can 
still change, can't it? It looks like {{killableContainers}} is only used once 
in this method. Would it make more sense to wait until {{killableContainers}} 
is ready to use before calling  {{node.getKillableContainers()}}?
* {{ReservedContainerCandidateSelector#getNodesForPreemption}}:
** I am a little concerned about calling 
{{preemptionContext.getScheduler().getAllNodes())}} to get the list of all of 
the nodes on every iteration of the preemption monitor. I can't think of a 
better way to handle this, but I think this could be expensive on the RM since 
{{getAllNodes()}} will lock the {{ClusterNodeTracker}} while it creates the 
list of nodes, and anything trying to use {{ClusterNodeTrackers}} resources 
will have to wait. It may not be a problem, but I know that we sometimes see 
the RM getting bogged down, and I am concerned about adding another long wait 
every 15 seconds (default) (or however long the preemption monitor interval is 
configured for). The nodes list doesn't change all that often. I wonder if it 
would make sense to cache it and only update it periodically (every {{n-th}} 
iteration?).


> Consider container request size during CS preemption
> ----------------------------------------------------
>
>                 Key: YARN-4390
>                 URL: https://issues.apache.org/jira/browse/YARN-4390
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: capacity scheduler
>    Affects Versions: 3.0.0, 2.8.0, 2.7.3
>            Reporter: Eric Payne
>            Assignee: Wangda Tan
>         Attachments: YARN-4390-design.1.pdf, YARN-4390-test-results.pdf, 
> YARN-4390.1.patch, YARN-4390.2.patch, YARN-4390.3.branch-2.patch, 
> YARN-4390.3.patch
>
>
> There are multiple reasons why preemption could unnecessarily preempt 
> containers. One is that an app could be requesting a large container (say 
> 8-GB), and the preemption monitor could conceivably preempt multiple 
> containers (say 8, 1-GB containers) in order to fill the large container 
> request. These smaller containers would then be rejected by the requesting AM 
> and potentially given right back to the preempted app.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to