[ https://issues.apache.org/jira/browse/YARN-9432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834446#comment-16834446 ]
Tao Yang commented on YARN-9432: -------------------------------- Thanks [~cheersyang] for the suggestion! Yes, there's a redundant loop which I can't understand either, perhaps I was dizzy at that time. :( Attached v4 patch to correct this logic. > Reserved containers leak after its request has been cancelled or satisfied > when multi-nodes enabled > --------------------------------------------------------------------------------------------------- > > Key: YARN-9432 > URL: https://issues.apache.org/jira/browse/YARN-9432 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Major > Attachments: YARN-9432.001.patch, YARN-9432.002.patch, > YARN-9432.003.patch, YARN-9432.004.patch > > > Reserved containers may change to be excess after its request has been > cancelled or satisfied, excess reserved containers need to be unreserved > quickly to release resource for others. > For multi-nodes disabled scenario, excess reserved containers can be quickly > released in next node heartbeat, the calling stack is > CapacityScheduler#nodeUpdate --> CapacityScheduler#allocateContainersToNode > --> CapacityScheduler#allocateContainerOnSingleNode. > But for multi-nodes enabled scenario, excess reserved containers have chance > to be released only in allocation process, key phase of the calling stack is > LeafQueue#assignContainers --> LeafQueue#allocateFromReservedContainer. > According to this, excess reserved containers may not be released until its > queue has pending request and has chance to be allocated, and the worst is > that excess reserved containers will never be released and keep holding > resource if there is no additional pending request for this queue. > To solve this problem, my opinion is to directly kill excess reserved > containers when request is satisfied (in FiCaSchedulerApp#apply) or the > allocation number of resource-requests/scheduling-requests is updated to be 0 > (in SchedulerApplicationAttempt#updateResourceRequests / > SchedulerApplicationAttempt#updateSchedulingRequests). > Please feel free to give your suggestions. Thanks. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org