[ https://issues.apache.org/jira/browse/YARN-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rohith Sharma K S updated YARN-4862: ------------------------------------ Attachment: 0003-YARN-4862.patch Updating the rebased patch fixing review comments from Jason. This JIRA is depends upon YARN-5279 for avoiding leak in the new set completedContainers. I rechecked the 2 scenarios mainly discussed in earlier comments # RM(scheduler) forgets container details (YARN-5279). In this case, for any unknown completed container reported from NodeManager, scheduler will intimate back to RMNodeImpl that these containers are no more maintained by scheduler, so inform to NodeManager to remove from NMContext. This event avoids leak in new set completedContainer which clears containers that are acknowledging to NM in heartbeat response # NM forgets to send completedContainer ## NM do not send completedContainer at once also. It is nothing but YARN-5197. ## NM sends completed container in one heartbeat and later it forgets it in next heartbeat. In this case, {{RMNodeImpl#completedContainer}} need not worry about leak because if once completed container has been sent then RM keep track of these containers. Even though NM forgets sending later, completedContainer will be cleared when RM acknowledge back to NM in heartbeat to remove from context. > Handle duplicate completed containers in RMNodeImpl > --------------------------------------------------- > > Key: YARN-4862 > URL: https://issues.apache.org/jira/browse/YARN-4862 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Reporter: Rohith Sharma K S > Assignee: Rohith Sharma K S > Attachments: 0001-YARN-4862.patch, 0002-YARN-4862.patch, > 0003-YARN-4862.patch > > > As per > [comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689] > from [~sharadag], there should be safe guard for duplicated container status > in RMNodeImpl before creating UpdatedContainerInfo. > Or else in heavily loaded cluster where event processing is gradually slow, > if any duplicated container are sent to RM(may be bug in NM also), there is > significant impact that RMNodImpl always create UpdatedContainerInfo for > duplicated containers. This result in increase in the heap memory and causes > problem like YARN-4852. > This is an optimization for issue kind YARN-4852 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org