[ https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222235#comment-17222235 ]
Jim Brennan commented on YARN-10467: ------------------------------------ Thanks for reporting this and for the solution [~haibochen]! Everything looks good to me. I hesitate to mention one minor nit, a typo in this comment: {quote}// there might be some completed containers that *are have* not been pulled {quote} It's up to you whether you want to fix this. [~jhung] were you planning to commit this? > ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers > ------------------------------------------------------------------------- > > Key: YARN-10467 > URL: https://issues.apache.org/jira/browse/YARN-10467 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.10.0, 3.0.3, 3.2.1, 3.1.4 > Reporter: Haibo Chen > Assignee: Haibo Chen > Priority: Major > Attachments: YARN-10467.00.patch, YARN-10467.01.patch, > YARN-10467.branch-2.10.00.patch, YARN-10467.branch-2.10.01.patch, > YARN-10467.branch-2.10.02.patch > > > In one of our recent heap analysis, we found that the majority of the heap is > occupied by {{RMNodeImpl.completedContainers}}<ContainerIdPBImp>, which > accounts for 19GB, out of 24.3 GB. There are over 86 million > ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects > which represent the # of active containers that RM is still tracking. > Inspecting some ContainerIdPBImpl objects, they belong to applications that > have long finished. This indicates some sort of memory leak of > ContainerIdPBImpl objects in RMNodeImpl. > > Right now, when a container is reported by a NM as completed, it is > immediately added to RMNodeImpl.completedContainers and later cleaned up > after the AM has been notified of its completion in the AM-RM heartbeat. The > cleanup can be broken into a few steps. > * Step 1: the completed container is first added to > RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added > to {{RMNodeImpl.completedContainers}}). > * Step 2: During the heartbeat AM-RM heartbeat, the container is removed > from RMAppAttemptImpl.justFinishedContainers and added to > RMAppAttemptImpl.finishedContainersSentToAM > Once a completed container gets added to > RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned > up from {{RMNodeImpl.completedContainers}} > > However, if the AM exits (regardless of failure or success) before some > recently completed containers can be added to > RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there > won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, > these objects stay in RMNodeImpl.completedContainers forever. > We have observed in MR that AMs can decide to exit upon success of all it > tasks without waiting for notification of the completion of every container, > or AM may just die suddenly (e.g. OOM). Spark and other framework may just > be similar. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org