[jira] [Commented] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

Jim Brennan (Jira) Wed, 28 Oct 2020 08:57:26 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-10467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17222235#comment-17222235
 ]


Jim Brennan commented on YARN-10467:
------------------------------------

Thanks for reporting this and for the solution [~haibochen]!    Everything 
looks good to me.

I hesitate to mention one minor nit, a typo in this comment:
{quote}// there might be some completed containers that *are have* not been 
pulled
{quote}
It's up to you whether you want to fix this.

[~jhung] were you planning to commit this?

> ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers
> -------------------------------------------------------------------------
>
>                 Key: YARN-10467
>                 URL: https://issues.apache.org/jira/browse/YARN-10467
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.10.0, 3.0.3, 3.2.1, 3.1.4
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>            Priority: Major
>         Attachments: YARN-10467.00.patch, YARN-10467.01.patch, 
> YARN-10467.branch-2.10.00.patch, YARN-10467.branch-2.10.01.patch, 
> YARN-10467.branch-2.10.02.patch
>
>
> In one of our recent heap analysis, we found that the majority of the heap is 
> occupied by {{RMNodeImpl.completedContainers}}<ContainerIdPBImp>, which 
> accounts for 19GB, out of 24.3 GB.  There are over 86 million 
> ContainerIdPBImpl objects, in contrast, only 161,601 RMContainerImpl objects 
> which represent the # of active containers that RM is still tracking.  
> Inspecting some ContainerIdPBImpl objects, they belong to applications that 
> have long finished. This indicates some sort of memory leak of 
> ContainerIdPBImpl objects in RMNodeImpl.
>  
> Right now, when a container is reported by a NM as completed, it is 
> immediately added to RMNodeImpl.completedContainers and later cleaned up 
> after the AM has been notified of its completion in the AM-RM heartbeat. The 
> cleanup can be broken into a few steps.
>  * Step 1:  the completed container is first added to 
> RMAppAttemptImpl.justFinishedContainers (this is asynchronous to being added 
> to {{RMNodeImpl.completedContainers}}).
>  * Step 2: During the heartbeat AM-RM heartbeat, the container is removed 
> from RMAppAttemptImpl.justFinishedContainers and added to 
> RMAppAttemptImpl.finishedContainersSentToAM
> Once a completed container gets added to 
> RMAppAttemptImpl.finishedContainersSentToAM, it is guaranteed to be cleaned 
> up from {{RMNodeImpl.completedContainers}}
>  
> However, if the AM exits (regardless of failure or success) before some 
> recently completed containers can be added to  
> RMAppAttemptImpl.finishedContainersSentToAM in previous heartbeats, there 
> won’t be any future AM-RM heartbeat to perform aforementioned step 2. Hence, 
> these objects stay in RMNodeImpl.completedContainers forever.
> We have observed in MR that AMs can decide to exit upon success of all it 
> tasks without waiting for notification of the completion of every container, 
> or AM may just die suddenly (e.g. OOM).  Spark and other framework may just 
> be similar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

[jira] [Commented] (YARN-10467) ContainerIdPBImpl objects can be leaked in RMNodeImpl.completedContainers

Reply via email to