[ 
https://issues.apache.org/jira/browse/YARN-10895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040734#comment-18040734
 ] 

ASF GitHub Bot commented on YARN-10895:
---------------------------------------

github-actions[bot] closed pull request #3327: YARN-10895. ContainerIdPBImpl 
objects still can be leaked in RMNodeImpl.completedContainers
URL: https://github.com/apache/hadoop/pull/3327




> ContainerIdPBImpl objects still can be leaked in 
> RMNodeImpl.completedContainers
> -------------------------------------------------------------------------------
>
>                 Key: YARN-10895
>                 URL: https://issues.apache.org/jira/browse/YARN-10895
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.1.2
>            Reporter: Jeongin Ju
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: YARN-10895.001.patch
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> YARN-10467 fixed ContainerIdPBImpl Object Leakage in 
> RMNodeImpl.completedContainers.
> After applying YARN-10467 patch and operating cluster with large number of 
> nodes, we found similar heap leakage still exists.
> In heap dump which are dumped after failover, (so it is not active RM) about 
> 4.5G is used by ContainerIDPBImpl on RMNodeImpl.completedContainers.
>  
> There are two cases.
>  
> 1. Apps with 'KeepContainersAcrossApplicationAttempts'  is not cleared when 
> they are failed
> Even though 'KeepContainersAcrossApplicationAttempts' is set, we should clear 
> RMAppAttemptImpl.justFinishedContainers.
> If app attempt is failed and retried by next attempt, we may not need to 
> clear RMAppAttemptImpl.justFinishedContainers because related 
> ContainerIDPBImpl will be handed over to next attempts and eventually cleared.
> However, when app is failed, there is no next attempt and heap leakage occur.
> (We found this case when Yarn Service Application failed over multiple 
> attempts because of OOM in AM)
>  
> 2. Apps is killed explicitly by user
> When app is killed by user by 'yarn application -kill' CLI interface or WebUI 
> interface,  RMAppAttemptImpl.amContainerFinished is not called because app 
> and app attempt state is already changed.
>  
> To handle this, we added sendFinishedContainersToNMs for each 
> RMAppAttemptImpl.finishedContainersSentToAm, 
> RMAppAttemptImpl.justFinishedContainers when Attempt is set to 'KILLED'
>  
> We found and patched our cluster on 3.1.2 but it seems trunk still has the 
> same problem.
> I attached patch based on the trunk.
>  
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to