[
https://issues.apache.org/jira/browse/YARN-10895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18040734#comment-18040734
]
ASF GitHub Bot commented on YARN-10895:
---------------------------------------
github-actions[bot] closed pull request #3327: YARN-10895. ContainerIdPBImpl
objects still can be leaked in RMNodeImpl.completedContainers
URL: https://github.com/apache/hadoop/pull/3327
> ContainerIdPBImpl objects still can be leaked in
> RMNodeImpl.completedContainers
> -------------------------------------------------------------------------------
>
> Key: YARN-10895
> URL: https://issues.apache.org/jira/browse/YARN-10895
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 3.1.2
> Reporter: Jeongin Ju
> Priority: Major
> Labels: pull-request-available
> Attachments: YARN-10895.001.patch
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> YARN-10467 fixed ContainerIdPBImpl Object Leakage in
> RMNodeImpl.completedContainers.
> After applying YARN-10467 patch and operating cluster with large number of
> nodes, we found similar heap leakage still exists.
> In heap dump which are dumped after failover, (so it is not active RM) about
> 4.5G is used by ContainerIDPBImpl on RMNodeImpl.completedContainers.
>
> There are two cases.
>
> 1. Apps with 'KeepContainersAcrossApplicationAttempts' is not cleared when
> they are failed
> Even though 'KeepContainersAcrossApplicationAttempts' is set, we should clear
> RMAppAttemptImpl.justFinishedContainers.
> If app attempt is failed and retried by next attempt, we may not need to
> clear RMAppAttemptImpl.justFinishedContainers because related
> ContainerIDPBImpl will be handed over to next attempts and eventually cleared.
> However, when app is failed, there is no next attempt and heap leakage occur.
> (We found this case when Yarn Service Application failed over multiple
> attempts because of OOM in AM)
>
> 2. Apps is killed explicitly by user
> When app is killed by user by 'yarn application -kill' CLI interface or WebUI
> interface, RMAppAttemptImpl.amContainerFinished is not called because app
> and app attempt state is already changed.
>
> To handle this, we added sendFinishedContainersToNMs for each
> RMAppAttemptImpl.finishedContainersSentToAm,
> RMAppAttemptImpl.justFinishedContainers when Attempt is set to 'KILLED'
>
> We found and patched our cluster on 3.1.2 but it seems trunk still has the
> same problem.
> I attached patch based on the trunk.
>
> Thanks!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]