[ 
https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263747#comment-14263747
 ] 

Chengbing Liu commented on YARN-2997:
-------------------------------------

{quote}
I think we can simplify the logic in getContainerStatuses as such:
{quote}
It seems that if we do not remove the containers whose app is already stopped, 
we will rely on the heartbeat response from RM to remove containers acked by 
AM. If something goes wrong on the AM or RM side, the NM will never remove 
these containers from context. So in my opinion, that could be a potential leak.

{quote}
the sub class has the equal method.
{quote}
Yes, you are right. However, I'm still not sure if it is a good idea to use 
{{Set<ContainerStatus>}} instead of {{Map<ContainerId, ContainerStatus>}} for 
the following reasons:
* {{ContainerId}} is a unique identifier for a container, while 
{{ContainerStatus}} can be changed over time, even for the same container.
* We want to ensure no duplicate container status reported to RM. 
{{ContainerStatus}} has not only containerId, but also container state, exit 
status and diagnostic message, we may run into a situation where we report two 
different {{ContainerStatus}} with same ID and different states or other stuffs.
* {{ContainerId}} has {{equals}} method and annotated as public and stable, 
while {{ContainerStatus}} has no {{equals}} method and 
{{ContainerStatusPBImpl}} is annotated as private and unstable. It may not be a 
good idea to rely on the implementation of {{ContainerStatus}}.
* The use {{Set<ContainerStatus>}} never appears in the current code base.

{quote}
that's limitation of the test, we should fix the tests.
{quote}
Yes, I see. I will fix them.

> NM keeps sending finished containers to RM until app is finished
> ----------------------------------------------------------------
>
>                 Key: YARN-2997
>                 URL: https://issues.apache.org/jira/browse/YARN-2997
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: Chengbing Liu
>         Attachments: YARN-2997.2.patch, YARN-2997.patch
>
>
> We have seen in RM log a lot of
> {quote}
> INFO 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: 
> Null container completed...
> {quote}
> It is caused by NM sending completed containers repeatedly until the app is 
> finished. On the RM side, the container is already released, hence 
> {{getRMContainer}} returns null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to