Tao Yang created HDFS-14689:
-------------------------------

             Summary: AM container might leak
                 Key: HDFS-14689
                 URL: https://issues.apache.org/jira/browse/HDFS-14689
             Project: Hadoop HDFS
          Issue Type: Bug
            Reporter: Tao Yang
            Assignee: Tao Yang


There is a risk that AM container might leak when NM exits unexpected meanwhile 
AM container is localizing if AM expiry interval (conf-key: 
yarn.am.liveness-monitor.expiry-interval-ms) is less than NM expiry interval 
(conf-key: yarn.nm.liveness-monitor.expiry-interval-ms).
 RMAppAttempt state changes as follows:
{noformat}
LAUNCHED/RUNNING – event:EXPIRED(FinalSavingTransition) 
 --> FINAL_SAVING – event:ATTEMPT_UPDATE_SAVED(FinalStateSavedTransition / 
ExpiredTransition: send AMLauncherEventType.CLEANUP )  --> FAILED
{noformat}
AMLauncherEventType.CLEANUP will be handled by AMLauncher#cleanup which 
internally call ContainerManagementProtocol#stopContainer to stop AM container 
via communicating with NM, if NM can't be connected, it just skip it without 
any logs.

I think in this case we can complete the AM container in scheduler when failed 
to stop it, so that it will have a chance to be stopped when NM reconnects with 
RM. 
 Hope to hear your thoughts? Thank you!



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to