Tao Yang created HDFS-14689: ------------------------------- Summary: AM container might leak Key: HDFS-14689 URL: https://issues.apache.org/jira/browse/HDFS-14689 Project: Hadoop HDFS Issue Type: Bug Reporter: Tao Yang Assignee: Tao Yang
There is a risk that AM container might leak when NM exits unexpected meanwhile AM container is localizing if AM expiry interval (conf-key: yarn.am.liveness-monitor.expiry-interval-ms) is less than NM expiry interval (conf-key: yarn.nm.liveness-monitor.expiry-interval-ms). RMAppAttempt state changes as follows: {noformat} LAUNCHED/RUNNING – event:EXPIRED(FinalSavingTransition) --> FINAL_SAVING – event:ATTEMPT_UPDATE_SAVED(FinalStateSavedTransition / ExpiredTransition: send AMLauncherEventType.CLEANUP ) --> FAILED {noformat} AMLauncherEventType.CLEANUP will be handled by AMLauncher#cleanup which internally call ContainerManagementProtocol#stopContainer to stop AM container via communicating with NM, if NM can't be connected, it just skip it without any logs. I think in this case we can complete the AM container in scheduler when failed to stop it, so that it will have a chance to be stopped when NM reconnects with RM. Hope to hear your thoughts? Thank you! -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org