[ https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862480#comment-13862480 ]
Jian He commented on YARN-1490: ------------------------------- Summarize the problem here: the basic assumption that containers belonging to attempt is broken. Containers should be ideally tied to application. bq. Can the data that we want to share across app attempts in this patch be moved to the app itself? Not sure we want to do this.Today all the container events are sent to the RMAppAttempt instead of RMApp, this will break many existing attempt transition logic. bq. Can we have the old attempt (now in a terminal state) to just hold onto the events and do nothing with them. We can do this, but this does put more overhead in scheduler that every failed attempt has to store all the incoming container events before the new attempt is created. This brings one more question when to release these events, if we wait until am container to be allocated to release these events, this interval can be arbitrarily long depending on the resource consumption of the cluster and events can keep coming in. Probably we should change almost every attempt state to accept these container events then. bq. Having more than 1 active attempt object will be have even more race conditions IMO, the patch didn't make the attempt to be active. the only change made in this patch is to make attempt Failed state to accept 2 kinds of container events, - CONTAINER_ACQUIRED for tracking the ranNodes where the containers ran. (Forgot to remove this. This should not happen because the failed attempt is killing all the allocated containers regardless whether AM is restarting work-preservingly or non-preseverly.) - CONTAINER_FINISHED for tracking all the finishedContainers. So the next correct patch will only make failed attempt to accept CONTAINER_FINISHED event. Even without the purpose of this patch, we might also want to make the terminal state of attempt to accept the CONTAINER_FINISHED and keep track of the finished containers. Otherwise those finished containers info are lost. > RM should optionally not kill all containers when an ApplicationMaster exits > ---------------------------------------------------------------------------- > > Key: YARN-1490 > URL: https://issues.apache.org/jira/browse/YARN-1490 > Project: Hadoop YARN > Issue Type: Sub-task > Reporter: Vinod Kumar Vavilapalli > Assignee: Jian He > Attachments: YARN-1490.1.patch, YARN-1490.2.patch, YARN-1490.3.patch > > > This is needed to enable work-preserving AM restart. Some apps can chose to > reconnect with old running containers, some may not want to. This should be > an option. -- This message was sent by Atlassian JIRA (v6.1.5#6160)