[ 
https://issues.apache.org/jira/browse/YARN-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862480#comment-13862480
 ] 

Jian He commented on YARN-1490:
-------------------------------

Summarize the problem here: the basic assumption that containers belonging to 
attempt is broken. Containers should be ideally tied to application.

bq. Can the data that we want to share across app attempts in this patch be 
moved to the app itself? 
Not sure we want to do this.Today all the container events are sent to the 
RMAppAttempt instead of RMApp, this will break many existing attempt transition 
logic.

bq. Can we have the old attempt (now in a terminal state) to just hold onto the 
events and do nothing with them.
We can do this, but this does put more overhead in scheduler that every failed 
attempt has to store all the incoming container events before the new attempt 
is created. This brings one more question when to release these events, if we 
wait until am container to be allocated to release these events, this interval 
can be arbitrarily long depending on the resource consumption of the cluster 
and events can keep coming in.  Probably we should change almost every attempt 
state to accept these container events then.

bq. Having more than 1 active attempt object will be have even more race 
conditions
IMO, the patch didn't make the attempt to be active. the only change made in 
this patch is to make attempt Failed state to accept 2 kinds of container 
events,
- CONTAINER_ACQUIRED for tracking the ranNodes where the containers ran. 
(Forgot to remove this. This should not happen because the failed attempt is 
killing all the allocated containers regardless whether AM is restarting 
work-preservingly or non-preseverly.)
- CONTAINER_FINISHED for tracking all the finishedContainers.

So the next correct patch will only make failed attempt to accept 
CONTAINER_FINISHED event.
Even without the purpose of this patch, we might also want to make the terminal 
state of attempt to accept the CONTAINER_FINISHED and keep track of the 
finished containers. Otherwise those finished containers info are lost.

> RM should optionally not kill all containers when an ApplicationMaster exits
> ----------------------------------------------------------------------------
>
>                 Key: YARN-1490
>                 URL: https://issues.apache.org/jira/browse/YARN-1490
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Jian He
>         Attachments: YARN-1490.1.patch, YARN-1490.2.patch, YARN-1490.3.patch
>
>
> This is needed to enable work-preserving AM restart. Some apps can chose to 
> reconnect with old running containers, some may not want to. This should be 
> an option.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to