[ 
https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862359#comment-13862359
 ] 

Bikas Saha commented on YARN-1489:
----------------------------------

Here is an idea:
The RM allows the app to send it some data during registration. This data could 
include the AM port information etc. The RM could then sync this data with the 
NM during NM heartbeat. The NM anyways maintain per app attempt info and this 
data would be added to that. The containers running on an AM could query for 
this attempt data and get the information about the new app attempt. This would 
be a scalable and efficient solution.
The data per NM will be small since the data would be size checked and 
proportional to the app attempts. The NM could give access to an attempts data 
only to the containers that belong to that attempt. Only local containers 
should be able to communicate with their NM for such information. This could be 
done via a local access token that is supplied by the NM whenever it launches a 
container.

> [Umbrella] Work-preserving ApplicationMaster restart
> ----------------------------------------------------
>
>                 Key: YARN-1489
>                 URL: https://issues.apache.org/jira/browse/YARN-1489
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>         Attachments: Work preserving AM restart.pdf
>
>
> Today if AMs go down,
>  - RM kills all the containers of that ApplicationAttempt
>  - New ApplicationAttempt doesn't know where the previous containers are 
> running
>  - Old running containers don't know where the new AM is running.
> We need to fix this to enable work-preserving AM restart. The later two 
> potentially can be done at the app level, but it is good to have a common 
> solution for all apps where-ever possible.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to