[ 
https://issues.apache.org/jira/browse/YARN-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14077178#comment-14077178
 ] 

Jason Lowe commented on YARN-1354:
----------------------------------

Thanks for taking a look, Junping!

bq. what would happen if storeApplication(), finishApplication(), 
removeApplication() failed with application related information get 
inconsistent after restart?

If storeApplication fails then it will throw an IOException which will bubble 
up and fail the container start request on the client.  As long as we're unable 
to store a new application, containers for that application will not start, 
which I believe is the desired behavior.  That prevents the state store from 
being inconsistent in this particular scenario.

If finishApplication fails then the NM will proceed as if it did succeed but 
the state store will still have the application present.  This should be 
corrected when the NM restarts and registers with the RM with those 
applications still running.  The RM should correct the situation by telling the 
NM that the application has finished (see YARN-1885), and the NM will proceed 
to perform application finish processing (e.g.: log aggregation, etc.).  I 
think worst-case it will upload all of the app container logs again, but when 
it goes to rename to the final destination name that will fail because the name 
already exists.  Thus there could be some wasted work, but it should sort 
itself out and not do something catastrophic.

If removeApplication fails then the NM will proceed as if it did succeed but 
the state store will still have the application present.  This should be 
corrected when the NM finishes application processing (per above or if it was 
already recorded as finished) and it will again try to remove it from the state 
store.  As above I think there could be some unnecessary work performed, but I 
think in the end the application should eventually be removed from the NM on 
restart.  It could still remain in the state store if the second removal also 
fails, but a subsequent restart should behave the same.

bq. Do we need special warning if get failed on deserializing credential here?

I'm not sure how credential processing is fundamentally all that different from 
protocol buffer parsing which could also fail.  If the credentials can't be 
read then we can't recover the application.  Currently recovery errors are 
fatal to NM startup.  Do you have something specific in mind for handling the 
credentials if the writable changes (e.g.: some pseudo code to show the 
approach)?

> Recover applications upon nodemanager restart
> ---------------------------------------------
>
>                 Key: YARN-1354
>                 URL: https://issues.apache.org/jira/browse/YARN-1354
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1354-v1.patch, 
> YARN-1354-v2-and-YARN-1987-and-YARN-1362.patch, YARN-1354-v3.patch, 
> YARN-1354-v4.patch, YARN-1354-v5.patch
>
>
> The set of active applications in the nodemanager context need to be 
> recovered for work-preserving nodemanager restart



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to