[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788978#comment-13788978
 ] 

Zhijie Shen commented on MAPREDUCE-5547:
----------------------------------------

On shutting down an AM, there're following work:

1. Finish OutputCommitter
2. Move the history file to AHS (Maybe move to after unregister in this Jira)
3. Unregister
4. Delete staging dir
5. Send end job notifier
6. The implicit step of returning the final step to the client

Ideally, the 6 steps should be consistent. However, each steps may fail, while 
it seems not to be possible to make them a transaction to succeed all or fail 
all. Nevertheless, IMHO, we should do as much as we can to ensure the 
consistency of each steps.

Among the six steps, the most critical one is unregistration (correct me if I'm 
wrong), because it the only step that syncs with RM. It is the most harmful 
that AM and RM have different knowledge on the conclusion of the application. 
For this reason, unregister should be considered as the principle step, while 
how other steps behave should depend on the result of this step. Therefore, 
IMOH, unregister should be the first step to complete. On unregistration 
success, the following steps execute the ordinary logic, while on 
unregistration failure, the following steps handle the exceptions (e.g. not 
moving the job history file, not sending the job end notification and etc).

As [~jlowe] mentioned, moving job history file may fail. It's right, but the 
failure is independent of whether it is before or after unregistration. Now, 
moving job history file is before unregistration. If moving job history file 
fails, unregistration will not be invoked, and the application may be concluded 
as FAILED. This should be not reasonable. Similarly, other steps shouldn't be 
the reason of failing an application except unregistration. The failure of them 
should be isolated, such that AM can proceed to the end.

To sum up, IMHO, unregistration should be completed first, and be the step that 
judges the final state of the application. Given the result unregistration, the 
other steps decide what they should do, and the client see the final state. The 
other steps may fail or not fail, but the failure should be isolated. If 
fortunately none of steps fail (I guess it should be the most cases), the final 
states are consistent via every channels. If one step fails, it will only 
impact one part.

Moreover, I'm not sure whether we'd like to add one more state for AM, which is 
unregistering. Move the job to unregistering before calling unregister and then 
move the job to the final state after all the steps are gone through.


> Job history should not be flushed to JHS until AM gets unregistered
> -------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5547
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5547
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
>




--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to