[
https://issues.apache.org/jira/browse/MAPREDUCE-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788978#comment-13788978
]
Zhijie Shen commented on MAPREDUCE-5547:
----------------------------------------
On shutting down an AM, there're following work:
1. Finish OutputCommitter
2. Move the history file to AHS (Maybe move to after unregister in this Jira)
3. Unregister
4. Delete staging dir
5. Send end job notifier
6. The implicit step of returning the final step to the client
Ideally, the 6 steps should be consistent. However, each steps may fail, while
it seems not to be possible to make them a transaction to succeed all or fail
all. Nevertheless, IMHO, we should do as much as we can to ensure the
consistency of each steps.
Among the six steps, the most critical one is unregistration (correct me if I'm
wrong), because it the only step that syncs with RM. It is the most harmful
that AM and RM have different knowledge on the conclusion of the application.
For this reason, unregister should be considered as the principle step, while
how other steps behave should depend on the result of this step. Therefore,
IMOH, unregister should be the first step to complete. On unregistration
success, the following steps execute the ordinary logic, while on
unregistration failure, the following steps handle the exceptions (e.g. not
moving the job history file, not sending the job end notification and etc).
As [~jlowe] mentioned, moving job history file may fail. It's right, but the
failure is independent of whether it is before or after unregistration. Now,
moving job history file is before unregistration. If moving job history file
fails, unregistration will not be invoked, and the application may be concluded
as FAILED. This should be not reasonable. Similarly, other steps shouldn't be
the reason of failing an application except unregistration. The failure of them
should be isolated, such that AM can proceed to the end.
To sum up, IMHO, unregistration should be completed first, and be the step that
judges the final state of the application. Given the result unregistration, the
other steps decide what they should do, and the client see the final state. The
other steps may fail or not fail, but the failure should be isolated. If
fortunately none of steps fail (I guess it should be the most cases), the final
states are consistent via every channels. If one step fails, it will only
impact one part.
Moreover, I'm not sure whether we'd like to add one more state for AM, which is
unregistering. Move the job to unregistering before calling unregister and then
move the job to the final state after all the steps are gone through.
> Job history should not be flushed to JHS until AM gets unregistered
> -------------------------------------------------------------------
>
> Key: MAPREDUCE-5547
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5547
> Project: Hadoop Map/Reduce
> Issue Type: Sub-task
> Reporter: Zhijie Shen
> Assignee: Zhijie Shen
>
--
This message was sent by Atlassian JIRA
(v6.1#6144)