[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13787380#comment-13787380
 ] 

Vinod Kumar Vavilapalli commented on MAPREDUCE-5547:
----------------------------------------------------

We finally sat down and reasoned about all things in MR App that are broken 
because of various race conditions during RM restart. While fixing that is a 
larger effort, looking at this specific problem, I think we should stick to the 
invariant that if RM sees the app as failed, we should make sure clients also 
see the same.

In such cases where the AM has successfully finished the 'job', but failed to 
unregister or failed to write history file(we ran into this also), the client 
will still the job as running till the last attempt. And if the last attempt 
also fails with the same reason, it sees the job as failed. In corner cases, we 
will lose work but that's better than clients struggling with successful jobs 
with no history files or a failure information on RM.

That said, to really fix this issue, we should change the order of things in 
the AM - unregister should be the first thing that should happen. Previously we 
moved JobHistory flush/close to be before unregister as we didn't have the AM 
grace period as we have now. Given that we now have the AM grace period, we can 
do the following
 - Flush history events & close the current history file
 - unregister
 - If unregister fails, don't do anything any more - irrespective of whether 
this is the last retry or not. This is done at MAPREDUCE-5562.
 - Otherwise, if this is not the last retry, then
    -- let the client loop (safeTermination flag)
    -- Don't copy the history file
    -- Don't send the job-end notification
    -- Don't delete the staging directory
    -- Exit
 - Otherwise, this is the last retry
    -- copy the history file to intermediate done directory
    -- send the job notification URL and
    -- let the client know job succeeded/failed/killed
    -- remove the staging directory.

> Job history should not be flushed to JHS until AM gets unregistered
> -------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5547
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5547
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
>




--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to