[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915015#comment-13915015
 ] 

Jason Lowe commented on MAPREDUCE-5547:
---------------------------------------

The client can still miss the history, and the realProxy cache only applies to 
existing clients.  Here's the scenario:

- Job finishes and unregisters with the RM then begins copying the history file 
to done_intermediate
- While that occurs a client comes along to check the counters of the job.  To 
do this, it must first contact the RM to check the job state to see whether it 
should contact the AM or the history server.
- RM reports job has finished, so client goes to history server
- History server doesn't have the file yet since AM hasn't completed copying it

Besides this race, my other concern is that we're piling up tasks in the 
non-fault-tolerant portion of the job that are important to the user, namely 
providing history.  Copying the history file is an operation that can take 
substantial time (e.g.: slow datanode), and the AM can fail before/during that 
operation.  If we do this after we unregister then the RM will not retry and 
there will be no history.  If we do it before we unregister then if the AM 
fails it will retry, the retry will realize there's nothing left to do but 
resume attempting to copy the history over to the history server, and we have 
some fault tolerance there.

> Job history should not be flushed to JHS until AM gets unregistered
> -------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5547
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5547
>             Project: Hadoop Map/Reduce
>          Issue Type: Sub-task
>            Reporter: Zhijie Shen
>            Assignee: Zhijie Shen
>




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to