[ https://issues.apache.org/jira/browse/MAPREDUCE-5547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13915015#comment-13915015 ]
Jason Lowe commented on MAPREDUCE-5547: --------------------------------------- The client can still miss the history, and the realProxy cache only applies to existing clients. Here's the scenario: - Job finishes and unregisters with the RM then begins copying the history file to done_intermediate - While that occurs a client comes along to check the counters of the job. To do this, it must first contact the RM to check the job state to see whether it should contact the AM or the history server. - RM reports job has finished, so client goes to history server - History server doesn't have the file yet since AM hasn't completed copying it Besides this race, my other concern is that we're piling up tasks in the non-fault-tolerant portion of the job that are important to the user, namely providing history. Copying the history file is an operation that can take substantial time (e.g.: slow datanode), and the AM can fail before/during that operation. If we do this after we unregister then the RM will not retry and there will be no history. If we do it before we unregister then if the AM fails it will retry, the retry will realize there's nothing left to do but resume attempting to copy the history over to the history server, and we have some fault tolerance there. > Job history should not be flushed to JHS until AM gets unregistered > ------------------------------------------------------------------- > > Key: MAPREDUCE-5547 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5547 > Project: Hadoop Map/Reduce > Issue Type: Sub-task > Reporter: Zhijie Shen > Assignee: Zhijie Shen > -- This message was sent by Atlassian JIRA (v6.1.5#6160)