[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504646#comment-13504646
 ] 

Jason Lowe commented on MAPREDUCE-4819:
---------------------------------------

bq. Maybe final client notification should be the last thing after all post 
processing is done.

No, moving the client notification later just creates a different set of 
problems, like the client never being notified *at all* because the AM crashes 
after unregistering with the RM but before it notifies the client.  The RM 
won't restart the app because it unregistered successfully, but the client is 
never notified.

bq. In general it seems like we need to come up with a set of markers that 
previous AM's leave behind that can tell the next retry if the previous one 
failed/succeeded and so the current AM should exit or continue to run.

Exactly, and the AM is already doing this in the job history file which is how 
it helps supports recovery.  We should extend this so that even if the output 
committer doesn't support recovery the AM will check for markers in the job 
history file and prevent the job from executing tasks and committing output if 
final job status has been determined by previous attempts.  That way we prevent 
the AM from re-committing job output or changing the final job status after 
notifying the client.  We just need to make sure those markers are flushed to 
persistent store and located properly by future AM attempts before attempting 
to notify the client.  If subsequent attempts see the final job status marker 
then they should skip straight to the client notification process instead of 
running tasks.

                
> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to