[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542278#comment-13542278
 ] 

Bikas Saha commented on MAPREDUCE-4819:
---------------------------------------

It would really help if you could elaborate on the solution a bit more. I think 
I get the gist (ie try to lock the commit using atomic file operations) but I 
am not clear beyond that part. We can quickly discuss the utility of both 
approaches after that. Perhaps you have already done that in your mind :)
The only thing I would like to guard against is linking of job commit operation 
with job completion where they can be independent. I agree that job commit is 
strictly needed before job completion. But making job commit the same as job 
completion may not be correct. eg. other operations post completion that are 
unsafe to repeat (maybe none exist now) or committing multiple outputs perhaps.
The patch posted earlier, made sure that if a job has completed then it will be 
a no-op to run it again. Its a safe change. Also, it notifies the client about 
job success after making sure that the success state is persisted. I agree is 
does not handle errors in commit which is perhaps what your patch is addressing.
So it could be that both changes are needed.
                
> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
>         Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
> MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt
>
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to