[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14997949#comment-14997949
 ] 

Bikas Saha commented on MAPREDUCE-5485:
---------------------------------------

Separate jira sounds good. We may need to add a new exception that can be 
thrown by the committer that clearly signals that the failure can be retried 
(e.g. NN in safemode) vs a fatal error (e.g. missing task files).

About the patch.

{code}+      // cleanup previous half done commits if committer supports 
repeatable job
+      // commit.
+      if (isCommitJobRepeatable()) {
+        cleanupInterruptedCommit(conf);
+      }{code}
Is the above check too early in the code. E.g. IIRC, at this point we have not 
checked whether the previous job commit was succeeded or failed - in which case 
we cannot recover and there is nothing to do. We should be doing this check and 
cleaning up files only when we detect an in-progress commit - ie. startCommit 
file is present and finish/failedCommit file is not present. Else with this 
patch, we may end up redoing commit for an already successful commit.
Is this the code we need to change?
{code}//The commit is still pending, commit error
          shutDownMessage =
              "Job commit from a prior MRAppMaster attempt is " +
              "potentially in progress. Preventing multiple commit executions";
          forcedState = JobStateInternal.ERROR;{code}

Also, we have already changed the startCommit operation to be repeatable via 
the overwrite flag. After that is there a need to delete the files upfront. 
Delete may be an expensive operation on some cloud stores.

Mapred javadoc fixes are missing. Also there are some typos in there. E.g.
{code}+   * If repeatable job commit is supported, job restart can tolerant 
previous <<< tolerate ?
+   * failures during job is committing. <<< during job commit ?
{code}

This part of the code change could use some tests.

Tests for repeatable success marker file and FileExistsException for repeatable 
deletes would be good to have.


> Allow repeating job commit by extending OutputCommitter API
> -----------------------------------------------------------
>
>                 Key: MAPREDUCE-5485
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.1.0-beta
>            Reporter: Nemon Lou
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch, MAPREDUCE-5485-v2.patch, MAPREDUCE-5485-v3.1.patch, 
> MAPREDUCE-5485-v3.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to