[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14991486#comment-14991486
 ] 

Bikas Saha commented on MAPREDUCE-5485:
---------------------------------------

I dont think AM should reboot to retry after commit failure. Commit repeatable 
implies that a commit can be retried if it was stopped while in progress. If 
the commit actually failed then there does not seem any reason to assume that 
retrying it will succeed. IMO if the commit reports a failure then AM should 
fail. Similarly, if a commit failure file exists (from a previous AM version) 
then the new version of the AM should respect that and fail since the commit 
has been reported to be failed.

So, IMO, we are looking to change the code here. Thoughts?
{code}          //The commit is still pending, commit error
          shutDownMessage =
              "Job commit from a prior MRAppMaster attempt is " +
              "potentially in progress. Preventing multiple commit executions";
          forcedState = JobStateInternal.ERROR;
{code}

Javadoc could be improved. Inline
{code}
   /**
+   * Is repeatable job commit supported so that job commit can be repeated 
again
+   * from previous failures. <<<< Returns true if an in-progress job commit 
can be retried. If the MR AM is re-run then it will check this value to 
determine if it can retry an in-progress commit that was started by a previous 
version. Note that in rare scenarios, the previous AM version might still be 
running at that time, due to system anomalies. Hence if this method returns 
true then the retry commit operation should be able to run concurrently with 
the previous operation.
+   *
+   * If repeatable job commit is supported, job restart can tolerant previous 
+   * failures during job is committing.
+   *
+   * By default, it is not supported. Extended classes (like: 
+   * FileOutputCommitter) should explicitly override it if provide support. << 
if they support it
+   *
+   * @param jobContext
+   *          Context of the job whose output is being written.
+   * @return <code>true</code> repeatable job commit is supported,
+   *         <code>false</code> otherwise
+   */
{code}

num-retries instead of retries? Also, if its num-retries then default should be 
0. If its num-attempts then default should be 1.
{code}+  public static final String FILEOUTPUTCOMMITTER_FAILURE_RETRIES =
+      "mapreduce.fileoutputcommitter.failures.retry";{code}

Retry count checking code in the catch block subsumes the check retry count 
check in the while block?
{code}+    while (jobCommitNotFinished && (retries++ < retriesOnFailure)) {
...
+      } catch (Exception e) {
+        if (retries >= retriesOnFailure) {{code}

The previous operation could delete the path after the if check has succeeded. 
So we probably also need to catch FileNotFoundException exception class here 
and ignore it if repeatableCommit is true.
{code}+      // if job allow repeatable commit and pendingJobAttemptsPath could 
be
+      // deleted by previous AM, we do nothing here.
+      if (isCommitJobRepeatable(context) && 
!fs.exists(pendingJobAttemptsPath)) {
+        return;
+      }{code}

Do testcases need an @Test annotation?
{code}+  public void testCommitterWithFailureV1() throws Exception {
+    testCommitterWithFailureInternal(1);{code}

firstTimeFail is probably a more clear name for what its doing - failing on the 
first attempt.
{code}+    boolean firstTimeCommit = true;{code}

Would be good to have a test that version 2 and retry = 1 will fail also.

Testcases missing for specific changes in FileOutputCommitter for create/delete 
operation changes?


> Allow repeating job commit by extending OutputCommitter API
> -----------------------------------------------------------
>
>                 Key: MAPREDUCE-5485
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.1.0-beta
>            Reporter: Nemon Lou
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to