[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14993559#comment-14993559
 ] 

Bikas Saha commented on MAPREDUCE-5485:
---------------------------------------

bq. I don't think so. Can you take a look at it again?
Inline
{code}
+    while (jobCommitNotFinished && (retries++ < retriesOnFailure)) {
+      try {
+        commitJobInternal(context);
+        jobCommitNotFinished = false;
+      } catch (Exception e) {
+        if (retries >= retriesOnFailure) { <<<<< doing ++retries here can 
remove code duplication for the < check in the while?
+          throw e;
+        } else {
+          LOG.warn("Exception get thrown in job commit, retry (" + retries +
+              ") time.", e);
+        }
+      }
+    }{code}

bq. There are still reasons that related to AM specific, i.e. previous AM 
cannot connect to FS (FS or other CloudFS), committer mis-behavior because of 
getting loaded incorrect (due to classpath or other defect), etc. I think it 
make sense to do the best effort to retry the commit failure (like other reason 
to cause AM failure) given the commit is repeatable and all tasks are done 
successfully.
Sure. But then for such cases commitIsRepeatable may not be strictly needed. 
Even for a non-repeatable committer, if there is a classpath issue (which can 
get fixed by retrying the AM) then the AM should retry, right? The scope of 
that change seems related to this but is perhaps large enough to deserve its 
own jira as a follow up to this one. E.g. if the committer has written a failed 
file then commit is failed. Maybe we need an extension or API exception that 
allows us to know if the committer error was fatal or non-fatal and write a 
retry/failed file based on that?

> Allow repeating job commit by extending OutputCommitter API
> -----------------------------------------------------------
>
>                 Key: MAPREDUCE-5485
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5485
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>    Affects Versions: 2.1.0-beta
>            Reporter: Nemon Lou
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: MAPREDUCE-5485-demo-2.patch, MAPREDUCE-5485-demo.patch, 
> MAPREDUCE-5485-v1.patch
>
>
> There are chances MRAppMaster crush during job committing,or NodeManager 
> restart cause the committing AM exit due to container expire.In these cases 
> ,the job will fail.
> However,some jobs can redo commit so failing the job becomes unnecessary.
> Let clients tell AM to allow redo commit or not is a better choice.
> This idea comes from Jason Lowe's comments in MAPREDUCE-4819 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to