[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13542748#comment-13542748
 ] 

Bikas Saha commented on MAPREDUCE-4819:
---------------------------------------

Looks like after the recent changes in JobImpl and the current alternative 
approach my original fix for not rerunning the job does not really apply. I 
think you would want to take the changes in my patch that adds the jobid to the 
history staging dir. Since the staging dir is not deleted during job history 
flushing, I had observed that if I made my AM crash (by putting an exit(1) in 
shutdownJob() then the history files would get orphaned and not cleaned up. Or 
something like that. And to fix that I had to add the jobid to the path.
Snippet from my patch.
{code}
+++ 
hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-common/src/main/java/org/apache/hadoop/mapreduce/v2/jobhistory/JobHistoryUtils.java
@@ -186,10 +186,11 @@ public static PathFilter getHistoryFileFilter() {
    * @return A string representation of the prefix.
    */
   public static String
-      getConfiguredHistoryStagingDirPrefix(Configuration conf)
+      getConfiguredHistoryStagingDirPrefix(Configuration conf, String jobId)
           throws IOException {
     String user = UserGroupInformation.getCurrentUser().getShortUserName();
-    Path path = MRApps.getStagingAreaDir(conf, user);
+    Path stagingPath = MRApps.getStagingAreaDir(conf, user);
+    Path path = new Path(stagingPath, jobId);
     String logDir = path.toString();
     return logDir;
   }
{code}


For the patch itself I have a few comments

Why not end in success if the staging dir was cleaned up by the last attempt? I 
am guessing that this code wont be necessary after we move the unregister to RM 
before the staging dir cleanup in MAPREDUCE-4841, right?
{code}
+      if(!stagingExists) {
+        copyHistory = false;
+        isLastAMRetry = true;
+        justShutDown = true;
+        shouldNotify = false;
+        forcedState = JobStateInternal.ERROR;
+        shutDownMessage = "Staging dir does not exist " + stagingDir;
+        LOG.fatal(shutDownMessage);
{code}

Why are we only eating/ignoring the JobEvents in the dispatcher? So that the 
JobImpl state machine is not triggered?

This might be a question of personal preference. I think an explicit transition 
to from the INIT to final state is cleaner than overriding the state in the 
getter.
{code}
   public JobStateInternal getInternalState() {
     readLock.lock();
     try {
+      if(forcedState != null) {
+        return forcedState;
+      }
{code}

Didnt quite get this in HistoryFileManager.java. Looks like it related to a 
recent change in that code.
{code}
+      } else if (old != null && !old.isMovePending()) {
+        //This is a duplicate so just delete it
+        fileInfo.delete();
       }
{code}

Typo
{code}
+        throw new Exception("No handler for regitered for " + type);
+      }
{code}


                
> AM can rerun job after reporting final job status to the client
> ---------------------------------------------------------------
>
>                 Key: MAPREDUCE-4819
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4819
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Bikas Saha
>            Priority: Critical
>         Attachments: MAPREDUCE-4819.1.patch, MAPREDUCE-4819.2.patch, 
> MAPREDUCE-4819.3.patch, MR-4819-bobby-trunk.txt, MR-4819-bobby-trunk.txt
>
>
> If the AM reports final job status to the client but then crashes before 
> unregistering with the RM then the RM can run another AM attempt.  Currently 
> AM re-attempts assume that the previous attempts did not reach a final job 
> state, and that causes the job to rerun (from scratch, if the output format 
> doesn't support recovery).
> Re-running the job when we've already told the client the final status of the 
> job is bad for a number of reasons.  If the job failed, it's confusing at 
> best since the client was already told the job failed but the subsequent 
> attempt could succeed.  If the job succeeded there could be data loss, as a 
> subsequent job launched by the client tries to consume the job's output as 
> input just as the re-attempt starts removing output files in preparation for 
> the output commit.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to