Recovering running/scheduled jobs after JobTracker failure
----------------------------------------------------------

                 Key: HADOOP-1121
                 URL: https://issues.apache.org/jira/browse/HADOOP-1121
             Project: Hadoop
          Issue Type: New Feature
          Components: mapred
         Environment: all
            Reporter: Alejandro Abdelnur


Currently all running/scheduled jobs are kept in memory in the JobTracker. If 
the JobTracker goes down all the running/scheduled jobs have to be resubmitted.

Proposal:

(1) On job submission the JobTracker would save the job configuration (job.xml) 
in a jobs DFS directory using the jobID as name.
(2) On job completion (success, failure, klll) it would delete the job 
configuration from the jobs DFS directory.
(3) On JobTracker failure the jobs DFS directory will have all 
running/scheduled jobs at failure time.
(4) On startup the JobTracker would check the jobs DFS directory for job config 
files. if there is none it means no failure happened on last stop, there is 
nothing to be done. If there are job config files in the jobs DFS directory 
continue with the following recovery steps.

(A) rename all job config files to $JOB_CONFIG_FILE.recover.
(B) for each $JOB_CONFIG_FILE.recover: delete the output directory if it 
exists, schedule the job using the original job ID, delete the 
$JOB_CONFIG_FILE.recover (as a new $JOB_CONFIG_FILE will be there per 
scheduling (per step #1).
(C) when B is completed start accepting new job submissions.

Other details:

A configuration flag would enable/disable the above behavior, if switched off 
(default behavior) nothing of the above happens.
A startup flag could switch off job recovery for systems with the recover set 
to ON.
Changes to the job ID generation should be put in place to avoid Job ID 
collision with jobs IDs from previous failed runs, for example appending a JT 
startup timestamp to the job IDs would do.

Further improvements on top of this one:

This mechanism would allow having a JobTracker node in standby to be started in 
case of main JobTracker failure. The standby JobTracker would be started on 
main JobTracker failure. Making things a little more comprehensive they backup 
JobTrackers could be running in warm mode and hearbeats and ping calls among 
them would activate a warm stand by JobTracker as new main JobTracker. Together 
with an enhancement in the JobClient (keeping a list of backup JobTracker URLs) 
would enable client fallback to backup JobTrackers.

State about partially run jobs could be kept, tasks 
completed/in-progress/pending. This would enable to recover jobs half way 
instead restarting them. 


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to