[ 
https://issues.apache.org/jira/browse/HADOOP-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495094
 ] 

Owen O'Malley commented on HADOOP-1121:
---------------------------------------

This looks like a feature, not a bug fix. I haven't gone through it, but I 
don't see a justification for it to go into 0.13.

> Recovering running/scheduled jobs after JobTracker failure
> ----------------------------------------------------------
>
>                 Key: HADOOP-1121
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1121
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>             Fix For: 0.13.0
>
>         Attachments: patch1121.txt
>
>
> Currently all running/scheduled jobs are kept in memory in the JobTracker. If 
> the JobTracker goes down all the running/scheduled jobs have to be 
> resubmitted.
> Proposal:
> (1) On job submission the JobTracker would save the job configuration 
> (job.xml) in a jobs DFS directory using the jobID as name.
> (2) On job completion (success, failure, klll) it would delete the job 
> configuration from the jobs DFS directory.
> (3) On JobTracker failure the jobs DFS directory will have all 
> running/scheduled jobs at failure time.
> (4) On startup the JobTracker would check the jobs DFS directory for job 
> config files. if there is none it means no failure happened on last stop, 
> there is nothing to be done. If there are job config files in the jobs DFS 
> directory continue with the following recovery steps.
> (A) rename all job config files to $JOB_CONFIG_FILE.recover.
> (B) for each $JOB_CONFIG_FILE.recover: delete the output directory if it 
> exists, schedule the job using the original job ID, delete the 
> $JOB_CONFIG_FILE.recover (as a new $JOB_CONFIG_FILE will be there per 
> scheduling (per step #1).
> (C) when B is completed start accepting new job submissions.
> Other details:
> A configuration flag would enable/disable the above behavior, if switched off 
> (default behavior) nothing of the above happens.
> A startup flag could switch off job recovery for systems with the recover set 
> to ON.
> Changes to the job ID generation should be put in place to avoid Job ID 
> collision with jobs IDs from previous failed runs, for example appending a JT 
> startup timestamp to the job IDs would do.
> Further improvements on top of this one:
> This mechanism would allow having a JobTracker node in standby to be started 
> in case of main JobTracker failure. The standby JobTracker would be started 
> on main JobTracker failure. Making things a little more comprehensive they 
> backup JobTrackers could be running in warm mode and hearbeats and ping calls 
> among them would activate a warm stand by JobTracker as new main JobTracker. 
> Together with an enhancement in the JobClient (keeping a list of backup 
> JobTracker URLs) would enable client fallback to backup JobTrackers.
> State about partially run jobs could be kept, tasks 
> completed/in-progress/pending. This would enable to recover jobs half way 
> instead restarting them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to