[jira] Commented: (HADOOP-1121) Recovering running/scheduled jobs after JobTracker failure

Doug Cutting (JIRA) Mon, 19 Mar 2007 11:33:53 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482189
 ]


Doug Cutting commented on HADOOP-1121:
--------------------------------------

> delete the output directory if it exists [ ... ]

This is OutputFormat-specific.  The kernel should not assume that there's an 
output directory.

Also, it might be difficult for OutputFormatBase to tell if an existing output 
directory was from a failed run that's being restarted or if it was previously 
existing and should stop the job from starting.  A fix might be to have 
OutputFormatBase always write things to a temporary directory, then rename this 
on completion.  The temporary directory could always be safely deleted on 
restart.  Perhaps we should add a new OutputFormat method to perform such 
cleanups?

> Recovering running/scheduled jobs after JobTracker failure
> ----------------------------------------------------------
>
>                 Key: HADOOP-1121
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1121
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>         Environment: all
>            Reporter: Alejandro Abdelnur
>
> Currently all running/scheduled jobs are kept in memory in the JobTracker. If 
> the JobTracker goes down all the running/scheduled jobs have to be 
> resubmitted.
> Proposal:
> (1) On job submission the JobTracker would save the job configuration 
> (job.xml) in a jobs DFS directory using the jobID as name.
> (2) On job completion (success, failure, klll) it would delete the job 
> configuration from the jobs DFS directory.
> (3) On JobTracker failure the jobs DFS directory will have all 
> running/scheduled jobs at failure time.
> (4) On startup the JobTracker would check the jobs DFS directory for job 
> config files. if there is none it means no failure happened on last stop, 
> there is nothing to be done. If there are job config files in the jobs DFS 
> directory continue with the following recovery steps.
> (A) rename all job config files to $JOB_CONFIG_FILE.recover.
> (B) for each $JOB_CONFIG_FILE.recover: delete the output directory if it 
> exists, schedule the job using the original job ID, delete the 
> $JOB_CONFIG_FILE.recover (as a new $JOB_CONFIG_FILE will be there per 
> scheduling (per step #1).
> (C) when B is completed start accepting new job submissions.
> Other details:
> A configuration flag would enable/disable the above behavior, if switched off 
> (default behavior) nothing of the above happens.
> A startup flag could switch off job recovery for systems with the recover set 
> to ON.
> Changes to the job ID generation should be put in place to avoid Job ID 
> collision with jobs IDs from previous failed runs, for example appending a JT 
> startup timestamp to the job IDs would do.
> Further improvements on top of this one:
> This mechanism would allow having a JobTracker node in standby to be started 
> in case of main JobTracker failure. The standby JobTracker would be started 
> on main JobTracker failure. Making things a little more comprehensive they 
> backup JobTrackers could be running in warm mode and hearbeats and ping calls 
> among them would activate a warm stand by JobTracker as new main JobTracker. 
> Together with an enhancement in the JobClient (keeping a list of backup 
> JobTracker URLs) would enable client fallback to backup JobTrackers.
> State about partially run jobs could be kept, tasks 
> completed/in-progress/pending. This would enable to recover jobs half way 
> instead restarting them. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-1121) Recovering running/scheduled jobs after JobTracker failure

Reply via email to