[
https://issues.apache.org/jira/browse/HADOOP-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494950
]
Hadoop QA commented on HADOOP-1121:
-----------------------------------
+1
http://issues.apache.org/jira/secure/attachment/12357074/patch1121.txt applied
and successfully tested against trunk revision r536583.
Test results:
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/130/testReport/
Console output:
http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/130/console
> Recovering running/scheduled jobs after JobTracker failure
> ----------------------------------------------------------
>
> Key: HADOOP-1121
> URL: https://issues.apache.org/jira/browse/HADOOP-1121
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Environment: all
> Reporter: Alejandro Abdelnur
> Fix For: 0.13.0
>
> Attachments: patch1121.txt
>
>
> Currently all running/scheduled jobs are kept in memory in the JobTracker. If
> the JobTracker goes down all the running/scheduled jobs have to be
> resubmitted.
> Proposal:
> (1) On job submission the JobTracker would save the job configuration
> (job.xml) in a jobs DFS directory using the jobID as name.
> (2) On job completion (success, failure, klll) it would delete the job
> configuration from the jobs DFS directory.
> (3) On JobTracker failure the jobs DFS directory will have all
> running/scheduled jobs at failure time.
> (4) On startup the JobTracker would check the jobs DFS directory for job
> config files. if there is none it means no failure happened on last stop,
> there is nothing to be done. If there are job config files in the jobs DFS
> directory continue with the following recovery steps.
> (A) rename all job config files to $JOB_CONFIG_FILE.recover.
> (B) for each $JOB_CONFIG_FILE.recover: delete the output directory if it
> exists, schedule the job using the original job ID, delete the
> $JOB_CONFIG_FILE.recover (as a new $JOB_CONFIG_FILE will be there per
> scheduling (per step #1).
> (C) when B is completed start accepting new job submissions.
> Other details:
> A configuration flag would enable/disable the above behavior, if switched off
> (default behavior) nothing of the above happens.
> A startup flag could switch off job recovery for systems with the recover set
> to ON.
> Changes to the job ID generation should be put in place to avoid Job ID
> collision with jobs IDs from previous failed runs, for example appending a JT
> startup timestamp to the job IDs would do.
> Further improvements on top of this one:
> This mechanism would allow having a JobTracker node in standby to be started
> in case of main JobTracker failure. The standby JobTracker would be started
> on main JobTracker failure. Making things a little more comprehensive they
> backup JobTrackers could be running in warm mode and hearbeats and ping calls
> among them would activate a warm stand by JobTracker as new main JobTracker.
> Together with an enhancement in the JobClient (keeping a list of backup
> JobTracker URLs) would enable client fallback to backup JobTrackers.
> State about partially run jobs could be kept, tasks
> completed/in-progress/pending. This would enable to recover jobs half way
> instead restarting them.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.