[
https://issues.apache.org/jira/browse/HADOOP-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508160
]
Alejandro Abdelnur commented on HADOOP-1121:
--------------------------------------------
Reviving this issue (hoping it will make it for 0.14).
I'd like to hear opinions on my last comments.
On #3 (OutputFormat, existing dirs and deleting), I had to do some homework on
my side.
Building on Doug's idea in his first comment, the temporary output directory
idea would do:
Something like:
* OutputFormat changes:
* Change getRecordWrite() contract - it should return a non-existing temp
directory and keep track of it.
* Add a method done() - it should move/rename the temp directory to the
output directory.
* JobTracker changes:
* On start up it should clean up the temp directories area.
* On job completion it should call done() on the OutputFormat instance.
If this seems alright I can create a separate issue for this.
> Recovering running/scheduled jobs after JobTracker failure
> ----------------------------------------------------------
>
> Key: HADOOP-1121
> URL: https://issues.apache.org/jira/browse/HADOOP-1121
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Environment: all
> Reporter: Alejandro Abdelnur
> Fix For: 0.14.0
>
> Attachments: patch1121.txt
>
>
> Currently all running/scheduled jobs are kept in memory in the JobTracker. If
> the JobTracker goes down all the running/scheduled jobs have to be
> resubmitted.
> Proposal:
> (1) On job submission the JobTracker would save the job configuration
> (job.xml) in a jobs DFS directory using the jobID as name.
> (2) On job completion (success, failure, klll) it would delete the job
> configuration from the jobs DFS directory.
> (3) On JobTracker failure the jobs DFS directory will have all
> running/scheduled jobs at failure time.
> (4) On startup the JobTracker would check the jobs DFS directory for job
> config files. if there is none it means no failure happened on last stop,
> there is nothing to be done. If there are job config files in the jobs DFS
> directory continue with the following recovery steps.
> (A) rename all job config files to $JOB_CONFIG_FILE.recover.
> (B) for each $JOB_CONFIG_FILE.recover: delete the output directory if it
> exists, schedule the job using the original job ID, delete the
> $JOB_CONFIG_FILE.recover (as a new $JOB_CONFIG_FILE will be there per
> scheduling (per step #1).
> (C) when B is completed start accepting new job submissions.
> Other details:
> A configuration flag would enable/disable the above behavior, if switched off
> (default behavior) nothing of the above happens.
> A startup flag could switch off job recovery for systems with the recover set
> to ON.
> Changes to the job ID generation should be put in place to avoid Job ID
> collision with jobs IDs from previous failed runs, for example appending a JT
> startup timestamp to the job IDs would do.
> Further improvements on top of this one:
> This mechanism would allow having a JobTracker node in standby to be started
> in case of main JobTracker failure. The standby JobTracker would be started
> on main JobTracker failure. Making things a little more comprehensive they
> backup JobTrackers could be running in warm mode and hearbeats and ping calls
> among them would activate a warm stand by JobTracker as new main JobTracker.
> Together with an enhancement in the JobClient (keeping a list of backup
> JobTracker URLs) would enable client fallback to backup JobTrackers.
> State about partially run jobs could be kept, tasks
> completed/in-progress/pending. This would enable to recover jobs half way
> instead restarting them.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.