[ 
https://issues.apache.org/jira/browse/MAPREDUCE-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sharad Agarwal updated MAPREDUCE-873:
-------------------------------------

    Attachment: 873_v2.patch

Patch for review. It does following:
Recovery no more depends on job history. Logic to replay history events is 
removed.
Jobs are recovered based on job files present in mapred system dir.
Job info file containing job tracker restart count is retained as it is 
required to avoid task attempt id clashes for recovered jobs.
When job tracker comes up, the job history files from last run are moved to 
"mapred.job.tracker.history.completed.location" with the suffix added as "." + 
jtIdentifier +".old". This is done to avoid over writing the history files for 
recovered jobs.
TestJobTrackerSafeMode, TestJobTrackerRestart and 
TestJobTrackerRestartWithLostTracker are removed. 

> Simplify Job Recovery
> ---------------------
>
>                 Key: MAPREDUCE-873
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-873
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: jobtracker
>    Affects Versions: 0.20.1
>            Reporter: Devaraj Das
>            Assignee: Sharad Agarwal
>             Fix For: 0.21.0
>
>         Attachments: 873_v1.patch, 873_v2.patch
>
>
> On a couple of occasions we have seen the JobTracker not being able to handle 
> job recovery well, and leading to cluster downtime after a restart. The 
> current design for handling job recovery is complex and prone to corner cases 
> not being handled well enough. In retrospect, it seems like the transaction 
> log based approach as was proposed on HADOOP-3245 
> (http://tinyurl.com/luh9hb), would have been a better/simpler model. However, 
> that is a big project, and it seems for the medium term, just handling job 
> re-submissions after a restart is a good tradeoff. That is, the JobTracker 
> after getting restarted, will resubmit all jobs that were running in its past 
> life. They will all start from the beginning (downside is completed tasks 
> will reexecute). In the long term, the transaction log model or some variant 
> of that should be pursued.
> Thoughts/comments welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to