[ https://issues.apache.org/jira/browse/MAPREDUCE-873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749805#action_12749805 ]
Devaraj Das commented on MAPREDUCE-873: --------------------------------------- +1 > Simplify Job Recovery > --------------------- > > Key: MAPREDUCE-873 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-873 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: jobtracker > Affects Versions: 0.20.1 > Reporter: Devaraj Das > Assignee: Sharad Agarwal > Fix For: 0.21.0 > > Attachments: 873_v1.patch, 873_v2.patch, 873_v3.patch > > > On a couple of occasions we have seen the JobTracker not being able to handle > job recovery well, and leading to cluster downtime after a restart. The > current design for handling job recovery is complex and prone to corner cases > not being handled well enough. In retrospect, it seems like the transaction > log based approach as was proposed on HADOOP-3245 > (http://tinyurl.com/luh9hb), would have been a better/simpler model. However, > that is a big project, and it seems for the medium term, just handling job > re-submissions after a restart is a good tradeoff. That is, the JobTracker > after getting restarted, will resubmit all jobs that were running in its past > life. They will all start from the beginning (downside is completed tasks > will reexecute). In the long term, the transaction log model or some variant > of that should be pursued. > Thoughts/comments welcome. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.