[ 
https://issues.apache.org/jira/browse/MAPREDUCE-873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sharad Agarwal updated MAPREDUCE-873:
-------------------------------------

    Hadoop Flags: [Incompatible change]
          Status: Patch Available  (was: Open)

> Simplify Job Recovery
> ---------------------
>
>                 Key: MAPREDUCE-873
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-873
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: jobtracker
>    Affects Versions: 0.20.1
>            Reporter: Devaraj Das
>            Assignee: Sharad Agarwal
>             Fix For: 0.21.0
>
>         Attachments: 873_v1.patch, 873_v2.patch
>
>
> On a couple of occasions we have seen the JobTracker not being able to handle 
> job recovery well, and leading to cluster downtime after a restart. The 
> current design for handling job recovery is complex and prone to corner cases 
> not being handled well enough. In retrospect, it seems like the transaction 
> log based approach as was proposed on HADOOP-3245 
> (http://tinyurl.com/luh9hb), would have been a better/simpler model. However, 
> that is a big project, and it seems for the medium term, just handling job 
> re-submissions after a restart is a good tradeoff. That is, the JobTracker 
> after getting restarted, will resubmit all jobs that were running in its past 
> life. They will all start from the beginning (downside is completed tasks 
> will reexecute). In the long term, the transaction log model or some variant 
> of that should be pursued.
> Thoughts/comments welcome.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to