[ 
https://issues.apache.org/jira/browse/HADOOP-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amar Kamat updated HADOOP-3245:
-------------------------------

    Attachment: HADOOP-3245-v2.6.5.patch

 Attaching a patch for review. Following are the changes
1) The bug discussed 
[here|https://issues.apache.org/jira/browse/HADOOP-3245?focusedCommentId=12604620#action_12604620]
 is taken care of. The reducer on resetting wont skip any task completion 
events. Duplicate events for a tip but from different attempts will also be 
added. The reduce task seems to take care of it.

2) The job directories for restored jobs are checked for completeness before 
adding to the queue. A job directory is considered complete when it has 
_job.xml, job.jar and job.split_.

3) There was one corner case where the jobtracker dies with a job as completed 
(job dir missing) before communicating to the tasktracker i.e task trackers 
still have the task statuses for the completed job. The way this is handled is 
that the jobtracker on receiving an update request for a missing job will ask 
all the TTs to clear this job's details.

4) Restart mode turned off : The restart mode is turned off after some time. 
This is useful as we dont want the JT to entertain latecomers. The JT comes out 
of restart mode using the following equation
{{current-time > last-time-when-a-tt-synced + lost-task-tracker-interval}}
This somehow will make sure that we dont close the registration too early. 

5) The web ui now shows the restart information. It shows whether the JT is 
still recovering and the time it has taken to recover.
----
Issues taken care of :
1) Consider the following case :
    Reducers belonging to the old JT are still shuffling a map m while the jt 
gets restarted. m gets re-executed on a different host, say m'. Consider m' 
checking in before m. Since m checks in later, it gets killed. The reducer 
which fetches from m now start failing. Here the fetch failure notification 
will have no effect on the jt and hence there are no false notifications.
2) Backlisting of a tracker per job is based on the task failures on that host. 
Failed statuses are not cleared from the running jobs on the tracker and hence 
will be replayed as per the design.
3) If a TIP has failed earlier, it will fail again since all the failed task 
statuses will be replayed.
----
Known issues : 
1) I have seen jobs getting stuck. I tried hard to reproduce it but I couldn't. 
Will keep testing the patch.
2) The job runtime will change as the runtime is calculated based on the time 
the job is created at the jobtracker. With restarted jobtracker the old start 
time will be lost.
3) The task attempt id is now changed. It requires the jobtracker's start time 
and hence it might affect the task output filters. Also application outside the 
framework would not be able to _guess_ the attempt id which they anyways should 
not be able to.


> Provide ability to persist running jobs (extend HADOOP-1876)
> ------------------------------------------------------------
>
>                 Key: HADOOP-3245
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3245
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>         Attachments: HADOOP-3245-v2.5.patch, HADOOP-3245-v2.6.5.patch
>
>
> This could probably extend the work done in HADOOP-1876. This feature can be 
> applied for things like jobs being able to survive jobtracker restarts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to