[
https://issues.apache.org/jira/browse/HADOOP-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Amar Kamat updated HADOOP-3245:
-------------------------------
Attachment: HADOOP-3245-v2.6.5.patch
Attaching a patch for review. Following are the changes
1) The bug discussed
[here|https://issues.apache.org/jira/browse/HADOOP-3245?focusedCommentId=12604620#action_12604620]
is taken care of. The reducer on resetting wont skip any task completion
events. Duplicate events for a tip but from different attempts will also be
added. The reduce task seems to take care of it.
2) The job directories for restored jobs are checked for completeness before
adding to the queue. A job directory is considered complete when it has
_job.xml, job.jar and job.split_.
3) There was one corner case where the jobtracker dies with a job as completed
(job dir missing) before communicating to the tasktracker i.e task trackers
still have the task statuses for the completed job. The way this is handled is
that the jobtracker on receiving an update request for a missing job will ask
all the TTs to clear this job's details.
4) Restart mode turned off : The restart mode is turned off after some time.
This is useful as we dont want the JT to entertain latecomers. The JT comes out
of restart mode using the following equation
{{current-time > last-time-when-a-tt-synced + lost-task-tracker-interval}}
This somehow will make sure that we dont close the registration too early.
5) The web ui now shows the restart information. It shows whether the JT is
still recovering and the time it has taken to recover.
----
Issues taken care of :
1) Consider the following case :
Reducers belonging to the old JT are still shuffling a map m while the jt
gets restarted. m gets re-executed on a different host, say m'. Consider m'
checking in before m. Since m checks in later, it gets killed. The reducer
which fetches from m now start failing. Here the fetch failure notification
will have no effect on the jt and hence there are no false notifications.
2) Backlisting of a tracker per job is based on the task failures on that host.
Failed statuses are not cleared from the running jobs on the tracker and hence
will be replayed as per the design.
3) If a TIP has failed earlier, it will fail again since all the failed task
statuses will be replayed.
----
Known issues :
1) I have seen jobs getting stuck. I tried hard to reproduce it but I couldn't.
Will keep testing the patch.
2) The job runtime will change as the runtime is calculated based on the time
the job is created at the jobtracker. With restarted jobtracker the old start
time will be lost.
3) The task attempt id is now changed. It requires the jobtracker's start time
and hence it might affect the task output filters. Also application outside the
framework would not be able to _guess_ the attempt id which they anyways should
not be able to.
> Provide ability to persist running jobs (extend HADOOP-1876)
> ------------------------------------------------------------
>
> Key: HADOOP-3245
> URL: https://issues.apache.org/jira/browse/HADOOP-3245
> Project: Hadoop Core
> Issue Type: New Feature
> Components: mapred
> Reporter: Devaraj Das
> Assignee: Amar Kamat
> Attachments: HADOOP-3245-v2.5.patch, HADOOP-3245-v2.6.5.patch
>
>
> This could probably extend the work done in HADOOP-1876. This feature can be
> applied for things like jobs being able to survive jobtracker restarts.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.