[
https://issues.apache.org/jira/browse/HADOOP-4716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651357#action_12651357
]
Amar Kamat commented on HADOOP-4716:
------------------------------------
The JobTracker upon restart rebuilds the _task-completion-event_ list. Here
there are events from the tracker which was lost upon restart. When the
task-tracker (re)connects it re-sizes its own _task-completion-event_ list.
Hence the tracker retains the missing map's events. After some time the
jobtracker finds out that the tracker is lost and kills all the maps that were
run on the lost tracker and re-executes them. The tracker will have the
_task-completion-event_ list like
{code}
1. SUC m1-t1
2. SUC m2-t2
3. SUC m3-t1
4. SUC m4-t2
5. KIL m1-t1
6. KIL m3-t1
7. SUC m1-t2
8. SUC m3-t2
{code}
The reducer takes _m1-t1_ and starts pulling map output from _t1_. Note that
when the reducer fails on _m1_ it checks that _m1_ is _OBSOLETE_ and then
ignores it. The test case times out because it takes fair amount of time
(~3mins) to fail once. So this doesnt look like a bug but a limitation. The
reason this issue is not commonly seen is because the reducer actually starts
late and hence the tracker has the latest updates which prevents the reducer to
take up maps from the lost tracker. I could easily reproduce this problem when
the reducer was scheduled early.
----
One thing that can be done here is to make _num-reducers=0_ as the test case
doesnt actually require reducers. But actually its better to have reducers as
it makes the testcase strict and hence better. So if we decide to keep reducers
then there should be some way to control the timeout (~3min --> ~5 secs).
Thoughts?
> testRestartWithLostTracker frequently times out
> -----------------------------------------------
>
> Key: HADOOP-4716
> URL: https://issues.apache.org/jira/browse/HADOOP-4716
> Project: Hadoop Core
> Issue Type: Bug
> Components: mapred
> Reporter: Johan Oskarsson
> Assignee: Amar Kamat
> Priority: Minor
> Fix For: 0.20.0
>
> Attachments: log.txt
>
>
> This test frequently times out:
> org.apache.hadoop.mapred.TestJobTrackerRestartWithLostTracker.testRestartWithLostTracker
> Example:
> http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3637/testReport/org.apache.hadoop.mapred/TestJobTrackerRestartWithLostTracker/testRestartWithLostTracker/
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.