[
https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533863
]
Arun C Murthy commented on HADOOP-2016:
---------------------------------------
Here are relevant logs:
{noformat}
1. 2007-10-09 12:19:15,055 INFO org.apache.hadoop.mapred.TaskTracker: Received
KillTaskAction for task: task_200710090910_0003_r_001792_1
2. 2007-10-09 12:19:15,055 INFO org.apache.hadoop.mapred.TaskTracker: About to
purge task: task_200710090910_0003_r_001792_1
3. 2007-10-09 12:19:15,056 INFO org.apache.hadoop.mapred.TaskTracker:
task_200710090910_0003_r_001792_1 0.67524564% reduce > reduce
4. 2007-10-09 12:19:15,056 INFO org.apache.hadoop.mapred.TaskRunner:
task_200710090910_0003_r_001792_1 done; removing files.
5. 2007-10-09 12:19:15,491 WARN org.apache.hadoop.mapred.TaskTracker: Unknown
child task finshed: task_200710090910_0003_r_001792_1. Ignored.
6. 2007-10-09 12:19:18,059 WARN org.apache.hadoop.mapred.TaskTracker: Progress
from unknown child task: task_200710090910_0003_r_001792_1
{noformat}
With particular emphasis on line #3 above, it looks like this can happen due to
the fact that a task's progress update (child-vm) got interspersed with methods
which were called while purging the task i.e.
{{TaskTracker#purgeTask}} -> {{TaskTracker#TaskInProgress#jobHasFinished}}
which then calls {{TaskTracker#TaskInProgress#kill}} and
{{TaskTracker#TaskInProgress#cleanup}}.
Unfortunately there are a couple of issues which result in this scenario:
a) {{TaskTracker#TaskInProgress#jobHasFinished}} isn't a synchronized method
and hence there is no transaction semantics for calls made from there i.e.
{{TaskTracker#TaskInProgress#kill}} and {{TaskTracker#TaskInProgress#cleanup}}.
b) Thus the call to kill and clean can be interspersed with a call to
{{TaskTracker#TaskInProgress#reportProgress}} (as seen in the logs). This is
dangerous since it is the *{{TaskTracker#TaskInProgress#cleanup}}* call which
removes the taskid from {{TaskTracker#tasks}}.
c) {{TaskTracker#TaskInProgress#reportProgress}} unconditionally marks the
task's run-state as {{RUNNING}} which clearly is wrong, since it overwrites the
task's {{KILLED}} status set in {{TaskTracker#TaskInProgress#kill}}.
Overall a combination of the above leads to the task never being removed from
{{TaskTracker#runningTasks}} which leads to the bug in question.
The way to get around is to:
a) Call {{tasks.remove(taskid)}} from {{TaskTracker#TaskInProgress#kill}} to
ensure the interspersed call to {{TaskTracker#TaskInProgress#reportProgress}}
fails to wrongly update the task status as {{RUNNING}}
or
b) Check to ensure the task's state is actually {{RUNNING}} before updating
it's status when the child reports in.
I'd go with (b).
> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
> Key: HADOOP-2016
> URL: https://issues.apache.org/jira/browse/HADOOP-2016
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Reporter: Devaraj Das
> Assignee: Arun C Murthy
> Priority: Blocker
> Fix For: 0.15.0
>
>
> I ran into a situation where a speculative task was killed by the JobTracker
> and the relevant TaskTracker got the right KillTaskAction, but the
> tasktracker continued to hold a reference to that task (although the task jvm
> was killed). The task continued to be in RUNNING state in both the JobTracker
> and that TaskTracker for ever. I suspect there is some race condition in
> reading/updating datastructures inside the taskCleanupThread &
> transmitHeartBeat.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.