[ 
https://issues.apache.org/jira/browse/HADOOP-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533863
 ] 

Arun C Murthy commented on HADOOP-2016:
---------------------------------------

Here are relevant logs:

{noformat}
1. 2007-10-09 12:19:15,055 INFO org.apache.hadoop.mapred.TaskTracker: Received 
KillTaskAction for task: task_200710090910_0003_r_001792_1
2. 2007-10-09 12:19:15,055 INFO org.apache.hadoop.mapred.TaskTracker: About to 
purge task: task_200710090910_0003_r_001792_1
3. 2007-10-09 12:19:15,056 INFO org.apache.hadoop.mapred.TaskTracker: 
task_200710090910_0003_r_001792_1 0.67524564% reduce > reduce
4. 2007-10-09 12:19:15,056 INFO org.apache.hadoop.mapred.TaskRunner: 
task_200710090910_0003_r_001792_1 done; removing files.
5. 2007-10-09 12:19:15,491 WARN org.apache.hadoop.mapred.TaskTracker: Unknown 
child task finshed: task_200710090910_0003_r_001792_1. Ignored.
6. 2007-10-09 12:19:18,059 WARN org.apache.hadoop.mapred.TaskTracker: Progress 
from unknown child task: task_200710090910_0003_r_001792_1
{noformat}

With particular emphasis on line #3 above, it looks like this can happen due to 
the fact that a task's progress update (child-vm) got interspersed with methods 
which were called while purging the task i.e. 
{{TaskTracker#purgeTask}} -> {{TaskTracker#TaskInProgress#jobHasFinished}} 
which then calls {{TaskTracker#TaskInProgress#kill}} and 
{{TaskTracker#TaskInProgress#cleanup}}.

Unfortunately there are a couple of issues which result in this scenario:
a) {{TaskTracker#TaskInProgress#jobHasFinished}} isn't a synchronized method 
and hence there is no transaction semantics for calls made from there i.e. 
{{TaskTracker#TaskInProgress#kill}} and {{TaskTracker#TaskInProgress#cleanup}}. 
b) Thus the call to kill and clean can be interspersed with a call to 
{{TaskTracker#TaskInProgress#reportProgress}} (as seen in the logs). This is 
dangerous since it is the *{{TaskTracker#TaskInProgress#cleanup}}* call which 
removes the taskid from {{TaskTracker#tasks}}.
c) {{TaskTracker#TaskInProgress#reportProgress}} unconditionally marks the 
task's run-state as {{RUNNING}} which clearly is wrong, since it overwrites the 
task's {{KILLED}} status set in {{TaskTracker#TaskInProgress#kill}}.

Overall a combination of the above leads to the task never being removed from 
{{TaskTracker#runningTasks}} which leads to the bug in question.

The way to get around is to:
a) Call {{tasks.remove(taskid)}} from {{TaskTracker#TaskInProgress#kill}} to 
ensure the interspersed call to {{TaskTracker#TaskInProgress#reportProgress}} 
fails to wrongly update the task status as {{RUNNING}}
or
b) Check to ensure the task's state is actually {{RUNNING}} before updating 
it's status when the child reports in.

I'd go with (b).


> Race condition in removing a KILLED task from tasktracker
> ---------------------------------------------------------
>
>                 Key: HADOOP-2016
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2016
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.15.0
>
>
> I ran into a situation where a speculative task was killed by the JobTracker 
> and the relevant TaskTracker got the right KillTaskAction, but the 
> tasktracker continued to hold a reference to that task (although the task jvm 
> was killed). The task continued to be in RUNNING state in both the JobTracker 
> and that TaskTracker for ever. I suspect there is some race condition in 
> reading/updating datastructures inside the taskCleanupThread & 
> transmitHeartBeat.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to