Could you try to kill the tasktracker hosting the task the next time when it
happens? I just want to isolate the problem - whether it is a problem in the
TT-JT communication or in the Task-TT communication. From your description
it looks like the problem is between the JT-TT communication. But pls run
the experiment when it happens again and let us know what happens.

Thanks,
Devaraj


On 7/24/08 1:42 PM, "Andreas Kostyrka" <[EMAIL PROTECTED]> wrote:

> Hi!
> 
> I'm experiencing hung reducers, with the following symptoms:
> 
>> Task Logs: 'task_200807230647_0008_r_000009_1'
>> 
>> 
>> stdout logs
>> 
>> 
>> 
>> stderr logs
>> 
>> 
>> 
>> syslog logs
>> 
>> red.ReduceTask: task_200807230647_0008_r_000009_1 Got 0 known map output
>> location(s); scheduling... 2008-07-24 07:56:11,064 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:16,073 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:16,074 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:16,074 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:21,083 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:21,084 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:21,084 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:26,093 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:26,094 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:26,094 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:31,103 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:31,104 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:31,104 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:36,113 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:36,114 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:36,114 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts) 2008-07-24
>> 07:56:41,123 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Need 6 map output(s) 2008-07-24
>> 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1: Got 0 new map-outputs & 0 obsolete
>> map-outputs from tasktracker and 0 map-outputs from previous failures
>> 2008-07-24 07:56:41,126 INFO org.apache.hadoop.mapred.ReduceTask:
>> task_200807230647_0008_r_000009_1 Got 0 known map output location(s);
>> scheduling... 2008-07-24 07:56:41,126 INFO
>> org.apache.hadoop.mapred.ReduceTask: task_200807230647_0008_r_000009_1
>> Scheduled 0 of 0 known outputs (0 slow hosts and 0 dup hosts)
> 
> 
> Notice how it needs 6 map outputs, all map tasks have finished, and it still
> just hangs there.
> 
> The second speculative copy of that reducer task needs 14 map outputs with the
> same messages :(
> 
> Other observations:
> 
> killing the reduce tasks via job -killtask ends up with restarting the job on
> the same node, and curiously the new job gets jammed at the same position
> (6/14 maps needed).
> 
> The only remedy to this problem seems to be a complete restart of the cluster
> and reprocessing. That gets really boring with jobs that took a day to
> process first :(
> 
> Andreas


Reply via email to