RE: Reduce hangs

Yunhong Gu1 Mon, 21 Jan 2008 14:34:36 -0800


Hi, all

Just to keep this topic updated :) I am still tring to figure whathappened.

I found that in my 2-node configuration (namenode and jobtracker onnode-1, while both are datanodes and tasktrackers). The reduce task maysometimes (but rarely) complete for programs that needs small amount ofCPU time (e.g., mrbench), but for programs with large computation, itnever finish. When reduces blocks, it always fails at 16%.


Eventually I will get this error information:
08/01/18 15:01:27 WARN mapred.JobClient: Error reading task 
outputncdm-IPxxxxxxxxx
08/01/18 15:01:27 WARN mapred.JobClient: Error reading task 
outputncdm-IPxxxxxxxxx

08/01/18 15:13:38 INFO mapred.JobClient: Task Id :task_200801181145_0005_r_000000_1, Status : FAILED

Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
08/01/18 19:56:38 WARN mapred.JobClient: Error reading task outputConnection 
timed out
08/01/18 19:59:47 WARN mapred.JobClient: Error reading task outputConnection 
timed out
08/01/18 20:09:40 INFO mapred.JobClient:  map 100% reduce 100%
java.io.IOException: Job failed!

I found that "IPxxxxxxxx" is not the correct network addressthat Hadoop should read result from. The servers I use have 2 networkinterfaces and I am using another one. I explicitly fill the IP addresses10.0.0.x in all the configuration files.


Might this be the reason of Reduce failure? But the Map phase does work.

Thanks
Yunhong


On Sat, 19 Jan 2008, Yunhong Gu1 wrote:

Oh, so it is the task running on the other node (ncdm-15) fails and Hadoopre-run the task on the local node (ncdm-8). (I only have two nodes, ncdm-8and ncdm-15. Both namenode and jobtracker are running on ncdm-8. The programis also started on ncdm-8).

08/01/18 19:08:27 INFO mapred.JobClient: Task Id :task_200801181852_0001_m_000001_0, Status : FAILED Too many fetch-failures
08/01/18 19:08:27 WARN mapred.JobClient: Error reading task outputncdm15

Any ideas why the task would fail? And why it takes so long for Hadoop todetect the failure?


Thanks
Yunhong

On Sat, 19 Jan 2008, Devaraj Das wrote:

Hi Yunhong,

As per the output it seems the job ran to successful completion (albeitwith

some failures)...
Devaraj

-----Original Message-----
From: Yunhong Gu1 [mailto:[EMAIL PROTECTED]
Sent: Saturday, January 19, 2008 8:56 AM
To: [EMAIL PROTECTED]
Subject: Re: Reduce hangs



Yes, it looks like HADOOP-1374

The program actually failed after a while:


[EMAIL PROTECTED]:~/hadoop-0.15.2$ ./bin/hadoop jar
hadoop-0.15.2-test.jar mrbench
MRBenchmark.0.0.2
08/01/18 18:53:08 INFO mapred.MRBench: creating control file:
1 numLines, ASCENDING sortOrder
08/01/18 18:53:08 INFO mapred.MRBench: created control file:
/benchmarks/MRBench/mr_input/input_-450753747.txt
08/01/18 18:53:09 INFO mapred.MRBench: Running job 0:
input=/benchmarks/MRBench/mr_input
output=/benchmarks/MRBench/mr_output/output_1843693325
08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input
paths to process : 1
08/01/18 18:53:09 INFO mapred.JobClient: Running job:
job_200801181852_0001
08/01/18 18:53:10 INFO mapred.JobClient:  map 0% reduce 0%
08/01/18 18:53:17 INFO mapred.JobClient:  map 100% reduce 0%
08/01/18 18:53:25 INFO mapred.JobClient:  map 100% reduce 16%
08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
task_200801181852_0001_m_000001_0, Status : FAILED Too many
fetch-failures
08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
outputncdm15
08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
outputncdm15
08/01/18 19:08:34 INFO mapred.JobClient:  map 100% reduce 100%
08/01/18 19:08:35 INFO mapred.JobClient: Job complete:
job_200801181852_0001
08/01/18 19:08:35 INFO mapred.JobClient: Counters: 10
08/01/18 19:08:35 INFO mapred.JobClient:   Job Counters
08/01/18 19:08:35 INFO mapred.JobClient:     Launched map tasks=3
08/01/18 19:08:35 INFO mapred.JobClient:     Launched reduce tasks=1
08/01/18 19:08:35 INFO mapred.JobClient:     Data-local map tasks=2
08/01/18 19:08:35 INFO mapred.JobClient:   Map-Reduce Framework
08/01/18 19:08:35 INFO mapred.JobClient:     Map input records=1
08/01/18 19:08:35 INFO mapred.JobClient:     Map output records=1
08/01/18 19:08:35 INFO mapred.JobClient:     Map input bytes=2
08/01/18 19:08:35 INFO mapred.JobClient:     Map output bytes=5
08/01/18 19:08:35 INFO mapred.JobClient:     Reduce input groups=1
08/01/18 19:08:35 INFO mapred.JobClient:     Reduce input records=1
08/01/18 19:08:35 INFO mapred.JobClient:     Reduce output records=1
DataLines       Maps    Reduces AvgTime (milliseconds)
1               2       1       926333



On Fri, 18 Jan 2008, Konstantin Shvachko wrote:

Looks like we still have this unsolved mysterious problem:

http://issues.apache.org/jira/browse/HADOOP-1374

Could it be related to HADOOP-1246? Arun?

Thanks,
--Konstantin

Yunhong Gu1 wrote:


Hi,

If someone knows how to fix the problem described below,

please help

me out. Thanks!

I am testing Hadoop on 2-node cluster and the "reduce"

always hangs

at some stage, even if I use different clusters. My OS is Debian
Linux kernel 2.6 (AMD Opteron w/ 4GB Mem). Hadoop verision

is 0.15.2.

Java version is 1.5.0_01-b08.

I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar

mrbench" and

when the map stage finishes, the reduce stage will hang

somewhere in

the middle, sometimes at 0%. I also tried any other

mapreduce program

I can find in the example jar package but they all hang.

The log file simply print
2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801181424_0004_r_000000_0 0.0% reduce > copy >
2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801181424_0004_r_000000_0 0.0% reduce > copy >
2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801181424_0004_r_000000_0 0.0% reduce > copy >

forever.

The program does work if I start Hadoop only on single node.

Below is my hadoop-site.xml configuration:

<configuration>

<property>
   <name>fs.default.name</name>
   <value>10.0.0.1:60000</value>
</property>

<property>
   <name>mapred.job.tracker</name>
   <value>10.0.0.1:60001</value>
</property>

<property>
   <name>dfs.data.dir</name>
   <value>/raid/hadoop/data</value>
</property>

<property>
   <name>mapred.local.dir</name>
   <value>/raid/hadoop/mapred</value>
</property>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/raid/hadoop/tmp</value>
</property>

<property>
  <name>mapred.child.java.opts</name>
  <value>-Xmx1024m</value>
</property>

<property>
  <name>mapred.tasktracker.tasks.maximum</name>
  <value>4</value>
</property>

<!--
<property>
  <name>mapred.map.tasks</name>
  <value>7</value>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>3</value>
</property>
-->

<property>
  <name>fs.inmemory.size.mb</name>
  <value>200</value>
</property>

<property>
  <name>dfs.block.size</name>
  <value>134217728</value>
</property>

<property>
  <name>io.sort.factor</name>
  <value>100</value>
</property>

<property>
  <name>io.sort.mb</name>
  <value>200</value>
</property>

<property>
  <name>io.file.buffer.size</name>
  <value>131072</value>
</property>

</configuration>

RE: Reduce hangs

Reply via email to