RE: Reduce hangs

Joydeep Sen Sarma Mon, 21 Jan 2008 22:09:06 -0800

hey list admins

what's this list and how is it different from the other one? (hadoop-user). i 
still see mails on the other one - so curious ..


Joydeep


-----Original Message-----
From: Taeho Kang [mailto:[EMAIL PROTECTED]
Sent: Mon 1/21/2008 6:03 PM
To: core-user@hadoop.apache.org
Subject: Re: Reduce hangs
 
You said you had two network interfaces... and it might be the source of
your problem.

Try disabling one of your network interfaces
Or set "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver" in
your config file so that datanode knows which network interface to pick up.

/Taeho

On Jan 22, 2008 7:34 AM, Yunhong Gu1 <[EMAIL PROTECTED]> wrote:

>
> Hi, all
>
> Just to keep this topic updated :) I am still tring to figure what
> happened.
>
> I found that in my 2-node configuration (namenode and jobtracker on
> node-1, while both are datanodes and tasktrackers). The reduce task may
> sometimes (but rarely) complete for programs that needs small amount of
> CPU time (e.g., mrbench), but for programs with large computation, it
> never finish. When reduces blocks, it always fails at 16%.
>
> Eventually I will get this error information:
> 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task
> outputncdm-IPxxxxxxxxx
> 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task
> outputncdm-IPxxxxxxxxx
> 08/01/18 15:13:38 INFO mapred.JobClient: Task Id :
> task_200801181145_0005_r_000000_1, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/01/18 19:56:38 WARN mapred.JobClient: Error reading task
> outputConnection timed out
> 08/01/18 19:59:47 WARN mapred.JobClient: Error reading task
> outputConnection timed out
> 08/01/18 20:09:40 INFO mapred.JobClient:  map 100% reduce 100%
> java.io.IOException: Job failed!
>
> I found that "IPxxxxxxxx" is not the correct network address
> that Hadoop should read result from. The servers I use have 2 network
> interfaces and I am using another one. I explicitly fill the IP addresses
> 10.0.0.x in all the configuration files.
>
> Might this be the reason of Reduce failure? But the Map phase does work.
>
> Thanks
> Yunhong
>
>
> On Sat, 19 Jan 2008, Yunhong Gu1 wrote:
>
> >
> > Oh, so it is the task running on the other node (ncdm-15) fails and
> Hadoop
> > re-run the task on the local node (ncdm-8). (I only have two nodes,
> ncdm-8
> > and ncdm-15. Both namenode and jobtracker are running on ncdm-8. The
> program
> > is also started on ncdm-8).
> >
> >>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
> >>> task_200801181852_0001_m_000001_0, Status : FAILED Too many
> fetch-failures
> >>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
> outputncdm15
> >
> > Any ideas why the task would fail? And why it takes so long for Hadoop
> to
> > detect the failure?
> >
> > Thanks
> > Yunhong
> >
> > On Sat, 19 Jan 2008, Devaraj Das wrote:
> >
> >> Hi Yunhong,
> >> As per the output it seems the job ran to successful completion (albeit
>
> >> with
> >> some failures)...
> >> Devaraj
> >>
> >>> -----Original Message-----
> >>> From: Yunhong Gu1 [mailto:[EMAIL PROTECTED]
> >>> Sent: Saturday, January 19, 2008 8:56 AM
> >>> To: [EMAIL PROTECTED]
> >>> Subject: Re: Reduce hangs
> >>>
> >>>
> >>>
> >>> Yes, it looks like HADOOP-1374
> >>>
> >>> The program actually failed after a while:
> >>>
> >>>
> >>> [EMAIL PROTECTED]:~/hadoop-0.15.2$ ./bin/hadoop jar
> >>> hadoop-0.15.2-test.jar mrbench
> >>> MRBenchmark.0.0.2
> >>> 08/01/18 18:53:08 INFO mapred.MRBench: creating control file:
> >>> 1 numLines, ASCENDING sortOrder
> >>> 08/01/18 18:53:08 INFO mapred.MRBench: created control file:
> >>> /benchmarks/MRBench/mr_input/input_-450753747.txt
> >>> 08/01/18 18:53:09 INFO mapred.MRBench: Running job 0:
> >>> input=/benchmarks/MRBench/mr_input
> >>> output=/benchmarks/MRBench/mr_output/output_1843693325
> >>> 08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input
> >>> paths to process : 1
> >>> 08/01/18 18:53:09 INFO mapred.JobClient: Running job:
> >>> job_200801181852_0001
> >>> 08/01/18 18:53:10 INFO mapred.JobClient:  map 0% reduce 0%
> >>> 08/01/18 18:53:17 INFO mapred.JobClient:  map 100% reduce 0%
> >>> 08/01/18 18:53:25 INFO mapred.JobClient:  map 100% reduce 16%
> >>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
> >>> task_200801181852_0001_m_000001_0, Status : FAILED Too many
> >>> fetch-failures
> >>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
> >>> outputncdm15
> >>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
> >>> outputncdm15
> >>> 08/01/18 19:08:34 INFO mapred.JobClient:  map 100% reduce 100%
> >>> 08/01/18 19:08:35 INFO mapred.JobClient: Job complete:
> >>> job_200801181852_0001
> >>> 08/01/18 19:08:35 INFO mapred.JobClient: Counters: 10
> >>> 08/01/18 19:08:35 INFO mapred.JobClient:   Job Counters
> >>> 08/01/18 19:08:35 INFO mapred.JobClient:     Launched map tasks=3
> >>> 08/01/18 19:08:35 INFO mapred.JobClient:     Launched reduce tasks=1
> >>> 08/01/18 19:08:35 INFO mapred.JobClient:     Data-local map tasks=2
> >>> 08/01/18 19:08:35 INFO mapred.JobClient:   Map-Reduce Framework
> >>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map input records=1
> >>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map output records=1
> >>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map input bytes=2
> >>> 08/01/18 19:08:35 INFO mapred.JobClient:     Map output bytes=5
> >>> 08/01/18 19:08:35 INFO mapred.JobClient:     Reduce input groups=1
> >>> 08/01/18 19:08:35 INFO mapred.JobClient:     Reduce input records=1
> >>> 08/01/18 19:08:35 INFO mapred.JobClient:     Reduce output records=1
> >>> DataLines       Maps    Reduces AvgTime (milliseconds)
> >>> 1               2       1       926333
> >>>
> >>>
> >>>
> >>> On Fri, 18 Jan 2008, Konstantin Shvachko wrote:
> >>>
> >>>> Looks like we still have this unsolved mysterious problem:
> >>>>
> >>>> http://issues.apache.org/jira/browse/HADOOP-1374
> >>>>
> >>>> Could it be related to HADOOP-1246? Arun?
> >>>>
> >>>> Thanks,
> >>>> --Konstantin
> >>>>
> >>>> Yunhong Gu1 wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> If someone knows how to fix the problem described below,
> >>> please help
> >>>>> me out. Thanks!
> >>>>>
> >>>>> I am testing Hadoop on 2-node cluster and the "reduce"
> >>> always hangs
> >>>>> at some stage, even if I use different clusters. My OS is Debian
> >>>>> Linux kernel 2.6 (AMD Opteron w/ 4GB Mem). Hadoop verision
> >>> is 0.15.2.
> >>>>> Java version is 1.5.0_01-b08.
> >>>>>
> >>>>> I simply tried "./bin/hadoop jar hadoop-0.15.2-test.jar
> >>> mrbench" and
> >>>>> when the map stage finishes, the reduce stage will hang
> >>> somewhere in
> >>>>> the middle, sometimes at 0%. I also tried any other
> >>> mapreduce program
> >>>>> I can find in the example jar package but they all hang.
> >>>>>
> >>>>> The log file simply print
> >>>>> 2008-01-18 15:15:50,831 INFO org.apache.hadoop.mapred.TaskTracker:
> >>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
> >>>>> 2008-01-18 15:15:56,841 INFO org.apache.hadoop.mapred.TaskTracker:
> >>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
> >>>>> 2008-01-18 15:16:02,850 INFO org.apache.hadoop.mapred.TaskTracker:
> >>>>> task_200801181424_0004_r_000000_0 0.0% reduce > copy >
> >>>>>
> >>>>> forever.
> >>>>>
> >>>>> The program does work if I start Hadoop only on single node.
> >>>>>
> >>>>> Below is my hadoop-site.xml configuration:
> >>>>>
> >>>>> <configuration>
> >>>>>
> >>>>> <property>
> >>>>>    <name> fs.default.name</name>
> >>>>>    <value>10.0.0.1:60000</value>
> >>>>> </property>
> >>>>>
> >>>>> <property>
> >>>>>    <name>mapred.job.tracker</name>
> >>>>>    <value>10.0.0.1:60001</value>
> >>>>> </property>
> >>>>>
> >>>>> <property>
> >>>>>    <name>dfs.data.dir</name>
> >>>>>    <value>/raid/hadoop/data</value>
> >>>>> </property>
> >>>>>
> >>>>> <property>
> >>>>>    <name>mapred.local.dir</name>
> >>>>>    <value>/raid/hadoop/mapred</value>
> >>>>> </property>
> >>>>>
> >>>>> <property>
> >>>>>   <name>hadoop.tmp.dir</name>
> >>>>>   <value>/raid/hadoop/tmp</value>
> >>>>> </property>
> >>>>>
> >>>>> <property>
> >>>>>   <name>mapred.child.java.opts</name>
> >>>>>   <value>-Xmx1024m</value>
> >>>>> </property>
> >>>>>
> >>>>> <property>
> >>>>>   <name>mapred.tasktracker.tasks.maximum</name>
> >>>>>   <value>4</value>
> >>>>> </property>
> >>>>>
> >>>>> <!--
> >>>>> <property>
> >>>>>   <name>mapred.map.tasks</name>
> >>>>>   <value>7</value>
> >>>>> </property>
> >>>>>
> >>>>> <property>
> >>>>>   <name>mapred.reduce.tasks</name>
> >>>>>   <value>3</value>
> >>>>> </property>
> >>>>> -->
> >>>>>
> >>>>> <property>
> >>>>>   <name>fs.inmemory.size.mb</name>
> >>>>>   <value>200</value>
> >>>>> </property>
> >>>>>
> >>>>> <property>
> >>>>>   <name>dfs.block.size</name>
> >>>>>   <value>134217728</value>
> >>>>> </property>
> >>>>>
> >>>>> <property>
> >>>>>   <name>io.sort.factor</name>
> >>>>>   <value>100</value>
> >>>>> </property>
> >>>>>
> >>>>> <property>
> >>>>>   <name>io.sort.mb</name>
> >>>>>   <value>200</value>
> >>>>> </property>
> >>>>>
> >>>>> <property>
> >>>>>   <name>io.file.buffer.size</name>
> >>>>>   <value>131072</value>
> >>>>> </property>
> >>>>>
> >>>>> </configuration>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >>
> >
>



-- 
Taeho Kang [tkang.blogspot.com]
Software Engineer, NHN Corporation, Korea

RE: Reduce hangs

Reply via email to