Re: Reduce Hangs
All ports are listed in conf/hadoop-default.xml and conf/hadoop-site.xml. Also, if you are using hbase, you need to concern about hbase-default.xmland hbase-site.xml, located in hbase directory. 2008/3/29 Natarajan, Senthil <[EMAIL PROTECTED]>: > Hi, > Thanks for your suggestions. > > It looks like the problem is with firewall, I created the firewall rule to > allow these ports 5 to 50100 (I found in these port range hadoop was > listening) > > Looks like I am missing some ports and that gets blocked in the firewall. > > Could anyone please let me know, how to configure hadoop to use only > certain specified ports, so that those ports can be allowed in the firewall. > > Thanks, > Senthil > > -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
RE: Reduce Hangs
Hi, Thanks for your suggestions. It looks like the problem is with firewall, I created the firewall rule to allow these ports 5 to 50100 (I found in these port range hadoop was listening) Looks like I am missing some ports and that gets blocked in the firewall. Could anyone please let me know, how to configure hadoop to use only certain specified ports, so that those ports can be allowed in the firewall. Thanks, Senthil -Original Message- From: 朱盛凯 [mailto:[EMAIL PROTECTED] Sent: Thursday, March 27, 2008 12:32 PM To: core-user@hadoop.apache.org Subject: Re: Reduce Hangs Hi, I met this problem in my cluster before, I think I can share with you some of my experience. But it may not work in you case. The job in my cluster always hung at 16% of reduce. It occured because the reduce task could not fetch the map output from other nodes. In my case, two factors may result in this faliure of communication between two task trackers. One is the firewall block the trackers from communications. I solved this by disabling the firewall. The other factor is that trackers refer to other nodes by host name only, but not ip address. I solved this by editing the file /etc/hosts with mapping from hostname to ip address of all nodes in cluster. I hope my experience will be helpful for you. On 3/27/08, Natarajan, Senthil <[EMAIL PROTECTED]> wrote: > > Hi, > I have small Hadoop cluster, one master and three slaves. > When I try the example wordcount on one of our log file (size ~350 MB) > > Map runs fine but reduce always hangs (sometime around 19%,60% ...) after > very long time it finishes. > I am seeing this error > Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out > In the log I am seeing this > INFO org.apache.hadoop.mapred.TaskTracker: > task_200803261535_0001_r_00_0 0.1834% reduce > copy (11 of 20 at > 0.02 MB/s) > > > Do you know what might be the problem. > Thanks, > Senthil > >
Re: Reduce Hangs
On Fri, Mar 28, 2008 at 12:31 AM, 朱盛凯 <[EMAIL PROTECTED]> wrote: > Hi, > > I met this problem in my cluster before, I think I can share with you some > of my experience. > But it may not work in you case. > > The job in my cluster always hung at 16% of reduce. It occured because the > reduce task could not fetch the > map output from other nodes. > > In my case, two factors may result in this faliure of communication > between > two task trackers. > > One is the firewall block the trackers from communications. I solved this > by > disabling the firewall. > The other factor is that trackers refer to other nodes by host name only, > but not ip address. I solved this by editing the file /etc/hosts > with mapping from hostname to ip address of all nodes in cluster. I meet this problem with the same reason too. Try to host names to all your /etc/hosts files . > > > I hope my experience will be helpful for you. > > On 3/27/08, Natarajan, Senthil <[EMAIL PROTECTED]> wrote: > > > > Hi, > > I have small Hadoop cluster, one master and three slaves. > > When I try the example wordcount on one of our log file (size ~350 MB) > > > > Map runs fine but reduce always hangs (sometime around 19%,60% ...) > after > > very long time it finishes. > > I am seeing this error > > Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out > > In the log I am seeing this > > INFO org.apache.hadoop.mapred.TaskTracker: > > task_200803261535_0001_r_00_0 0.1834% reduce > copy (11 of 20 at > > 0.02 MB/s) > > > > > Do you know what might be the problem. > > Thanks, > > Senthil > > > > > -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: Reduce Hangs
Hi, I met this problem in my cluster before, I think I can share with you some of my experience. But it may not work in you case. The job in my cluster always hung at 16% of reduce. It occured because the reduce task could not fetch the map output from other nodes. In my case, two factors may result in this faliure of communication between two task trackers. One is the firewall block the trackers from communications. I solved this by disabling the firewall. The other factor is that trackers refer to other nodes by host name only, but not ip address. I solved this by editing the file /etc/hosts with mapping from hostname to ip address of all nodes in cluster. I hope my experience will be helpful for you. On 3/27/08, Natarajan, Senthil <[EMAIL PROTECTED]> wrote: > > Hi, > I have small Hadoop cluster, one master and three slaves. > When I try the example wordcount on one of our log file (size ~350 MB) > > Map runs fine but reduce always hangs (sometime around 19%,60% ...) after > very long time it finishes. > I am seeing this error > Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out > In the log I am seeing this > INFO org.apache.hadoop.mapred.TaskTracker: > task_200803261535_0001_r_00_0 0.1834% reduce > copy (11 of 20 at > 0.02 MB/s) > > > Do you know what might be the problem. > Thanks, > Senthil > >
Re: Reduce Hangs
On Thu, 27 Mar 2008, Natarajan, Senthil wrote: > Hi, > I have small Hadoop cluster, one master and three slaves. > When I try the example wordcount on one of our log file (size ~350 MB) > > Map runs fine but reduce always hangs (sometime around 19%,60% ...) after > very long time it finishes. > I am seeing this error > Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out This error occurs when the reducer fails to fetch map-task-output from 5 unique map tasks. Before considering an attempt as failed the reducer tries to fetch the map output for 7 times in 5 mins (default config). In case of the job failure check the following 1. Is this problem common in all the reducers? 2. Are the map tasks same across all the reducers for which the failure is reported? 3. Is there atleast one map task whose output is successfully fetched? If the job becomes successful then there might be some problem with the reducer. Amar > In the log I am seeing this > INFO org.apache.hadoop.mapred.TaskTracker: task_200803261535_0001_r_00_0 > 0.1834% reduce > copy (11 of 20 at 0.02 MB/s) > > > Do you know what might be the problem. > Thanks, > Senthil > >
Reduce Hangs
Hi, I have small Hadoop cluster, one master and three slaves. When I try the example wordcount on one of our log file (size ~350 MB) Map runs fine but reduce always hangs (sometime around 19%,60% ...) after very long time it finishes. I am seeing this error Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out In the log I am seeing this INFO org.apache.hadoop.mapred.TaskTracker: task_200803261535_0001_r_00_0 0.1834% reduce > copy (11 of 20 at 0.02 MB/s) > Do you know what might be the problem. Thanks, Senthil
Re: reduce > copy ?? (was Reduce hangs)
This is the Java stack when the problem happens. Not sure if it is a deadlock. (Hadoop version 0.15.2) Thread-25[1] threads Group system: (java.lang.ref.Reference$ReferenceHandler)0x77b Reference Handler cond. waiting (java.lang.ref.Finalizer$FinalizerThread)0x77a Finalizer cond. waiting (java.lang.Thread)0x779 Signal Dispatcher running Group main: (java.lang.Thread)0x1main cond. waiting (java.lang.Thread)0x778 taskCleanup cond. waiting (org.mortbay.jetty.servlet.AbstractSessionManager$SessionScavenger)0x777 SessionScavenger running (org.mortbay.util.ThreadedServer$Acceptor)0x776 Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=50060] running (org.mortbay.util.ThreadPool$PoolThread)0x775 SocketListener0-0 cond. waiting (org.apache.hadoop.ipc.Server$Listener)0x774 IPC Server listener on 34516 running (org.apache.hadoop.ipc.Server$Handler)0x773 IPC Server handler 0 on 34516 cond. waiting (org.apache.hadoop.ipc.Server$Handler)0x772 IPC Server handler 1 on 34516 cond. waiting (org.apache.hadoop.ipc.Server$Handler)0x771 IPC Server handler 2 on 34516 cond. waiting (org.apache.hadoop.ipc.Server$Handler)0x770 IPC Server handler 3 on 34516 cond. waiting (org.apache.hadoop.ipc.Client$ConnectionCuller)0x76f org.apache.hadoop.io.ObjectWritable Connection Culler running (org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread)0x76e Map-events fetcher for all reduce tasks on tracker_ncdm15:/10.0.0.2:34516 running (org.apache.hadoop.util.Daemon)0x76d [EMAIL PROTECTED] running (org.apache.hadoop.mapred.ReduceTaskRunner)0x76c Thread-25 cond. waiting (java.lang.UNIXProcess$1$1)0x76b process reaper running (org.apache.hadoop.mapred.ReduceTaskRunner)0x7d0 Thread-110cond. waiting (java.lang.UNIXProcess$1$1)0x7d1 process reaper running (org.apache.hadoop.ipc.Client$Connection)0x7f6 IPC Client connection to /10.0.0.1:60001 cond. waiting Thread-25[1] thread 0x7d0 Thread-110[1] where [1] java.lang.Object.wait (native method) [2] java.lang.Object.wait (Object.java:485) [3] java.lang.UNIXProcess.waitFor (UNIXProcess.java:165) [4] org.apache.hadoop.mapred.TaskRunner.runChild (TaskRunner.java:477) [5] org.apache.hadoop.mapred.TaskRunner.run (TaskRunner.java:343) On Tue, 22 Jan 2008, Yunhong Gu1 wrote: Hi, All I tried many possible configurations and I think this is the deepest reason I can dig out so far. The whole "Reduce hangs" because one task tracker does not progress at all. It is doing some "reduce > copy" forever, as shown below. This is very easy to reproduce on my machines (AMD dual dual-core Opteron 3.0GHz, Debian Linux kernel 2.16, running at 32-bit mode). [EMAIL PROTECTED]:~/hadoop-0.15.2/logs$ tail hadoop-gu-tasktracker-ncdm-8.log 2008-01-22 18:34:45,591 INFO org.apache.hadoop.mapred.TaskTracker: task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) > 2008-01-22 18:34:48,596 INFO org.apache.hadoop.mapred.TaskTracker: task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) > 2008-01-22 18:34:54,605 INFO org.apache.hadoop.mapred.TaskTracker: task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) > 2008-01-22 18:35:00,619 INFO org.apache.hadoop.mapred.TaskTracker: task_200801221827_0001_r_00_0 0.1667% reduce >
reduce > copy ?? (was Reduce hangs)
Hi, All I tried many possible configurations and I think this is the deepest reason I can dig out so far. The whole "Reduce hangs" because one task tracker does not progress at all. It is doing some "reduce > copy" forever, as shown below. This is very easy to reproduce on my machines (AMD dual dual-core Opteron 3.0GHz, Debian Linux kernel 2.16, running at 32-bit mode). [EMAIL PROTECTED]:~/hadoop-0.15.2/logs$ tail hadoop-gu-tasktracker-ncdm-8.log 2008-01-22 18:34:45,591 INFO org.apache.hadoop.mapred.TaskTracker: task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) > 2008-01-22 18:34:48,596 INFO org.apache.hadoop.mapred.TaskTracker: task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) > 2008-01-22 18:34:54,605 INFO org.apache.hadoop.mapred.TaskTracker: task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) > 2008-01-22 18:35:00,619 INFO org.apache.hadoop.mapred.TaskTracker: task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) > 2008-01-22 18:35:03,627 INFO org.apache.hadoop.mapred.TaskTracker: task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) > 2008-01-22 18:35:09,644 INFO org.apache.hadoop.mapred.TaskTracker: task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) > 2008-01-22 18:35:15,661 INFO org.apache.hadoop.mapred.TaskTracker: task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) > 2008-01-22 18:35:18,670 INFO org.apache.hadoop.mapred.TaskTracker: task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) > 2008-01-22 18:35:24,687 INFO org.apache.hadoop.mapred.TaskTracker: task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) > On Tue, 22 Jan 2008, Yunhong Gu1 wrote: Thanks, I tried but this probably not the reason. I checked the network connection using "netstat" and the client is actually connected to the correct server address. In addition, "mrbench" works sometime; if it is network problem, nothing should work at all. I let the "sort" program ran longer, and get some interesting output, the reduce progress can actually be descreasing and oscillating (see the bottom part of the output below). The error information: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. 08/01/22 13:56:21 WARN mapred.JobClient: Error reading task outputncdm15 What can cause "Error reading task outputncdm15"? [EMAIL PROTECTED]:~/hadoop-0.15.2$ ./bin/hadoop jar hadoop-0.15.2-examples.jar sort rand randsort Running on 2 nodes to sort from /user/gu/rand into /user/gu/randsort with 8 reduces. Job started: Tue Jan 22 11:56:39 CST 2008 08/01/22 11:56:39 INFO mapred.FileInputFormat: Total input paths to process: 20 08/01/22 11:56:40 INFO mapred.JobClient: Running job: job_200801221154_0005 08/01/22 11:56:41 INFO mapred.JobClient: map 0% reduce 0% 08/01/22 11:56:54 INFO mapred.JobClient: map 1% reduce 0% 08/01/22 11:56:58 INFO mapred.JobClient: map 2% reduce 0% 08/01/22 11:57:04 INFO mapred.JobClient: map 3% reduce 0% 08/01/22 11:57:08 INFO mapred.JobClient: map 5% reduce 0% 08/01/22 11:57:15 INFO mapred.JobClient: map 7% reduce 0% 08/01/22 11:57:23 INFO mapred.JobClient: map 8% reduce 0% 08/01/22 11:57:25 INFO mapred.JobClient: map 9% reduce 0% 08/01/22 11:57:27 INFO mapred.JobClient: map 10% reduce 0% 08/01/22 11:57:34 INFO mapred.JobClient: map 11% reduce 0% 08/01/22 11:57:35 INFO mapred.JobClient: map 12% reduce 0% 08/01/22 11:57:42 INFO mapred.JobClient: map 13% reduce 0% 08/01/22 11:57:48 INFO mapred.JobClient: map 14% reduce 0% 08/01/22 11:57:49 INFO mapred.JobClient: map 15% reduce 0% 08/01/22 11:57:56 INFO mapred.JobClient: map 17% reduce 0% 08/01/22 11:58:05 INFO mapred.JobClient: map 19% reduce 0% 08/01/22 11:58:12 INFO mapred.JobClient: map 20% reduce 0% 08/01/22 11:58:13 INFO mapred.JobClient: map 20% reduce 1% 08/01/22 11:58:14 INFO mapred.JobClient: map 21% reduce 1% 08/01/22 11:58:19 INFO mapred.JobClient: map 22% reduce 1% 08/01/22 11:58:26 INFO mapred.JobClient: map 23% reduce 1% 08/01/22 11:58:30 INFO mapred.JobClient: map 24% reduce 1% 08/01/22 11:58:32 INFO mapred.JobClient: map 25% reduce 1% 08/01/22 11:58:37 INFO mapred.JobClient: map 26% reduce 1% 08/01/22 11:58:42 INFO mapred.JobClient: map 27% reduce 1% 08/01/22 11:58:44 INFO mapred.JobClient: map 28% reduce 1% 08/01/22 11:58:49 INFO mapred.JobClient: map 29% reduce 1% 08/01/22 11:58:55 INFO mapred.JobClient: map 30% reduce 1% 08/01/22 11:58:57 INFO mapred.JobClient: map 31% reduce 1% 08/01/22 11:58:59 INFO mapred.JobClient: map 32% reduce 1% 08/01/22 11:59:07 INFO mapred.JobClient: map 33% reduce 1% 08/01/22 11:59:08 INFO mapred.JobClient: map 34% reduce 1% 08/01/22 11:59:15 INFO mapred.JobClient: map 35% r
Re: Reduce hangs
Status : FAILED Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. 08/01/22 15:37:13 INFO mapred.JobClient: map 100% reduce 10% 08/01/22 15:37:33 INFO mapred.JobClient: map 100% reduce 11% 08/01/22 15:37:53 INFO mapred.JobClient: map 100% reduce 12% 08/01/22 15:38:23 INFO mapred.JobClient: map 100% reduce 13% 08/01/22 15:45:51 INFO mapred.JobClient: map 100% reduce 12% 08/01/22 15:45:51 INFO mapred.JobClient: Task Id : task_200801221154_0005_r_01_0, Status : FAILED Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. 08/01/22 15:46:11 INFO mapred.JobClient: map 100% reduce 13% 08/01/22 15:46:31 INFO mapred.JobClient: map 100% reduce 14% 08/01/22 15:46:51 INFO mapred.JobClient: map 100% reduce 15% 08/01/22 15:47:24 INFO mapred.JobClient: map 100% reduce 16% On Tue, 22 Jan 2008, Taeho Kang wrote: You said you had two network interfaces... and it might be the source of your problem. Try disabling one of your network interfaces Or set "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver" in your config file so that datanode knows which network interface to pick up. /Taeho On Jan 22, 2008 7:34 AM, Yunhong Gu1 <[EMAIL PROTECTED]> wrote: Hi, all Just to keep this topic updated :) I am still tring to figure what happened. I found that in my 2-node configuration (namenode and jobtracker on node-1, while both are datanodes and tasktrackers). The reduce task may sometimes (but rarely) complete for programs that needs small amount of CPU time (e.g., mrbench), but for programs with large computation, it never finish. When reduces blocks, it always fails at 16%. Eventually I will get this error information: 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task outputncdm-IPx 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task outputncdm-IPx 08/01/18 15:13:38 INFO mapred.JobClient: Task Id : task_200801181145_0005_r_00_1, Status : FAILED Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. 08/01/18 19:56:38 WARN mapred.JobClient: Error reading task outputConnection timed out 08/01/18 19:59:47 WARN mapred.JobClient: Error reading task outputConnection timed out 08/01/18 20:09:40 INFO mapred.JobClient: map 100% reduce 100% java.io.IOException: Job failed! I found that "IP" is not the correct network address that Hadoop should read result from. The servers I use have 2 network interfaces and I am using another one. I explicitly fill the IP addresses 10.0.0.x in all the configuration files. Might this be the reason of Reduce failure? But the Map phase does work. Thanks Yunhong On Sat, 19 Jan 2008, Yunhong Gu1 wrote: Oh, so it is the task running on the other node (ncdm-15) fails and Hadoop re-run the task on the local node (ncdm-8). (I only have two nodes, ncdm-8 and ncdm-15. Both namenode and jobtracker are running on ncdm-8. The program is also started on ncdm-8). 08/01/18 19:08:27 INFO mapred.JobClient: Task Id : task_200801181852_0001_m_01_0, Status : FAILED Too many fetch-failures 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task outputncdm15 Any ideas why the task would fail? And why it takes so long for Hadoop to detect the failure? Thanks Yunhong On Sat, 19 Jan 2008, Devaraj Das wrote: Hi Yunhong, As per the output it seems the job ran to successful completion (albeit with some failures)... Devaraj -Original Message- From: Yunhong Gu1 [mailto:[EMAIL PROTECTED] Sent: Saturday, January 19, 2008 8:56 AM To: [EMAIL PROTECTED] Subject: Re: Reduce hangs Yes, it looks like HADOOP-1374 The program actually failed after a while: [EMAIL PROTECTED]:~/hadoop-0.15.2$ ./bin/hadoop jar hadoop-0.15.2-test.jar mrbench MRBenchmark.0.0.2 08/01/18 18:53:08 INFO mapred.MRBench: creating control file: 1 numLines, ASCENDING sortOrder 08/01/18 18:53:08 INFO mapred.MRBench: created control file: /benchmarks/MRBench/mr_input/input_-450753747.txt 08/01/18 18:53:09 INFO mapred.MRBench: Running job 0: input=/benchmarks/MRBench/mr_input output=/benchmarks/MRBench/mr_output/output_1843693325 08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input paths to process : 1 08/01/18 18:53:09 INFO mapred.JobClient: Running job: job_200801181852_0001 08/01/18 18:53:10 INFO mapred.JobClient: map 0% reduce 0% 08/01/18 18:53:17 INFO mapred.JobClient: map 100% reduce 0% 08/01/18 18:53:25 INFO mapred.JobClient: map 100% reduce 16% 08/01/18 19:08:27 INFO mapred.JobClient: Task Id : task_200801181852_0001_m_01_0, Status : FAILED Too many fetch-failures 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task outputncdm15 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task outputncdm15 08/01/18 19:08:34 INFO mapred.JobClient: map 100% reduce 100% 08/01/18 19:08:35 INFO mapred.JobClient: Job complete: job_200801181852_0001 08/01/18 19:08:35 INFO mapred.JobClient: Counters: 10 08/01/18 19:08:35 INFO mapred.JobClient: Job Counters 08/01/18 19:08:35 INFO mapred
Re: Reduce hangs 2
Hi, not sure if this is the same source of problem, but I also run in problems with a hanging reduce. It is reproducible for me, though I did not find the source of the problem yet. I run a series of jobs and my last job, the last reduce task hangs for about 15 to 20 minutes doing nothing, but than resumes. I running hadoop 15.1 Below the log entries during the hang. So I think it is not the copy problem mentioned before. I also checked our dfs is healthy. 2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 Need 2 map output(s) 2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1: Got 0 new map-outputs & 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 Got 2 known map output location(s); scheduling... 2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 Scheduled 2 of 2 known outputs (0 slow hosts and 0 dup hosts) 2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 Copying task_200801221313_0003_m_35_0 output from hadoop5.dev.company.com. 2008-01-22 21:22:09,328 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 Copying task_200801221313_0003_m_40_0 output from hadoop1.dev.company.com. 2008-01-22 21:22:11,243 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 done copying task_200801221313_0003_m_40_0 output from hadoop1.dev.company.com. 2008-01-22 21:22:11,610 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 done copying task_200801221313_0003_m_35_0 output from hadoop5.dev.company.com. 2008-01-22 21:22:11,611 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 Copying of all map outputs complete. Initiating the last merge on the remaining files in ramfs:// mapoutput169937755 2008-01-22 21:22:11,635 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 Merge of the 1 files in InMemoryFileSystem complete. Local file is /home/hadoop/data/hadoop- hadoop/mapred/local/task_200801221313_0003_r_46_1/map_34.out Any ideas? Thanks! Stefan
Re: Reduce hangs
On Jan 21, 2008, at 11:22 PM, ma qiang wrote: Do we need update our mailing list from hadoop-user to core-user ? No, everyone should have been moved automatically. -- Owen
Re: Reduce hangs
- Original Message - From: ma qiang <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Mon Jan 21 23:22:34 2008 Subject: Re: Reduce hangs Do we need update our mailing list from hadoop-user to core-user ? On Jan 22, 2008 2:56 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > On Jan 21, 2008, at 10:08 PM, Joydeep Sen Sarma wrote: > > > hey list admins > > > > what's this list and how is it different from the other one? > > (hadoop-user). i still see mails on the other one - so curious .. > > This is part of Hadoop's move to a top level project at Apache. The > code previously know as Hadoop is now Hadoop core. Therefore, we have > gone from: > > hadoop-{user,dev,[EMAIL PROTECTED] > > to: > > core-{user,dev,[EMAIL PROTECTED] > > -- Owen >
Re: Reduce hangs
Do we need update our mailing list from hadoop-user to core-user ? On Jan 22, 2008 2:56 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > On Jan 21, 2008, at 10:08 PM, Joydeep Sen Sarma wrote: > > > hey list admins > > > > what's this list and how is it different from the other one? > > (hadoop-user). i still see mails on the other one - so curious .. > > This is part of Hadoop's move to a top level project at Apache. The > code previously know as Hadoop is now Hadoop core. Therefore, we have > gone from: > > hadoop-{user,dev,[EMAIL PROTECTED] > > to: > > core-{user,dev,[EMAIL PROTECTED] > > -- Owen >
Re: Reduce hangs
On Jan 21, 2008, at 10:08 PM, Joydeep Sen Sarma wrote: hey list admins what's this list and how is it different from the other one? (hadoop-user). i still see mails on the other one - so curious .. This is part of Hadoop's move to a top level project at Apache. The code previously know as Hadoop is now Hadoop core. Therefore, we have gone from: hadoop-{user,dev,[EMAIL PROTECTED] to: core-{user,dev,[EMAIL PROTECTED] -- Owen
RE: Reduce hangs
hey list admins what's this list and how is it different from the other one? (hadoop-user). i still see mails on the other one - so curious .. Joydeep -Original Message- From: Taeho Kang [mailto:[EMAIL PROTECTED] Sent: Mon 1/21/2008 6:03 PM To: core-user@hadoop.apache.org Subject: Re: Reduce hangs You said you had two network interfaces... and it might be the source of your problem. Try disabling one of your network interfaces Or set "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver" in your config file so that datanode knows which network interface to pick up. /Taeho On Jan 22, 2008 7:34 AM, Yunhong Gu1 <[EMAIL PROTECTED]> wrote: > > Hi, all > > Just to keep this topic updated :) I am still tring to figure what > happened. > > I found that in my 2-node configuration (namenode and jobtracker on > node-1, while both are datanodes and tasktrackers). The reduce task may > sometimes (but rarely) complete for programs that needs small amount of > CPU time (e.g., mrbench), but for programs with large computation, it > never finish. When reduces blocks, it always fails at 16%. > > Eventually I will get this error information: > 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task > outputncdm-IPx > 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task > outputncdm-IPx > 08/01/18 15:13:38 INFO mapred.JobClient: Task Id : > task_200801181145_0005_r_00_1, Status : FAILED > Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. > 08/01/18 19:56:38 WARN mapred.JobClient: Error reading task > outputConnection timed out > 08/01/18 19:59:47 WARN mapred.JobClient: Error reading task > outputConnection timed out > 08/01/18 20:09:40 INFO mapred.JobClient: map 100% reduce 100% > java.io.IOException: Job failed! > > I found that "IP" is not the correct network address > that Hadoop should read result from. The servers I use have 2 network > interfaces and I am using another one. I explicitly fill the IP addresses > 10.0.0.x in all the configuration files. > > Might this be the reason of Reduce failure? But the Map phase does work. > > Thanks > Yunhong > > > On Sat, 19 Jan 2008, Yunhong Gu1 wrote: > > > > > Oh, so it is the task running on the other node (ncdm-15) fails and > Hadoop > > re-run the task on the local node (ncdm-8). (I only have two nodes, > ncdm-8 > > and ncdm-15. Both namenode and jobtracker are running on ncdm-8. The > program > > is also started on ncdm-8). > > > >>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id : > >>> task_200801181852_0001_m_01_0, Status : FAILED Too many > fetch-failures > >>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task > outputncdm15 > > > > Any ideas why the task would fail? And why it takes so long for Hadoop > to > > detect the failure? > > > > Thanks > > Yunhong > > > > On Sat, 19 Jan 2008, Devaraj Das wrote: > > > >> Hi Yunhong, > >> As per the output it seems the job ran to successful completion (albeit > > >> with > >> some failures)... > >> Devaraj > >> > >>> -Original Message- > >>> From: Yunhong Gu1 [mailto:[EMAIL PROTECTED] > >>> Sent: Saturday, January 19, 2008 8:56 AM > >>> To: [EMAIL PROTECTED] > >>> Subject: Re: Reduce hangs > >>> > >>> > >>> > >>> Yes, it looks like HADOOP-1374 > >>> > >>> The program actually failed after a while: > >>> > >>> > >>> [EMAIL PROTECTED]:~/hadoop-0.15.2$ ./bin/hadoop jar > >>> hadoop-0.15.2-test.jar mrbench > >>> MRBenchmark.0.0.2 > >>> 08/01/18 18:53:08 INFO mapred.MRBench: creating control file: > >>> 1 numLines, ASCENDING sortOrder > >>> 08/01/18 18:53:08 INFO mapred.MRBench: created control file: > >>> /benchmarks/MRBench/mr_input/input_-450753747.txt > >>> 08/01/18 18:53:09 INFO mapred.MRBench: Running job 0: > >>> input=/benchmarks/MRBench/mr_input > >>> output=/benchmarks/MRBench/mr_output/output_1843693325 > >>> 08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input > >>> paths to process : 1 > >>> 08/01/18 18:53:09 INFO mapred.JobClient: Running job: > >>> job_200801181852_0001 > >>> 08/01/18 18:53:10 INFO mapred.JobClient: map 0% reduce 0% > >>> 08/01/18 18:53:17 INFO mapred.JobClient: map 100% reduce 0% > >>> 08/01/18 18:53:25 INFO mapred.JobClient: map 100% reduce 16% > >>> 08/01/18 19:08:27
Re: Reduce hangs
You said you had two network interfaces... and it might be the source of your problem. Try disabling one of your network interfaces Or set "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver" in your config file so that datanode knows which network interface to pick up. /Taeho On Jan 22, 2008 7:34 AM, Yunhong Gu1 <[EMAIL PROTECTED]> wrote: > > Hi, all > > Just to keep this topic updated :) I am still tring to figure what > happened. > > I found that in my 2-node configuration (namenode and jobtracker on > node-1, while both are datanodes and tasktrackers). The reduce task may > sometimes (but rarely) complete for programs that needs small amount of > CPU time (e.g., mrbench), but for programs with large computation, it > never finish. When reduces blocks, it always fails at 16%. > > Eventually I will get this error information: > 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task > outputncdm-IPx > 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task > outputncdm-IPx > 08/01/18 15:13:38 INFO mapred.JobClient: Task Id : > task_200801181145_0005_r_00_1, Status : FAILED > Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. > 08/01/18 19:56:38 WARN mapred.JobClient: Error reading task > outputConnection timed out > 08/01/18 19:59:47 WARN mapred.JobClient: Error reading task > outputConnection timed out > 08/01/18 20:09:40 INFO mapred.JobClient: map 100% reduce 100% > java.io.IOException: Job failed! > > I found that "IP" is not the correct network address > that Hadoop should read result from. The servers I use have 2 network > interfaces and I am using another one. I explicitly fill the IP addresses > 10.0.0.x in all the configuration files. > > Might this be the reason of Reduce failure? But the Map phase does work. > > Thanks > Yunhong > > > On Sat, 19 Jan 2008, Yunhong Gu1 wrote: > > > > > Oh, so it is the task running on the other node (ncdm-15) fails and > Hadoop > > re-run the task on the local node (ncdm-8). (I only have two nodes, > ncdm-8 > > and ncdm-15. Both namenode and jobtracker are running on ncdm-8. The > program > > is also started on ncdm-8). > > > >>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id : > >>> task_200801181852_0001_m_01_0, Status : FAILED Too many > fetch-failures > >>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task > outputncdm15 > > > > Any ideas why the task would fail? And why it takes so long for Hadoop > to > > detect the failure? > > > > Thanks > > Yunhong > > > > On Sat, 19 Jan 2008, Devaraj Das wrote: > > > >> Hi Yunhong, > >> As per the output it seems the job ran to successful completion (albeit > > >> with > >> some failures)... > >> Devaraj > >> > >>> -Original Message- > >>> From: Yunhong Gu1 [mailto:[EMAIL PROTECTED] > >>> Sent: Saturday, January 19, 2008 8:56 AM > >>> To: [EMAIL PROTECTED] > >>> Subject: Re: Reduce hangs > >>> > >>> > >>> > >>> Yes, it looks like HADOOP-1374 > >>> > >>> The program actually failed after a while: > >>> > >>> > >>> [EMAIL PROTECTED]:~/hadoop-0.15.2$ ./bin/hadoop jar > >>> hadoop-0.15.2-test.jar mrbench > >>> MRBenchmark.0.0.2 > >>> 08/01/18 18:53:08 INFO mapred.MRBench: creating control file: > >>> 1 numLines, ASCENDING sortOrder > >>> 08/01/18 18:53:08 INFO mapred.MRBench: created control file: > >>> /benchmarks/MRBench/mr_input/input_-450753747.txt > >>> 08/01/18 18:53:09 INFO mapred.MRBench: Running job 0: > >>> input=/benchmarks/MRBench/mr_input > >>> output=/benchmarks/MRBench/mr_output/output_1843693325 > >>> 08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input > >>> paths to process : 1 > >>> 08/01/18 18:53:09 INFO mapred.JobClient: Running job: > >>> job_200801181852_0001 > >>> 08/01/18 18:53:10 INFO mapred.JobClient: map 0% reduce 0% > >>> 08/01/18 18:53:17 INFO mapred.JobClient: map 100% reduce 0% > >>> 08/01/18 18:53:25 INFO mapred.JobClient: map 100% reduce 16% > >>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id : > >>> task_200801181852_0001_m_01_0, Status : FAILED Too many > >>> fetch-failures > >>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task > >>> outputncdm15 > >>> 08/01/18 19:08:27 WARN mapred.Jo
RE: Reduce hangs
Hi, all Just to keep this topic updated :) I am still tring to figure what happened. I found that in my 2-node configuration (namenode and jobtracker on node-1, while both are datanodes and tasktrackers). The reduce task may sometimes (but rarely) complete for programs that needs small amount of CPU time (e.g., mrbench), but for programs with large computation, it never finish. When reduces blocks, it always fails at 16%. Eventually I will get this error information: 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task outputncdm-IPx 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task outputncdm-IPx 08/01/18 15:13:38 INFO mapred.JobClient: Task Id : task_200801181145_0005_r_00_1, Status : FAILED Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. 08/01/18 19:56:38 WARN mapred.JobClient: Error reading task outputConnection timed out 08/01/18 19:59:47 WARN mapred.JobClient: Error reading task outputConnection timed out 08/01/18 20:09:40 INFO mapred.JobClient: map 100% reduce 100% java.io.IOException: Job failed! I found that "IP" is not the correct network address that Hadoop should read result from. The servers I use have 2 network interfaces and I am using another one. I explicitly fill the IP addresses 10.0.0.x in all the configuration files. Might this be the reason of Reduce failure? But the Map phase does work. Thanks Yunhong On Sat, 19 Jan 2008, Yunhong Gu1 wrote: Oh, so it is the task running on the other node (ncdm-15) fails and Hadoop re-run the task on the local node (ncdm-8). (I only have two nodes, ncdm-8 and ncdm-15. Both namenode and jobtracker are running on ncdm-8. The program is also started on ncdm-8). 08/01/18 19:08:27 INFO mapred.JobClient: Task Id : task_200801181852_0001_m_01_0, Status : FAILED Too many fetch-failures 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task outputncdm15 Any ideas why the task would fail? And why it takes so long for Hadoop to detect the failure? Thanks Yunhong On Sat, 19 Jan 2008, Devaraj Das wrote: Hi Yunhong, As per the output it seems the job ran to successful completion (albeit with some failures)... Devaraj -Original Message- From: Yunhong Gu1 [mailto:[EMAIL PROTECTED] Sent: Saturday, January 19, 2008 8:56 AM To: [EMAIL PROTECTED] Subject: Re: Reduce hangs Yes, it looks like HADOOP-1374 The program actually failed after a while: [EMAIL PROTECTED]:~/hadoop-0.15.2$ ./bin/hadoop jar hadoop-0.15.2-test.jar mrbench MRBenchmark.0.0.2 08/01/18 18:53:08 INFO mapred.MRBench: creating control file: 1 numLines, ASCENDING sortOrder 08/01/18 18:53:08 INFO mapred.MRBench: created control file: /benchmarks/MRBench/mr_input/input_-450753747.txt 08/01/18 18:53:09 INFO mapred.MRBench: Running job 0: input=/benchmarks/MRBench/mr_input output=/benchmarks/MRBench/mr_output/output_1843693325 08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input paths to process : 1 08/01/18 18:53:09 INFO mapred.JobClient: Running job: job_200801181852_0001 08/01/18 18:53:10 INFO mapred.JobClient: map 0% reduce 0% 08/01/18 18:53:17 INFO mapred.JobClient: map 100% reduce 0% 08/01/18 18:53:25 INFO mapred.JobClient: map 100% reduce 16% 08/01/18 19:08:27 INFO mapred.JobClient: Task Id : task_200801181852_0001_m_01_0, Status : FAILED Too many fetch-failures 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task outputncdm15 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task outputncdm15 08/01/18 19:08:34 INFO mapred.JobClient: map 100% reduce 100% 08/01/18 19:08:35 INFO mapred.JobClient: Job complete: job_200801181852_0001 08/01/18 19:08:35 INFO mapred.JobClient: Counters: 10 08/01/18 19:08:35 INFO mapred.JobClient: Job Counters 08/01/18 19:08:35 INFO mapred.JobClient: Launched map tasks=3 08/01/18 19:08:35 INFO mapred.JobClient: Launched reduce tasks=1 08/01/18 19:08:35 INFO mapred.JobClient: Data-local map tasks=2 08/01/18 19:08:35 INFO mapred.JobClient: Map-Reduce Framework 08/01/18 19:08:35 INFO mapred.JobClient: Map input records=1 08/01/18 19:08:35 INFO mapred.JobClient: Map output records=1 08/01/18 19:08:35 INFO mapred.JobClient: Map input bytes=2 08/01/18 19:08:35 INFO mapred.JobClient: Map output bytes=5 08/01/18 19:08:35 INFO mapred.JobClient: Reduce input groups=1 08/01/18 19:08:35 INFO mapred.JobClient: Reduce input records=1 08/01/18 19:08:35 INFO mapred.JobClient: Reduce output records=1 DataLines MapsReduces AvgTime (milliseconds) 1 2 1 926333 On Fri, 18 Jan 2008, Konstantin Shvachko wrote: Looks like we still have this unsolved mysterious problem: http://issues.apache.org/jira/browse/HADOOP-1374 Could it be related to HADOOP-1246? Arun? Thanks, --Konstantin Yunhong Gu1 wrote: Hi, If someone knows how to fix the problem described below, please help me out. Thanks! I am testing Hadoop on 2-node cluster and the "reduce