Re: Reduce Hangs

2008-03-30 Thread Mafish Liu
All ports are listed in conf/hadoop-default.xml and conf/hadoop-site.xml.
Also, if you are using hbase, you need to concern about hbase-default.xmland
hbase-site.xml, located in hbase directory.

2008/3/29 Natarajan, Senthil <[EMAIL PROTECTED]>:

> Hi,
> Thanks for your suggestions.
>
> It looks like the problem is with firewall, I created the firewall rule to
> allow these ports 5 to 50100 (I found in these port range hadoop was
> listening)
>
> Looks like I am missing some ports and that gets blocked in the firewall.
>
> Could anyone please let me know, how to configure hadoop to use only
> certain specified ports, so that those ports can be allowed in the firewall.
>
> Thanks,
> Senthil
>
>


-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


RE: Reduce Hangs

2008-03-28 Thread Natarajan, Senthil
Hi,
Thanks for your suggestions.

It looks like the problem is with firewall, I created the firewall rule to 
allow these ports 5 to 50100 (I found in these port range hadoop was 
listening)

Looks like I am missing some ports and that gets blocked in the firewall.

Could anyone please let me know, how to configure hadoop to use only certain 
specified ports, so that those ports can be allowed in the firewall.

Thanks,
Senthil

-Original Message-
From: 朱盛凯 [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 27, 2008 12:32 PM
To: core-user@hadoop.apache.org
Subject: Re: Reduce Hangs

Hi,

I met this problem in my cluster before, I think I can share with you some
of my experience.
But it may not work in you case.

The job in my cluster always hung at 16% of reduce. It occured because the
reduce task could not fetch the
map output from other nodes.

In my case, two factors may result in this faliure of communication between
two task trackers.

One is the firewall block the trackers from communications. I solved this by
disabling the firewall.
The other factor is that trackers refer to other nodes by host name only,
but not ip address. I solved this by editing the file /etc/hosts
with mapping from hostname to ip address of all nodes in cluster.

I hope my experience will be helpful for you.

On 3/27/08, Natarajan, Senthil <[EMAIL PROTECTED]> wrote:
>
> Hi,
> I have small Hadoop cluster, one master and three slaves.
> When I try the example wordcount on one of our log file (size ~350 MB)
>
> Map runs fine but reduce always hangs (sometime around 19%,60% ...) after
> very long time it finishes.
> I am seeing this error
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out
> In the log I am seeing this
> INFO org.apache.hadoop.mapred.TaskTracker:
> task_200803261535_0001_r_00_0 0.1834% reduce > copy (11 of 20 at
> 0.02 MB/s) >
>
> Do you know what might be the problem.
> Thanks,
> Senthil
>
>


Re: Reduce Hangs

2008-03-27 Thread Mafish Liu
On Fri, Mar 28, 2008 at 12:31 AM, 朱盛凯 <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I met this problem in my cluster before, I think I can share with you some
> of my experience.
> But it may not work in you case.
>
> The job in my cluster always hung at 16% of reduce. It occured because the
> reduce task could not fetch the
> map output from other nodes.
>
> In my case, two factors may result in this faliure of communication
> between
> two task trackers.
>
> One is the firewall block the trackers from communications. I solved this
> by
> disabling the firewall.
> The other factor is that trackers refer to other nodes by host name only,
> but not ip address. I solved this by editing the file /etc/hosts
> with mapping from hostname to ip address of all nodes in cluster.


I meet this problem with the same reason too.
Try to host names to all your /etc/hosts files .

>
>
> I hope my experience will be helpful for you.
>
> On 3/27/08, Natarajan, Senthil <[EMAIL PROTECTED]> wrote:
> >
> > Hi,
> > I have small Hadoop cluster, one master and three slaves.
> > When I try the example wordcount on one of our log file (size ~350 MB)
> >
> > Map runs fine but reduce always hangs (sometime around 19%,60% ...)
> after
> > very long time it finishes.
> > I am seeing this error
> > Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out
> > In the log I am seeing this
> > INFO org.apache.hadoop.mapred.TaskTracker:
> > task_200803261535_0001_r_00_0 0.1834% reduce > copy (11 of 20 at
> > 0.02 MB/s) >
> >
> > Do you know what might be the problem.
> > Thanks,
> > Senthil
> >
> >
>



-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Re: Reduce Hangs

2008-03-27 Thread 朱盛凯
Hi,

I met this problem in my cluster before, I think I can share with you some
of my experience.
But it may not work in you case.

The job in my cluster always hung at 16% of reduce. It occured because the
reduce task could not fetch the
map output from other nodes.

In my case, two factors may result in this faliure of communication between
two task trackers.

One is the firewall block the trackers from communications. I solved this by
disabling the firewall.
The other factor is that trackers refer to other nodes by host name only,
but not ip address. I solved this by editing the file /etc/hosts
with mapping from hostname to ip address of all nodes in cluster.

I hope my experience will be helpful for you.

On 3/27/08, Natarajan, Senthil <[EMAIL PROTECTED]> wrote:
>
> Hi,
> I have small Hadoop cluster, one master and three slaves.
> When I try the example wordcount on one of our log file (size ~350 MB)
>
> Map runs fine but reduce always hangs (sometime around 19%,60% ...) after
> very long time it finishes.
> I am seeing this error
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out
> In the log I am seeing this
> INFO org.apache.hadoop.mapred.TaskTracker:
> task_200803261535_0001_r_00_0 0.1834% reduce > copy (11 of 20 at
> 0.02 MB/s) >
>
> Do you know what might be the problem.
> Thanks,
> Senthil
>
>


Re: Reduce Hangs

2008-03-27 Thread Amar Kamat
On Thu, 27 Mar 2008, Natarajan, Senthil wrote:

> Hi,
> I have small Hadoop cluster, one master and three slaves.
> When I try the example wordcount on one of our log file (size ~350 MB)
>
> Map runs fine but reduce always hangs (sometime around 19%,60% ...) after 
> very long time it finishes.
> I am seeing this error
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out
This error occurs when the reducer fails to fetch map-task-output from 5
unique map tasks. Before considering an attempt as failed the reducer
tries to fetch the map output for 7 times in 5 mins (default config).
In case of the job failure check the following
1. Is this problem common in all the reducers?
2. Are the map tasks same across all the reducers for which the failure is
reported?
3. Is there atleast one map task whose output is successfully fetched?
If the job becomes successful then there might be some problem with the
reducer.
Amar
> In the log I am seeing this
> INFO org.apache.hadoop.mapred.TaskTracker: task_200803261535_0001_r_00_0 
> 0.1834% reduce > copy (11 of 20 at 0.02 MB/s) >
>
> Do you know what might be the problem.
> Thanks,
> Senthil
>
>


Reduce Hangs

2008-03-27 Thread Natarajan, Senthil
Hi,
I have small Hadoop cluster, one master and three slaves.
When I try the example wordcount on one of our log file (size ~350 MB)

Map runs fine but reduce always hangs (sometime around 19%,60% ...) after very 
long time it finishes.
I am seeing this error
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out
In the log I am seeing this
INFO org.apache.hadoop.mapred.TaskTracker: task_200803261535_0001_r_00_0 
0.1834% reduce > copy (11 of 20 at 0.02 MB/s) >

Do you know what might be the problem.
Thanks,
Senthil



Re: reduce > copy ?? (was Reduce hangs)

2008-01-22 Thread Yunhong Gu1



This is the Java stack when the problem happens. Not sure if it is a 
deadlock. (Hadoop version 0.15.2)


Thread-25[1] threads
Group system:
  (java.lang.ref.Reference$ReferenceHandler)0x77b  
Reference Handler cond. 
waiting
  (java.lang.ref.Finalizer$FinalizerThread)0x77a   
Finalizer cond. 
waiting
  (java.lang.Thread)0x779  
Signal Dispatcher 
running
Group main:
  (java.lang.Thread)0x1main 
 cond. 
waiting
  (java.lang.Thread)0x778  
taskCleanup   cond. 
waiting
  (org.mortbay.jetty.servlet.AbstractSessionManager$SessionScavenger)0x777 
SessionScavenger  
running
  (org.mortbay.util.ThreadedServer$Acceptor)0x776  
Acceptor ServerSocket[addr=0.0.0.0/0.0.0.0,port=0,localport=50060]
running
  (org.mortbay.util.ThreadPool$PoolThread)0x775
SocketListener0-0 cond. 
waiting
  (org.apache.hadoop.ipc.Server$Listener)0x774 IPC 
Server listener on 34516  running
  (org.apache.hadoop.ipc.Server$Handler)0x773  IPC 
Server handler 0 on 34516 cond. 
waiting
  (org.apache.hadoop.ipc.Server$Handler)0x772  IPC 
Server handler 1 on 34516 cond. 
waiting
  (org.apache.hadoop.ipc.Server$Handler)0x771  IPC 
Server handler 2 on 34516 cond. 
waiting
  (org.apache.hadoop.ipc.Server$Handler)0x770  IPC 
Server handler 3 on 34516 cond. 
waiting
  (org.apache.hadoop.ipc.Client$ConnectionCuller)0x76f 
org.apache.hadoop.io.ObjectWritable Connection Culler 
running
  (org.apache.hadoop.mapred.TaskTracker$MapEventsFetcherThread)0x76e   
Map-events fetcher for all reduce tasks on tracker_ncdm15:/10.0.0.2:34516 
running
  (org.apache.hadoop.util.Daemon)0x76d 
[EMAIL PROTECTED]   running
  (org.apache.hadoop.mapred.ReduceTaskRunner)0x76c 
Thread-25 cond. 
waiting
  (java.lang.UNIXProcess$1$1)0x76b 
process reaper
running
  (org.apache.hadoop.mapred.ReduceTaskRunner)0x7d0 
Thread-110cond. 
waiting
  (java.lang.UNIXProcess$1$1)0x7d1 
process reaper
running
  (org.apache.hadoop.ipc.Client$Connection)0x7f6   IPC 
Client connection to /10.0.0.1:60001  cond. 
waiting
Thread-25[1] thread 0x7d0
Thread-110[1] where
  [1] java.lang.Object.wait (native method)
  [2] java.lang.Object.wait (Object.java:485)
  [3] java.lang.UNIXProcess.waitFor (UNIXProcess.java:165)
  [4] org.apache.hadoop.mapred.TaskRunner.runChild (TaskRunner.java:477)
  [5] org.apache.hadoop.mapred.TaskRunner.run (TaskRunner.java:343)

On Tue, 22 Jan 2008, Yunhong Gu1 wrote:



Hi, All

I tried many possible configurations and I think this is the deepest reason I 
can dig out so far. The whole "Reduce hangs" because one task tracker does 
not progress at all. It is doing some "reduce > copy" forever, as shown 
below.


This is very easy to reproduce on my machines (AMD dual dual-core Opteron 
3.0GHz, Debian Linux kernel 2.16, running at 32-bit mode).


[EMAIL PROTECTED]:~/hadoop-0.15.2/logs$ tail hadoop-gu-tasktracker-ncdm-8.log
2008-01-22 18:34:45,591 INFO org.apache.hadoop.mapred.TaskTracker: 
task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 
MB/s) >
2008-01-22 18:34:48,596 INFO org.apache.hadoop.mapred.TaskTracker: 
task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 
MB/s) >
2008-01-22 18:34:54,605 INFO org.apache.hadoop.mapred.TaskTracker: 
task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 
MB/s) >

2008-01-22 18:35:00,619 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801221827_0001_r_00_0 0.1667% reduce >

reduce > copy ?? (was Reduce hangs)

2008-01-22 Thread Yunhong Gu1


Hi, All

I tried many possible configurations and I think this is the deepest 
reason I can dig out so far. The whole "Reduce hangs" because one task 
tracker does not progress at all. It is doing some "reduce > copy" 
forever, as shown below.


This is very easy to reproduce on my machines (AMD dual dual-core 
Opteron 3.0GHz, Debian Linux kernel 2.16, running at 32-bit mode).


[EMAIL PROTECTED]:~/hadoop-0.15.2/logs$ tail hadoop-gu-tasktracker-ncdm-8.log
2008-01-22 18:34:45,591 INFO org.apache.hadoop.mapred.TaskTracker: 
task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) >
2008-01-22 18:34:48,596 INFO org.apache.hadoop.mapred.TaskTracker: 
task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) >
2008-01-22 18:34:54,605 INFO org.apache.hadoop.mapred.TaskTracker: 
task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) >

2008-01-22 18:35:00,619 INFO org.apache.hadoop.mapred.TaskTracker:
task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) 
>
2008-01-22 18:35:03,627 INFO org.apache.hadoop.mapred.TaskTracker: 
task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) >
2008-01-22 18:35:09,644 INFO org.apache.hadoop.mapred.TaskTracker: 
task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) >
2008-01-22 18:35:15,661 INFO org.apache.hadoop.mapred.TaskTracker: 
task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) >
2008-01-22 18:35:18,670 INFO org.apache.hadoop.mapred.TaskTracker: 
task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) >
2008-01-22 18:35:24,687 INFO org.apache.hadoop.mapred.TaskTracker: 
task_200801221827_0001_r_00_0 0.1667% reduce > copy (1 of 2 at 0.00 MB/s) >



On Tue, 22 Jan 2008, Yunhong Gu1 wrote:



Thanks, I tried but this probably not the reason. I checked the network
connection using "netstat" and the client is actually connected to the
correct server address. In addition, "mrbench" works sometime; if it is 
network problem, nothing should work at all.


I let the "sort" program ran longer, and get some interesting output, the
reduce progress can actually be descreasing and oscillating (see the bottom 
part of the output below).


The error information:
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
08/01/22 13:56:21 WARN mapred.JobClient: Error reading task outputncdm15

What can cause "Error reading task outputncdm15"?

[EMAIL PROTECTED]:~/hadoop-0.15.2$ ./bin/hadoop jar hadoop-0.15.2-examples.jar sort 
rand randsort
Running on 2 nodes to sort from /user/gu/rand into /user/gu/randsort with 8 
reduces.

Job started: Tue Jan 22 11:56:39 CST 2008
08/01/22 11:56:39 INFO mapred.FileInputFormat: Total input paths to process: 
20

08/01/22 11:56:40 INFO mapred.JobClient: Running job: job_200801221154_0005
08/01/22 11:56:41 INFO mapred.JobClient:  map 0% reduce 0%
08/01/22 11:56:54 INFO mapred.JobClient:  map 1% reduce 0%
08/01/22 11:56:58 INFO mapred.JobClient:  map 2% reduce 0%
08/01/22 11:57:04 INFO mapred.JobClient:  map 3% reduce 0%
08/01/22 11:57:08 INFO mapred.JobClient:  map 5% reduce 0%
08/01/22 11:57:15 INFO mapred.JobClient:  map 7% reduce 0%
08/01/22 11:57:23 INFO mapred.JobClient:  map 8% reduce 0%
08/01/22 11:57:25 INFO mapred.JobClient:  map 9% reduce 0%
08/01/22 11:57:27 INFO mapred.JobClient:  map 10% reduce 0%
08/01/22 11:57:34 INFO mapred.JobClient:  map 11% reduce 0%
08/01/22 11:57:35 INFO mapred.JobClient:  map 12% reduce 0%
08/01/22 11:57:42 INFO mapred.JobClient:  map 13% reduce 0%
08/01/22 11:57:48 INFO mapred.JobClient:  map 14% reduce 0%
08/01/22 11:57:49 INFO mapred.JobClient:  map 15% reduce 0%
08/01/22 11:57:56 INFO mapred.JobClient:  map 17% reduce 0%
08/01/22 11:58:05 INFO mapred.JobClient:  map 19% reduce 0%
08/01/22 11:58:12 INFO mapred.JobClient:  map 20% reduce 0%
08/01/22 11:58:13 INFO mapred.JobClient:  map 20% reduce 1%
08/01/22 11:58:14 INFO mapred.JobClient:  map 21% reduce 1%
08/01/22 11:58:19 INFO mapred.JobClient:  map 22% reduce 1%
08/01/22 11:58:26 INFO mapred.JobClient:  map 23% reduce 1%
08/01/22 11:58:30 INFO mapred.JobClient:  map 24% reduce 1%
08/01/22 11:58:32 INFO mapred.JobClient:  map 25% reduce 1%
08/01/22 11:58:37 INFO mapred.JobClient:  map 26% reduce 1%
08/01/22 11:58:42 INFO mapred.JobClient:  map 27% reduce 1%
08/01/22 11:58:44 INFO mapred.JobClient:  map 28% reduce 1%
08/01/22 11:58:49 INFO mapred.JobClient:  map 29% reduce 1%
08/01/22 11:58:55 INFO mapred.JobClient:  map 30% reduce 1%
08/01/22 11:58:57 INFO mapred.JobClient:  map 31% reduce 1%
08/01/22 11:58:59 INFO mapred.JobClient:  map 32% reduce 1%
08/01/22 11:59:07 INFO mapred.JobClient:  map 33% reduce 1%
08/01/22 11:59:08 INFO mapred.JobClient:  map 34% reduce 1%
08/01/22 11:59:15 INFO mapred.JobClient:  map 35% r

Re: Reduce hangs

2008-01-22 Thread Yunhong Gu1
 Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
08/01/22 15:37:13 INFO mapred.JobClient:  map 100% reduce 10%
08/01/22 15:37:33 INFO mapred.JobClient:  map 100% reduce 11%
08/01/22 15:37:53 INFO mapred.JobClient:  map 100% reduce 12%
08/01/22 15:38:23 INFO mapred.JobClient:  map 100% reduce 13%
08/01/22 15:45:51 INFO mapred.JobClient:  map 100% reduce 12%
08/01/22 15:45:51 INFO mapred.JobClient: Task Id :
task_200801221154_0005_r_01_0, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
08/01/22 15:46:11 INFO mapred.JobClient:  map 100% reduce 13%
08/01/22 15:46:31 INFO mapred.JobClient:  map 100% reduce 14%
08/01/22 15:46:51 INFO mapred.JobClient:  map 100% reduce 15%
08/01/22 15:47:24 INFO mapred.JobClient:  map 100% reduce 16%


On Tue, 22 Jan 2008, Taeho Kang wrote:


You said you had two network interfaces... and it might be the source of
your problem.

Try disabling one of your network interfaces
Or set "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver" in
your config file so that datanode knows which network interface to pick up.

/Taeho

On Jan 22, 2008 7:34 AM, Yunhong Gu1 <[EMAIL PROTECTED]> wrote:



Hi, all

Just to keep this topic updated :) I am still tring to figure what
happened.

I found that in my 2-node configuration (namenode and jobtracker on
node-1, while both are datanodes and tasktrackers). The reduce task may
sometimes (but rarely) complete for programs that needs small amount of
CPU time (e.g., mrbench), but for programs with large computation, it
never finish. When reduces blocks, it always fails at 16%.

Eventually I will get this error information:
08/01/18 15:01:27 WARN mapred.JobClient: Error reading task
outputncdm-IPx
08/01/18 15:01:27 WARN mapred.JobClient: Error reading task
outputncdm-IPx
08/01/18 15:13:38 INFO mapred.JobClient: Task Id :
task_200801181145_0005_r_00_1, Status : FAILED
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
08/01/18 19:56:38 WARN mapred.JobClient: Error reading task
outputConnection timed out
08/01/18 19:59:47 WARN mapred.JobClient: Error reading task
outputConnection timed out
08/01/18 20:09:40 INFO mapred.JobClient:  map 100% reduce 100%
java.io.IOException: Job failed!

I found that "IP" is not the correct network address
that Hadoop should read result from. The servers I use have 2 network
interfaces and I am using another one. I explicitly fill the IP addresses
10.0.0.x in all the configuration files.

Might this be the reason of Reduce failure? But the Map phase does work.

Thanks
Yunhong


On Sat, 19 Jan 2008, Yunhong Gu1 wrote:



Oh, so it is the task running on the other node (ncdm-15) fails and

Hadoop

re-run the task on the local node (ncdm-8). (I only have two nodes,

ncdm-8

and ncdm-15. Both namenode and jobtracker are running on ncdm-8. The

program

is also started on ncdm-8).


08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
task_200801181852_0001_m_01_0, Status : FAILED Too many

fetch-failures

08/01/18 19:08:27 WARN mapred.JobClient: Error reading task

outputncdm15


Any ideas why the task would fail? And why it takes so long for Hadoop

to

detect the failure?

Thanks
Yunhong

On Sat, 19 Jan 2008, Devaraj Das wrote:


Hi Yunhong,
As per the output it seems the job ran to successful completion (albeit



with
some failures)...
Devaraj


-Original Message-
From: Yunhong Gu1 [mailto:[EMAIL PROTECTED]
Sent: Saturday, January 19, 2008 8:56 AM
To: [EMAIL PROTECTED]
Subject: Re: Reduce hangs



Yes, it looks like HADOOP-1374

The program actually failed after a while:


[EMAIL PROTECTED]:~/hadoop-0.15.2$ ./bin/hadoop jar
hadoop-0.15.2-test.jar mrbench
MRBenchmark.0.0.2
08/01/18 18:53:08 INFO mapred.MRBench: creating control file:
1 numLines, ASCENDING sortOrder
08/01/18 18:53:08 INFO mapred.MRBench: created control file:
/benchmarks/MRBench/mr_input/input_-450753747.txt
08/01/18 18:53:09 INFO mapred.MRBench: Running job 0:
input=/benchmarks/MRBench/mr_input
output=/benchmarks/MRBench/mr_output/output_1843693325
08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input
paths to process : 1
08/01/18 18:53:09 INFO mapred.JobClient: Running job:
job_200801181852_0001
08/01/18 18:53:10 INFO mapred.JobClient:  map 0% reduce 0%
08/01/18 18:53:17 INFO mapred.JobClient:  map 100% reduce 0%
08/01/18 18:53:25 INFO mapred.JobClient:  map 100% reduce 16%
08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
task_200801181852_0001_m_01_0, Status : FAILED Too many
fetch-failures
08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
outputncdm15
08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
outputncdm15
08/01/18 19:08:34 INFO mapred.JobClient:  map 100% reduce 100%
08/01/18 19:08:35 INFO mapred.JobClient: Job complete:
job_200801181852_0001
08/01/18 19:08:35 INFO mapred.JobClient: Counters: 10
08/01/18 19:08:35 INFO mapred.JobClient:   Job Counters
08/01/18 19:08:35 INFO mapred

Re: Reduce hangs 2

2008-01-22 Thread Stefan Groschupf

Hi,
not sure if this is the same source of problem, but I also run in  
problems with a hanging reduce.
It is reproducible for me, though I did not find the source of the  
problem yet.
I run a series of jobs and my last job, the last reduce task hangs for  
about 15 to 20 minutes doing nothing, but than resumes. I running  
hadoop 15.1


Below the log entries during the hang. So I think it is not the copy  
problem mentioned before. I also checked our dfs is healthy.



2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 Need 2 map output(s)
2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1: Got 0 new map-outputs & 0 obsolete  
map-outputs from tasktracker and 0 map-outputs from previous failures
2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 Got 2 known map output location(s);  
scheduling...
2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 Scheduled 2 of 2 known outputs (0  
slow hosts and 0 dup hosts)
2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 Copying  
task_200801221313_0003_m_35_0 output from hadoop5.dev.company.com.
2008-01-22 21:22:09,328 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 Copying  
task_200801221313_0003_m_40_0 output from hadoop1.dev.company.com.
2008-01-22 21:22:11,243 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 done copying  
task_200801221313_0003_m_40_0 output from hadoop1.dev.company.com.
2008-01-22 21:22:11,610 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 done copying  
task_200801221313_0003_m_35_0 output from hadoop5.dev.company.com.
2008-01-22 21:22:11,611 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 Copying of all map outputs complete.  
Initiating the last merge on the remaining files in ramfs:// 
mapoutput169937755
2008-01-22 21:22:11,635 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200801221313_0003_r_46_1 Merge of the 1 files in  
InMemoryFileSystem complete. Local file is /home/hadoop/data/hadoop- 
hadoop/mapred/local/task_200801221313_0003_r_46_1/map_34.out


Any ideas? Thanks!
Stefan 


Re: Reduce hangs

2008-01-22 Thread Owen O'Malley


On Jan 21, 2008, at 11:22 PM, ma qiang wrote:


Do we need update our mailing list from hadoop-user to core-user ?


No, everyone should have been moved automatically.

-- Owen


Re: Reduce hangs

2008-01-22 Thread Ajay Anand


- Original Message -
From: ma qiang <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org 
Sent: Mon Jan 21 23:22:34 2008
Subject: Re: Reduce hangs

Do we need update our mailing list from hadoop-user to core-user ?

On Jan 22, 2008 2:56 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
>
> On Jan 21, 2008, at 10:08 PM, Joydeep Sen Sarma wrote:
>
> > hey list admins
> >
> > what's this list and how is it different from the other one?
> > (hadoop-user). i still see mails on the other one - so curious ..
>
> This is part of Hadoop's move to a top level project at Apache. The
> code previously know as Hadoop is now Hadoop core. Therefore, we have
> gone from:
>
> hadoop-{user,dev,[EMAIL PROTECTED]
>
> to:
>
>   core-{user,dev,[EMAIL PROTECTED]
>
> -- Owen
>


Re: Reduce hangs

2008-01-21 Thread ma qiang
Do we need update our mailing list from hadoop-user to core-user ?

On Jan 22, 2008 2:56 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
>
> On Jan 21, 2008, at 10:08 PM, Joydeep Sen Sarma wrote:
>
> > hey list admins
> >
> > what's this list and how is it different from the other one?
> > (hadoop-user). i still see mails on the other one - so curious ..
>
> This is part of Hadoop's move to a top level project at Apache. The
> code previously know as Hadoop is now Hadoop core. Therefore, we have
> gone from:
>
> hadoop-{user,dev,[EMAIL PROTECTED]
>
> to:
>
>   core-{user,dev,[EMAIL PROTECTED]
>
> -- Owen
>


Re: Reduce hangs

2008-01-21 Thread Owen O'Malley


On Jan 21, 2008, at 10:08 PM, Joydeep Sen Sarma wrote:


hey list admins

what's this list and how is it different from the other one?  
(hadoop-user). i still see mails on the other one - so curious ..


This is part of Hadoop's move to a top level project at Apache. The  
code previously know as Hadoop is now Hadoop core. Therefore, we have  
gone from:


hadoop-{user,dev,[EMAIL PROTECTED]

to:

 core-{user,dev,[EMAIL PROTECTED]

-- Owen


RE: Reduce hangs

2008-01-21 Thread Joydeep Sen Sarma
hey list admins

what's this list and how is it different from the other one? (hadoop-user). i 
still see mails on the other one - so curious ..

Joydeep


-Original Message-
From: Taeho Kang [mailto:[EMAIL PROTECTED]
Sent: Mon 1/21/2008 6:03 PM
To: core-user@hadoop.apache.org
Subject: Re: Reduce hangs
 
You said you had two network interfaces... and it might be the source of
your problem.

Try disabling one of your network interfaces
Or set "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver" in
your config file so that datanode knows which network interface to pick up.

/Taeho

On Jan 22, 2008 7:34 AM, Yunhong Gu1 <[EMAIL PROTECTED]> wrote:

>
> Hi, all
>
> Just to keep this topic updated :) I am still tring to figure what
> happened.
>
> I found that in my 2-node configuration (namenode and jobtracker on
> node-1, while both are datanodes and tasktrackers). The reduce task may
> sometimes (but rarely) complete for programs that needs small amount of
> CPU time (e.g., mrbench), but for programs with large computation, it
> never finish. When reduces blocks, it always fails at 16%.
>
> Eventually I will get this error information:
> 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task
> outputncdm-IPx
> 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task
> outputncdm-IPx
> 08/01/18 15:13:38 INFO mapred.JobClient: Task Id :
> task_200801181145_0005_r_00_1, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/01/18 19:56:38 WARN mapred.JobClient: Error reading task
> outputConnection timed out
> 08/01/18 19:59:47 WARN mapred.JobClient: Error reading task
> outputConnection timed out
> 08/01/18 20:09:40 INFO mapred.JobClient:  map 100% reduce 100%
> java.io.IOException: Job failed!
>
> I found that "IP" is not the correct network address
> that Hadoop should read result from. The servers I use have 2 network
> interfaces and I am using another one. I explicitly fill the IP addresses
> 10.0.0.x in all the configuration files.
>
> Might this be the reason of Reduce failure? But the Map phase does work.
>
> Thanks
> Yunhong
>
>
> On Sat, 19 Jan 2008, Yunhong Gu1 wrote:
>
> >
> > Oh, so it is the task running on the other node (ncdm-15) fails and
> Hadoop
> > re-run the task on the local node (ncdm-8). (I only have two nodes,
> ncdm-8
> > and ncdm-15. Both namenode and jobtracker are running on ncdm-8. The
> program
> > is also started on ncdm-8).
> >
> >>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
> >>> task_200801181852_0001_m_01_0, Status : FAILED Too many
> fetch-failures
> >>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
> outputncdm15
> >
> > Any ideas why the task would fail? And why it takes so long for Hadoop
> to
> > detect the failure?
> >
> > Thanks
> > Yunhong
> >
> > On Sat, 19 Jan 2008, Devaraj Das wrote:
> >
> >> Hi Yunhong,
> >> As per the output it seems the job ran to successful completion (albeit
>
> >> with
> >> some failures)...
> >> Devaraj
> >>
> >>> -Original Message-
> >>> From: Yunhong Gu1 [mailto:[EMAIL PROTECTED]
> >>> Sent: Saturday, January 19, 2008 8:56 AM
> >>> To: [EMAIL PROTECTED]
> >>> Subject: Re: Reduce hangs
> >>>
> >>>
> >>>
> >>> Yes, it looks like HADOOP-1374
> >>>
> >>> The program actually failed after a while:
> >>>
> >>>
> >>> [EMAIL PROTECTED]:~/hadoop-0.15.2$ ./bin/hadoop jar
> >>> hadoop-0.15.2-test.jar mrbench
> >>> MRBenchmark.0.0.2
> >>> 08/01/18 18:53:08 INFO mapred.MRBench: creating control file:
> >>> 1 numLines, ASCENDING sortOrder
> >>> 08/01/18 18:53:08 INFO mapred.MRBench: created control file:
> >>> /benchmarks/MRBench/mr_input/input_-450753747.txt
> >>> 08/01/18 18:53:09 INFO mapred.MRBench: Running job 0:
> >>> input=/benchmarks/MRBench/mr_input
> >>> output=/benchmarks/MRBench/mr_output/output_1843693325
> >>> 08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input
> >>> paths to process : 1
> >>> 08/01/18 18:53:09 INFO mapred.JobClient: Running job:
> >>> job_200801181852_0001
> >>> 08/01/18 18:53:10 INFO mapred.JobClient:  map 0% reduce 0%
> >>> 08/01/18 18:53:17 INFO mapred.JobClient:  map 100% reduce 0%
> >>> 08/01/18 18:53:25 INFO mapred.JobClient:  map 100% reduce 16%
> >>> 08/01/18 19:08:27 

Re: Reduce hangs

2008-01-21 Thread Taeho Kang
You said you had two network interfaces... and it might be the source of
your problem.

Try disabling one of your network interfaces
Or set "dfs.datanode.dns.interface" and "dfs.datanode.dns.nameserver" in
your config file so that datanode knows which network interface to pick up.

/Taeho

On Jan 22, 2008 7:34 AM, Yunhong Gu1 <[EMAIL PROTECTED]> wrote:

>
> Hi, all
>
> Just to keep this topic updated :) I am still tring to figure what
> happened.
>
> I found that in my 2-node configuration (namenode and jobtracker on
> node-1, while both are datanodes and tasktrackers). The reduce task may
> sometimes (but rarely) complete for programs that needs small amount of
> CPU time (e.g., mrbench), but for programs with large computation, it
> never finish. When reduces blocks, it always fails at 16%.
>
> Eventually I will get this error information:
> 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task
> outputncdm-IPx
> 08/01/18 15:01:27 WARN mapred.JobClient: Error reading task
> outputncdm-IPx
> 08/01/18 15:13:38 INFO mapred.JobClient: Task Id :
> task_200801181145_0005_r_00_1, Status : FAILED
> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
> 08/01/18 19:56:38 WARN mapred.JobClient: Error reading task
> outputConnection timed out
> 08/01/18 19:59:47 WARN mapred.JobClient: Error reading task
> outputConnection timed out
> 08/01/18 20:09:40 INFO mapred.JobClient:  map 100% reduce 100%
> java.io.IOException: Job failed!
>
> I found that "IP" is not the correct network address
> that Hadoop should read result from. The servers I use have 2 network
> interfaces and I am using another one. I explicitly fill the IP addresses
> 10.0.0.x in all the configuration files.
>
> Might this be the reason of Reduce failure? But the Map phase does work.
>
> Thanks
> Yunhong
>
>
> On Sat, 19 Jan 2008, Yunhong Gu1 wrote:
>
> >
> > Oh, so it is the task running on the other node (ncdm-15) fails and
> Hadoop
> > re-run the task on the local node (ncdm-8). (I only have two nodes,
> ncdm-8
> > and ncdm-15. Both namenode and jobtracker are running on ncdm-8. The
> program
> > is also started on ncdm-8).
> >
> >>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
> >>> task_200801181852_0001_m_01_0, Status : FAILED Too many
> fetch-failures
> >>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
> outputncdm15
> >
> > Any ideas why the task would fail? And why it takes so long for Hadoop
> to
> > detect the failure?
> >
> > Thanks
> > Yunhong
> >
> > On Sat, 19 Jan 2008, Devaraj Das wrote:
> >
> >> Hi Yunhong,
> >> As per the output it seems the job ran to successful completion (albeit
>
> >> with
> >> some failures)...
> >> Devaraj
> >>
> >>> -Original Message-
> >>> From: Yunhong Gu1 [mailto:[EMAIL PROTECTED]
> >>> Sent: Saturday, January 19, 2008 8:56 AM
> >>> To: [EMAIL PROTECTED]
> >>> Subject: Re: Reduce hangs
> >>>
> >>>
> >>>
> >>> Yes, it looks like HADOOP-1374
> >>>
> >>> The program actually failed after a while:
> >>>
> >>>
> >>> [EMAIL PROTECTED]:~/hadoop-0.15.2$ ./bin/hadoop jar
> >>> hadoop-0.15.2-test.jar mrbench
> >>> MRBenchmark.0.0.2
> >>> 08/01/18 18:53:08 INFO mapred.MRBench: creating control file:
> >>> 1 numLines, ASCENDING sortOrder
> >>> 08/01/18 18:53:08 INFO mapred.MRBench: created control file:
> >>> /benchmarks/MRBench/mr_input/input_-450753747.txt
> >>> 08/01/18 18:53:09 INFO mapred.MRBench: Running job 0:
> >>> input=/benchmarks/MRBench/mr_input
> >>> output=/benchmarks/MRBench/mr_output/output_1843693325
> >>> 08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input
> >>> paths to process : 1
> >>> 08/01/18 18:53:09 INFO mapred.JobClient: Running job:
> >>> job_200801181852_0001
> >>> 08/01/18 18:53:10 INFO mapred.JobClient:  map 0% reduce 0%
> >>> 08/01/18 18:53:17 INFO mapred.JobClient:  map 100% reduce 0%
> >>> 08/01/18 18:53:25 INFO mapred.JobClient:  map 100% reduce 16%
> >>> 08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
> >>> task_200801181852_0001_m_01_0, Status : FAILED Too many
> >>> fetch-failures
> >>> 08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
> >>> outputncdm15
> >>> 08/01/18 19:08:27 WARN mapred.Jo

RE: Reduce hangs

2008-01-21 Thread Yunhong Gu1


Hi, all

Just to keep this topic updated :) I am still tring to figure what 
happened.


I found that in my 2-node configuration (namenode and jobtracker on 
node-1, while both are datanodes and tasktrackers). The reduce task may 
sometimes (but rarely) complete for programs that needs small amount of 
CPU time (e.g., mrbench), but for programs with large computation, it 
never finish. When reduces blocks, it always fails at 16%.


Eventually I will get this error information:
08/01/18 15:01:27 WARN mapred.JobClient: Error reading task 
outputncdm-IPx
08/01/18 15:01:27 WARN mapred.JobClient: Error reading task 
outputncdm-IPx
08/01/18 15:13:38 INFO mapred.JobClient: Task Id : 
task_200801181145_0005_r_00_1, Status : FAILED

Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
08/01/18 19:56:38 WARN mapred.JobClient: Error reading task outputConnection 
timed out
08/01/18 19:59:47 WARN mapred.JobClient: Error reading task outputConnection 
timed out
08/01/18 20:09:40 INFO mapred.JobClient:  map 100% reduce 100%
java.io.IOException: Job failed!

I found that "IP" is not the correct network address 
that Hadoop should read result from. The servers I use have 2 network 
interfaces and I am using another one. I explicitly fill the IP addresses 
10.0.0.x in all the configuration files.


Might this be the reason of Reduce failure? But the Map phase does work.

Thanks
Yunhong


On Sat, 19 Jan 2008, Yunhong Gu1 wrote:



Oh, so it is the task running on the other node (ncdm-15) fails and Hadoop 
re-run the task on the local node (ncdm-8). (I only have two nodes, ncdm-8 
and ncdm-15. Both namenode and jobtracker are running on ncdm-8. The program 
is also started on ncdm-8).


08/01/18 19:08:27 INFO mapred.JobClient: Task Id : 
task_200801181852_0001_m_01_0, Status : FAILED Too many fetch-failures

08/01/18 19:08:27 WARN mapred.JobClient: Error reading task outputncdm15


Any ideas why the task would fail? And why it takes so long for Hadoop to 
detect the failure?


Thanks
Yunhong

On Sat, 19 Jan 2008, Devaraj Das wrote:


Hi Yunhong,
As per the output it seems the job ran to successful completion (albeit 
with

some failures)...
Devaraj


-Original Message-
From: Yunhong Gu1 [mailto:[EMAIL PROTECTED]
Sent: Saturday, January 19, 2008 8:56 AM
To: [EMAIL PROTECTED]
Subject: Re: Reduce hangs



Yes, it looks like HADOOP-1374

The program actually failed after a while:


[EMAIL PROTECTED]:~/hadoop-0.15.2$ ./bin/hadoop jar
hadoop-0.15.2-test.jar mrbench
MRBenchmark.0.0.2
08/01/18 18:53:08 INFO mapred.MRBench: creating control file:
1 numLines, ASCENDING sortOrder
08/01/18 18:53:08 INFO mapred.MRBench: created control file:
/benchmarks/MRBench/mr_input/input_-450753747.txt
08/01/18 18:53:09 INFO mapred.MRBench: Running job 0:
input=/benchmarks/MRBench/mr_input
output=/benchmarks/MRBench/mr_output/output_1843693325
08/01/18 18:53:09 INFO mapred.FileInputFormat: Total input
paths to process : 1
08/01/18 18:53:09 INFO mapred.JobClient: Running job:
job_200801181852_0001
08/01/18 18:53:10 INFO mapred.JobClient:  map 0% reduce 0%
08/01/18 18:53:17 INFO mapred.JobClient:  map 100% reduce 0%
08/01/18 18:53:25 INFO mapred.JobClient:  map 100% reduce 16%
08/01/18 19:08:27 INFO mapred.JobClient: Task Id :
task_200801181852_0001_m_01_0, Status : FAILED Too many
fetch-failures
08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
outputncdm15
08/01/18 19:08:27 WARN mapred.JobClient: Error reading task
outputncdm15
08/01/18 19:08:34 INFO mapred.JobClient:  map 100% reduce 100%
08/01/18 19:08:35 INFO mapred.JobClient: Job complete:
job_200801181852_0001
08/01/18 19:08:35 INFO mapred.JobClient: Counters: 10
08/01/18 19:08:35 INFO mapred.JobClient:   Job Counters
08/01/18 19:08:35 INFO mapred.JobClient: Launched map tasks=3
08/01/18 19:08:35 INFO mapred.JobClient: Launched reduce tasks=1
08/01/18 19:08:35 INFO mapred.JobClient: Data-local map tasks=2
08/01/18 19:08:35 INFO mapred.JobClient:   Map-Reduce Framework
08/01/18 19:08:35 INFO mapred.JobClient: Map input records=1
08/01/18 19:08:35 INFO mapred.JobClient: Map output records=1
08/01/18 19:08:35 INFO mapred.JobClient: Map input bytes=2
08/01/18 19:08:35 INFO mapred.JobClient: Map output bytes=5
08/01/18 19:08:35 INFO mapred.JobClient: Reduce input groups=1
08/01/18 19:08:35 INFO mapred.JobClient: Reduce input records=1
08/01/18 19:08:35 INFO mapred.JobClient: Reduce output records=1
DataLines   MapsReduces AvgTime (milliseconds)
1   2   1   926333



On Fri, 18 Jan 2008, Konstantin Shvachko wrote:


Looks like we still have this unsolved mysterious problem:

http://issues.apache.org/jira/browse/HADOOP-1374

Could it be related to HADOOP-1246? Arun?

Thanks,
--Konstantin

Yunhong Gu1 wrote:


Hi,

If someone knows how to fix the problem described below,

please help

me out. Thanks!

I am testing Hadoop on 2-node cluster and the "reduce