On Fri, Mar 28, 2008 at 12:31 AM, 朱盛凯 <[EMAIL PROTECTED]> wrote:
> Hi, > > I met this problem in my cluster before, I think I can share with you some > of my experience. > But it may not work in you case. > > The job in my cluster always hung at 16% of reduce. It occured because the > reduce task could not fetch the > map output from other nodes. > > In my case, two factors may result in this faliure of communication > between > two task trackers. > > One is the firewall block the trackers from communications. I solved this > by > disabling the firewall. > The other factor is that trackers refer to other nodes by host name only, > but not ip address. I solved this by editing the file /etc/hosts > with mapping from hostname to ip address of all nodes in cluster. I meet this problem with the same reason too. Try to host names to all your /etc/hosts files . > > > I hope my experience will be helpful for you. > > On 3/27/08, Natarajan, Senthil <[EMAIL PROTECTED]> wrote: > > > > Hi, > > I have small Hadoop cluster, one master and three slaves. > > When I try the example wordcount on one of our log file (size ~350 MB) > > > > Map runs fine but reduce always hangs (sometime around 19%,60% ...) > after > > very long time it finishes. > > I am seeing this error > > Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out > > In the log I am seeing this > > INFO org.apache.hadoop.mapred.TaskTracker: > > task_200803261535_0001_r_000000_0 0.18333334% reduce > copy (11 of 20 at > > 0.02 MB/s) > > > > > Do you know what might be the problem. > > Thanks, > > Senthil > > > > > -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.