Hi, I met this problem in my cluster before, I think I can share with you some of my experience. But it may not work in you case.
The job in my cluster always hung at 16% of reduce. It occured because the reduce task could not fetch the map output from other nodes. In my case, two factors may result in this faliure of communication between two task trackers. One is the firewall block the trackers from communications. I solved this by disabling the firewall. The other factor is that trackers refer to other nodes by host name only, but not ip address. I solved this by editing the file /etc/hosts with mapping from hostname to ip address of all nodes in cluster. I hope my experience will be helpful for you. On 3/27/08, Natarajan, Senthil <[EMAIL PROTECTED]> wrote: > > Hi, > I have small Hadoop cluster, one master and three slaves. > When I try the example wordcount on one of our log file (size ~350 MB) > > Map runs fine but reduce always hangs (sometime around 19%,60% ...) after > very long time it finishes. > I am seeing this error > Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out > In the log I am seeing this > INFO org.apache.hadoop.mapred.TaskTracker: > task_200803261535_0001_r_000000_0 0.18333334% reduce > copy (11 of 20 at > 0.02 MB/s) > > > Do you know what might be the problem. > Thanks, > Senthil > >