Too many fetch-failures - reduce task problem

Nachiket Vaidya Wed, 27 Jan 2010 04:17:16 -0800

Hi all,
My problem is the same problem as
http://issues.apache.org/jira/browse/HADOOP-3362 and there no solution is
given :(


1. I am using hadoop 20.1. My structure is very simple. I have two machines
(both are Ubuntu machines)
machine1 = namenode, jobtracker and also datanode and tasktracker. (We will
call this as master)
machine2 = datanode, namenode (We will call this as slave)
Same as given in
http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)
Just one difference I have not changed my /etc/hosts file as I am using ip
address in conf files. *Is it ok?*
*
*
2. The program is running fine with stand alone mode but in multi node mode
it is halting in reduce phase and eventually returning successfully. I am
running just word count example.
/**************************/
10/01/27 12:08:21 INFO input.FileInputFormat: Total input paths to process :
17
10/01/27 12:08:21 INFO mapred.JobClient: Running job: job_201001271157_0002
10/01/27 12:08:22 INFO mapred.JobClient:  map 0% reduce 0%
10/01/27 12:08:39 INFO mapred.JobClient:  map 11% reduce 0%
10/01/27 12:08:46 INFO mapred.JobClient:  map 23% reduce 0%
10/01/27 12:08:53 INFO mapred.JobClient:  map 35% reduce 0%
10/01/27 12:08:56 INFO mapred.JobClient:  map 47% reduce 3%
10/01/27 12:09:02 INFO mapred.JobClient:  map 58% reduce 7%
10/01/27 12:09:05 INFO mapred.JobClient:  map 70% reduce 7%
10/01/27 12:09:08 INFO mapred.JobClient:  map 82% reduce 11%
10/01/27 12:09:11 INFO mapred.JobClient:  map 88% reduce 11%
10/01/27 12:09:14 INFO mapred.JobClient:  map 100% reduce 11%
10/01/27 12:09:23 INFO mapred.JobClient:  map 100% reduce 17%
10/01/27 12:16:39 INFO mapred.JobClient: Task Id :
attempt_201001271157_0002_m_000002_0, Status : FAILED
Too many fetch-failures
10/01/27 12:16:54 INFO mapred.JobClient:  map 100% reduce 19%
10/01/27 12:26:52 INFO mapred.JobClient: Task Id :
attempt_201001271157_0002_m_000003_0, Status : FAILED
Too many fetch-failures
10/01/27 12:27:08 INFO mapred.JobClient:  map 100% reduce 21%
10/01/27 12:37:08 INFO mapred.JobClient: Task Id :
attempt_201001271157_0002_m_000006_0, Status : FAILED
Too many fetch-failures
10/01/27 12:37:24 INFO mapred.JobClient:  map 100% reduce 23%
10/01/27 12:47:24 INFO mapred.JobClient: Task Id :
attempt_201001271157_0002_m_000007_0, Status : FAILED
Too many fetch-failures
10/01/27 12:47:28 INFO mapred.JobClient:  map 94% reduce 23%
10/01/27 12:47:31 INFO mapred.JobClient:  map 100% reduce 23%
10/01/27 12:47:40 INFO mapred.JobClient:  map 100% reduce 25%
10/01/27 12:57:38 INFO mapred.JobClient: Task Id :
attempt_201001271157_0002_m_000010_0, Status : FAILED
Too many fetch-failures
10/01/27 12:57:54 INFO mapred.JobClient:  map 100% reduce 27%
10/01/27 13:07:55 INFO mapred.JobClient: Task Id :
attempt_201001271157_0002_m_000011_0, Status : FAILED
Too many fetch-failures
10/01/27 13:08:11 INFO mapred.JobClient:  map 100% reduce 29%
10/01/27 13:18:11 INFO mapred.JobClient: Task Id :
attempt_201001271157_0002_m_000014_0, Status : FAILED
Too many fetch-failures
10/01/27 13:18:27 INFO mapred.JobClient:  map 100% reduce 31%
10/01/27 13:28:24 INFO mapred.JobClient: Task Id :
attempt_201001271157_0002_m_000015_0, Status : FAILED
Too many fetch-failures
10/01/27 13:28:40 INFO mapred.JobClient:  map 100% reduce 100%
10/01/27 13:28:42 INFO mapred.JobClient: Job complete: job_201001271157_0002
10/01/27 13:28:42 INFO mapred.JobClient: Counters: 17
10/01/27 13:28:42 INFO mapred.JobClient:   Job Counters
10/01/27 13:28:42 INFO mapred.JobClient:     Launched reduce tasks=1
10/01/27 13:28:42 INFO mapred.JobClient:     Launched map tasks=25
10/01/27 13:28:42 INFO mapred.JobClient:     Data-local map tasks=25
10/01/27 13:28:42 INFO mapred.JobClient:   FileSystemCounters
10/01/27 13:28:42 INFO mapred.JobClient:     FILE_BYTES_READ=16584
10/01/27 13:28:42 INFO mapred.JobClient:     HDFS_BYTES_READ=18805
10/01/27 13:28:42 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=33808
10/01/27 13:28:42 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=10731
10/01/27 13:28:42 INFO mapred.JobClient:   Map-Reduce Framework
10/01/27 13:28:42 INFO mapred.JobClient:     Reduce input groups=0
10/01/27 13:28:42 INFO mapred.JobClient:     Combine output records=821
10/01/27 13:28:42 INFO mapred.JobClient:     Map input records=580
10/01/27 13:28:42 INFO mapred.JobClient:     Reduce shuffle bytes=16680
10/01/27 13:28:42 INFO mapred.JobClient:     Reduce output records=0
10/01/27 13:28:42 INFO mapred.JobClient:     Spilled Records=1642
10/01/27 13:28:42 INFO mapred.JobClient:     Map output bytes=25180
10/01/27 13:28:42 INFO mapred.JobClient:     Combine input records=1818
10/01/27 13:28:42 INFO mapred.JobClient:     Map output records=1818
10/01/27 13:28:42 INFO mapred.JobClient:     Reduce input records=821
/**************************/

I checked the logs for namenodes/jobtracker/datanodes/tasktracker:
 (attached herewith.)
There no exception in the files. Just failure statement in jobtracker logs
as
/*-------------------------*/
2010-01-27 12:26:51,554 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_201001271157_0002_m_000003_1' to tip
task_201001271157_0002_m_000003, for tracker
'tracker_hadoop-desktop2:localhost/127.0.0.1:55734'
2010-01-27 12:26:51,554 INFO org.apache.hadoop.mapred.JobInProgress:
Choosing data-local task task_201001271157_0002_m_000003
2010-01-27 12:26:54,350 INFO org.apache.hadoop.mapred.JobTracker: Removed
completed task 'attempt_201001271157_0002_m_000003_0' from
'tracker_hadoop-desktop1:localhost/127.0.0.1:36778'
2010-01-27 12:26:54,626 INFO org.apache.hadoop.mapred.JobInProgress: Task
'attempt_201001271157_0002_m_000003_1' has completed
task_201001271157_0002_m_000003 successfully.
2010-01-27 12:26:54,627 INFO org.apache.hadoop.mapred.ResourceEstimator:
completedMapsUpdates:19  completedMapsInputSize:23876
 completedMapsOutputSize:22641
2010-01-27 12:29:30,987 INFO org.apache.hadoop.mapred.JobInProgress: Failed
fetch notification #1 for task attempt_201001271157_0002_m_000006_0
2010-01-27 12:32:07,410 INFO org.apache.hadoop.mapred.JobInProgress: Failed
fetch notification #2 for task attempt_201001271157_0002_m_000006_0
2010-01-27 12:37:08,075 INFO org.apache.hadoop.mapred.JobInProgress: Failed
fetch notification #3 for task attempt_201001271157_0002_m_000006_0
2010-01-27 12:37:08,075 INFO org.apache.hadoop.mapred.JobInProgress: Too
many fetch-failures for output of task: attempt_201001271157_0002_m_000006_0
... killing it
2010-01-27 12:37:08,075 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from attempt_201001271157_0002_m_000006_0: Too many fetch-failures
2010-01-27 12:37:08,076 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_201001271157_0002_m_000006_1' to tip
task_201001271157_0002_m_000006, for tracker
'tracker_hadoop-desktop2:localhost/127.0.0.1:55734'
2010-01-27 12:37:08,076 INFO org.apache.hadoop.mapred.JobInProgress:
Choosing data-local task task_201001271157_0002_m_000006
2010-01-27 12:37:10,613 INFO org.apache.hadoop.mapred.JobTracker: Removed
completed task 'attempt_201001271157_0002_m_000006_0' from
'tracker_hadoop-desktop1:localhost/127.0.0.1:36778'
2010-01-27 12:37:11,084 INFO org.apache.hadoop.mapred.JobInProgress: Task
'attempt_201001271157_0002_m_000006_1' has completed
task_201001271157_0002_m_000006 successfully.
2010-01-27 12:37:11,084 INFO org.apache.hadoop.mapred.ResourceEstimator:
completedMapsUpdates:20  completedMapsInputSize:25072
 completedMapsOutputSize:23508
2010-01-27 12:39:47,424 INFO org.apache.hadoop.mapred.JobInProgress: Failed
fetch notification #1 for task attempt_201001271157_0002_m_000007_0
2010-01-27 12:42:23,822 INFO org.apache.hadoop.mapred.JobInProgress: Failed
fetch notification #2 for task attempt_201001271157_0002_m_000007_0
2010-01-27 12:47:24,576 INFO org.apache.hadoop.mapred.JobInProgress: Failed
fetch notification #3 for task attempt_201001271157_0002_m_000007_0
2010-01-27 12:47:24,578 INFO org.apache.hadoop.mapred.JobInProgress: Too
many fetch-failures for output of task: attempt_201001271157_0002_m_000007_0
... killing it
2010-01-27 12:47:24,578 INFO org.apache.hadoop.mapred.TaskInProgress: Error
from attempt_201001271157_0002_m_000007_0: Too many fetch-failures
2010-01-27 12:47:24,578 INFO org.apache.hadoop.mapred.JobInProgress:
TaskTracker at 'hadoop-desktop1' turned 'flaky'
2010-01-27 12:47:24,579 INFO org.apache.hadoop.mapred.JobTracker: Adding
task 'attempt_201001271157_0002_m_000007_1' to tip
task_201001271157_0002_m_000007, for tracker
'tracker_hadoop-desktop2:localhost/127.0.0.1:55734'
2010-01-27 12:47:24,579 INFO org.apache.hadoop.mapred.JobInProgress:
Choosing data-local task task_201001271157_0002_m_000007
/*--------------------------*/

More info:
1. The job filed due to "Too many fetch-failures" are on the *master machine
only*. the slave able to finish those jobs.
2. From master/slave machine, we could not able to access web UI when ip
address of master is given. But we can access web UI, when use localhost
instead of ip address of master machine on master  machine.
- Nachiket


On Fri, Jan 22, 2010 at 7:13 PM, Sayali <sayali.kulka...@gmail.com> wrote:

> Hey Nachiket!
> So nice to hear from you! I recently joined back PSL and currently working
> hard on adjusting with the new environment :) I guess you can understand
> what I mean -- 2 years in IIT, its tough to get back :)
>
> Well... your news server needs to be tested! It should not give out such
> false info! :P (but anyways, ye reporter logon ko masala lagake bolane ki
> aadat hoti hai... to samajh lo jo samajhana hai :) )
>
> Jokes apart... I have worked little bit on hadoop. so let me know what help
> you need. I will try to help as much as my little memory can allow..
>
> :)
> --s
>
>
>
> On Fri, Jan 22, 2010 at 9:39 PM, Nachiket Vaidya <vaidy...@gmail.com>wrote:
>
>> Hey Sayali,
>> How are you? Where are you now?
>>
>> I am using Hadoop. From news server I got the info that you are boss in
>> hadoop. I want some help about it.
>> Do you help me?
>>
>>  - Nachiket
>>
>
>

Too many fetch-failures - reduce task problem

Reply via email to