Hi, Did you formatted the HDFS? On Tue, Feb 21, 2012 at 7:40 PM, Shi Yu <sh...@uchicago.edu> wrote:
> Hi Hadoopers, > > We are experiencing a strange problem on Hadoop 0.20.203 > > Our cluster has 58 nodes, everything is started from a fresh > HDFS (we deleted all local folders on datanodes and > reformatted the namenode). After running some small jobs, the > HDFS becomes behaving abnormally and the jobs become very > slow. The namenode log is crushed by Gigabytes of errors like > is: > > 2012-02-21 00:00:38,632 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addToInvalidates: blk_4524177823306792294 is added > to invalidSet of 10.105.19.31:50010 > 2012-02-21 00:00:38,632 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addToInvalidates: blk_4524177823306792294 is added > to invalidSet of 10.105.19.18:50010 > 2012-02-21 00:00:38,632 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addToInvalidates: blk_4524177823306792294 is added > to invalidSet of 10.105.19.32:50010 > 2012-02-21 00:00:38,632 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addToInvalidates: blk_2884522252507300332 is added > to invalidSet of 10.105.19.35:50010 > 2012-02-21 00:00:38,632 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addToInvalidates: blk_2884522252507300332 is added > to invalidSet of 10.105.19.27:50010 > 2012-02-21 00:00:38,632 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addToInvalidates: blk_2884522252507300332 is added > to invalidSet of 10.105.19.33:50010 > 2012-02-21 00:00:38,632 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: > 10.105.19.21:50010 is added to blk_- > 6843171124277753504_2279882 size 124490 > 2012-02-21 00:00:38,632 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202 > 043_0013_m_000313_0/result_stem-m-00313. blk_- > 6379064588594672168_2279890 > 2012-02-21 00:00:38,633 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: > 10.105.19.26:50010 is added to blk_5338983375361999760_2279887 > size 1476 > 2012-02-21 00:00:38,633 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: > 10.105.19.29:50010 is added to blk_-977828927900581074_2279887 > size 13818 > 2012-02-21 00:00:38,633 INFO > org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202 > 043_0013_m_000364_0/result_stem-m-00364 is closed by > DFSClient_attempt_201202202043_0013_m_000364_0 > 2012-02-21 00:00:38,633 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: > 10.105.19.23:50010 is added to blk_5338983375361999760_2279887 > size 1476 > 2012-02-21 00:00:38,633 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: > 10.105.19.20:50010 is added to blk_5338983375361999760_2279887 > size 1476 > 2012-02-21 00:00:38,633 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202 > 043_0013_m_000364_0/result_suffix-m-00364. > blk_1921685366929756336_2279890 > 2012-02-21 00:00:38,634 INFO > org.apache.hadoop.hdfs.StateChange: DIR* > NameSystem.completeFile: file > /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202 > 043_0013_m_000279_0/result_suffix-m-00279 is closed by > DFSClient_attempt_201202202043_0013_m_000279_0 > 2012-02-21 00:00:38,635 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addToInvalidates: blk_495061820035691700 is added > to invalidSet of 10.105.19.20:50010 > 2012-02-21 00:00:38,635 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addToInvalidates: blk_495061820035691700 is added > to invalidSet of 10.105.19.25:50010 > 2012-02-21 00:00:38,635 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addToInvalidates: blk_495061820035691700 is added > to invalidSet of 10.105.19.33:50010 > 2012-02-21 00:00:38,635 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202 > 043_0013_m_000284_0/result_stem-m-00284. > blk_8796188324642771330_2279891 > 2012-02-21 00:00:38,638 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: > 10.105.19.34:50010 is added to blk_-977828927900581074_2279887 > size 13818 > 2012-02-21 00:00:38,638 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.allocateBlock: > /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202 > 043_0013_m_000296_0/result_stem-m-00296. blk_- > 6800409224007034579_2279891 > 2012-02-21 00:00:38,638 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: > 10.105.19.29:50010 is added to blk_1921685366929756336_2279890 > size 1511 > 2012-02-21 00:00:38,638 INFO > org.apache.hadoop.hdfs.StateChange: BLOCK* > NameSystem.addStoredBlock: blockMap updated: > 10.105.19.25:50010 is added to blk_- > 2982099629304436976_2279752 size 569 > > In Map/Reduce job, tasks got killing because of timeout error > " > > Task attempt_201202202043_0014_m_000019_0 failed to report > status for 600 seconds. Killing! > > " > and finally all the jobs executed after a specific time point > will definitely all failed. All HDFS commands get stalled. > Then we have to reformat and delete data from HDFS all again. > > In the jobtracker-namenode.log file, information is like: > > 08403,firstReduceTaskLaunchTime=1329808089070,firstJobSetupTas > kLaunchTime=1329806700863,firstJobCleanupTaskLaunchTime=132981 > 4093350,finishTime=1329814099359,numMaps=467,numSlotsPerMap=1, > numReduces=89,numSlotsPerReduce=1,user=syu,queue=default,statu > s=FAILED,mapSlotSeconds=522658,reduceSlotsSeconds=324155,clust > erMapCapacity=160,clusterReduceCapacity=60 > 2012-02-21 02:48:19,361 INFO > org.apache.hadoop.mapred.JobHistory: Moving > file:/opt/hadoop/logs/history/job_201202202043_0015_1329806700 > 027_syu_EM%5Fnaive%5Fbayes%5Fstep3 to > file:/opt/hadoop/logs/history/done/version- > 1/namenode_1329792199074_/2012/02/21/000000 > 2012-02-21 02:48:19,361 INFO > org.apache.hadoop.mapred.JobTracker: Removing task > 'attempt_201202202043_0015_m_000018_0' > 2012-02-21 02:48:19,361 INFO > org.apache.hadoop.mapred.JobTracker: Removing task > 'attempt_201202202043_0015_m_000038_0' > 2012-02-21 02:48:19,362 INFO > org.apache.hadoop.mapred.JobTracker: Removing task > 'attempt_201202202043_0015_m_000058_0' > 2012-02-21 02:48:19,362 INFO > org.apache.hadoop.mapred.JobTracker: Removing task > 'attempt_201202202043_0015_m_000078_0' > 2012-02-21 02:48:19,362 INFO > org.apache.hadoop.mapred.JobTracker: Removing task > 'attempt_201202202043_0015_m_000098_0' > 2012-02-21 02:48:19,362 INFO > org.apache.hadoop.mapred.JobTracker: Removing task > 'attempt_201202202043_0015_m_000118_0' > 2012-02-21 02:48:19,362 INFO > org.apache.hadoop.mapred.JobTracker: Removing task > 'attempt_201202202043_0015_m_000138_0' > 2012-02-21 02:48:19,362 INFO > org.apache.hadoop.mapred.JobTracker: Removing task > 'attempt_201202202043_0015_m_000158_0' > 2012-02-21 02:48:19,362 INFO > org.apache.hadoop.mapred.JobTracker: Removing task > 'attempt_201202202043_0015_m_000230_0' > 2012-02-21 02:48:19,362 INFO > org.apache.hadoop.mapred.JobTracker: Removing task > 'attempt_201202202043_0015_m_000266_0' > 2012-02-21 02:48:19,362 INFO > org.apache.hadoop.mapred.JobTracker: Removing task > 'attempt_201202202043_0015_m_000294_0' > 2012-02-21 02:48:19,362 INFO > org.apache.hadoop.mapred.JobTracker: Removing task > 'attempt_201202202043_0015_m_000299_0' > 2012-02-21 02:48:19,362 INFO > org.apache.hadoop.mapred.JobTracker: Removing task > 'attempt_201202202043_0015_m_000467_0' > 2012-02-21 02:48:19,362 WARN > org.apache.hadoop.mapred.JobInProgress: Running list for > reducers missing!! Job details are missing. > 2012-02-21 02:48:19,362 WARN > org.apache.hadoop.mapred.JobInProgress: Failed cache for > reducers missing!! Job details are missing. > 2012-02-21 02:48:19,362 INFO > org.apache.hadoop.mapred.JobHistory: Logging failed for job > job_201202202043_0015removing PrintWriter from FileManager > 2012-02-21 02:48:19,362 WARN > org.apache.hadoop.mapred.JobInProgress: Running list for > reducers missing!! Job details are missing. > 2012-02-21 02:48:19,362 WARN > org.apache.hadoop.mapred.JobInProgress: Failed cache for > reducers missing!! Job details are missing. > 2012-02-21 02:48:19,362 WARN > org.apache.hadoop.mapred.JobInProgress: Running list for > reducers missing!! Job details are missing. > 2012-02-21 02:48:19,362 WARN > org.apache.hadoop.mapred.JobInProgress: Failed cache for > reducers missing!! Job details are missing. > 2012-02-21 02:48:19,362 WARN > org.apache.hadoop.mapred.JobInProgress: Running cache for maps > missing!! Job details are missing. > 2012-02-21 02:48:19,362 WARN > org.apache.hadoop.mapred.JobInProgress: Running cache for maps > missing!! Job details are missing. > 2012-02-21 02:48:19,362 WARN > org.apache.hadoop.mapred.JobInProgress: Running cache for maps > missing!! Job details are missing. > 2012-02-21 02:48:19,362 WARN > org.apache.hadoop.mapred.JobInProgress: Running cache for maps > missing!! Job details are missing. > 2012-02-21 02:48:19,363 WARN > org.apache.hadoop.mapred.JobInProgress: Running cache for maps > missing!! Job details are missing. > > > We believe it is a kind of HDFS corruption, we have tried the > same program with ZERO output successfully. Even after all the > jobs have stopped for 2 hours, the HDFS is still generating > logs in the namenode log file like: > > 2012-02-21 06:26:07,478 INFO > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease > [Lease. Holder: > DFSClient_attempt_201202202043_0016_m_000080_1, > pendingcreates: 1] has expired hard limit > 2012-02-21 06:26:07,478 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > Recovering lease=[Lease. Holder: > DFSClient_attempt_201202202043_0016_m_000080_1, > pendingcreates: 1], > src=/syu/output/naive/iter5_partout4/_temporary/_attempt_20120 > 2202043_0016_m_000080_1/result_stem-m-00080 > 2012-02-21 06:26:07,478 INFO > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease > [Lease. Holder: > DFSClient_attempt_201202202043_0016_m_000094_1, > pendingcreates: 1] has expired hard limit > 2012-02-21 06:26:07,478 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > Recovering lease=[Lease. Holder: > DFSClient_attempt_201202202043_0016_m_000094_1, > pendingcreates: 1], > src=/syu/output/naive/iter5_partout4/_temporary/_attempt_20120 > 2202043_0016_m_000094_1/result_stem-m-00094 > 2012-02-21 07:26:07,667 INFO > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease > [Lease. Holder: > DFSClient_attempt_201202202043_0016_m_000027_1, > pendingcreates: 1] has expired hard limit > 2012-02-21 07:26:07,667 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > Recovering lease=[Lease. Holder: > DFSClient_attempt_201202202043_0016_m_000027_1, > pendingcreates: 1], > src=/syu/output/naive/iter5_partout4/_temporary/_attempt_20120 > 2202043_0016_m_000027_1/result_suffix-m-00027 > 2012-02-21 07:26:07,667 INFO > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease > [Lease. Holder: > DFSClient_attempt_201202202043_0016_m_000080_1, > pendingcreates: 1] has expired hard limit > 2012-02-21 07:26:07,667 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > Recovering lease=[Lease. Holder: > DFSClient_attempt_201202202043_0016_m_000080_1, > pendingcreates: 1], > src=/syu/output/naive/iter5_partout4/_temporary/_attempt_20120 > 2202043_0016_m_000080_1/result_stem-m-00080 > 2012-02-21 07:26:07,667 INFO > org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease > [Lease. Holder: > DFSClient_attempt_201202202043_0016_m_000094_1, > pendingcreates: 1] has expired hard limit > > > We don't know what is the exactly reason causing this HDFS > corruption. Could it be some misconfiguration in our cluster? > We have tried using small number of data nodes (20), the > problem still occurs. Thanks for any suggestion about this. > > > > Shi > > > > > > -- Join me at http://hadoopworkshop.eventbrite.com/