Hi,
 Did you formatted the HDFS?

On Tue, Feb 21, 2012 at 7:40 PM, Shi Yu <sh...@uchicago.edu> wrote:

> Hi Hadoopers,
>
> We are experiencing a strange problem on Hadoop 0.20.203
>
> Our cluster has 58 nodes, everything is started from a fresh
> HDFS (we deleted all local folders on datanodes and
> reformatted the namenode).  After running some small jobs, the
> HDFS becomes behaving abnormally and the jobs become very
> slow.  The namenode log is crushed by Gigabytes of errors like
> is:
>
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_4524177823306792294 is added
> to invalidSet of 10.105.19.31:50010
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_4524177823306792294 is added
> to invalidSet of 10.105.19.18:50010
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_4524177823306792294 is added
> to invalidSet of 10.105.19.32:50010
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_2884522252507300332 is added
> to invalidSet of 10.105.19.35:50010
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_2884522252507300332 is added
> to invalidSet of 10.105.19.27:50010
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_2884522252507300332 is added
> to invalidSet of 10.105.19.33:50010
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated:
> 10.105.19.21:50010 is added to blk_-
> 6843171124277753504_2279882 size 124490
> 2012-02-21 00:00:38,632 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
> 043_0013_m_000313_0/result_stem-m-00313. blk_-
> 6379064588594672168_2279890
> 2012-02-21 00:00:38,633 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated:
> 10.105.19.26:50010 is added to blk_5338983375361999760_2279887
> size 1476
> 2012-02-21 00:00:38,633 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated:
> 10.105.19.29:50010 is added to blk_-977828927900581074_2279887
> size 13818
> 2012-02-21 00:00:38,633 INFO
> org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.completeFile: file
> /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
> 043_0013_m_000364_0/result_stem-m-00364 is closed by
> DFSClient_attempt_201202202043_0013_m_000364_0
> 2012-02-21 00:00:38,633 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated:
> 10.105.19.23:50010 is added to blk_5338983375361999760_2279887
> size 1476
> 2012-02-21 00:00:38,633 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated:
> 10.105.19.20:50010 is added to blk_5338983375361999760_2279887
> size 1476
> 2012-02-21 00:00:38,633 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
> 043_0013_m_000364_0/result_suffix-m-00364.
> blk_1921685366929756336_2279890
> 2012-02-21 00:00:38,634 INFO
> org.apache.hadoop.hdfs.StateChange: DIR*
> NameSystem.completeFile: file
> /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
> 043_0013_m_000279_0/result_suffix-m-00279 is closed by
> DFSClient_attempt_201202202043_0013_m_000279_0
> 2012-02-21 00:00:38,635 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_495061820035691700 is added
> to invalidSet of 10.105.19.20:50010
> 2012-02-21 00:00:38,635 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_495061820035691700 is added
> to invalidSet of 10.105.19.25:50010
> 2012-02-21 00:00:38,635 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addToInvalidates: blk_495061820035691700 is added
> to invalidSet of 10.105.19.33:50010
> 2012-02-21 00:00:38,635 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
> 043_0013_m_000284_0/result_stem-m-00284.
> blk_8796188324642771330_2279891
> 2012-02-21 00:00:38,638 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated:
> 10.105.19.34:50010 is added to blk_-977828927900581074_2279887
> size 13818
> 2012-02-21 00:00:38,638 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /syu/output/naive/iter5_partout1/_temporary/_attempt_201202202
> 043_0013_m_000296_0/result_stem-m-00296. blk_-
> 6800409224007034579_2279891
> 2012-02-21 00:00:38,638 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated:
> 10.105.19.29:50010 is added to blk_1921685366929756336_2279890
> size 1511
> 2012-02-21 00:00:38,638 INFO
> org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated:
> 10.105.19.25:50010 is added to blk_-
> 2982099629304436976_2279752 size 569
>
> In Map/Reduce job, tasks got killing because of timeout error
> "
>
> Task attempt_201202202043_0014_m_000019_0 failed to report
> status for 600 seconds. Killing!
>
> "
> and finally all the jobs executed after a specific time point
> will definitely all failed.  All HDFS commands get stalled.
> Then we have to reformat and delete data from HDFS all again.
>
> In the jobtracker-namenode.log file, information is like:
>
> 08403,firstReduceTaskLaunchTime=1329808089070,firstJobSetupTas
> kLaunchTime=1329806700863,firstJobCleanupTaskLaunchTime=132981
> 4093350,finishTime=1329814099359,numMaps=467,numSlotsPerMap=1,
> numReduces=89,numSlotsPerReduce=1,user=syu,queue=default,statu
> s=FAILED,mapSlotSeconds=522658,reduceSlotsSeconds=324155,clust
> erMapCapacity=160,clusterReduceCapacity=60
> 2012-02-21 02:48:19,361 INFO
> org.apache.hadoop.mapred.JobHistory: Moving
> file:/opt/hadoop/logs/history/job_201202202043_0015_1329806700
> 027_syu_EM%5Fnaive%5Fbayes%5Fstep3 to
> file:/opt/hadoop/logs/history/done/version-
> 1/namenode_1329792199074_/2012/02/21/000000
> 2012-02-21 02:48:19,361 INFO
> org.apache.hadoop.mapred.JobTracker: Removing task
> 'attempt_201202202043_0015_m_000018_0'
> 2012-02-21 02:48:19,361 INFO
> org.apache.hadoop.mapred.JobTracker: Removing task
> 'attempt_201202202043_0015_m_000038_0'
> 2012-02-21 02:48:19,362 INFO
> org.apache.hadoop.mapred.JobTracker: Removing task
> 'attempt_201202202043_0015_m_000058_0'
> 2012-02-21 02:48:19,362 INFO
> org.apache.hadoop.mapred.JobTracker: Removing task
> 'attempt_201202202043_0015_m_000078_0'
> 2012-02-21 02:48:19,362 INFO
> org.apache.hadoop.mapred.JobTracker: Removing task
> 'attempt_201202202043_0015_m_000098_0'
> 2012-02-21 02:48:19,362 INFO
> org.apache.hadoop.mapred.JobTracker: Removing task
> 'attempt_201202202043_0015_m_000118_0'
> 2012-02-21 02:48:19,362 INFO
> org.apache.hadoop.mapred.JobTracker: Removing task
> 'attempt_201202202043_0015_m_000138_0'
> 2012-02-21 02:48:19,362 INFO
> org.apache.hadoop.mapred.JobTracker: Removing task
> 'attempt_201202202043_0015_m_000158_0'
> 2012-02-21 02:48:19,362 INFO
> org.apache.hadoop.mapred.JobTracker: Removing task
> 'attempt_201202202043_0015_m_000230_0'
> 2012-02-21 02:48:19,362 INFO
> org.apache.hadoop.mapred.JobTracker: Removing task
> 'attempt_201202202043_0015_m_000266_0'
> 2012-02-21 02:48:19,362 INFO
> org.apache.hadoop.mapred.JobTracker: Removing task
> 'attempt_201202202043_0015_m_000294_0'
> 2012-02-21 02:48:19,362 INFO
> org.apache.hadoop.mapred.JobTracker: Removing task
> 'attempt_201202202043_0015_m_000299_0'
> 2012-02-21 02:48:19,362 INFO
> org.apache.hadoop.mapred.JobTracker: Removing task
> 'attempt_201202202043_0015_m_000467_0'
> 2012-02-21 02:48:19,362 WARN
> org.apache.hadoop.mapred.JobInProgress: Running list for
> reducers missing!! Job details are missing.
> 2012-02-21 02:48:19,362 WARN
> org.apache.hadoop.mapred.JobInProgress: Failed cache for
> reducers missing!! Job details are missing.
> 2012-02-21 02:48:19,362 INFO
> org.apache.hadoop.mapred.JobHistory: Logging failed for job
> job_201202202043_0015removing PrintWriter from FileManager
> 2012-02-21 02:48:19,362 WARN
> org.apache.hadoop.mapred.JobInProgress: Running list for
> reducers missing!! Job details are missing.
> 2012-02-21 02:48:19,362 WARN
> org.apache.hadoop.mapred.JobInProgress: Failed cache for
> reducers missing!! Job details are missing.
> 2012-02-21 02:48:19,362 WARN
> org.apache.hadoop.mapred.JobInProgress: Running list for
> reducers missing!! Job details are missing.
> 2012-02-21 02:48:19,362 WARN
> org.apache.hadoop.mapred.JobInProgress: Failed cache for
> reducers missing!! Job details are missing.
> 2012-02-21 02:48:19,362 WARN
> org.apache.hadoop.mapred.JobInProgress: Running cache for maps
> missing!! Job details are missing.
> 2012-02-21 02:48:19,362 WARN
> org.apache.hadoop.mapred.JobInProgress: Running cache for maps
> missing!! Job details are missing.
> 2012-02-21 02:48:19,362 WARN
> org.apache.hadoop.mapred.JobInProgress: Running cache for maps
> missing!! Job details are missing.
> 2012-02-21 02:48:19,362 WARN
> org.apache.hadoop.mapred.JobInProgress: Running cache for maps
> missing!! Job details are missing.
> 2012-02-21 02:48:19,363 WARN
> org.apache.hadoop.mapred.JobInProgress: Running cache for maps
> missing!! Job details are missing.
>
>
> We believe it is a kind of HDFS corruption, we have tried the
> same program with ZERO output successfully. Even after all the
> jobs have stopped for 2 hours, the HDFS is still generating
> logs in the namenode log file like:
>
> 2012-02-21 06:26:07,478 INFO
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease
> [Lease.  Holder:
> DFSClient_attempt_201202202043_0016_m_000080_1,
> pendingcreates: 1] has expired hard limit
> 2012-02-21 06:26:07,478 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> Recovering lease=[Lease.  Holder:
> DFSClient_attempt_201202202043_0016_m_000080_1,
> pendingcreates: 1],
> src=/syu/output/naive/iter5_partout4/_temporary/_attempt_20120
> 2202043_0016_m_000080_1/result_stem-m-00080
> 2012-02-21 06:26:07,478 INFO
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease
> [Lease.  Holder:
> DFSClient_attempt_201202202043_0016_m_000094_1,
> pendingcreates: 1] has expired hard limit
> 2012-02-21 06:26:07,478 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> Recovering lease=[Lease.  Holder:
> DFSClient_attempt_201202202043_0016_m_000094_1,
> pendingcreates: 1],
> src=/syu/output/naive/iter5_partout4/_temporary/_attempt_20120
> 2202043_0016_m_000094_1/result_stem-m-00094
> 2012-02-21 07:26:07,667 INFO
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease
> [Lease.  Holder:
> DFSClient_attempt_201202202043_0016_m_000027_1,
> pendingcreates: 1] has expired hard limit
> 2012-02-21 07:26:07,667 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> Recovering lease=[Lease.  Holder:
> DFSClient_attempt_201202202043_0016_m_000027_1,
> pendingcreates: 1],
> src=/syu/output/naive/iter5_partout4/_temporary/_attempt_20120
> 2202043_0016_m_000027_1/result_suffix-m-00027
> 2012-02-21 07:26:07,667 INFO
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease
> [Lease.  Holder:
> DFSClient_attempt_201202202043_0016_m_000080_1,
> pendingcreates: 1] has expired hard limit
> 2012-02-21 07:26:07,667 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
> Recovering lease=[Lease.  Holder:
> DFSClient_attempt_201202202043_0016_m_000080_1,
> pendingcreates: 1],
> src=/syu/output/naive/iter5_partout4/_temporary/_attempt_20120
> 2202043_0016_m_000080_1/result_stem-m-00080
> 2012-02-21 07:26:07,667 INFO
> org.apache.hadoop.hdfs.server.namenode.LeaseManager: Lease
> [Lease.  Holder:
> DFSClient_attempt_201202202043_0016_m_000094_1,
> pendingcreates: 1] has expired hard limit
>
>
> We don't know what is the exactly reason causing this HDFS
> corruption. Could it be some misconfiguration in our cluster?
> We have tried using small number of data nodes (20), the
> problem still occurs. Thanks for any suggestion about this.
>
>
>
> Shi
>
>
>
>
>
>


-- 
Join me at http://hadoopworkshop.eventbrite.com/

Reply via email to