Hey all, I've a 2 Node cluster which is now running in Safe Mode. Its been 15-16 hrs now & yet to come out of Safe Mode. Does it normally take that long ?
The DataNode logs on Node running NameNode indicates following & similar output on the slave node ( running only Data Node ) as well. 2010-04-07 10:03:10,687 INFO org.apache.hadoop.dfs.DataBlockScanner: Verification succeeded for blk_-310922324774702076_996024 2010-04-07 10:03:10,705 INFO org.apache.hadoop.dfs.DataBlockScanner: Verification succeeded for blk_3302288729849061244_813694 2010-04-07 10:03:10,730 INFO org.apache.hadoop.dfs.DataBlockScanner: Verification succeeded for blk_-7252548330326272479_1259723 2010-04-07 10:03:10,745 INFO org.apache.hadoop.dfs.DataBlockScanner: Verification succeeded for blk_-5909954202848831867_1075933 2010-04-07 10:03:10,886 INFO org.apache.hadoop.dfs.DataBlockScanner: Verification succeeded for blk_-3213723859645738103_1075939 2010-04-07 10:03:10,910 INFO org.apache.hadoop.dfs.DataBlockScanner: Verification succeeded for blk_-2209269106581706132_676390 2010-04-07 10:03:10,923 INFO org.apache.hadoop.dfs.DataBlockScanner: Verification succeeded for blk_-6007998488187910667_676379 2010-04-07 10:03:11,086 INFO org.apache.hadoop.dfs.DataBlockScanner: Verification succeeded for blk_-1024215056075897357_676383 2010-04-07 10:03:11,127 INFO org.apache.hadoop.dfs.DataBlockScanner: Verification succeeded for blk_3780597313184168671_1270304 2010-04-07 10:03:11,160 INFO org.apache.hadoop.dfs.DataBlockScanner: Verification succeeded for blk_8891623760013835158_676336 One thing I wanted to point out is sometime back I'd to do setrep on the entire Cluster, are these verifications messages related to that ? Also while going through the NameNode logs i encountered following things. 2010-04-05 21:01:31,383 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.21:50010 2010-04-05 21:01:49,240 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.100.21:50010 2010-04-05 21:01:49,243 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.2:50010 2010-04-05 21:02:01,791 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.100.2:50010 then again @ 2010-04-06 06:41:56,290 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.21:50010 2010-04-06 06:41:56,290 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.100.21:50010 2010-04-06 06:41:56,290 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.2:50010 2010-04-06 06:41:56,290 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/192.168.100.2:50010 I had to restart the cluster post which I got both the nodes back. 2010-04-06 10:11:24,325 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 192.168.100.21:50010storage DS-455083797-192 .168.100.21-50010-1268220157729 2010-04-06 10:11:24,328 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.100.21:50010 2010-04-06 10:11:25,245 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.allocateBlock: /data/listing/image/5/84025/35924c87e664a43893904effbd2be601_list.jpg. blk_-1845977707636580795_1665561 2010-04-06 10:11:25,342 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 192.168.100.21:50010 is added to blk_-1845977707636580795_1665561 size 72753 2010-04-06 10:11:44,257 INFO org.apache.hadoop.fs.FSNamesystem: Number of transactions: 64 Total time for transactions(ms): 4 Number of syncs: 45 SyncTimes(ms): 387 2010-04-06 10:11:51,485 INFO org.apache.hadoop.dfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 192.168.100.2:50010storage DS-1237294752-192.168.100.2-50010-1252010614375 2010-04-06 10:11:51,488 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/192.168.100.2:50010 Then again subsequently they were removed. No clue why this happened. Ever since I'm seeing following things in logs.. 2010-04-06 10:00:49,052 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 54310, call create(/data/listing/image/4/43734/5af88437f6c6a88d62c5f900b06ab8dd_high.jpg, rwxr-xr-x, DFSClient_1226879860, true, 2, 67108864) from 192.168.100.5:40437: error: org.apache.hadoop.dfs.SafeModeException: Cannot create file/data/listing/image/4/43734/5af88437f6c6a88d62c5f900b06ab8dd_high.jpg. Name node is in safe mode. The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically. org.apache.hadoop.dfs.SafeModeException: Cannot create file/data/listing/image/4/43734/5af88437f6c6a88d62c5f900b06ab8dd_high.jpg. Name node is in safe mode. The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically. I ran FSCK also on the entire cluster, it gave following o/p in the summary. Total size: 540525108291 B Total dirs: 53298 Total files: 1617706 Total blocks (validated): 1601927 (avg. block size 337421 B) ******************************** CORRUPT FILES: 1601525 MISSING BLOCKS: 1601927 MISSING SIZE: 540525108291 B CORRUPT BLOCKS: 1601927 ******************************** Minimally replicated blocks: 0 (0.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 0.0 Corrupt blocks: 1601927 The filesystem under path '/data' is CORRUPT I'm using hadoop-0.18.3 on Ubuntu I'm completely clueless as to why its taking that long coming out of Safe Mode. Suggestions / Comments appreciated. - Manish