Cluster in Safe Mode

Manish N Tue, 06 Apr 2010 22:17:18 -0700

Hey all,

I've a 2 Node cluster which is now running in Safe Mode. Its been 15-16 hrs
now & yet to come out of Safe Mode. Does it normally take that long ?


The DataNode logs on Node running NameNode indicates following & similar
output on the slave node ( running only Data Node ) as well.

2010-04-07 10:03:10,687 INFO org.apache.hadoop.dfs.DataBlockScanner:
Verification succeeded for blk_-310922324774702076_996024
2010-04-07 10:03:10,705 INFO org.apache.hadoop.dfs.DataBlockScanner:
Verification succeeded for blk_3302288729849061244_813694
2010-04-07 10:03:10,730 INFO org.apache.hadoop.dfs.DataBlockScanner:
Verification succeeded for blk_-7252548330326272479_1259723
2010-04-07 10:03:10,745 INFO org.apache.hadoop.dfs.DataBlockScanner:
Verification succeeded for blk_-5909954202848831867_1075933
2010-04-07 10:03:10,886 INFO org.apache.hadoop.dfs.DataBlockScanner:
Verification succeeded for blk_-3213723859645738103_1075939
2010-04-07 10:03:10,910 INFO org.apache.hadoop.dfs.DataBlockScanner:
Verification succeeded for blk_-2209269106581706132_676390
2010-04-07 10:03:10,923 INFO org.apache.hadoop.dfs.DataBlockScanner:
Verification succeeded for blk_-6007998488187910667_676379
2010-04-07 10:03:11,086 INFO org.apache.hadoop.dfs.DataBlockScanner:
Verification succeeded for blk_-1024215056075897357_676383
2010-04-07 10:03:11,127 INFO org.apache.hadoop.dfs.DataBlockScanner:
Verification succeeded for blk_3780597313184168671_1270304
2010-04-07 10:03:11,160 INFO org.apache.hadoop.dfs.DataBlockScanner:
Verification succeeded for blk_8891623760013835158_676336

One thing I wanted to point out is sometime back I'd to do setrep on the
entire Cluster, are these verifications messages related to that ?

Also while going through the NameNode logs i encountered following things.

2010-04-05 21:01:31,383 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.21:50010
2010-04-05 21:01:49,240 INFO org.apache.hadoop.net.NetworkTopology: Removing
a node: /default-rack/192.168.100.21:50010
2010-04-05 21:01:49,243 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.2:50010
2010-04-05 21:02:01,791 INFO org.apache.hadoop.net.NetworkTopology: Removing
a node: /default-rack/192.168.100.2:50010

then again @

2010-04-06 06:41:56,290 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.21:50010
2010-04-06 06:41:56,290 INFO org.apache.hadoop.net.NetworkTopology: Removing
a node: /default-rack/192.168.100.21:50010
2010-04-06 06:41:56,290 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.2:50010
2010-04-06 06:41:56,290 INFO org.apache.hadoop.net.NetworkTopology: Removing
a node: /default-rack/192.168.100.2:50010

I had to restart the cluster post which I got both the nodes back.

2010-04-06 10:11:24,325 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.registerDatanode: node registration from
192.168.100.21:50010storage DS-455083797-192
.168.100.21-50010-1268220157729
2010-04-06 10:11:24,328 INFO org.apache.hadoop.net.NetworkTopology: Adding a
new node: /default-rack/192.168.100.21:50010
2010-04-06 10:11:25,245 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.allocateBlock:
/data/listing/image/5/84025/35924c87e664a43893904effbd2be601_list.jpg.
blk_-1845977707636580795_1665561
2010-04-06 10:11:25,342 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.addStoredBlock: blockMap updated: 192.168.100.21:50010 is added
to blk_-1845977707636580795_1665561 size 72753
2010-04-06 10:11:44,257 INFO org.apache.hadoop.fs.FSNamesystem: Number of
transactions: 64 Total time for transactions(ms): 4 Number of syncs: 45
SyncTimes(ms): 387
2010-04-06 10:11:51,485 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.registerDatanode: node registration from
192.168.100.2:50010storage
DS-1237294752-192.168.100.2-50010-1252010614375
2010-04-06 10:11:51,488 INFO org.apache.hadoop.net.NetworkTopology: Adding a
new node: /default-rack/192.168.100.2:50010

Then again subsequently they were removed. No clue why this happened.

Ever since I'm seeing following things in logs..

2010-04-06 10:00:49,052 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 54310, call
create(/data/listing/image/4/43734/5af88437f6c6a88d62c5f900b06ab8dd_high.jpg,
rwxr-xr-x, DFSClient_1226879860, true, 2, 67108864) from 192.168.100.5:40437:
error: org.apache.hadoop.dfs.SafeModeException: Cannot create
file/data/listing/image/4/43734/5af88437f6c6a88d62c5f900b06ab8dd_high.jpg.
Name node is in safe mode.
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
Safe mode will be turned off automatically.
org.apache.hadoop.dfs.SafeModeException: Cannot create
file/data/listing/image/4/43734/5af88437f6c6a88d62c5f900b06ab8dd_high.jpg.
Name node is in safe mode.
The ratio of reported blocks 0.0000 has not reached the threshold 0.9990.
Safe mode will be turned off automatically.

I ran FSCK also on the entire cluster, it gave following o/p in the summary.

Total size:    540525108291 B
Total dirs:    53298
Total files:    1617706
Total blocks (validated):    1601927 (avg. block size 337421 B)
  ********************************
  CORRUPT FILES:    1601525
  MISSING BLOCKS:    1601927
  MISSING SIZE:        540525108291 B
  CORRUPT BLOCKS:     1601927
  ********************************
Minimally replicated blocks:    0 (0.0 %)
Over-replicated blocks:    0 (0.0 %)
Under-replicated blocks:    0 (0.0 %)
Mis-replicated blocks:        0 (0.0 %)
Default replication factor:    2
Average block replication:    0.0
Corrupt blocks:        1601927

The filesystem under path '/data' is CORRUPT

I'm using hadoop-0.18.3 on Ubuntu

I'm completely clueless as to why its taking that long coming out of Safe
Mode. Suggestions / Comments appreciated.

- Manish

Cluster in Safe Mode

Reply via email to