On Wed, Apr 7, 2010 at 7:27 PM, Edson Ramiro <erlfi...@gmail.com> wrote:
> To solve the safemode problem, you may first start the DFS, leave the > safemode and do a fsck. > > ./bin/start-dfs > ./bin/hadoop dfs -safemode leave > ./bin/hadoop fsck / > > After this, restart the DFS. > > You can configure HADOOP_OPTS in conf/hadoop-env.sh to give more mem. to > Java. > Also configure HADOOP_HEAPSIZE. > Yes that's exactly what I did, DataNodes are back. I've taken them out of safe mode. I'm planning to upgrade the hadoop instance to latest stable release. How wise would this be ? > # export HADOOP_OPTS="-server -XX:+HeapDumpOnOutOfMemoryError > -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX :+UseParallelGC > -XX:ParallelGCThreads=4 -XX:NewSize=1G -XX:MaxNewSize=1G > > Edson Ramiro > > > On 7 April 2010 06:04, Manish N <m1n...@gmail.com> wrote: > > > On Wed, Apr 7, 2010 at 10:59 AM, Sagar Shukla < > > sagar_shu...@persistent.co.in > > > wrote: > > > > > Hi Manish, > > > Do you see any errors on DataNode log-files ? It is quite likely > > that > > > after the namenode starts the processes on datanode then are failing to > > > start, causing the namenode to wait in safe mode for datanode services > to > > > start. > > > > > > > I do see following in the DataNode.out file whenever I start a DataNode > on > > both the DataNodes of mine, after sometime they are marked as dead as > > expected. > > > > Exception in thread "DataNode: [/root/Datadir/hadoop/dfs/data]" > > java.lang.OutOfMemoryError: Java heap space > > at java.util.Arrays.copyOf(Arrays.java:2786) > > at > > java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71) > > at java.io.DataOutputStream.writeByte(DataOutputStream.java:136) > > at org.apache.hadoop.io.UTF8.writeChars(UTF8.java:274) > > at org.apache.hadoop.io.UTF8.writeString(UTF8.java:246) > > at > > org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:120) > > at > > org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:126) > > at org.apache.hadoop.ipc.RPC$Invocation.write(RPC.java:109) > > at > > org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:474) > > at org.apache.hadoop.ipc.Client.call(Client.java:706) > > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) > > at org.apache.hadoop.dfs.$Proxy4.blockReport(Unknown Source) > > at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:744) > > at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2967) > > at java.lang.Thread.run(Thread.java:619) > > > > > > > > > > > > > > > > > > Thanks, > > > Sagar > > > > > > -----Original Message----- > > > From: Manish N [mailto:m1n...@gmail.com] > > > Sent: Wednesday, April 07, 2010 10:47 AM > > > To: common-user@hadoop.apache.org > > > Subject: Cluster in Safe Mode > > > > > > Hey all, > > > > > > I've a 2 Node cluster which is now running in Safe Mode. Its been 15-16 > > hrs > > > now & yet to come out of Safe Mode. Does it normally take that long ? > > > > > > The DataNode logs on Node running NameNode indicates following & > similar > > > output on the slave node ( running only Data Node ) as well. > > > > > > 2010-04-07 10:03:10,687 INFO org.apache.hadoop.dfs.DataBlockScanner: > > > Verification succeeded for blk_-310922324774702076_996024 > > > 2010-04-07 10:03:10,705 INFO org.apache.hadoop.dfs.DataBlockScanner: > > > Verification succeeded for blk_3302288729849061244_813694 > > > 2010-04-07 10:03:10,730 INFO org.apache.hadoop.dfs.DataBlockScanner: > > > Verification succeeded for blk_-7252548330326272479_1259723 > > > 2010-04-07 10:03:10,745 INFO org.apache.hadoop.dfs.DataBlockScanner: > > > Verification succeeded for blk_-5909954202848831867_1075933 > > > 2010-04-07 10:03:10,886 INFO org.apache.hadoop.dfs.DataBlockScanner: > > > Verification succeeded for blk_-3213723859645738103_1075939 > > > 2010-04-07 10:03:10,910 INFO org.apache.hadoop.dfs.DataBlockScanner: > > > Verification succeeded for blk_-2209269106581706132_676390 > > > 2010-04-07 10:03:10,923 INFO org.apache.hadoop.dfs.DataBlockScanner: > > > Verification succeeded for blk_-6007998488187910667_676379 > > > 2010-04-07 10:03:11,086 INFO org.apache.hadoop.dfs.DataBlockScanner: > > > Verification succeeded for blk_-1024215056075897357_676383 > > > 2010-04-07 10:03:11,127 INFO org.apache.hadoop.dfs.DataBlockScanner: > > > Verification succeeded for blk_3780597313184168671_1270304 > > > 2010-04-07 10:03:11,160 INFO org.apache.hadoop.dfs.DataBlockScanner: > > > Verification succeeded for blk_8891623760013835158_676336 > > > > > > One thing I wanted to point out is sometime back I'd to do setrep on > the > > > entire Cluster, are these verifications messages related to that ? > > > > > > Also while going through the NameNode logs i encountered following > > things. > > > > > > 2010-04-05 21:01:31,383 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > > > NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.21:50010 > > > 2010-04-05 21:01:49,240 INFO org.apache.hadoop.net.NetworkTopology: > > > Removing > > > a node: /default-rack/192.168.100.21:50010 > > > 2010-04-05 21:01:49,243 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > > > NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.2:50010 > > > 2010-04-05 21:02:01,791 INFO org.apache.hadoop.net.NetworkTopology: > > > Removing > > > a node: /default-rack/192.168.100.2:50010 > > > > > > then again @ > > > > > > 2010-04-06 06:41:56,290 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > > > NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.21:50010 > > > 2010-04-06 06:41:56,290 INFO org.apache.hadoop.net.NetworkTopology: > > > Removing > > > a node: /default-rack/192.168.100.21:50010 > > > 2010-04-06 06:41:56,290 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > > > NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.2:50010 > > > 2010-04-06 06:41:56,290 INFO org.apache.hadoop.net.NetworkTopology: > > > Removing > > > a node: /default-rack/192.168.100.2:50010 > > > > > > I had to restart the cluster post which I got both the nodes back. > > > > > > 2010-04-06 10:11:24,325 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > > > NameSystem.registerDatanode: node registration from > > > 192.168.100.21:50010storage DS-455083797-192 > > > .168.100.21-50010-1268220157729 > > > 2010-04-06 10:11:24,328 INFO org.apache.hadoop.net.NetworkTopology: > > Adding > > > a > > > new node: /default-rack/192.168.100.21:50010 > > > 2010-04-06 10:11:25,245 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > > > NameSystem.allocateBlock: > > > /data/listing/image/5/84025/35924c87e664a43893904effbd2be601_list.jpg. > > > blk_-1845977707636580795_1665561 > > > 2010-04-06 10:11:25,342 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > > > NameSystem.addStoredBlock: blockMap updated: 192.168.100.21:50010 is > > added > > > to blk_-1845977707636580795_1665561 size 72753 > > > 2010-04-06 10:11:44,257 INFO org.apache.hadoop.fs.FSNamesystem: Number > of > > > transactions: 64 Total time for transactions(ms): 4 Number of syncs: 45 > > > SyncTimes(ms): 387 > > > 2010-04-06 10:11:51,485 INFO org.apache.hadoop.dfs.StateChange: BLOCK* > > > NameSystem.registerDatanode: node registration from > > > 192.168.100.2:50010storage > > > DS-1237294752-192.168.100.2-50010-1252010614375 > > > 2010-04-06 10:11:51,488 INFO org.apache.hadoop.net.NetworkTopology: > > Adding > > > a > > > new node: /default-rack/192.168.100.2:50010 > > > > > > Then again subsequently they were removed. No clue why this happened. > > > > > > Ever since I'm seeing following things in logs.. > > > > > > 2010-04-06 10:00:49,052 INFO org.apache.hadoop.ipc.Server: IPC Server > > > handler 2 on 54310, call > > > > > > > > > create(/data/listing/image/4/43734/5af88437f6c6a88d62c5f900b06ab8dd_high.jpg, > > > rwxr-xr-x, DFSClient_1226879860, true, 2, 67108864) from > > > 192.168.100.5:40437: > > > error: org.apache.hadoop.dfs.SafeModeException: Cannot create > > > > > > file/data/listing/image/4/43734/5af88437f6c6a88d62c5f900b06ab8dd_high.jpg. > > > Name node is in safe mode. > > > The ratio of reported blocks 0.0000 has not reached the threshold > 0.9990. > > > Safe mode will be turned off automatically. > > > org.apache.hadoop.dfs.SafeModeException: Cannot create > > > > > > file/data/listing/image/4/43734/5af88437f6c6a88d62c5f900b06ab8dd_high.jpg. > > > Name node is in safe mode. > > > The ratio of reported blocks 0.0000 has not reached the threshold > 0.9990. > > > Safe mode will be turned off automatically. > > > > > > I ran FSCK also on the entire cluster, it gave following o/p in the > > > summary. > > > > > > Total size: 540525108291 B > > > Total dirs: 53298 > > > Total files: 1617706 > > > Total blocks (validated): 1601927 (avg. block size 337421 B) > > > ******************************** > > > CORRUPT FILES: 1601525 > > > MISSING BLOCKS: 1601927 > > > MISSING SIZE: 540525108291 B > > > CORRUPT BLOCKS: 1601927 > > > ******************************** > > > Minimally replicated blocks: 0 (0.0 %) > > > Over-replicated blocks: 0 (0.0 %) > > > Under-replicated blocks: 0 (0.0 %) > > > Mis-replicated blocks: 0 (0.0 %) > > > Default replication factor: 2 > > > Average block replication: 0.0 > > > Corrupt blocks: 1601927 > > > > > > The filesystem under path '/data' is CORRUPT > > > > > > I'm using hadoop-0.18.3 on Ubuntu > > > > > > I'm completely clueless as to why its taking that long coming out of > Safe > > > Mode. Suggestions / Comments appreciated. > > > > > > - Manish > > > > > > DISCLAIMER > > > ========== > > > This e-mail may contain privileged and confidential information which > is > > > the property of Persistent Systems Ltd. It is intended only for the use > > of > > > the individual or entity to which it is addressed. If you are not the > > > intended recipient, you are not authorized to read, retain, copy, > print, > > > distribute or use this message. If you have received this communication > > in > > > error, please notify the sender and delete all copies of this message. > > > Persistent Systems Ltd. does not accept any liability for virus > infected > > > mails. > > > > > > > > > - Manish > > > - Manish.