Re: Cluster in Safe Mode

Manish N Wed, 07 Apr 2010 10:17:38 -0700

On Wed, Apr 7, 2010 at 7:27 PM, Edson Ramiro <erlfi...@gmail.com> wrote:


> To solve the safemode problem, you may first start the DFS, leave the
> safemode and do a fsck.
>
> ./bin/start-dfs
> ./bin/hadoop dfs -safemode leave
> ./bin/hadoop fsck /
>
> After this, restart the DFS.
>
> You can configure HADOOP_OPTS in conf/hadoop-env.sh to give more mem. to
> Java.
> Also configure HADOOP_HEAPSIZE.
>


Yes that's exactly what I did, DataNodes are back. I've taken them out of
safe mode.

I'm planning to upgrade the hadoop instance to latest stable release. How
wise would this be ?




> # export HADOOP_OPTS="-server -XX:+HeapDumpOnOutOfMemoryError
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX    :+UseParallelGC
> -XX:ParallelGCThreads=4 -XX:NewSize=1G -XX:MaxNewSize=1G
>
> Edson Ramiro
>
>
> On 7 April 2010 06:04, Manish N <m1n...@gmail.com> wrote:
>
> > On Wed, Apr 7, 2010 at 10:59 AM, Sagar Shukla <
> > sagar_shu...@persistent.co.in
> > > wrote:
> >
> > > Hi Manish,
> > >      Do you see any errors on DataNode log-files ? It is quite likely
> > that
> > > after the namenode starts the processes on datanode then are failing to
> > > start, causing the namenode to wait in safe mode for datanode services
> to
> > > start.
> > >
> >
> > I do see following in the DataNode.out file whenever I start a DataNode
> on
> > both the DataNodes of mine, after sometime they are marked as dead as
> > expected.
> >
> > Exception in thread "DataNode: [/root/Datadir/hadoop/dfs/data]"
> > java.lang.OutOfMemoryError: Java heap space
> >        at java.util.Arrays.copyOf(Arrays.java:2786)
> >        at
> > java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:71)
> >        at java.io.DataOutputStream.writeByte(DataOutputStream.java:136)
> >        at org.apache.hadoop.io.UTF8.writeChars(UTF8.java:274)
> >        at org.apache.hadoop.io.UTF8.writeString(UTF8.java:246)
> >        at
> > org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:120)
> >        at
> > org.apache.hadoop.io.ObjectWritable.writeObject(ObjectWritable.java:126)
> >        at org.apache.hadoop.ipc.RPC$Invocation.write(RPC.java:109)
> >        at
> > org.apache.hadoop.ipc.Client$Connection.sendParam(Client.java:474)
> >        at org.apache.hadoop.ipc.Client.call(Client.java:706)
> >        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
> >        at org.apache.hadoop.dfs.$Proxy4.blockReport(Unknown Source)
> >        at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:744)
> >        at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2967)
> >        at java.lang.Thread.run(Thread.java:619)
> >
> >
> >
> >
> >
> >
> > >
> > > Thanks,
> > > Sagar
> > >
> > > -----Original Message-----
> > > From: Manish N [mailto:m1n...@gmail.com]
> > > Sent: Wednesday, April 07, 2010 10:47 AM
> > > To: common-user@hadoop.apache.org
> > > Subject: Cluster in Safe Mode
> > >
> > > Hey all,
> > >
> > > I've a 2 Node cluster which is now running in Safe Mode. Its been 15-16
> > hrs
> > > now & yet to come out of Safe Mode. Does it normally take that long ?
> > >
> > > The DataNode logs on Node running NameNode indicates following &
> similar
> > > output on the slave node ( running only Data Node ) as well.
> > >
> > > 2010-04-07 10:03:10,687 INFO org.apache.hadoop.dfs.DataBlockScanner:
> > > Verification succeeded for blk_-310922324774702076_996024
> > > 2010-04-07 10:03:10,705 INFO org.apache.hadoop.dfs.DataBlockScanner:
> > > Verification succeeded for blk_3302288729849061244_813694
> > > 2010-04-07 10:03:10,730 INFO org.apache.hadoop.dfs.DataBlockScanner:
> > > Verification succeeded for blk_-7252548330326272479_1259723
> > > 2010-04-07 10:03:10,745 INFO org.apache.hadoop.dfs.DataBlockScanner:
> > > Verification succeeded for blk_-5909954202848831867_1075933
> > > 2010-04-07 10:03:10,886 INFO org.apache.hadoop.dfs.DataBlockScanner:
> > > Verification succeeded for blk_-3213723859645738103_1075939
> > > 2010-04-07 10:03:10,910 INFO org.apache.hadoop.dfs.DataBlockScanner:
> > > Verification succeeded for blk_-2209269106581706132_676390
> > > 2010-04-07 10:03:10,923 INFO org.apache.hadoop.dfs.DataBlockScanner:
> > > Verification succeeded for blk_-6007998488187910667_676379
> > > 2010-04-07 10:03:11,086 INFO org.apache.hadoop.dfs.DataBlockScanner:
> > > Verification succeeded for blk_-1024215056075897357_676383
> > > 2010-04-07 10:03:11,127 INFO org.apache.hadoop.dfs.DataBlockScanner:
> > > Verification succeeded for blk_3780597313184168671_1270304
> > > 2010-04-07 10:03:11,160 INFO org.apache.hadoop.dfs.DataBlockScanner:
> > > Verification succeeded for blk_8891623760013835158_676336
> > >
> > > One thing I wanted to point out is sometime back I'd to do setrep on
> the
> > > entire Cluster, are these verifications messages related to that ?
> > >
> > > Also while going through the NameNode logs i encountered following
> > things.
> > >
> > > 2010-04-05 21:01:31,383 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> > > NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.21:50010
> > > 2010-04-05 21:01:49,240 INFO org.apache.hadoop.net.NetworkTopology:
> > > Removing
> > > a node: /default-rack/192.168.100.21:50010
> > > 2010-04-05 21:01:49,243 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> > > NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.2:50010
> > > 2010-04-05 21:02:01,791 INFO org.apache.hadoop.net.NetworkTopology:
> > > Removing
> > > a node: /default-rack/192.168.100.2:50010
> > >
> > > then again @
> > >
> > > 2010-04-06 06:41:56,290 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> > > NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.21:50010
> > > 2010-04-06 06:41:56,290 INFO org.apache.hadoop.net.NetworkTopology:
> > > Removing
> > > a node: /default-rack/192.168.100.21:50010
> > > 2010-04-06 06:41:56,290 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> > > NameSystem.heartbeatCheck: lost heartbeat from 192.168.100.2:50010
> > > 2010-04-06 06:41:56,290 INFO org.apache.hadoop.net.NetworkTopology:
> > > Removing
> > > a node: /default-rack/192.168.100.2:50010
> > >
> > > I had to restart the cluster post which I got both the nodes back.
> > >
> > > 2010-04-06 10:11:24,325 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> > > NameSystem.registerDatanode: node registration from
> > > 192.168.100.21:50010storage DS-455083797-192
> > > .168.100.21-50010-1268220157729
> > > 2010-04-06 10:11:24,328 INFO org.apache.hadoop.net.NetworkTopology:
> > Adding
> > > a
> > > new node: /default-rack/192.168.100.21:50010
> > > 2010-04-06 10:11:25,245 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> > > NameSystem.allocateBlock:
> > > /data/listing/image/5/84025/35924c87e664a43893904effbd2be601_list.jpg.
> > > blk_-1845977707636580795_1665561
> > > 2010-04-06 10:11:25,342 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> > > NameSystem.addStoredBlock: blockMap updated: 192.168.100.21:50010 is
> > added
> > > to blk_-1845977707636580795_1665561 size 72753
> > > 2010-04-06 10:11:44,257 INFO org.apache.hadoop.fs.FSNamesystem: Number
> of
> > > transactions: 64 Total time for transactions(ms): 4 Number of syncs: 45
> > > SyncTimes(ms): 387
> > > 2010-04-06 10:11:51,485 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
> > > NameSystem.registerDatanode: node registration from
> > > 192.168.100.2:50010storage
> > > DS-1237294752-192.168.100.2-50010-1252010614375
> > > 2010-04-06 10:11:51,488 INFO org.apache.hadoop.net.NetworkTopology:
> > Adding
> > > a
> > > new node: /default-rack/192.168.100.2:50010
> > >
> > > Then again subsequently they were removed. No clue why this happened.
> > >
> > > Ever since I'm seeing following things in logs..
> > >
> > > 2010-04-06 10:00:49,052 INFO org.apache.hadoop.ipc.Server: IPC Server
> > > handler 2 on 54310, call
> > >
> > >
> >
> create(/data/listing/image/4/43734/5af88437f6c6a88d62c5f900b06ab8dd_high.jpg,
> > > rwxr-xr-x, DFSClient_1226879860, true, 2, 67108864) from
> > > 192.168.100.5:40437:
> > > error: org.apache.hadoop.dfs.SafeModeException: Cannot create
> > >
> >
> file/data/listing/image/4/43734/5af88437f6c6a88d62c5f900b06ab8dd_high.jpg.
> > > Name node is in safe mode.
> > > The ratio of reported blocks 0.0000 has not reached the threshold
> 0.9990.
> > > Safe mode will be turned off automatically.
> > > org.apache.hadoop.dfs.SafeModeException: Cannot create
> > >
> >
> file/data/listing/image/4/43734/5af88437f6c6a88d62c5f900b06ab8dd_high.jpg.
> > > Name node is in safe mode.
> > > The ratio of reported blocks 0.0000 has not reached the threshold
> 0.9990.
> > > Safe mode will be turned off automatically.
> > >
> > > I ran FSCK also on the entire cluster, it gave following o/p in the
> > > summary.
> > >
> > > Total size:    540525108291 B
> > > Total dirs:    53298
> > > Total files:    1617706
> > > Total blocks (validated):    1601927 (avg. block size 337421 B)
> > >  ********************************
> > >  CORRUPT FILES:    1601525
> > >  MISSING BLOCKS:    1601927
> > >  MISSING SIZE:        540525108291 B
> > >  CORRUPT BLOCKS:     1601927
> > >  ********************************
> > > Minimally replicated blocks:    0 (0.0 %)
> > > Over-replicated blocks:    0 (0.0 %)
> > > Under-replicated blocks:    0 (0.0 %)
> > > Mis-replicated blocks:        0 (0.0 %)
> > > Default replication factor:    2
> > > Average block replication:    0.0
> > > Corrupt blocks:        1601927
> > >
> > > The filesystem under path '/data' is CORRUPT
> > >
> > > I'm using hadoop-0.18.3 on Ubuntu
> > >
> > > I'm completely clueless as to why its taking that long coming out of
> Safe
> > > Mode. Suggestions / Comments appreciated.
> > >
> > > - Manish
> > >
> > > DISCLAIMER
> > > ==========
> > > This e-mail may contain privileged and confidential information which
> is
> > > the property of Persistent Systems Ltd. It is intended only for the use
> > of
> > > the individual or entity to which it is addressed. If you are not the
> > > intended recipient, you are not authorized to read, retain, copy,
> print,
> > > distribute or use this message. If you have received this communication
> > in
> > > error, please notify the sender and delete all copies of this message.
> > > Persistent Systems Ltd. does not accept any liability for virus
> infected
> > > mails.
> > >
> >
> >
> > - Manish
> >
>


- Manish.

Re: Cluster in Safe Mode

Reply via email to