James, You may be suffering from this old issue that deals with system entropy? http://search-hadoop.com/m/7Giae6vLWR1
You'll know for sure if you jstack your DNs and run it by the above thread's comments. On Tue, Jul 3, 2012 at 4:00 AM, Jianhui Zhang <jhzhang.em...@gmail.com> wrote: > Thanks for helping. > > I do have a secondary namenode. > > And I'm starting one datanode at a time now. Previously I was using > start-dfs.sh to start NN and slaves. I observed pretty strange > behaviors: > > 1. If use start-dfs.sh, almost none of the DN joined. All seemed stuck. > > 2. If I start one DN at a time, SOME joined instantly, but some looked > hanging... The funny thing is: if I start over again, the behavior > changed for each node.... > > Any idea? > > Thanks, > James > > > On Mon, Jul 2, 2012 at 12:56 PM, David Rosenstrauch <dar...@darose.net> wrote: >> A couple of thoughts: >> >> 1) It shouldn't take more than a couple of minutes for the data nodes to >> re-register after the name node comes back up. If the data nodes aren't >> registering, you might try restarting the data node daemons on each machine. >> They should all register soon after. (If not, then you have other problems >> involved.) >> >> 2) Re: the original symptom you reported about the namenode "hanging" at >> startup: you should read up on the "secondary namenode" component in Hadoop >> (you might start at >> http://hadoop.apache.org/common/docs/r0.17.2/hdfs_user_guide.html#Secondary+Namenode >> and >> http://www.cloudera.com/blog/2009/02/multi-host-secondarynamenode-configuration/). >> If you don't have a secondary namenode running in your cluster, that can >> have a significantly negative impact on namenode startup time. (The >> namenode needs to replay the log at startup - which can take a long time if >> the log has gotten large.) >> >> HTH, >> >> DR >> >> >> On 07/02/2012 03:32 PM, Jianhui Zhang wrote: >>> >>> Hi Harsh, >>> >>> The web portal of the NN shows 0 nodes. >>> >>> Looking into each node's log, all nodes but one have been staying at: >>> >>> /************************************************************ >>> STARTUP_MSG: Starting DataNode >>> STARTUP_MSG: host = pipeline09.x.y.z/10.2.20.109 >>> STARTUP_MSG: args = [] >>> STARTUP_MSG: version = 0.20.205.0 >>> STARTUP_MSG: build = >>> >>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205 >>> -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 >>> ************************************************************/ >>> 2012-07-02 11:00:38,891 INFO >>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from >>> hadoop-metrics2.properties >>> 2012-07-02 11:00:38,906 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>> MetricsSystem,sub=Stats registered. >>> 2012-07-02 11:00:38,908 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot >>> period at 10 second(s). >>> 2012-07-02 11:00:38,908 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics >>> system started >>> 2012-07-02 11:00:39,091 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>> ugi registered. >>> 2012-07-02 11:00:39,094 WARN >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi >>> already exists! >>> ============================= >>> >>> The one node passed that and reached: >>> >>> 2012-07-02 12:24:58,450 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification >>> succeeded for blk_-2932980158620691384_129517 >>> >>> Does that mean progress? It's been 1.5 hour since the start. And the >>> file system side is: 74515 files and directories, 28439 blocks = >>> 102954 total >>> >>> Thanks for helping. >>> James >>> >>> On Mon, Jul 2, 2012 at 12:07 PM, Harsh J <ha...@cloudera.com> wrote: >>>> >>>> Juanhui, >>>> >>>> It is merely waiting for the DNs to start, and to report its blocks >>>> in. This does not take long once the DNs are up and running. Do you >>>> see any Live Nodes yet? >>>> >>>> On Tue, Jul 3, 2012 at 12:34 AM, Jianhui Zhang <jhzhang.em...@gmail.com> >>>> wrote: >>>>> >>>>> Hi folks, >>>>> >>>>> Thanks for helping, especially at such earlier hours. >>>>> >>>>> After leaving it overnight, during which period nothing happened in >>>>> the log, I restarted this morning. This time, it passed the previously >>>>> stuck point, and reached all the way to "IPC Server handler.. >>>>> starting", in Safe Mode. So it looks more promising now. >>>>> >>>>> But it's in a state of: >>>>> >>>>> "The ratio of reported blocks 0.0000 has not reached the threshold >>>>> 0.9990. Safe mode will be turned off automatically." >>>>> >>>>> Does that mean the NN is waiting for DNs's communications/updates? How >>>>> can I tell whether it's stuck or just slow? >>>>> >>>>> The NN log is at: http://pastebin.com/5fvRfRSD >>>>> >>>>> The jstack output is at: http://pastebin.com/RnDXWrtc >>>>> >>>>> The configurations are really basic: >>>>> >>>>> core-site.xml: >>>>> >>>>> <configuration> >>>>> <property> >>>>> <name>fs.default.name</name> >>>>> <value>hdfs://pipeline-hdnn01-virtual.x.y.z:8020</value> >>>>> <final>true</final> >>>>> </property> >>>>> <property> >>>>> <name>io.file.buffer.size</name> >>>>> <value>65536</value> >>>>> </property> >>>>> </configuration> >>>>> >>>>> It's the same for all nodes. >>>>> >>>>> Again, appreciate your help. >>>>> >>>>> Thanks, >>>>> James >>>>> >>>>> On Mon, Jul 2, 2012 at 3:21 AM, Harsh J <ha...@cloudera.com> wrote: >>>>>> >>>>>> Jianhui, >>>>>> >>>>>> Can you pastebin.com the output of your "jstack <NN PID>" command >>>>>> after its hung, and pass us the paste link please? It looks to me like >>>>>> it may have just been merging/saving the image, and that may be slow >>>>>> but it depends on how long did you have to wait around to see NN >>>>>> resume and begin properly. >>>>>> >>>>>> On Mon, Jul 2, 2012 at 2:34 PM, Jianhui Zhang <jhzhang.em...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Apache Hadoop 0.20.205. >>>>>>> >>>>>>> I'm trying to restart NN and it always hangs at the very beginning. >>>>>>> The only logs I've got are: >>>>>>> >>>>>>> /************************************************************ >>>>>>> STARTUP_MSG: Starting NameNode >>>>>>> STARTUP_MSG: host = host/ip >>>>>>> STARTUP_MSG: args = [] >>>>>>> STARTUP_MSG: version = 0.20.205.0 >>>>>>> STARTUP_MSG: build = >>>>>>> >>>>>>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205 >>>>>>> -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 >>>>>>> ************************************************************/ >>>>>>> 2012-07-02 01:33:01,281 INFO >>>>>>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from >>>>>>> hadoop-metrics2.properties >>>>>>> 2012-07-02 01:33:01,290 INFO >>>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>>>>>> MetricsSystem,sub=Stats registered. >>>>>>> 2012-07-02 01:33:01,292 INFO >>>>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot >>>>>>> period at 10 second(s). >>>>>>> 2012-07-02 01:33:01,292 INFO >>>>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics >>>>>>> system started >>>>>>> 2012-07-02 01:33:01,434 INFO >>>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>>>>>> ugi registered. >>>>>>> 2012-07-02 01:33:01,436 WARN >>>>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi >>>>>>> already exists! >>>>>>> 2012-07-02 01:33:01,441 INFO >>>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>>>>>> jvm registered. >>>>>>> 2012-07-02 01:33:01,441 INFO >>>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>>>>>> NameNode registered. >>>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: VM type >>>>>>> = 64-bit >>>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: 2% max >>>>>>> memory = 314.0275 MB >>>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: >>>>>>> capacity = 2^25 = 33554432 entries >>>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: >>>>>>> recommended=33554432, actual=33554432 >>>>>>> 2012-07-02 01:33:01,546 INFO >>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=owner >>>>>>> 2012-07-02 01:33:01,546 INFO >>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >>>>>>> supergroup=supergroup >>>>>>> 2012-07-02 01:33:01,546 INFO >>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >>>>>>> isPermissionEnabled=true >>>>>>> 2012-07-02 01:33:01,550 INFO >>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >>>>>>> dfs.block.invalidate.limit=100 >>>>>>> 2012-07-02 01:33:01,550 INFO >>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >>>>>>> isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), >>>>>>> accessTokenLifetime=0 min(s) >>>>>>> 2012-07-02 01:33:01,787 INFO >>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered >>>>>>> FSNamesystemStateMBean and NameNodeMXBean >>>>>>> 2012-07-02 01:33:01,802 INFO >>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names >>>>>>> occuring more than 10 times >>>>>>> 2012-07-02 01:33:01,811 INFO >>>>>>> org.apache.hadoop.hdfs.server.common.Storage: Number of files = 17032 >>>>>>> 2012-07-02 01:33:02,406 INFO >>>>>>> org.apache.hadoop.hdfs.server.common.Storage: Number of files under >>>>>>> construction = 0 >>>>>>> 2012-07-02 01:33:02,406 INFO >>>>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size >>>>>>> 2553316 loaded in 0 seconds. >>>>>>> 2012-07-02 01:33:02,410 INFO >>>>>>> org.apache.hadoop.hdfs.server.common.Storage: Edits file >>>>>>> /apr/hdfs/name/current/edits of size 498 edits # 7 loaded in 0 >>>>>>> seconds. >>>>>>> >>>>>>> ==================================== >>>>>>> >>>>>>> It hangs thereafter.... I wonder if anybody has seen this before? >>>>>>> >>>>>>> Some background: I shut down DFS and MR while there were still jobs >>>>>>> running. Some MR jobs were hanging, so I manually killed the children >>>>>>> JVMs after the shutdown. Not sure how such actions would affect NN >>>>>>> startup. >>>>>>> >>>>>>> Any help would be appreciated. >>>>>>> >>>>>>> Thanks, >>>>>>> James >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Harsh J >>>> >>>> >>>> >>>> >>>> -- >>>> Harsh J >> >> >> -- Harsh J