Hi Harsh, The web portal of the NN shows 0 nodes.
Looking into each node's log, all nodes but one have been staying at: /************************************************************ STARTUP_MSG: Starting DataNode STARTUP_MSG: host = pipeline09.x.y.z/10.2.20.109 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.205.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205 -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 ************************************************************/ 2012-07-02 11:00:38,891 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2012-07-02 11:00:38,906 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered. 2012-07-02 11:00:38,908 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2012-07-02 11:00:38,908 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system started 2012-07-02 11:00:39,091 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered. 2012-07-02 11:00:39,094 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists! ============================= The one node passed that and reached: 2012-07-02 12:24:58,450 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_-2932980158620691384_129517 Does that mean progress? It's been 1.5 hour since the start. And the file system side is: 74515 files and directories, 28439 blocks = 102954 total Thanks for helping. James On Mon, Jul 2, 2012 at 12:07 PM, Harsh J <ha...@cloudera.com> wrote: > Juanhui, > > It is merely waiting for the DNs to start, and to report its blocks > in. This does not take long once the DNs are up and running. Do you > see any Live Nodes yet? > > On Tue, Jul 3, 2012 at 12:34 AM, Jianhui Zhang <jhzhang.em...@gmail.com> > wrote: >> Hi folks, >> >> Thanks for helping, especially at such earlier hours. >> >> After leaving it overnight, during which period nothing happened in >> the log, I restarted this morning. This time, it passed the previously >> stuck point, and reached all the way to "IPC Server handler.. >> starting", in Safe Mode. So it looks more promising now. >> >> But it's in a state of: >> >> "The ratio of reported blocks 0.0000 has not reached the threshold >> 0.9990. Safe mode will be turned off automatically." >> >> Does that mean the NN is waiting for DNs's communications/updates? How >> can I tell whether it's stuck or just slow? >> >> The NN log is at: http://pastebin.com/5fvRfRSD >> >> The jstack output is at: http://pastebin.com/RnDXWrtc >> >> The configurations are really basic: >> >> core-site.xml: >> >> <configuration> >> <property> >> <name>fs.default.name</name> >> <value>hdfs://pipeline-hdnn01-virtual.x.y.z:8020</value> >> <final>true</final> >> </property> >> <property> >> <name>io.file.buffer.size</name> >> <value>65536</value> >> </property> >> </configuration> >> >> It's the same for all nodes. >> >> Again, appreciate your help. >> >> Thanks, >> James >> >> On Mon, Jul 2, 2012 at 3:21 AM, Harsh J <ha...@cloudera.com> wrote: >>> Jianhui, >>> >>> Can you pastebin.com the output of your "jstack <NN PID>" command >>> after its hung, and pass us the paste link please? It looks to me like >>> it may have just been merging/saving the image, and that may be slow >>> but it depends on how long did you have to wait around to see NN >>> resume and begin properly. >>> >>> On Mon, Jul 2, 2012 at 2:34 PM, Jianhui Zhang <jhzhang.em...@gmail.com> >>> wrote: >>>> Hi, >>>> >>>> Apache Hadoop 0.20.205. >>>> >>>> I'm trying to restart NN and it always hangs at the very beginning. >>>> The only logs I've got are: >>>> >>>> /************************************************************ >>>> STARTUP_MSG: Starting NameNode >>>> STARTUP_MSG: host = host/ip >>>> STARTUP_MSG: args = [] >>>> STARTUP_MSG: version = 0.20.205.0 >>>> STARTUP_MSG: build = >>>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205 >>>> -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 >>>> ************************************************************/ >>>> 2012-07-02 01:33:01,281 INFO >>>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from >>>> hadoop-metrics2.properties >>>> 2012-07-02 01:33:01,290 INFO >>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>>> MetricsSystem,sub=Stats registered. >>>> 2012-07-02 01:33:01,292 INFO >>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot >>>> period at 10 second(s). >>>> 2012-07-02 01:33:01,292 INFO >>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics >>>> system started >>>> 2012-07-02 01:33:01,434 INFO >>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>>> ugi registered. >>>> 2012-07-02 01:33:01,436 WARN >>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi >>>> already exists! >>>> 2012-07-02 01:33:01,441 INFO >>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>>> jvm registered. >>>> 2012-07-02 01:33:01,441 INFO >>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>>> NameNode registered. >>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: VM type >>>> = 64-bit >>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: 2% max >>>> memory = 314.0275 MB >>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: >>>> capacity = 2^25 = 33554432 entries >>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: >>>> recommended=33554432, actual=33554432 >>>> 2012-07-02 01:33:01,546 INFO >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=owner >>>> 2012-07-02 01:33:01,546 INFO >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >>>> supergroup=supergroup >>>> 2012-07-02 01:33:01,546 INFO >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >>>> isPermissionEnabled=true >>>> 2012-07-02 01:33:01,550 INFO >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >>>> dfs.block.invalidate.limit=100 >>>> 2012-07-02 01:33:01,550 INFO >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >>>> isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), >>>> accessTokenLifetime=0 min(s) >>>> 2012-07-02 01:33:01,787 INFO >>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered >>>> FSNamesystemStateMBean and NameNodeMXBean >>>> 2012-07-02 01:33:01,802 INFO >>>> org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names >>>> occuring more than 10 times >>>> 2012-07-02 01:33:01,811 INFO >>>> org.apache.hadoop.hdfs.server.common.Storage: Number of files = 17032 >>>> 2012-07-02 01:33:02,406 INFO >>>> org.apache.hadoop.hdfs.server.common.Storage: Number of files under >>>> construction = 0 >>>> 2012-07-02 01:33:02,406 INFO >>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size >>>> 2553316 loaded in 0 seconds. >>>> 2012-07-02 01:33:02,410 INFO >>>> org.apache.hadoop.hdfs.server.common.Storage: Edits file >>>> /apr/hdfs/name/current/edits of size 498 edits # 7 loaded in 0 >>>> seconds. >>>> >>>> ==================================== >>>> >>>> It hangs thereafter.... I wonder if anybody has seen this before? >>>> >>>> Some background: I shut down DFS and MR while there were still jobs >>>> running. Some MR jobs were hanging, so I manually killed the children >>>> JVMs after the shutdown. Not sure how such actions would affect NN >>>> startup. >>>> >>>> Any help would be appreciated. >>>> >>>> Thanks, >>>> James >>> >>> >>> >>> -- >>> Harsh J > > > > -- > Harsh J