Hi Harsh,

The web portal of the NN shows 0 nodes.

Looking into each node's log, all nodes but one have been staying at:

/************************************************************
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = pipeline09.x.y.z/10.2.20.109
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.205.0
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205
-r 1179940; compiled by 'hortonfo' on Fri Oct  7 06:20:32 UTC 2011
************************************************************/
2012-07-02 11:00:38,891 INFO
org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
hadoop-metrics2.properties
2012-07-02 11:00:38,906 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
MetricsSystem,sub=Stats registered.
2012-07-02 11:00:38,908 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
period at 10 second(s).
2012-07-02 11:00:38,908 INFO
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics
system started
2012-07-02 11:00:39,091 INFO
org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
ugi registered.
2012-07-02 11:00:39,094 WARN
org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
already exists!
=============================

The one node passed that and reached:

2012-07-02 12:24:58,450 INFO
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
succeeded for blk_-2932980158620691384_129517

Does that mean progress? It's been 1.5 hour since the start. And the
file system side is: 74515 files and directories, 28439 blocks =
102954 total

Thanks for  helping.
James

On Mon, Jul 2, 2012 at 12:07 PM, Harsh J <ha...@cloudera.com> wrote:
> Juanhui,
>
> It is merely waiting for the DNs to start, and to report its blocks
> in. This does not take long once the DNs are up and running. Do you
> see any Live Nodes yet?
>
> On Tue, Jul 3, 2012 at 12:34 AM, Jianhui Zhang <jhzhang.em...@gmail.com> 
> wrote:
>> Hi folks,
>>
>> Thanks for helping, especially at such earlier hours.
>>
>> After leaving it overnight, during which period nothing happened in
>> the log, I restarted this morning. This time, it passed the previously
>> stuck point, and reached all the way to "IPC Server handler..
>> starting", in Safe Mode. So it looks more promising now.
>>
>> But it's in a state of:
>>
>> "The ratio of reported blocks 0.0000 has not reached the threshold
>> 0.9990. Safe mode will be turned off automatically."
>>
>> Does that mean the NN is waiting for DNs's communications/updates? How
>> can I tell whether it's stuck or just slow?
>>
>> The NN log is at: http://pastebin.com/5fvRfRSD
>>
>> The jstack output is at: http://pastebin.com/RnDXWrtc
>>
>> The configurations are really basic:
>>
>> core-site.xml:
>>
>> <configuration>
>>   <property>
>>     <name>fs.default.name</name>
>>     <value>hdfs://pipeline-hdnn01-virtual.x.y.z:8020</value>
>>     <final>true</final>
>>   </property>
>>   <property>
>>     <name>io.file.buffer.size</name>
>>     <value>65536</value>
>>   </property>
>> </configuration>
>>
>> It's the same for all nodes.
>>
>> Again, appreciate your help.
>>
>> Thanks,
>> James
>>
>> On Mon, Jul 2, 2012 at 3:21 AM, Harsh J <ha...@cloudera.com> wrote:
>>> Jianhui,
>>>
>>> Can you pastebin.com the output of your "jstack <NN PID>" command
>>> after its hung, and pass us the paste link please? It looks to me like
>>> it may have just been merging/saving the image, and that may be slow
>>> but it depends on how long did you have to wait around to see NN
>>> resume and begin properly.
>>>
>>> On Mon, Jul 2, 2012 at 2:34 PM, Jianhui Zhang <jhzhang.em...@gmail.com> 
>>> wrote:
>>>> Hi,
>>>>
>>>> Apache Hadoop 0.20.205.
>>>>
>>>> I'm trying to restart NN and it always hangs at the very beginning.
>>>> The only logs I've got are:
>>>>
>>>> /************************************************************
>>>> STARTUP_MSG: Starting NameNode
>>>> STARTUP_MSG:   host = host/ip
>>>> STARTUP_MSG:   args = []
>>>> STARTUP_MSG:   version = 0.20.205.0
>>>> STARTUP_MSG:   build =
>>>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205
>>>> -r 1179940; compiled by 'hortonfo' on Fri Oct  7 06:20:32 UTC 2011
>>>> ************************************************************/
>>>> 2012-07-02 01:33:01,281 INFO
>>>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
>>>> hadoop-metrics2.properties
>>>> 2012-07-02 01:33:01,290 INFO
>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>>>> MetricsSystem,sub=Stats registered.
>>>> 2012-07-02 01:33:01,292 INFO
>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
>>>> period at 10 second(s).
>>>> 2012-07-02 01:33:01,292 INFO
>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics
>>>> system started
>>>> 2012-07-02 01:33:01,434 INFO
>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>>>> ugi registered.
>>>> 2012-07-02 01:33:01,436 WARN
>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
>>>> already exists!
>>>> 2012-07-02 01:33:01,441 INFO
>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>>>> jvm registered.
>>>> 2012-07-02 01:33:01,441 INFO
>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>>>> NameNode registered.
>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: VM type
>>>>       = 64-bit
>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: 2% max
>>>> memory = 314.0275 MB
>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet:
>>>> capacity      = 2^25 = 33554432 entries
>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet:
>>>> recommended=33554432, actual=33554432
>>>> 2012-07-02 01:33:01,546 INFO
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=owner
>>>> 2012-07-02 01:33:01,546 INFO
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>>> supergroup=supergroup
>>>> 2012-07-02 01:33:01,546 INFO
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>>> isPermissionEnabled=true
>>>> 2012-07-02 01:33:01,550 INFO
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>>> dfs.block.invalidate.limit=100
>>>> 2012-07-02 01:33:01,550 INFO
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>>> isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
>>>> accessTokenLifetime=0 min(s)
>>>> 2012-07-02 01:33:01,787 INFO
>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
>>>> FSNamesystemStateMBean and NameNodeMXBean
>>>> 2012-07-02 01:33:01,802 INFO
>>>> org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names
>>>> occuring more than 10 times
>>>> 2012-07-02 01:33:01,811 INFO
>>>> org.apache.hadoop.hdfs.server.common.Storage: Number of files = 17032
>>>> 2012-07-02 01:33:02,406 INFO
>>>> org.apache.hadoop.hdfs.server.common.Storage: Number of files under
>>>> construction = 0
>>>> 2012-07-02 01:33:02,406 INFO
>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size
>>>> 2553316 loaded in 0 seconds.
>>>> 2012-07-02 01:33:02,410 INFO
>>>> org.apache.hadoop.hdfs.server.common.Storage: Edits file
>>>> /apr/hdfs/name/current/edits of size 498 edits # 7 loaded in 0
>>>> seconds.
>>>>
>>>> ====================================
>>>>
>>>> It hangs thereafter.... I wonder if anybody has seen this before?
>>>>
>>>> Some background: I shut down DFS and MR while there were still jobs
>>>> running. Some MR jobs were hanging, so I manually killed the children
>>>> JVMs after the shutdown. Not sure how such actions would affect NN
>>>> startup.
>>>>
>>>> Any help would be appreciated.
>>>>
>>>> Thanks,
>>>> James
>>>
>>>
>>>
>>> --
>>> Harsh J
>
>
>
> --
> Harsh J

Reply via email to