Re: Namenode hangs on startup

Harsh J Mon, 02 Jul 2012 19:22:28 -0700

James,

You may be suffering from this old issue that deals with system
entropy? http://search-hadoop.com/m/7Giae6vLWR1


You'll know for sure if you jstack your DNs and run it by the above
thread's comments.

On Tue, Jul 3, 2012 at 4:00 AM, Jianhui Zhang <jhzhang.em...@gmail.com> wrote:
> Thanks for helping.
>
> I do have a secondary namenode.
>
> And I'm starting one datanode at a time now. Previously I was using
> start-dfs.sh to start NN and slaves. I observed pretty strange
> behaviors:
>
> 1. If use start-dfs.sh, almost none of the DN joined. All seemed stuck.
>
> 2. If I start one DN at a time, SOME joined instantly, but some looked
> hanging... The funny thing is: if I start over again, the behavior
> changed for each node....
>
> Any idea?
>
> Thanks,
> James
>
>
> On Mon, Jul 2, 2012 at 12:56 PM, David Rosenstrauch <dar...@darose.net> wrote:
>> A couple of thoughts:
>>
>> 1) It shouldn't take more than a couple of minutes for the data nodes to
>> re-register after the name node comes back up.  If the data nodes aren't
>> registering, you might try restarting the data node daemons on each machine.
>> They should all register soon after.  (If not, then you have other problems
>> involved.)
>>
>> 2) Re: the original symptom you reported about the namenode "hanging" at
>> startup:  you should read up on the "secondary namenode" component in Hadoop
>> (you might start at
>> http://hadoop.apache.org/common/docs/r0.17.2/hdfs_user_guide.html#Secondary+Namenode
>> and
>> http://www.cloudera.com/blog/2009/02/multi-host-secondarynamenode-configuration/).
>> If you don't have a secondary namenode running in your cluster, that can
>> have a significantly negative impact on namenode startup time.  (The
>> namenode needs to replay the log at startup - which can take a long time if
>> the log has gotten large.)
>>
>> HTH,
>>
>> DR
>>
>>
>> On 07/02/2012 03:32 PM, Jianhui Zhang wrote:
>>>
>>> Hi Harsh,
>>>
>>> The web portal of the NN shows 0 nodes.
>>>
>>> Looking into each node's log, all nodes but one have been staying at:
>>>
>>> /************************************************************
>>> STARTUP_MSG: Starting DataNode
>>> STARTUP_MSG:   host = pipeline09.x.y.z/10.2.20.109
>>> STARTUP_MSG:   args = []
>>> STARTUP_MSG:   version = 0.20.205.0
>>> STARTUP_MSG:   build =
>>>
>>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205
>>> -r 1179940; compiled by 'hortonfo' on Fri Oct  7 06:20:32 UTC 2011
>>> ************************************************************/
>>> 2012-07-02 11:00:38,891 INFO
>>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
>>> hadoop-metrics2.properties
>>> 2012-07-02 11:00:38,906 INFO
>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>>> MetricsSystem,sub=Stats registered.
>>> 2012-07-02 11:00:38,908 INFO
>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
>>> period at 10 second(s).
>>> 2012-07-02 11:00:38,908 INFO
>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics
>>> system started
>>> 2012-07-02 11:00:39,091 INFO
>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>>> ugi registered.
>>> 2012-07-02 11:00:39,094 WARN
>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
>>> already exists!
>>> =============================
>>>
>>> The one node passed that and reached:
>>>
>>> 2012-07-02 12:24:58,450 INFO
>>> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
>>> succeeded for blk_-2932980158620691384_129517
>>>
>>> Does that mean progress? It's been 1.5 hour since the start. And the
>>> file system side is: 74515 files and directories, 28439 blocks =
>>> 102954 total
>>>
>>> Thanks for  helping.
>>> James
>>>
>>> On Mon, Jul 2, 2012 at 12:07 PM, Harsh J <ha...@cloudera.com> wrote:
>>>>
>>>> Juanhui,
>>>>
>>>> It is merely waiting for the DNs to start, and to report its blocks
>>>> in. This does not take long once the DNs are up and running. Do you
>>>> see any Live Nodes yet?
>>>>
>>>> On Tue, Jul 3, 2012 at 12:34 AM, Jianhui Zhang <jhzhang.em...@gmail.com>
>>>> wrote:
>>>>>
>>>>> Hi folks,
>>>>>
>>>>> Thanks for helping, especially at such earlier hours.
>>>>>
>>>>> After leaving it overnight, during which period nothing happened in
>>>>> the log, I restarted this morning. This time, it passed the previously
>>>>> stuck point, and reached all the way to "IPC Server handler..
>>>>> starting", in Safe Mode. So it looks more promising now.
>>>>>
>>>>> But it's in a state of:
>>>>>
>>>>> "The ratio of reported blocks 0.0000 has not reached the threshold
>>>>> 0.9990. Safe mode will be turned off automatically."
>>>>>
>>>>> Does that mean the NN is waiting for DNs's communications/updates? How
>>>>> can I tell whether it's stuck or just slow?
>>>>>
>>>>> The NN log is at: http://pastebin.com/5fvRfRSD
>>>>>
>>>>> The jstack output is at: http://pastebin.com/RnDXWrtc
>>>>>
>>>>> The configurations are really basic:
>>>>>
>>>>> core-site.xml:
>>>>>
>>>>> <configuration>
>>>>>    <property>
>>>>>      <name>fs.default.name</name>
>>>>>      <value>hdfs://pipeline-hdnn01-virtual.x.y.z:8020</value>
>>>>>      <final>true</final>
>>>>>    </property>
>>>>>    <property>
>>>>>      <name>io.file.buffer.size</name>
>>>>>      <value>65536</value>
>>>>>    </property>
>>>>> </configuration>
>>>>>
>>>>> It's the same for all nodes.
>>>>>
>>>>> Again, appreciate your help.
>>>>>
>>>>> Thanks,
>>>>> James
>>>>>
>>>>> On Mon, Jul 2, 2012 at 3:21 AM, Harsh J <ha...@cloudera.com> wrote:
>>>>>>
>>>>>> Jianhui,
>>>>>>
>>>>>> Can you pastebin.com the output of your "jstack <NN PID>" command
>>>>>> after its hung, and pass us the paste link please? It looks to me like
>>>>>> it may have just been merging/saving the image, and that may be slow
>>>>>> but it depends on how long did you have to wait around to see NN
>>>>>> resume and begin properly.
>>>>>>
>>>>>> On Mon, Jul 2, 2012 at 2:34 PM, Jianhui Zhang <jhzhang.em...@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Apache Hadoop 0.20.205.
>>>>>>>
>>>>>>> I'm trying to restart NN and it always hangs at the very beginning.
>>>>>>> The only logs I've got are:
>>>>>>>
>>>>>>> /************************************************************
>>>>>>> STARTUP_MSG: Starting NameNode
>>>>>>> STARTUP_MSG:   host = host/ip
>>>>>>> STARTUP_MSG:   args = []
>>>>>>> STARTUP_MSG:   version = 0.20.205.0
>>>>>>> STARTUP_MSG:   build =
>>>>>>>
>>>>>>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205
>>>>>>> -r 1179940; compiled by 'hortonfo' on Fri Oct  7 06:20:32 UTC 2011
>>>>>>> ************************************************************/
>>>>>>> 2012-07-02 01:33:01,281 INFO
>>>>>>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
>>>>>>> hadoop-metrics2.properties
>>>>>>> 2012-07-02 01:33:01,290 INFO
>>>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>>>>>>> MetricsSystem,sub=Stats registered.
>>>>>>> 2012-07-02 01:33:01,292 INFO
>>>>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
>>>>>>> period at 10 second(s).
>>>>>>> 2012-07-02 01:33:01,292 INFO
>>>>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics
>>>>>>> system started
>>>>>>> 2012-07-02 01:33:01,434 INFO
>>>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>>>>>>> ugi registered.
>>>>>>> 2012-07-02 01:33:01,436 WARN
>>>>>>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
>>>>>>> already exists!
>>>>>>> 2012-07-02 01:33:01,441 INFO
>>>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>>>>>>> jvm registered.
>>>>>>> 2012-07-02 01:33:01,441 INFO
>>>>>>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>>>>>>> NameNode registered.
>>>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: VM type
>>>>>>>        = 64-bit
>>>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: 2% max
>>>>>>> memory = 314.0275 MB
>>>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet:
>>>>>>> capacity      = 2^25 = 33554432 entries
>>>>>>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet:
>>>>>>> recommended=33554432, actual=33554432
>>>>>>> 2012-07-02 01:33:01,546 INFO
>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=owner
>>>>>>> 2012-07-02 01:33:01,546 INFO
>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>>>>>> supergroup=supergroup
>>>>>>> 2012-07-02 01:33:01,546 INFO
>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>>>>>> isPermissionEnabled=true
>>>>>>> 2012-07-02 01:33:01,550 INFO
>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>>>>>> dfs.block.invalidate.limit=100
>>>>>>> 2012-07-02 01:33:01,550 INFO
>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>>>>>>> isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
>>>>>>> accessTokenLifetime=0 min(s)
>>>>>>> 2012-07-02 01:33:01,787 INFO
>>>>>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
>>>>>>> FSNamesystemStateMBean and NameNodeMXBean
>>>>>>> 2012-07-02 01:33:01,802 INFO
>>>>>>> org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names
>>>>>>> occuring more than 10 times
>>>>>>> 2012-07-02 01:33:01,811 INFO
>>>>>>> org.apache.hadoop.hdfs.server.common.Storage: Number of files = 17032
>>>>>>> 2012-07-02 01:33:02,406 INFO
>>>>>>> org.apache.hadoop.hdfs.server.common.Storage: Number of files under
>>>>>>> construction = 0
>>>>>>> 2012-07-02 01:33:02,406 INFO
>>>>>>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size
>>>>>>> 2553316 loaded in 0 seconds.
>>>>>>> 2012-07-02 01:33:02,410 INFO
>>>>>>> org.apache.hadoop.hdfs.server.common.Storage: Edits file
>>>>>>> /apr/hdfs/name/current/edits of size 498 edits # 7 loaded in 0
>>>>>>> seconds.
>>>>>>>
>>>>>>> ====================================
>>>>>>>
>>>>>>> It hangs thereafter.... I wonder if anybody has seen this before?
>>>>>>>
>>>>>>> Some background: I shut down DFS and MR while there were still jobs
>>>>>>> running. Some MR jobs were hanging, so I manually killed the children
>>>>>>> JVMs after the shutdown. Not sure how such actions would affect NN
>>>>>>> startup.
>>>>>>>
>>>>>>> Any help would be appreciated.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> James
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Harsh J
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Harsh J
>>
>>
>>



-- 
Harsh J

Re: Namenode hangs on startup

Reply via email to