Hi folks,

Thanks for helping, especially at such earlier hours.

After leaving it overnight, during which period nothing happened in
the log, I restarted this morning. This time, it passed the previously
stuck point, and reached all the way to "IPC Server handler..
starting", in Safe Mode. So it looks more promising now.

But it's in a state of:

"The ratio of reported blocks 0.0000 has not reached the threshold
0.9990. Safe mode will be turned off automatically."

Does that mean the NN is waiting for DNs's communications/updates? How
can I tell whether it's stuck or just slow?

The NN log is at: http://pastebin.com/5fvRfRSD

The jstack output is at: http://pastebin.com/RnDXWrtc

The configurations are really basic:

core-site.xml:

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://pipeline-hdnn01-virtual.x.y.z:8020</value>
    <final>true</final>
  </property>
  <property>
    <name>io.file.buffer.size</name>
    <value>65536</value>
  </property>
</configuration>

It's the same for all nodes.

Again, appreciate your help.

Thanks,
James

On Mon, Jul 2, 2012 at 3:21 AM, Harsh J <ha...@cloudera.com> wrote:
> Jianhui,
>
> Can you pastebin.com the output of your "jstack <NN PID>" command
> after its hung, and pass us the paste link please? It looks to me like
> it may have just been merging/saving the image, and that may be slow
> but it depends on how long did you have to wait around to see NN
> resume and begin properly.
>
> On Mon, Jul 2, 2012 at 2:34 PM, Jianhui Zhang <jhzhang.em...@gmail.com> wrote:
>> Hi,
>>
>> Apache Hadoop 0.20.205.
>>
>> I'm trying to restart NN and it always hangs at the very beginning.
>> The only logs I've got are:
>>
>> /************************************************************
>> STARTUP_MSG: Starting NameNode
>> STARTUP_MSG:   host = host/ip
>> STARTUP_MSG:   args = []
>> STARTUP_MSG:   version = 0.20.205.0
>> STARTUP_MSG:   build =
>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205
>> -r 1179940; compiled by 'hortonfo' on Fri Oct  7 06:20:32 UTC 2011
>> ************************************************************/
>> 2012-07-02 01:33:01,281 INFO
>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from
>> hadoop-metrics2.properties
>> 2012-07-02 01:33:01,290 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>> MetricsSystem,sub=Stats registered.
>> 2012-07-02 01:33:01,292 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot
>> period at 10 second(s).
>> 2012-07-02 01:33:01,292 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics
>> system started
>> 2012-07-02 01:33:01,434 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>> ugi registered.
>> 2012-07-02 01:33:01,436 WARN
>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi
>> already exists!
>> 2012-07-02 01:33:01,441 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>> jvm registered.
>> 2012-07-02 01:33:01,441 INFO
>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source
>> NameNode registered.
>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: VM type
>>       = 64-bit
>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: 2% max
>> memory = 314.0275 MB
>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet:
>> capacity      = 2^25 = 33554432 entries
>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet:
>> recommended=33554432, actual=33554432
>> 2012-07-02 01:33:01,546 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=owner
>> 2012-07-02 01:33:01,546 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>> supergroup=supergroup
>> 2012-07-02 01:33:01,546 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>> isPermissionEnabled=true
>> 2012-07-02 01:33:01,550 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>> dfs.block.invalidate.limit=100
>> 2012-07-02 01:33:01,550 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
>> isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s),
>> accessTokenLifetime=0 min(s)
>> 2012-07-02 01:33:01,787 INFO
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
>> FSNamesystemStateMBean and NameNodeMXBean
>> 2012-07-02 01:33:01,802 INFO
>> org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names
>> occuring more than 10 times
>> 2012-07-02 01:33:01,811 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Number of files = 17032
>> 2012-07-02 01:33:02,406 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Number of files under
>> construction = 0
>> 2012-07-02 01:33:02,406 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Image file of size
>> 2553316 loaded in 0 seconds.
>> 2012-07-02 01:33:02,410 INFO
>> org.apache.hadoop.hdfs.server.common.Storage: Edits file
>> /apr/hdfs/name/current/edits of size 498 edits # 7 loaded in 0
>> seconds.
>>
>> ====================================
>>
>> It hangs thereafter.... I wonder if anybody has seen this before?
>>
>> Some background: I shut down DFS and MR while there were still jobs
>> running. Some MR jobs were hanging, so I manually killed the children
>> JVMs after the shutdown. Not sure how such actions would affect NN
>> startup.
>>
>> Any help would be appreciated.
>>
>> Thanks,
>> James
>
>
>
> --
> Harsh J

Reply via email to