[ 
https://issues.apache.org/jira/browse/HDFS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14551234#comment-14551234
 ] 

Jing Zhao commented on HDFS-7991:
---------------------------------

Recently we just saw several clusters from our customers where the NameNodes 
were stopped without checking/doing checkpoint. This lead to hours of downtime 
for loading large amounts of editlog (some clusters also hit the issue reported 
by HDFS-7609 which makes things worse).

I had an offline discussion with [~cnauroth] and [~jnp] about this 
functionality. Here is the summary of the options we can come up with:
# The solution developed in the current patch: the script sends saveNamespace 
request to the NameNode before stopping it, and the NameNode does an extra 
checkpoint if necessary based on the time of the latest checkpoint and the 
total number of transactions outside of the checkpoint. The drawback of the 
method is that if the checkpoint is necessary, the admin will see the stopping 
command blocked for 10min or more. And the admin can also get confused if the 
saveNamespace command fails.
# Another way is that, instead of issuing the saveNamespace command directly, 
the script checks the  time of the latest checkpoint and the total number of 
transactions first (maybe through the jmxget command). If it is necessary to do 
a checkpoint, the script will abort and print out some warning msg asking the 
admin to run "dfsadmin -saveNamespace". This avoids the long time waiting from 
solution #1. Also if the jmxget command fails, the admin can use some command 
argument to force stopping the NameNode if he/she can confirm the checkpoint is 
not necessary.
# The third option is to move the checkpoint logic into the shutdown hook of 
the NameNode. The biggest challenge here is the sync between the server and the 
script, i.e., to decide when and whether to kill the NN in the script. The 
script may have to polling the current state of the NameNode and guess whether 
the NameNode is still doing a checkpoint or it hangs somewhere else. Currently 
I do not see an easy way to achieve this.

For now we think #2 may be the best solution. I will update the patch 
accordingly. [~aw], could you please also share your thoughts here? Thanks.


> Allow users to skip checkpoint when stopping NameNode
> -----------------------------------------------------
>
>                 Key: HDFS-7991
>                 URL: https://issues.apache.org/jira/browse/HDFS-7991
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>    Affects Versions: 3.0.0
>            Reporter: Jing Zhao
>            Assignee: Jing Zhao
>              Labels: BB2015-05-TBR
>         Attachments: HDFS-7991.000.patch, HDFS-7991.001.patch, 
> HDFS-7991.002.patch, HDFS-7991.003.patch, HDFS-7991.004.patch
>
>
> This is a follow-up jira of HDFS-6353. HDFS-6353 adds the functionality to 
> check if saving namespace is necessary before stopping namenode. As [~kihwal] 
> pointed out in this 
> [comment|https://issues.apache.org/jira/browse/HDFS-6353?focusedCommentId=14380898&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14380898],
>  in a secured cluster this new functionality requires the user to be kinit'ed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to