[ 
https://issues.apache.org/jira/browse/HDFS-7609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562214#comment-14562214
 ] 

Ming Ma commented on HDFS-7609:
-------------------------------

Thanks [~jingzhao]. Good point about saveNamespace.

Regarding moving {{checkOperation(OperationCategory.WRITE)}} from FSNamesystem 
to NameNodeRpcServer, I considered that before. There are two minor issues.

* Duration when both NNs are in standby should be short. But not sure if there 
is any failure scenario like ZK issue that can cause long duration. In 
addition, given the old ANN still keep its retry cache after it becomes 
standby, the application might get the cached result from the old ANN if we 
allow cache check when NN is in standby.
* If we want to move the check, we might also want to move other things like 
checking if the system supports symlink; such  that 
UnsupportedOperationException can be thrown before StandbyException. This order 
might not be important as UnsupportedOperationException will be eventually 
thrown to the application from the active NN.

Otherwise, completely agree checking standby before retry cache check is 
simpler. If these issues aren't important, I can update the patch accordingly.

> startup used too much time to load edits
> ----------------------------------------
>
>                 Key: HDFS-7609
>                 URL: https://issues.apache.org/jira/browse/HDFS-7609
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.2.0
>            Reporter: Carrey Zhan
>            Assignee: Ming Ma
>              Labels: BB2015-05-RFC
>         Attachments: HDFS-7609-2.patch, 
> HDFS-7609-CreateEditsLogWithRPCIDs.patch, HDFS-7609.patch, 
> recovery_do_not_use_retrycache.patch
>
>
> One day my namenode crashed because of two journal node timed out at the same 
> time under very high load, leaving behind about 100 million transactions in 
> edits log.(I still have no idea why they were not rolled into fsimage.)
> I tryed to restart namenode, but it showed that almost 20 hours would be 
> needed before finish, and it was loading fsedits most of the time. I also 
> tryed to restart namenode in recover mode, the loading speed had no different.
> I looked into the stack trace, judged that it is caused by the retry cache. 
> So I set dfs.namenode.enable.retrycache to false, the restart process 
> finished in half an hour.
> I think the retry cached is useless during startup, at least during recover 
> process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to