[ https://issues.apache.org/jira/browse/HDFS-7609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562214#comment-14562214 ]
Ming Ma commented on HDFS-7609: ------------------------------- Thanks [~jingzhao]. Good point about saveNamespace. Regarding moving {{checkOperation(OperationCategory.WRITE)}} from FSNamesystem to NameNodeRpcServer, I considered that before. There are two minor issues. * Duration when both NNs are in standby should be short. But not sure if there is any failure scenario like ZK issue that can cause long duration. In addition, given the old ANN still keep its retry cache after it becomes standby, the application might get the cached result from the old ANN if we allow cache check when NN is in standby. * If we want to move the check, we might also want to move other things like checking if the system supports symlink; such that UnsupportedOperationException can be thrown before StandbyException. This order might not be important as UnsupportedOperationException will be eventually thrown to the application from the active NN. Otherwise, completely agree checking standby before retry cache check is simpler. If these issues aren't important, I can update the patch accordingly. > startup used too much time to load edits > ---------------------------------------- > > Key: HDFS-7609 > URL: https://issues.apache.org/jira/browse/HDFS-7609 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode > Affects Versions: 2.2.0 > Reporter: Carrey Zhan > Assignee: Ming Ma > Labels: BB2015-05-RFC > Attachments: HDFS-7609-2.patch, > HDFS-7609-CreateEditsLogWithRPCIDs.patch, HDFS-7609.patch, > recovery_do_not_use_retrycache.patch > > > One day my namenode crashed because of two journal node timed out at the same > time under very high load, leaving behind about 100 million transactions in > edits log.(I still have no idea why they were not rolled into fsimage.) > I tryed to restart namenode, but it showed that almost 20 hours would be > needed before finish, and it was loading fsedits most of the time. I also > tryed to restart namenode in recover mode, the loading speed had no different. > I looked into the stack trace, judged that it is caused by the retry cache. > So I set dfs.namenode.enable.retrycache to false, the restart process > finished in half an hour. > I think the retry cached is useless during startup, at least during recover > process. -- This message was sent by Atlassian JIRA (v6.3.4#6332)