[jira] [Commented] (HDFS-7609) startup used too much time to load edits

Jing Zhao (JIRA) Wed, 27 May 2015 10:48:53 -0700

    [ 
https://issues.apache.org/jira/browse/HDFS-7609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14561354#comment-14561354
 ]


Jing Zhao commented on HDFS-7609:
---------------------------------

Thanks Ming! The new patch looks good to me. One minor is that we do not need 
to throw StandbyException for {{saveNamespace}} since it can also be processed 
by standby NN. For {{saveNamespace}}, since we do not have editlog for it, I 
guess we do not need to apply this fix to it?

Maybe another simpler way to fix the issue is to move the 
{{checkOperation(OperationCategory.WRITE)}} check to the very beginning (i.e., 
before the retry cache look up). In this way, we miss the chance to get the 
response directly from standby NN's retry cache and the client has to failover 
one more time. But looks like this chance is very small. This can only happen 
when the request has been handled by the active NN, then the client misses the 
response, then NN failover happens and the client is redirected to the other 
NN, which has loaded the edits but has not transitioned to active state yet.

> startup used too much time to load edits
> ----------------------------------------
>
>                 Key: HDFS-7609
>                 URL: https://issues.apache.org/jira/browse/HDFS-7609
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.2.0
>            Reporter: Carrey Zhan
>            Assignee: Ming Ma
>              Labels: BB2015-05-RFC
>         Attachments: HDFS-7609-2.patch, 
> HDFS-7609-CreateEditsLogWithRPCIDs.patch, HDFS-7609.patch, 
> recovery_do_not_use_retrycache.patch
>
>
> One day my namenode crashed because of two journal node timed out at the same 
> time under very high load, leaving behind about 100 million transactions in 
> edits log.(I still have no idea why they were not rolled into fsimage.)
> I tryed to restart namenode, but it showed that almost 20 hours would be 
> needed before finish, and it was loading fsedits most of the time. I also 
> tryed to restart namenode in recover mode, the loading speed had no different.
> I looked into the stack trace, judged that it is caused by the retry cache. 
> So I set dfs.namenode.enable.retrycache to false, the restart process 
> finished in half an hour.
> I think the retry cached is useless during startup, at least during recover 
> process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HDFS-7609) startup used too much time to load edits

Reply via email to