[ 
https://issues.apache.org/jira/browse/HDFS-7609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14555402#comment-14555402
 ] 

Jing Zhao commented on HDFS-7609:
---------------------------------

Spent some further time digging into the issue. Besides the scenario that Ming 
described, the retry cache collision could happen while recording the 
{{UpdateBlocksOp}} transaction. {{UpdateBlocksOp}} is recorded for multiple 
APIs: {{fsync}}, {{abandonBlock}}, {{updatePipeline}}, and 
{{commitBlockSynchronization}}. And before 2.3, {{UpdateBlocksOp}} is recorded 
for {{addBlock}}. Among these APIs, only {{updatePipeline}} needs to record the 
callId and clientId into the editlog. However, all other calls failed to reset 
the callId and clientId to the dummy one thus recorded the same callId and 
clientId into the journal. Considering {{addBlock}} is called heavily this can 
cause large amounts of collision.

HDFS-7398 should have fixed this already.

> startup used too much time to load edits
> ----------------------------------------
>
>                 Key: HDFS-7609
>                 URL: https://issues.apache.org/jira/browse/HDFS-7609
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: namenode
>    Affects Versions: 2.2.0
>            Reporter: Carrey Zhan
>            Assignee: Ming Ma
>              Labels: BB2015-05-RFC
>         Attachments: HDFS-7609-CreateEditsLogWithRPCIDs.patch, 
> HDFS-7609.patch, recovery_do_not_use_retrycache.patch
>
>
> One day my namenode crashed because of two journal node timed out at the same 
> time under very high load, leaving behind about 100 million transactions in 
> edits log.(I still have no idea why they were not rolled into fsimage.)
> I tryed to restart namenode, but it showed that almost 20 hours would be 
> needed before finish, and it was loading fsedits most of the time. I also 
> tryed to restart namenode in recover mode, the loading speed had no different.
> I looked into the stack trace, judged that it is caused by the retry cache. 
> So I set dfs.namenode.enable.retrycache to false, the restart process 
> finished in half an hour.
> I think the retry cached is useless during startup, at least during recover 
> process.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to