[ https://issues.apache.org/jira/browse/HDFS-14576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887878#comment-16887878 ]
Chen Zhang commented on HDFS-14576: ----------------------------------- Thanks [~hexiaoqiao] for your detailed description of your solution and insights. {quote}1. Optimizing safemode leave mechanism, HDFS-14559, and [^HDFS-14186.001.patch], {quote} I have go through all discussion under the JIRA HDFS-14186, your point is that: after NameNode leave SafeMode it still need to process very large amount of FBR, this cause the load of NameNode very high, which can't serve normal RPC requests (we can use lifeline to avoid dead-node problem, so it's not count), so you propose to leave the SafeMode later. I'm not sure I fully understand your proposal, If there is any misunderstanding, please correct me I think BlockReport Lease is also helpful in this case, it limits the concurrent block-reports and will significantly reduce the load of NameNode, which makes NameNode can process normal RPC at the same time {quote}2. Avoid useless block report retry to reduce load of NameNode when restart since in my own experience, there are almost 30% block report ops retry from datanode, but it has processed in NameNode view. {quote} Using BlockReport Lease will also reduce the chance of block report retry, because DataNode only send FBR when NameNode grant lease to it {quote}3. More fine-grained block than single disk of DataNode. {quote} I recently propose a JIRA(HDFS-14657) related with this work, the idea is quite simple and works very well on our production environment. Welcome comments and discussion {quote}4. Improve efficiency of process block report RPC request when startup. (not discard timeout RPC, increase queue capacity, namenode trigger to failed report retry, etc.) {quote} Yep, I believe we can improve the efficiency of processing block report, do you have any clue on this? > Avoid block report retry and slow down namenode startup > ------------------------------------------------------- > > Key: HDFS-14576 > URL: https://issues.apache.org/jira/browse/HDFS-14576 > Project: Hadoop HDFS > Issue Type: Sub-task > Components: namenode > Reporter: He Xiaoqiao > Assignee: He Xiaoqiao > Priority: Major > > During namenode startup, the load will be very high since it has to process > every datanodes blockreport one by one. If there are hundreds datanodes block > reports pending process, the issue will be more serious even > #processFirstBlockReport is processed a lot more efficiently than ordinary > block reports. Then some of datanode will retry blockreport and lengthens > restart times. I think we should filter the block report request (via > datanode blockreport retries) which has be processed and return directly then > shorten down restart time. I want to state this proposal may be obvious only > for large cluster. -- This message was sent by Atlassian JIRA (v7.6.14#76016) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org