[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264772#comment-15264772 ]
Colin Patrick McCabe commented on HDFS-10301: --------------------------------------------- bq. You can think of it as a new operation SyncStorages, which does just that - updates NameNode's knowledge of DN's storages. I combined this operation with the first br-RPC. One can combine it with any other call, same as you propose to combine it with the heartbeat. Except it seems a poor idea, since we don't want to wait for removal of thousands of replicas on a heartbeat. Thanks for explaining your proposal a little bit more. I agree that enumerating all the storages in the first block report RPC is a fairly simple way to handle this, and shouldn't add too much size to the FBR. It seems like a better idea than adding it to the heartbeat, like I proposed. In the short term, however, I would prefer the current patch, since it involves no RPC changes, and doesn't require all the DataNodes to be upgraded before it can work. > BlockReport retransmissions may lead to storages falsely being declared > zombie if storage report processing happens out of order > -------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.6.1 > Reporter: Konstantin Shvachko > Assignee: Colin Patrick McCabe > Priority: Critical > Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, > HDFS-10301.01.patch, zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org