[ https://issues.apache.org/jira/browse/HDFS-10301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427387#comment-15427387 ]
Konstantin Shvachko commented on HDFS-10301: -------------------------------------------- Took some time to look into heartbeat processing and consulting with Vinitha. So heartbeats currently have logic to remove failed storages reported by DNs via {{VolumeFailureSummary}}. This happens in three steps # If DN reports a failed volume in a heartbeat (HDFS-7604), NN marks the corresponding {{DatanodeStorageInfo}} as FAILED. See {{DatanodeDescriptor.updateFailedStorage()}}. # When the {{HeartbeatManager.Monitor}} kicks in it checks the FAILED flag on the storage and does {{removeBlocksAssociatedTo(failedStorage)}}. But it does not remove the storage itself. HDFS-7208 # On next heartbeat the DN will not report the storage that was previously reported as failed. This triggers NN to prune the storage {{DatanodeDescriptor.pruneStorageMap()}} because it doesn't contain replicas. HDFS-7596 Essentially we already have dual mechanism of deleting storages - one through heartbeats another via block reports. So we can remove the redundancy. [~daryn]'s idea simplifies a lot of code, does not require changes in any RPCs, is fully backward compatible, and eliminates the notion of zombie storage, which solves the interleaving report problem. I think we should go for it. Initially I was concerned about removing storages in heartbeats, but # We already do it anyway # All heartbeats hold FSN.readLock whether with failed storages or not. The scanning of the storages takes a lock on the corresponding {{DatanodeDescriptor.storageMap}}, which is fine-grain. # Storages are not actually removed in a heartbeat, only flagged as FAILED. The replica removal is performed by a background Montor. # If we decide to implement lock-less heartbeats we can move the storage reporting logic into a separate RPC periodically sent by DNs independently of and less frequently than regular heartbeats. > BlockReport retransmissions may lead to storages falsely being declared > zombie if storage report processing happens out of order > -------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-10301 > URL: https://issues.apache.org/jira/browse/HDFS-10301 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.6.1 > Reporter: Konstantin Shvachko > Assignee: Vinitha Reddy Gankidi > Priority: Critical > Fix For: 2.7.4 > > Attachments: HDFS-10301.002.patch, HDFS-10301.003.patch, > HDFS-10301.004.patch, HDFS-10301.005.patch, HDFS-10301.006.patch, > HDFS-10301.007.patch, HDFS-10301.008.patch, HDFS-10301.009.patch, > HDFS-10301.01.patch, HDFS-10301.010.patch, HDFS-10301.011.patch, > HDFS-10301.012.patch, HDFS-10301.013.patch, HDFS-10301.branch-2.7.patch, > HDFS-10301.branch-2.patch, HDFS-10301.sample.patch, zombieStorageLogs.rtf > > > When NameNode is busy a DataNode can timeout sending a block report. Then it > sends the block report again. Then NameNode while process these two reports > at the same time can interleave processing storages from different reports. > This screws up the blockReportId field, which makes NameNode think that some > storages are zombie. Replicas from zombie storages are immediately removed, > causing missing blocks. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org