[ https://issues.apache.org/jira/browse/HDFS-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375781#comment-14375781 ]
Yi Liu commented on HDFS-7960: ------------------------------ This is a good fix and improvement. Some comments: *1.* In {{BlockManager}}, the logic of checking zombie datanode storages has issue. {code} if (context != null) { storageInfo.setLastBlockReportId(context.getReportId()); if (lastStorageInRpc) { int rpcsSeen = node.updateBlockReportContext(context); if (rpcsSeen >= context.getTotalRpcs()) { List<DatanodeStorageInfo> zombies = node.removeZombieStorages(); if (zombies.isEmpty()) { ... {code} In the patch, *rpcsSeen* is calculated in NN by counting all rpcs of same block report, it's not safe in case of split reports. {{DatanodeProtocol#blockReport}} is {{@Idempotent}}, if retry happens, {{if (rpcsSeen >= context.getTotalRpcs())}} can be *true*, while some datanode storages may not send splits of reports, in this case, these datanode storages will be treated as zombie and wrongly removed from NN. I suggest to check all rpc ids of block report received before checking zombie storages. *2.* Another comment is in {{removeZombieReplicas}}: {code} removeStoredBlock(block, zombie.getDatanodeDescriptor()); {code} While removing stored block, we'd better to remove it from {{InvalidateBlocks}} too. How about call {{removeBlocksAssociatedTo(final DatanodeDescriptor node)}}? Then it can also save your code lines. > The full block report should prune zombie storages even if they're not empty > ---------------------------------------------------------------------------- > > Key: HDFS-7960 > URL: https://issues.apache.org/jira/browse/HDFS-7960 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.6.0 > Reporter: Lei (Eddy) Xu > Assignee: Colin Patrick McCabe > Priority: Critical > Attachments: HDFS-7960.002.patch, HDFS-7960.003.patch, > HDFS-7960.004.patch, HDFS-7960.005.patch, HDFS-7960.006.patch > > > The full block report should prune zombie storages even if they're not empty. > We have seen cases in production where zombie storages have not been pruned > subsequent to HDFS-7575. This could arise any time the NameNode thinks there > is a block in some old storage which is actually not there. In this case, > the block will not show up in the "new" storage (once old is renamed to new) > and the old storage will linger forever as a zombie, even with the HDFS-7596 > fix applied. This also happens with datanode hotplug, when a drive is > removed. In this case, an entire storage (volume) goes away but the blocks > do not show up in another storage on the same datanode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)