[ https://issues.apache.org/jira/browse/HDFS-15634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215676#comment-17215676 ]
Stephen O'Donnell commented on HDFS-15634: ------------------------------------------ The issue you have seen is a fairly extreme example - not too many clusters will have 200 nodes decommissioned at the same time I suspect. The empty node problem is valid concern on smaller clusters and in the decommission-recommission case, the empty node may never catch up to the other nodes, as the default block placement policy picks nodes randomly. The long lock hold is a problem even when a node (or perhaps rack) goes dead unexpectedly. I think it would be better to try to fix that generally, rather than doing something special for decommission. I am also not really comfortable changing the current "non destructive" decommission flow to something that removes the blocks from the DN. If you look at HeartBeatManager.heartbeatCheck(...), it seems to handle only 1 DN as dead on each check interval. The check interval is either 5 minute by default or 30 seconds if "dfs.namenode.avoid.write.stale.datanode" is true. Then it ultimately calls BlockManager.removeBlocksAssociatedTo(...) which does the expensive work of removing the blocks. In that method, I wonder if we could drop and re-take the write lock periodically so this does not hold the lock for too long? {code} /** Remove the blocks associated to the given datanode. */ void removeBlocksAssociatedTo(final DatanodeDescriptor node) { providedStorageMap.removeDatanode(node); for (DatanodeStorageInfo storage : node.getStorageInfos()) { final Iterator<BlockInfo> it = storage.getBlockIterator(); //add the BlockInfos to a new collection as the //returned iterator is not modifiable. Collection<BlockInfo> toRemove = new ArrayList<>(); while (it.hasNext()) { toRemove.add(it.next()); } // Could we drop and re-take the write lock in this loop every 1000 blocks? for (BlockInfo b : toRemove) { removeStoredBlock(b, node); } } // Remove all pending DN messages referencing this DN. pendingDNMessages.removeAllMessagesForDatanode(node); node.resetBlocks(); invalidateBlocks.remove(node); } {code} We see some nodes with 5M, 10M or even more blocks sometimes, so this would help them in general. I am not sure if there would be any negative consequences of dropping and retaking the write lock in this scenario? > Invalidate block on decommissioning DataNode after replication > -------------------------------------------------------------- > > Key: HDFS-15634 > URL: https://issues.apache.org/jira/browse/HDFS-15634 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs > Reporter: Fengnan Li > Assignee: Fengnan Li > Priority: Major > Labels: pull-request-available > Attachments: write lock.png > > Time Spent: 1h > Remaining Estimate: 0h > > Right now when a DataNode starts decommission, Namenode will mark it as > decommissioning and its blocks will be replicated over to different > DataNodes, then marked as decommissioned. These blocks are not touched since > they are not counted as live replicas. > Proposal: Invalidate these blocks once they are replicated and there are > enough live replicas in the cluster. > Reason: A recent shutdown of decommissioned datanodes to finished the flow > caused Namenode latency spike since namenode needs to remove all of the > blocks from its memory and this step requires holding write lock. If we have > gradually invalidated these blocks the deletion will be much easier and > faster. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org