[ https://issues.apache.org/jira/browse/HDFS-14648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16972358#comment-16972358 ]
Lisheng Sun commented on HDFS-14648: ------------------------------------ {quote} 2) The line newDeadNodes.retainAll(deadNodes.values()); should not be correct, it will let newDeadNodes be same with old deadnodes. {code:java} + public synchronized Set<DatanodeInfo> getDeadNodesToDetect() { + // remove the dead nodes who doesn't have any inputstream first + Set<DatanodeInfo> newDeadNodes = new HashSet<DatanodeInfo>(); + for (HashSet<DatanodeInfo> datanodeInfos : dfsInputStreamNodes.values()) { + newDeadNodes.addAll(datanodeInfos); + } + + newDeadNodes.retainAll(deadNodes.values()); + + for (DatanodeInfo datanodeInfo : deadNodes.values()) { + if (!newDeadNodes.contains(datanodeInfo)) { + deadNodes.remove(datanodeInfo); + } + } + return newDeadNodes; + } {code} {quote} Thanks [~linyiqun] for deep review comments. Finally, newDeadNodes should be same with old deadnodes in DeadNodeDetector#clearAndGetDetectedDeadNodes. And updated the patch and uploaded the v008 patch. Thank you a lot. [~linyiqun] > DeadNodeDetector basic model > ---------------------------- > > Key: HDFS-14648 > URL: https://issues.apache.org/jira/browse/HDFS-14648 > Project: Hadoop HDFS > Issue Type: Sub-task > Reporter: Lisheng Sun > Assignee: Lisheng Sun > Priority: Major > Attachments: HDFS-14648.001.patch, HDFS-14648.002.patch, > HDFS-14648.003.patch, HDFS-14648.004.patch, HDFS-14648.005.patch, > HDFS-14648.006.patch, HDFS-14648.007.patch > > > This Jira constructs DeadNodeDetector state machine model. The function it > implements as follow: > # When a DFSInputstream is opened, a BlockReader is opened. If some DataNode > of the block is found to inaccessible, put the DataNode into > DeadNodeDetector#deadnode.(HDFS-14649) will optimize this part. Because when > DataNode is not accessible, it is likely that the replica has been removed > from the DataNode.Therefore, it needs to be confirmed by re-probing and > requires a higher priority processing. > # DeadNodeDetector will periodically detect the Node in > DeadNodeDetector#deadnode, If the access is successful, the Node will be > moved from DeadNodeDetector#deadnode. Continuous detection of the dead node > is necessary. The DataNode need rejoin the cluster due to a service > restart/machine repair. The DataNode may be permanently excluded if there is > no added probe mechanism. > # DeadNodeDetector#dfsInputStreamNodes Record the DFSInputstream using > DataNode. When the DFSInputstream is closed, it will be moved from > DeadNodeDetector#dfsInputStreamNodes. > # Every time get the global deanode, update the DeadNodeDetector#deadnode. > The new DeadNodeDetector#deadnode Equals to the intersection of the old > DeadNodeDetector#deadnode and the Datanodes are by > DeadNodeDetector#dfsInputStreamNodes. > # DeadNodeDetector has a switch that is turned off by default. When it is > closed, each DFSInputstream still uses its own local deadnode. > # This feature has been used in the XIAOMI production environment for a long > time. Reduced hbase read stuck, due to node hangs. > # Just open the DeadNodeDetector switch and you can use it directly. No > other restrictions. Don't want to use DeadNodeDetector, just close it. > {code:java} > if (sharedDeadNodesEnabled && deadNodeDetector == null) { > deadNodeDetector = new DeadNodeDetector(name); > deadNodeDetectorThr = new Daemon(deadNodeDetector); > deadNodeDetectorThr.start(); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org