[ 
https://issues.apache.org/jira/browse/HDFS-15809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282879#comment-17282879
 ] 

Lisheng Sun commented on HDFS-15809:
------------------------------------

hi [~LiJinglun] for catching this scene.
{quote}
DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead node 
set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the same as 
the last time. So the 30 nodes that has already been probed are added to the 
queue again.
{quote}
It may happen only when a large number of nodes hang up in a very large 
cluster. 
But the probability of this situation should very small. Because of the 
background thread has been detecting the deadnode node, not just when call 
checkdeadnode.
The solution I thought of is:
1、 Firstly shuffle all deadnodes in deadnode ,  poll fixed number of nodes into 
deadNodesProbeQueue. This can avoid taking the same node every time
2、Adjust the queue size according to the cluster size.

I don’t understand your solution about this issue. Could you describe your 
solution first? Thank you.

> DeadNodeDetector doesn't remove live nodes from dead node set.
> --------------------------------------------------------------
>
>                 Key: HDFS-15809
>                 URL: https://issues.apache.org/jira/browse/HDFS-15809
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: Jinglun
>            Assignee: Jinglun
>            Priority: Major
>         Attachments: HDFS-15809.001.patch
>
>
> We found the dead node detector might never remove the alive nodes from the 
> dead node set in a big cluster. For example:
>  # 200 nodes are added to the dead node set by DeadNodeDetector.
>  # DeadNodeDetector#checkDeadNodes() adds 100 nodes to the 
> deadNodesProbeQueue because the queue limited length is 100.
>  # The probe threads start working and probe 30 nodes.
>  # DeadNodeDetector#checkDeadNodes() is scheduled again. It iterates the dead 
> node set  and adds 30 nodes to the deadNodesProbeQueue. But the order is the 
> same as the last time. So the 30 nodes that has already been probed are added 
> to the queue again.
>  # Repeat 3 and 4. But we always add the first 30 nodes from the dead set. If 
> they are all dead then the live nodes behind them could never be recovered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to