[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14983986#comment-14983986 ]
Hudson commented on HDFS-4937: ------------------------------ SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #560 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/560/]) Revert "HDFS-4937. ReplicationMonitor can infinite-loop in (yliu: rev 7fd6416759cbb202ed21b47d28c1587e04a5cdc6) * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > ---------------------------------------------------------------------------------- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.0.4-alpha, 0.23.8 > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)