[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14982973#comment-14982973 ]
Hudson commented on HDFS-4937: ------------------------------ FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #607 (See [https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/607/]) HDFS-4937. ReplicationMonitor can infinite-loop in (kihwal: rev 43539b5ff4ac0874a8a454dc93a2a782b0e0ea8f) * hadoop-hdfs-project/hadoop-hdfs/CHANGES.txt * hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > ---------------------------------------------------------------------------------- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.0.4-alpha, 0.23.8 > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v2.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)