[ https://issues.apache.org/jira/browse/HDFS-4937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14989859#comment-14989859 ]
Kihwal Lee commented on HDFS-4937: ---------------------------------- First of all, the precommit build ran 4,075 test cases, so I think it ran all of them this time. The test failures are not related to the patch. I've rerun the failed tests and only {{TestSeveralNameNodes}} were failing occasionally. It was timing out waiting for a thread to finish writing. This test has been failing in other precommit builds as well. When I increase the timeout, it passed 100% of times. I will file a jira for this. {panel} ------------------------------------------------------- T E S T S ------------------------------------------------------- Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 62.298 sec - in org.apache.hadoop.hdfs.server.datanode.TestDataNodeHotSwapVolumes Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 12.295 sec - in org.apache.hadoop.hdfs.server.namenode.ha.TestEditLogTailer Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 157.484 sec - in org.apache.hadoop.hdfs.server.namenode.ha.TestSeveralNameNodes Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.TestLeaseRecovery2 Tests run: 7, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 73.445 sec - in org.apache.hadoop.hdfs.TestLeaseRecovery2 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160 Tests run: 10, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 98.315 sec - in org.apache.hadoop.hdfs.TestDFSStripedOutputStreamWithFailure160 Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.TestCrcCorruption Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 30.387 sec - in org.apache.hadoop.hdfs.TestCrcCorruption Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=768m; support was removed in 8.0 Running org.apache.hadoop.hdfs.security.TestDelegationTokenForProxyUser Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.775 sec - in org.apache.hadoop.hdfs.security.TestDelegationTokenForProxyUser {panel} > ReplicationMonitor can infinite-loop in > BlockPlacementPolicyDefault#chooseRandom() > ---------------------------------------------------------------------------------- > > Key: HDFS-4937 > URL: https://issues.apache.org/jira/browse/HDFS-4937 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode > Affects Versions: 2.0.4-alpha, 0.23.8 > Reporter: Kihwal Lee > Assignee: Kihwal Lee > Labels: BB2015-05-TBR > Attachments: HDFS-4937.patch, HDFS-4937.v1.patch, HDFS-4937.v1.patch, > HDFS-4937.v2.patch, HDFS-4937.v3.patch > > > When a large number of nodes are removed by refreshing node lists, the > network topology is updated. If the refresh happens at the right moment, the > replication monitor thread may stuck in the while loop of {{chooseRandom()}}. > This is because the cached cluster size is used in the terminal condition > check of the loop. This usually happens when a block with a high replication > factor is being processed. Since replicas/rack is also calculated beforehand, > no node choice may satisfy the goodness criteria if refreshing removed racks. > All nodes will end up in the excluded list, but the size will still be less > than the cached cluster size, so it will loop infinitely. This was observed > in a production environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)