[ https://issues.apache.org/jira/browse/HDFS-6101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14986818#comment-14986818 ]
Walter Su commented on HDFS-6101: --------------------------------- The test failed is possibly because the stopped DN doesn't be removed from cluster map, and {{sleepSeconds(5)}} doesn't make sure it's removed from cluster map. 1. Please don't remove this. It's intended. After sleeping, we want some writer NOT yet started. {code} - // Some of them are too slow and will be not yet started. - sleepSeconds(1); {code} 2. Instead of hardcode sleep time 5s. We can use {{GenericTestUtils.waitFor(..)}} to check the block replication. The wait/notify is unnecessary. 3. After {code} cluster.stopDataNode(AppendTestUtil.nextInt(REPLICATION)); {code} We should call cluster.setDataNodeDead(..) to remove it from cluster map. > TestReplaceDatanodeOnFailure fails occasionally > ----------------------------------------------- > > Key: HDFS-6101 > URL: https://issues.apache.org/jira/browse/HDFS-6101 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: Arpit Agarwal > Assignee: Wei-Chiu Chuang > Attachments: HDFS-6101.001.patch, HDFS-6101.002.patch, > HDFS-6101.003.patch, TestReplaceDatanodeOnFailure.log > > > Exception details in a comment below. > The failure repros on both OS X and Linux if I run the test ~10 times in a > loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)