[ https://issues.apache.org/jira/browse/HDFS-9493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053953#comment-15053953 ]
Tony Wu commented on HDFS-9493: ------------------------------- Hi [~liuml07], I would like to work on fixing this test. Did some analysis on the failure by printing out the metasave content. Turns out the metasave output for the current test contains 2 Datanodes: {code} metasave out: 1 files and directories, 0 blocks = 1 total filesystem objects metasave out: Live Datanodes: 1 metasave out: Dead Datanodes: 1 metasave out: Metasave: Blocks waiting for replication: 0 metasave out: Mis-replicated blocks that have been postponed: metasave out: Metasave: Blocks being replicated: 0 metasave out: Metasave: Blocks 4 waiting deletion from 2 datanodes. metasave out: 127.0.0.1:53465 metasave out: LightWeightHashSet(size=2, modification=2, entries.length=16) metasave out: 127.0.0.1:53469 metasave out: LightWeightHashSet(size=2, modification=2, entries.length=16) metasave out: Metasave: Number of datanodes: 2 metasave out: 127.0.0.1:53465 IN 998093619200(929.55 GB) 10270(10.03 KB) 0.00% 882663514112(822.04 GB) 0(0 B) 0(0 B) 100.00% 0(0 B) Fri Dec 11 17:48:41 PST 2015 metasave out: 127.0.0.1:53469 IN 998093619200(929.55 GB) 8192(8 KB) 0.00% 882663825408(822.04 GB) 0(0 B) 0(0 B) 100.00% 0(0 B) Fri Dec 11 17:48:26 PST 2015 {code} This leads me to believe the following wait time was not long enough: {code:java} // wait for namenode to discover that a datanode is dead Thread.sleep(15000); {code} After increasing the sleep time to 30 seconds, the test was able to pass consistently. The invalid bock count shown in {{Block x waiting deletion...}} statement is updated by {{blockManager.removeBlocksAssociatedTo()}}, which is called by {{DatanodeManager#removeDeadDatanode()}}. This only happens at {{HeartbeatManager#heartbeatCheck()}}. Using sleep may not be the best way to ensure the Datanode is deleted by Namenode. I will upload a patch with a more robust way of waiting for the Datanode to be removed, instead of relying on {{Thread.sleep()}}. > Test o.a.h.hdfs.server.namenode.TestMetaSave fails in trunk > ----------------------------------------------------------- > > Key: HDFS-9493 > URL: https://issues.apache.org/jira/browse/HDFS-9493 > Project: Hadoop HDFS > Issue Type: Bug > Components: test > Reporter: Mingliang Liu > > Tested in both Gentoo Linux and Mac. > {quote} > ------------------------------------------------------- > T E S T S > ------------------------------------------------------- > Running org.apache.hadoop.hdfs.server.namenode.TestMetaSave > Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 34.159 sec > <<< FAILURE! - in org.apache.hadoop.hdfs.server.namenode.TestMetaSave > testMetasaveAfterDelete(org.apache.hadoop.hdfs.server.namenode.TestMetaSave) > Time elapsed: 15.318 sec <<< FAILURE! > java.lang.AssertionError: null > at org.junit.Assert.fail(Assert.java:86) > at org.junit.Assert.assertTrue(Assert.java:41) > at org.junit.Assert.assertTrue(Assert.java:52) > at > org.apache.hadoop.hdfs.server.namenode.TestMetaSave.testMetasaveAfterDelete(TestMetaSave.java:154) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332)