[ https://issues.apache.org/jira/browse/HBASE-17922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083491#comment-16083491 ]
Mike Drob commented on HBASE-17922: ----------------------------------- Chatted with [~appy] about this offline a bit... It looks like the problem here is that when TestUtil fails to start a region server, something in the JVM breaks. His concern was that even if it's a bug with TestUtil, we might still be uncovering a real issue with Hadoop 3 integration, and maybe changing the test will go back to masking the problem. This took me way too long to figure out because I had to wire up a bunch of reflection to start examining HDFS internals, but I think I finally caught the root cause here. Here is the minimal test case that fails with the same error as we're seeing here: {noformat} @Test (timeout=15000) public void testStartStopStart() throws Exception { TEST_UTIL.startMiniDFSCluster(1); TEST_UTIL.shutdownMiniDFSCluster(); TEST_UTIL.startMiniCluster(1, 1); } {noformat} What happens is that the first time we start up a DFS cluster, the file system caches get populated here (line numbers likely off because of the previously mentioned reflection hacks): {noformat} at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:210) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3318) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3275) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:476) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225) at org.apache.hadoop.hbase.fs.HFileSystem.<init>(HFileSystem.java:88) at org.apache.hadoop.hbase.fs.HFileSystem.get(HFileSystem.java:472) at org.apache.hadoop.hbase.HBaseTestingUtility.getTestFileSystem(HBaseTestingUtility.java:3072) at org.apache.hadoop.hbase.HBaseTestingUtility.getNewDataTestDirOnTestFS(HBaseTestingUtility.java:576) at org.apache.hadoop.hbase.HBaseTestingUtility.setupDataTestDirOnTestFS(HBaseTestingUtility.java:565) at org.apache.hadoop.hbase.HBaseTestingUtility.getDataTestDirOnTestFS(HBaseTestingUtility.java:538) at org.apache.hadoop.hbase.HBaseTestingUtility.getDataTestDirOnTestFS(HBaseTestingUtility.java:552) at org.apache.hadoop.hbase.HBaseTestingUtility.createDirsAndSetProperties(HBaseTestingUtility.java:786) at org.apache.hadoop.hbase.HBaseTestingUtility.startMiniDFSCluster(HBaseTestingUtility.java:655) {noformat} That is also where the client finalizer shutdown hook is added, which region servers attempt to suppress. In normal operation, only a single region server starts per JVM so we can suppress that hook and everything is good. In our tests, we can start and stop multiple mini clusters, and we fix the suppression by checking to see if we have already suppressed it. If we have then it's still registered in our own ShutdownHookManager and we don't need to suppress it again, but we can increment a refcount. However, if we start and stop a DFS cluster, then that hook gets cleared on DFS cluster shutdown. {noformat} at org.apache.hadoop.util.ShutdownHookManager.clearShutdownHooks(ShutdownHookManager.java:275) at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1975) at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1944) at org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1937) at org.apache.hadoop.hbase.HBaseTestingUtility.shutdownMiniDFSCluster(HBaseTestingUtility.java:849) {noformat} The second time we start DFS, this hook doesn't get added. I haven't been able to figure out what exactly gets reused, but the effect is that the hook isn't there, and we don't have a copy of it that we've saved off, so the whole thing goes boom. This particular test was triggering the failure because the aborting RegionServer would fail before the suppression could happen. The hook would get cleaned up by DFS instead of by us, and later attempts to start the mini cluster wouldn't have the hook available and their RegionServers would also fail. I assume that HDFS changed with version 3 to do shutdown hook cleanup in the mini cluster, and weren't doing this before, but haven't verified that. > TestRegionServerHostname always fails against hadoop 3.0.0-alpha2 > ----------------------------------------------------------------- > > Key: HBASE-17922 > URL: https://issues.apache.org/jira/browse/HBASE-17922 > Project: HBase > Issue Type: Sub-task > Components: hadoop3 > Affects Versions: 2.0.0 > Reporter: Jonathan Hsieh > Assignee: Mike Drob > Fix For: 2.0.0-alpha-2 > > Attachments: HBASE-17922.patch > > > {code} > Running org.apache.hadoop.hbase.regionserver.TestRegionServerHostname > Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 126.363 sec > <<< FAILURE! - in > org.apache.hadoop.hbase.regionserver.TestRegionServerHostname > testRegionServerHostname(org.apache.hadoop.hbase.regionserver.TestRegionServerHostname) > Time elapsed: 120.029 sec <<< ERROR! > org.junit.runners.model.TestTimedOutException: test timed out after 120000 > milliseconds > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:221) > at > org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:405) > at > org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:225) > at > org.apache.hadoop.hbase.MiniHBaseCluster.<init>(MiniHBaseCluster.java:94) > at > org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:1123) > at > org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1077) > at > org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:948) > at > org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:942) > at > org.apache.hadoop.hbase.regionserver.TestRegionServerHostname.testRegionServerHostname(TestRegionServerHostname.java:88) > Results : > Tests in error: > TestRegionServerHostname.testRegionServerHostname:88 ยป TestTimedOut test > timed... > Tests run: 2, Failures: 0, Errors: 1, Skipped: 0 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)