[ 
https://issues.apache.org/jira/browse/HBASE-17922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083491#comment-16083491
 ] 

Mike Drob commented on HBASE-17922:
-----------------------------------

Chatted with [~appy] about this offline a bit...

It looks like the problem here is that when TestUtil fails to start a region 
server, something in the JVM breaks. His concern was that even if it's a bug 
with TestUtil, we might still be uncovering a real issue with Hadoop 3 
integration, and maybe changing the test will go back to masking the problem.

This took me way too long to figure out because I had to wire up a bunch of 
reflection to start examining HDFS internals, but I think I finally caught the 
root cause here.

Here is the minimal test case that fails with the same error as we're seeing 
here:

{noformat}
  @Test (timeout=15000)
  public void testStartStopStart() throws Exception {
    TEST_UTIL.startMiniDFSCluster(1);
    TEST_UTIL.shutdownMiniDFSCluster();
    TEST_UTIL.startMiniCluster(1, 1);
  }
{noformat}

What happens is that the first time we start up a DFS cluster, the file system 
caches get populated here (line numbers likely off because of the previously 
mentioned reflection hacks):
{noformat}
        at 
org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:210)
        at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3318)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3275)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:476)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:225)
        at org.apache.hadoop.hbase.fs.HFileSystem.<init>(HFileSystem.java:88)
        at org.apache.hadoop.hbase.fs.HFileSystem.get(HFileSystem.java:472)
        at 
org.apache.hadoop.hbase.HBaseTestingUtility.getTestFileSystem(HBaseTestingUtility.java:3072)
        at 
org.apache.hadoop.hbase.HBaseTestingUtility.getNewDataTestDirOnTestFS(HBaseTestingUtility.java:576)
        at 
org.apache.hadoop.hbase.HBaseTestingUtility.setupDataTestDirOnTestFS(HBaseTestingUtility.java:565)
        at 
org.apache.hadoop.hbase.HBaseTestingUtility.getDataTestDirOnTestFS(HBaseTestingUtility.java:538)
        at 
org.apache.hadoop.hbase.HBaseTestingUtility.getDataTestDirOnTestFS(HBaseTestingUtility.java:552)
        at 
org.apache.hadoop.hbase.HBaseTestingUtility.createDirsAndSetProperties(HBaseTestingUtility.java:786)
        at 
org.apache.hadoop.hbase.HBaseTestingUtility.startMiniDFSCluster(HBaseTestingUtility.java:655)
{noformat}
That is also where the client finalizer shutdown hook is added, which region 
servers attempt to suppress.

In normal operation, only a single region server starts per JVM so we can 
suppress that hook and everything is good. In our tests, we can start and stop 
multiple mini clusters, and we fix the suppression by checking to see if we 
have already suppressed it. If we have then it's still registered in our own 
ShutdownHookManager and we don't need to suppress it again, but we can 
increment a refcount.

However, if we start and stop a DFS cluster, then that hook gets cleared on DFS 
cluster shutdown.

{noformat}
        at 
org.apache.hadoop.util.ShutdownHookManager.clearShutdownHooks(ShutdownHookManager.java:275)
        at 
org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1975)
        at 
org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1944)
        at 
org.apache.hadoop.hdfs.MiniDFSCluster.shutdown(MiniDFSCluster.java:1937)
        at 
org.apache.hadoop.hbase.HBaseTestingUtility.shutdownMiniDFSCluster(HBaseTestingUtility.java:849)
{noformat}
The second time we start DFS, this hook doesn't get added. I haven't been able 
to figure out what exactly gets reused, but the effect is that the hook isn't 
there, and we don't have a copy of it that we've saved off, so the whole thing 
goes boom.

This particular test was triggering the failure because the aborting 
RegionServer would fail before the suppression could happen. The hook would get 
cleaned up by DFS instead of by us, and later attempts to start the mini 
cluster wouldn't have the hook available and their RegionServers would also 
fail.

I assume that HDFS changed with version 3 to do shutdown hook cleanup in the 
mini cluster, and weren't doing this before, but haven't verified that.

> TestRegionServerHostname always fails against hadoop 3.0.0-alpha2
> -----------------------------------------------------------------
>
>                 Key: HBASE-17922
>                 URL: https://issues.apache.org/jira/browse/HBASE-17922
>             Project: HBase
>          Issue Type: Sub-task
>          Components: hadoop3
>    Affects Versions: 2.0.0
>            Reporter: Jonathan Hsieh
>            Assignee: Mike Drob
>             Fix For: 2.0.0-alpha-2
>
>         Attachments: HBASE-17922.patch
>
>
> {code}
> Running org.apache.hadoop.hbase.regionserver.TestRegionServerHostname
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 126.363 sec 
> <<< FAILURE! - in 
> org.apache.hadoop.hbase.regionserver.TestRegionServerHostname
> testRegionServerHostname(org.apache.hadoop.hbase.regionserver.TestRegionServerHostname)
>   Time elapsed: 120.029 sec  <<< ERROR!
> org.junit.runners.model.TestTimedOutException: test timed out after 120000 
> milliseconds
>       at java.lang.Thread.sleep(Native Method)
>       at 
> org.apache.hadoop.hbase.util.JVMClusterUtil.startup(JVMClusterUtil.java:221)
>       at 
> org.apache.hadoop.hbase.LocalHBaseCluster.startup(LocalHBaseCluster.java:405)
>       at 
> org.apache.hadoop.hbase.MiniHBaseCluster.init(MiniHBaseCluster.java:225)
>       at 
> org.apache.hadoop.hbase.MiniHBaseCluster.<init>(MiniHBaseCluster.java:94)
>       at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniHBaseCluster(HBaseTestingUtility.java:1123)
>       at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:1077)
>       at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:948)
>       at 
> org.apache.hadoop.hbase.HBaseTestingUtility.startMiniCluster(HBaseTestingUtility.java:942)
>       at 
> org.apache.hadoop.hbase.regionserver.TestRegionServerHostname.testRegionServerHostname(TestRegionServerHostname.java:88)
> Results :
> Tests in error: 
>   TestRegionServerHostname.testRegionServerHostname:88 ยป TestTimedOut test 
> timed...
> Tests run: 2, Failures: 0, Errors: 1, Skipped: 0
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to