[ https://issues.apache.org/jira/browse/HBASE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ted Yu updated HBASE-17287: --------------------------- Attachment: 17287.master.v4.txt Added testSafemodeBringsDownMaster in patch v4 - planning to create separate test class once the new test passes. Currently the wait for master thread to exit times out: {code} testSafemodeBringsDownMaster(org.apache.hadoop.hbase.master.procedure.TestCreateTableProcedure) Time elapsed: 61.538 sec <<< ERROR! org.junit.runners.model.TestTimedOutException: test timed out after 60000 milliseconds at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:196) at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:143) at org.apache.hadoop.hbase.HBaseTestingUtility.waitFor(HBaseTestingUtility.java:3959) at org.apache.hadoop.hbase.master.procedure.TestCreateTableProcedure.testSafemodeBringsDownMaster(TestCreateTableProcedure.java:92) {code} Let me see what the cause could be. > Master becomes a zombie if filesystem object closes > --------------------------------------------------- > > Key: HBASE-17287 > URL: https://issues.apache.org/jira/browse/HBASE-17287 > Project: HBase > Issue Type: Bug > Components: master > Reporter: Clay B. > Assignee: Ted Yu > Fix For: 1.4.0, 2.0 > > Attachments: 17287.branch-1.v3.txt, 17287.master.v2.txt, > 17287.master.v3.txt, 17287.master.v4.txt, 17287.v2.txt > > > We have seen an issue whereby if the HDFS is unstable and the HBase master's > HDFS client is unable to stabilize before > {{dfs.client.failover.max.attempts}} then the master's filesystem object > closes. This seems to result in an HBase master which will continue to run > (process and znode exists) but no meaningful work can be done (e.g. assigning > meta).What we saw in our HBase master logs was:{code}2016-12-01 19:19:08,192 > ERROR org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler: > Caught M_META_SERVER_SHUTDOWN, count=1java.io.IOException: failed log > splitting for cluster-r5n12.bloomberg.com,60200,1480632863218, will retryat > org.apache.hadoop.hbase.master.handler.MetaServerShutdownHandler.process(MetaServerShutdownHandler.java:84)at > org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:129)at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at > java.lang.Thread.run(Thread.java:745)Caused by: java.io.IOException: > Filesystem closed{code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)