[ 
https://issues.apache.org/jira/browse/HBASE-19794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16333295#comment-16333295
 ] 

stack commented on HBASE-19794:
-------------------------------

While Jira was down I spent some time on this last night. The backup Master 
tries to become active during cluster shutdown but only gets this far:

 
{code:java}
78612 Thread 1542 (M:1;asf903:32967):
78613 State: TIMED_WAITING
78614 Blocked count: 178
78615 Waited count: 389
78616 Stack:
78617 java.lang.Object.wait(Native Method)
78618 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:168)
78619 org.apache.hadoop.hbase.client.HTable.get(HTable.java:388)
78620 org.apache.hadoop.hbase.client.HTable.get(HTable.java:362)
78621 
org.apache.hadoop.hbase.MetaTableAccessor.getTableState(MetaTableAccessor.java:1117)
78622 
org.apache.hadoop.hbase.client.ConnectionImplementation.getTableState(ConnectionImplementation.java:1960)
78623 
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.getTableState(ConnectionUtils.java:131)
78624 
org.apache.hadoop.hbase.client.ConnectionImplementation.isTableDisabled(ConnectionImplementation.java:573)
78625 
org.apache.hadoop.hbase.client.ConnectionUtils$ShortCircuitingClusterConnection.isTableDisabled(ConnectionUtils.java:131)
78626 
org.apache.hadoop.hbase.client.RegionServerCallable.prepare(RegionServerCallable.java:219)
78627 
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries(RpcRetryingCallerImpl.java:105)
78628 org.apache.hadoop.hbase.client.HTable.get(HTable.java:388)
78629 org.apache.hadoop.hbase.client.HTable.get(HTable.java:362)
78630 
org.apache.hadoop.hbase.master.TableNamespaceManager.get(TableNamespaceManager.java:139)
78631 
org.apache.hadoop.hbase.master.TableNamespaceManager.isTableAvailableAndInitialized(TableNamespaceManager.java:276)
78632 
org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:101)
78633 
org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:62)
78634 
org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226)
78635 
org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1059)
78636 
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:921){code}
 

The backup Master will just be stuck here until all retries have been 
exhausted. This is a variant on a issue seen elsewhere where client hosted in 
server is trying to contact a  server or region that is not going to show up, 
usually because cluster is going down. We need means of signaling the client it 
should give up because its host is going away. We probably also need to move 
client communication off the main thread so the main thread remains available 
and can react to shutdown.

Concurrent w/ my digging [~Apache9] was digging too and arrived at same place 
(offline because Jira was down). He came up w/ a better workaround for now than 
my cutting down on retries. He suggested minihbasecluster should put down 
backup master's first, before we do the active Master (Thinking on it, it may 
not work... damage may already have been done before we get to the shutdown 
sequence... The backup master may have already started in on the shutdown 
sequence).

Let me work up a patch based on Duo's 
[https://github.com/Apache9/hbase/commit/97e030584504cc6019ef06462f6d44ca40125c45]
 Let me add timeout, Duo's suggestion, and some other cleanup I came across 
digging last night. Will also file issue to deal better w/ the root problem of 
clients stuck in retry though cluster has been asked go down.

> TestZooKeeper hangs
> -------------------
>
>                 Key: HBASE-19794
>                 URL: https://issues.apache.org/jira/browse/HBASE-19794
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Duo Zhang
>            Assignee: stack
>            Priority: Critical
>             Fix For: 2.0.0-beta-2
>
>         Attachments: org.apache.hadoop.hbase.TestZooKeeper-output.txt
>
>
> Seems like the TestZKAsyncRegistry that hangs in shutdown.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to