Viraj Jasani created HBASE-28428:
------------------------------------

             Summary: ConnectionRegistry APIs should have timeout
                 Key: HBASE-28428
                 URL: https://issues.apache.org/jira/browse/HBASE-28428
             Project: HBase
          Issue Type: Improvement
    Affects Versions: 2.5.8, 3.0.0-beta-1, 2.4.17
            Reporter: Viraj Jasani


Came across a couple of instances where active master failover happens around 
the same time as Zookeeper leader failover, leading to stuck HBase client if 
one of the threads is blocked on one of the ConnectionRegistry rpc calls. 
ConnectionRegistry APIs are wrapped with CompletableFuture. However, their 
usages do not have any timeouts, which can potentially lead to the entire 
client in stuck state indefinitely as we take some global locks. For instance, 
_getKeepAliveMasterService()_ takes
{_}masterLock{_}, hence if getting active master from _masterAddressZNode_ gets 
stuck, we can block any admin operation that needs 
{_}getKeepAliveMasterService(){_}.
 
Sample stacktrace that blocked all client operations that required table 
descriptor from Admin:
{code:java}
jdk.internal.misc.Unsafe.park
java.util.concurrent.locks.LockSupport.park
java.util.concurrent.CompletableFuture$Signaller.block
java.util.concurrent.ForkJoinPool.managedBlock
java.util.concurrent.CompletableFuture.waitingGet
java.util.concurrent.CompletableFuture.get
org.apache.hadoop.hbase.client.ConnectionImplementation.get
org.apache.hadoop.hbase.client.ConnectionImplementation.access$?
org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStubNoRetries
org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub
org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService
org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster
org.apache.hadoop.hbase.client.MasterCallable.prepare
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries
org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable
org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor
org.apache.hadoop.hbase.client.HTable.getDescriptororg.apache.phoenix.query.ConnectionQueryServicesImpl.getTableDescriptor
org.apache.phoenix.query.DelegateConnectionQueryServices.getTableDescriptor
org.apache.phoenix.util.IndexUtil.isGlobalIndexCheckerEnabled
org.apache.phoenix.execute.MutationState.filterIndexCheckerMutations
org.apache.phoenix.execute.MutationState.sendBatch
org.apache.phoenix.execute.MutationState.send
org.apache.phoenix.execute.MutationState.send
org.apache.phoenix.execute.MutationState.commit
org.apache.phoenix.jdbc.PhoenixConnection$?.call
org.apache.phoenix.jdbc.PhoenixConnection$?.call
org.apache.phoenix.call.CallRunner.run
org.apache.phoenix.jdbc.PhoenixConnection.commit {code}
Another similar incident is captured on PHOENIX-7233. In this case, retrieving 
clusterId from ZNode got stuck and that blocked client from being able to 
create any more HBase Connection. Stacktrace for referece:
{code:java}
jdk.internal.misc.Unsafe.park
java.util.concurrent.locks.LockSupport.park
java.util.concurrent.CompletableFuture$Signaller.block
java.util.concurrent.ForkJoinPool.managedBlock
java.util.concurrent.CompletableFuture.waitingGet
java.util.concurrent.CompletableFuture.get
org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
org.apache.hadoop.hbase.client.ConnectionImplementation.<init>
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
java.lang.reflect.Constructor.newInstance
org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
java.security.AccessController.doPrivileged
javax.security.auth.Subject.doAs
org.apache.hadoop.security.UserGroupInformation.doAs
org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
org.apache.hadoop.hbase.client.ConnectionFactory.createConnectionorg.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection
org.apache.phoenix.query.ConnectionQueryServicesImpl.access$?
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.util.PhoenixContextExecutor.call
org.apache.phoenix.query.ConnectionQueryServicesImpl.init
org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices
org.apache.phoenix.jdbc.HighAvailabilityGroup.connectToOneCluster
org.apache.phoenix.jdbc.ParallelPhoenixConnection.getConnection
org.apache.phoenix.jdbc.ParallelPhoenixConnection.lambda$new$?
org.apache.phoenix.jdbc.ParallelPhoenixConnection$$Lambda$?.get
org.apache.phoenix.jdbc.ParallelPhoenixContext.lambda$chainOnConnClusterContext$?
org.apache.phoenix.jdbc.ParallelPhoenixContext$$Lambda$?.apply  {code}
We should provide configurable timeout for all ConnectionRegistry APIs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to