Viraj Jasani created HBASE-28428:
------------------------------------
Summary: ConnectionRegistry APIs should have timeout
Key: HBASE-28428
URL: https://issues.apache.org/jira/browse/HBASE-28428
Project: HBase
Issue Type: Improvement
Affects Versions: 2.5.8, 3.0.0-beta-1, 2.4.17
Reporter: Viraj Jasani
Came across a couple of instances where active master failover happens around
the same time as Zookeeper leader failover, leading to stuck HBase client if
one of the threads is blocked on one of the ConnectionRegistry rpc calls.
ConnectionRegistry APIs are wrapped with CompletableFuture. However, their
usages do not have any timeouts, which can potentially lead to the entire
client in stuck state indefinitely as we take some global locks. For instance,
_getKeepAliveMasterService()_ takes
{_}masterLock{_}, hence if getting active master from _masterAddressZNode_ gets
stuck, we can block any admin operation that needs
{_}getKeepAliveMasterService(){_}.
Sample stacktrace that blocked all client operations that required table
descriptor from Admin:
{code:java}
jdk.internal.misc.Unsafe.park
java.util.concurrent.locks.LockSupport.park
java.util.concurrent.CompletableFuture$Signaller.block
java.util.concurrent.ForkJoinPool.managedBlock
java.util.concurrent.CompletableFuture.waitingGet
java.util.concurrent.CompletableFuture.get
org.apache.hadoop.hbase.client.ConnectionImplementation.get
org.apache.hadoop.hbase.client.ConnectionImplementation.access$?
org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStubNoRetries
org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub
org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService
org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster
org.apache.hadoop.hbase.client.MasterCallable.prepare
org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries
org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable
org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor
org.apache.hadoop.hbase.client.HTable.getDescriptororg.apache.phoenix.query.ConnectionQueryServicesImpl.getTableDescriptor
org.apache.phoenix.query.DelegateConnectionQueryServices.getTableDescriptor
org.apache.phoenix.util.IndexUtil.isGlobalIndexCheckerEnabled
org.apache.phoenix.execute.MutationState.filterIndexCheckerMutations
org.apache.phoenix.execute.MutationState.sendBatch
org.apache.phoenix.execute.MutationState.send
org.apache.phoenix.execute.MutationState.send
org.apache.phoenix.execute.MutationState.commit
org.apache.phoenix.jdbc.PhoenixConnection$?.call
org.apache.phoenix.jdbc.PhoenixConnection$?.call
org.apache.phoenix.call.CallRunner.run
org.apache.phoenix.jdbc.PhoenixConnection.commit {code}
Another similar incident is captured on PHOENIX-7233. In this case, retrieving
clusterId from ZNode got stuck and that blocked client from being able to
create any more HBase Connection. Stacktrace for referece:
{code:java}
jdk.internal.misc.Unsafe.park
java.util.concurrent.locks.LockSupport.park
java.util.concurrent.CompletableFuture$Signaller.block
java.util.concurrent.ForkJoinPool.managedBlock
java.util.concurrent.CompletableFuture.waitingGet
java.util.concurrent.CompletableFuture.get
org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
org.apache.hadoop.hbase.client.ConnectionImplementation.<init>
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
java.lang.reflect.Constructor.newInstance
org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
java.security.AccessController.doPrivileged
javax.security.auth.Subject.doAs
org.apache.hadoop.security.UserGroupInformation.doAs
org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
org.apache.hadoop.hbase.client.ConnectionFactory.createConnectionorg.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection
org.apache.phoenix.query.ConnectionQueryServicesImpl.access$?
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.util.PhoenixContextExecutor.call
org.apache.phoenix.query.ConnectionQueryServicesImpl.init
org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices
org.apache.phoenix.jdbc.HighAvailabilityGroup.connectToOneCluster
org.apache.phoenix.jdbc.ParallelPhoenixConnection.getConnection
org.apache.phoenix.jdbc.ParallelPhoenixConnection.lambda$new$?
org.apache.phoenix.jdbc.ParallelPhoenixConnection$$Lambda$?.get
org.apache.phoenix.jdbc.ParallelPhoenixContext.lambda$chainOnConnClusterContext$?
org.apache.phoenix.jdbc.ParallelPhoenixContext$$Lambda$?.apply {code}
We should provide configurable timeout for all ConnectionRegistry APIs.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)