Viraj Jasani created HBASE-28428: ------------------------------------ Summary: ConnectionRegistry APIs should have timeout Key: HBASE-28428 URL: https://issues.apache.org/jira/browse/HBASE-28428 Project: HBase Issue Type: Improvement Affects Versions: 2.5.8, 3.0.0-beta-1, 2.4.17 Reporter: Viraj Jasani
Came across a couple of instances where active master failover happens around the same time as Zookeeper leader failover, leading to stuck HBase client if one of the threads is blocked on one of the ConnectionRegistry rpc calls. ConnectionRegistry APIs are wrapped with CompletableFuture. However, their usages do not have any timeouts, which can potentially lead to the entire client in stuck state indefinitely as we take some global locks. For instance, _getKeepAliveMasterService()_ takes {_}masterLock{_}, hence if getting active master from _masterAddressZNode_ gets stuck, we can block any admin operation that needs {_}getKeepAliveMasterService(){_}. Sample stacktrace that blocked all client operations that required table descriptor from Admin: {code:java} jdk.internal.misc.Unsafe.park java.util.concurrent.locks.LockSupport.park java.util.concurrent.CompletableFuture$Signaller.block java.util.concurrent.ForkJoinPool.managedBlock java.util.concurrent.CompletableFuture.waitingGet java.util.concurrent.CompletableFuture.get org.apache.hadoop.hbase.client.ConnectionImplementation.get org.apache.hadoop.hbase.client.ConnectionImplementation.access$? org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStubNoRetries org.apache.hadoop.hbase.client.ConnectionImplementation$MasterServiceStubMaker.makeStub org.apache.hadoop.hbase.client.ConnectionImplementation.getKeepAliveMasterService org.apache.hadoop.hbase.client.ConnectionImplementation.getMaster org.apache.hadoop.hbase.client.MasterCallable.prepare org.apache.hadoop.hbase.client.RpcRetryingCallerImpl.callWithRetries org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable org.apache.hadoop.hbase.client.HBaseAdmin.getTableDescriptor org.apache.hadoop.hbase.client.HTable.getDescriptororg.apache.phoenix.query.ConnectionQueryServicesImpl.getTableDescriptor org.apache.phoenix.query.DelegateConnectionQueryServices.getTableDescriptor org.apache.phoenix.util.IndexUtil.isGlobalIndexCheckerEnabled org.apache.phoenix.execute.MutationState.filterIndexCheckerMutations org.apache.phoenix.execute.MutationState.sendBatch org.apache.phoenix.execute.MutationState.send org.apache.phoenix.execute.MutationState.send org.apache.phoenix.execute.MutationState.commit org.apache.phoenix.jdbc.PhoenixConnection$?.call org.apache.phoenix.jdbc.PhoenixConnection$?.call org.apache.phoenix.call.CallRunner.run org.apache.phoenix.jdbc.PhoenixConnection.commit {code} Another similar incident is captured on PHOENIX-7233. In this case, retrieving clusterId from ZNode got stuck and that blocked client from being able to create any more HBase Connection. Stacktrace for referece: {code:java} jdk.internal.misc.Unsafe.park java.util.concurrent.locks.LockSupport.park java.util.concurrent.CompletableFuture$Signaller.block java.util.concurrent.ForkJoinPool.managedBlock java.util.concurrent.CompletableFuture.waitingGet java.util.concurrent.CompletableFuture.get org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId org.apache.hadoop.hbase.client.ConnectionImplementation.<init> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance? jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance java.lang.reflect.Constructor.newInstance org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$? org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run java.security.AccessController.doPrivileged javax.security.auth.Subject.doAs org.apache.hadoop.security.UserGroupInformation.doAs org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs org.apache.hadoop.hbase.client.ConnectionFactory.createConnection org.apache.hadoop.hbase.client.ConnectionFactory.createConnectionorg.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection org.apache.phoenix.query.ConnectionQueryServicesImpl.access$? org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call org.apache.phoenix.util.PhoenixContextExecutor.call org.apache.phoenix.query.ConnectionQueryServicesImpl.init org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices org.apache.phoenix.jdbc.HighAvailabilityGroup.connectToOneCluster org.apache.phoenix.jdbc.ParallelPhoenixConnection.getConnection org.apache.phoenix.jdbc.ParallelPhoenixConnection.lambda$new$? org.apache.phoenix.jdbc.ParallelPhoenixConnection$$Lambda$?.get org.apache.phoenix.jdbc.ParallelPhoenixContext.lambda$chainOnConnClusterContext$? org.apache.phoenix.jdbc.ParallelPhoenixContext$$Lambda$?.apply {code} We should provide configurable timeout for all ConnectionRegistry APIs. -- This message was sent by Atlassian Jira (v8.20.10#820010)