Viraj Jasani created PHOENIX-7233:
-------------------------------------
Summary: CQSI openConnection should timeout to unblock other
connection threads
Key: PHOENIX-7233
URL: https://issues.apache.org/jira/browse/PHOENIX-7233
Project: Phoenix
Issue Type: Improvement
Affects Versions: 5.1.3
Reporter: Viraj Jasani
PhoenixDriver initializes and caches ConnectionQueryServices objects with
connectionQueryServicesCache. As part of the CQSI initialization, connection is
opened with HBase server by using HBase client provided ConnectionFactory,
which provides Connection object to the client. The Connection object provided
by HBase allows clients to share Zookeeper connection, meta cache as well as
remote connections to regionservers and master daemons. The Connection object
is used to perform Table CRUD operations as well as Administrative actions on
the cluster.
HBase Connection object initialization requires ClusterId, which is maintained
either in Zookeeper or Master daemons (or both) and retrieved by client
depending on whether the client is configured to use ZKConnectionRegistry or
MasterRegistry/RpcConnectionRegistry.
For ZKConnectionRegistry, we have run into an edge case wherein the connection
to Zookeeper server got stuck for more than 12 hours. When the client tried to
create connection to Zookeeper quorum to retrieve the ClusterId, Zookeeper
leader was switched from one server to another. While the leader switch event
resulting into stuck connection requires RCA, it is not appropriate for
Phoenix/HBase client to indefinitely wait for the response from Zookeeper
without any connection timeout.
For Phoenix client, if one thread is stuck in opening connection during
CQSI#init, all other threads trying to create connections would get stuck
because we take class level lock before opening the connection, leading to all
threads getting stuck and potential termination or degradation of the client
JVM.
While HBase client should also use timeout, however not having timeout from
Phoenix client side has far worse complications. As part of this Jira, we
should introduce a way for CQSI#openConnection to timeout, either by using
CompletableFuture API or using our preconfigured thread-pool.
Stacktrace for reference:
{code:java}
jdk.internal.misc.Unsafe.park
java.util.concurrent.locks.LockSupport.park
java.util.concurrent.CompletableFuture$Signaller.block
java.util.concurrent.ForkJoinPool.managedBlock
java.util.concurrent.CompletableFuture.waitingGet
java.util.concurrent.CompletableFuture.get
org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
org.apache.hadoop.hbase.client.ConnectionImplementation.<init>
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
java.lang.reflect.Constructor.newInstance
org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
java.security.AccessController.doPrivileged
javax.security.auth.Subject.doAs
org.apache.hadoop.security.UserGroupInformation.doAs
org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection
org.apache.phoenix.query.ConnectionQueryServicesImpl.access$?
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
org.apache.phoenix.util.PhoenixContextExecutor.call
org.apache.phoenix.query.ConnectionQueryServicesImpl.init
org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices
org.apache.phoenix.jdbc.HighAvailabilityGroup.connectToOneCluster
org.apache.phoenix.jdbc.ParallelPhoenixConnection.getConnection
org.apache.phoenix.jdbc.ParallelPhoenixConnection.lambda$new$?
org.apache.phoenix.jdbc.ParallelPhoenixConnection$$Lambda$?.get
org.apache.phoenix.jdbc.ParallelPhoenixContext.lambda$chainOnConnClusterContext$?
org.apache.phoenix.jdbc.ParallelPhoenixContext$$Lambda$?.apply {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)