[ 
https://issues.apache.org/jira/browse/PHOENIX-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani reassigned PHOENIX-7233:
-------------------------------------

    Assignee:     (was: Divneet Kaur)

> CQSI openConnection should timeout to unblock other connection threads
> ----------------------------------------------------------------------
>
>                 Key: PHOENIX-7233
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7233
>             Project: Phoenix
>          Issue Type: Improvement
>    Affects Versions: 5.1.3
>            Reporter: Viraj Jasani
>            Priority: Major
>
> PhoenixDriver initializes and caches ConnectionQueryServices objects with 
> connectionQueryServicesCache. As part of the CQSI initialization, connection 
> is opened with HBase server by using HBase client provided ConnectionFactory, 
> which provides Connection object to the client. The Connection object 
> provided by HBase allows clients to share Zookeeper connection, meta cache as 
> well as remote connections to regionservers and master daemons. The 
> Connection object is used to perform Table CRUD operations as well as 
> Administrative actions on the cluster.
> HBase Connection object initialization requires ClusterId, which is 
> maintained either in Zookeeper or Master daemons (or both) and retrieved by 
> client depending on whether the client is configured to use 
> ZKConnectionRegistry or MasterRegistry/RpcConnectionRegistry.
> For ZKConnectionRegistry, we have run into an edge case wherein the 
> connection to Zookeeper server got stuck for more than 12 hours. When the 
> client tried to create connection to Zookeeper quorum to retrieve the 
> ClusterId, Zookeeper leader was switched from one server to another. While 
> the leader switch event resulting into stuck connection requires RCA, it is 
> not appropriate for Phoenix/HBase client to indefinitely wait for the 
> response from Zookeeper without any connection timeout.
> For Phoenix client, if one thread is stuck in opening connection during 
> CQSI#init, all other threads trying to create connections would get stuck 
> because we take class level lock before opening the connection, leading to 
> all threads getting stuck and potential termination or degradation of the 
> client JVM.
> While HBase client should also use timeout, however not having timeout from 
> Phoenix client side has far worse complications. As part of this Jira, we 
> should introduce a way for CQSI#openConnection to timeout, either by using 
> CompletableFuture API or using our preconfigured thread-pool.
>  
> Stacktrace for reference:
>  
> {code:java}
> jdk.internal.misc.Unsafe.park
> java.util.concurrent.locks.LockSupport.park
> java.util.concurrent.CompletableFuture$Signaller.block
> java.util.concurrent.ForkJoinPool.managedBlock
> java.util.concurrent.CompletableFuture.waitingGet
> java.util.concurrent.CompletableFuture.get
> org.apache.hadoop.hbase.client.ConnectionImplementation.retrieveClusterId
> org.apache.hadoop.hbase.client.ConnectionImplementation.<init>
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance?
> jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance
> jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance
> java.lang.reflect.Constructor.newInstance
> org.apache.hadoop.hbase.client.ConnectionFactory.lambda$createConnection$?
> org.apache.hadoop.hbase.client.ConnectionFactory$$Lambda$?.run
> java.security.AccessController.doPrivileged
> javax.security.auth.Subject.doAs
> org.apache.hadoop.security.UserGroupInformation.doAs
> org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
> org.apache.hadoop.hbase.client.ConnectionFactory.createConnection
> org.apache.phoenix.query.ConnectionQueryServicesImpl.openConnection
> org.apache.phoenix.query.ConnectionQueryServicesImpl.access$?
> org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
> org.apache.phoenix.query.ConnectionQueryServicesImpl$?.call
> org.apache.phoenix.util.PhoenixContextExecutor.call
> org.apache.phoenix.query.ConnectionQueryServicesImpl.init
> org.apache.phoenix.jdbc.PhoenixDriver.getConnectionQueryServices
> org.apache.phoenix.jdbc.HighAvailabilityGroup.connectToOneCluster
> org.apache.phoenix.jdbc.ParallelPhoenixConnection.getConnection
> org.apache.phoenix.jdbc.ParallelPhoenixConnection.lambda$new$?
> org.apache.phoenix.jdbc.ParallelPhoenixConnection$$Lambda$?.get
> org.apache.phoenix.jdbc.ParallelPhoenixContext.lambda$chainOnConnClusterContext$?
> org.apache.phoenix.jdbc.ParallelPhoenixContext$$Lambda$?.apply {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to