[ 
https://issues.apache.org/jira/browse/HBASE-28741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18074385#comment-18074385
 ] 

Nick Dimiduk commented on HBASE-28741:
--------------------------------------

ConnectionRegistryRpcStubHolder.fetchClusterIdAndCreateStubs() can permanently 
orphan its CompletableFuture. The listener callback's success path (creating 
the RPC client and stubs) is not wrapped in a try-catch. If 
RpcClientFactory.createClient() or createStubs() throws, the exception 
propagates into FutureUtils.addListener's catch-all, which logs "Unexpected 
error caught when processing CompletableFuture" but cannot complete the future. 
Neither complete() nor completeExceptionally() is ever called, so every caller 
up the chain hangs indefinitely. A second path exists: if ClusterIdFetcher 
construction throws, the future is assigned to addr2StubFuture but the method 
throws before returning, leaving addr2StubFuture pointing at a zombie future 
that subsequent getStubs() calls will return.

Separately, ClusterIdFetcher creates its RPC channel with rpcTimeout=0. The 
comment says timeout doesn't matter because it's "only a preamble connection 
header," but the preamble still requires TCP connect and potentially TLS 
negotiation, which can hang.

Both issues were observed together in production: a certificate authority 
sidecar returning HTTP 500 caused the RPC client constructor to throw 
UnsupportedOperationException, orphaning the future and hanging the connection 
for 300 seconds until the framework killed the thread.

(written with AI)

> Rpc ConnectionRegistry APIs should have timeout
> -----------------------------------------------
>
>                 Key: HBASE-28741
>                 URL: https://issues.apache.org/jira/browse/HBASE-28741
>             Project: HBase
>          Issue Type: Improvement
>    Affects Versions: 2.6.0, 2.4.18, 2.5.10
>            Reporter: Viraj Jasani
>            Assignee: Nick Dimiduk
>            Priority: Major
>
> ConnectionRegistry are some of the most basic metadata APIs that determine 
> how clients can interact with the servers after getting required metadata. 
> These APIs should timeout quickly if they cannot serve metadata in time.
> Similar to HBASE-28428 introducing timeout for Zookeeper ConnectionRegistry 
> APIs, we should also introduce timeout (same timeout values) for Rpc 
> ConnectionRegistry APIs as well. RpcConnectionRegistry uses HBase RPC 
> framework with hedge read fanout mode.
> We have two options to introduce timeout:
>  # Use RetryTimer to keep watch on CompletableFuture and make it complete 
> exceptionally if timeout is reached (similar proposal as HBASE-28428).
>  # Introduce separate Rpc timeout config for 
> AbstractRpcBasedConnectionRegistry as the rpc timeout for generic RPC 
> operations (hbase.rpc.timeout) could be higher.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to