JeongMin Ju created HBASE-30058:
-----------------------------------

             Summary: Snapshot operations create unnecessary short-lived 
connections causing excessive KDC requests in Kerberos environments
                 Key: HBASE-30058
                 URL: https://issues.apache.org/jira/browse/HBASE-30058
             Project: HBase
          Issue Type: Bug
          Components: security, snapshots
            Reporter: JeongMin Ju


In a Kerberos-secured HBase cluster, each snapshot operation triggers two 
unnecessary {{ConnectionFactory.createConnection(conf)}} calls in 
{{SnapshotDescriptionUtils.validate()}}, which is invoked from 
{{MasterRpcServices.snapshot()}}. These short-lived connections are created and 
immediately closed, but each creation involves establishing a new ZooKeeper 
session with GSSAPI authentication, resulting in KDC requests for service 
tickets.

When batch snapshot jobs process many tables in a short period, this generates 
a large volume of KDC requests. The KDC may interpret this traffic as a 
brute-force or DDoS attack and block the HBase Master's IP. Once blocked, the 
Master can no longer authenticate any Kerberos operations, effectively 
rendering it non-functional and eventually causing it to fail.

h3. Root Cause

{{SnapshotDescriptionUtils.validate()}} calls:

1. {{isSecurityAvailable(conf)}} — creates a full {{Connection}} + {{Admin}} 
just to check if the {{hbase:acl}} table exists
2. {{writeAclToSnapshotDescription()}} — calls 
{{PermissionStorage.getTablePermissions(conf, tableName)}} which calls 
{{getPermissions()}} with {{Table t = null}}, creating another {{Connection}} 
to read from {{hbase:acl}}

Each {{ConnectionFactory.createConnection(conf)}} with the default 
{{ZKConnectionRegistry}} creates a new {{ReadOnlyZKClient}}, which establishes 
a ZK session with GSSAPI (Kerberos) SASL authentication. Since each connection 
gets a new JAAS {{LoginContext}} with a new {{Subject}} (in 
{{org.apache.zookeeper.Login}}), service tickets are not cached across 
connections, and every connection triggers a TGS request to the KDC.

{{isSecurityAvailable()}} is also called from {{RestoreSnapshotProcedure}} and 
{{CloneSnapshotProcedure}}, so the same issue affects snapshot restore/clone 
operations.

h3. Workaround

Setting 
{{hbase.client.registry.impl=org.apache.hadoop.hbase.client.RpcConnectionRegistry}}
 mitigates the issue. With {{RpcConnectionRegistry}}, new connections use 
RPC-based SASL authentication which runs under the server's shared UGI 
{{Subject}} (via {{ugi.doAs()}}). This allows service tickets to be cached in 
the shared {{Subject}} and reused across connections, eliminating repeated KDC 
requests after the initial authentication.

However, {{ZKConnectionRegistry}} creates a new JAAS {{LoginContext}} with a 
new {{Subject}} per ZK session (in {{org.apache.zookeeper.Login}}), so service 
tickets are never shared. This workaround does not address the unnecessary 
connection creation itself.

h3. Proposed Fix

1. Replace {{isSecurityAvailable(conf)}} with 
{{User.isHBaseSecurityEnabled(conf)}} — this checks the 
{{hbase.security.authentication}} configuration value instead of creating a 
connection to verify {{hbase:acl}} table existence. In a properly configured 
cluster, security enabled implies {{hbase:acl}} exists.

2. For {{writeAclToSnapshotDescription()}}, avoid creating a new {{Connection}} 
by obtaining a {{Table}} instance from an existing connection (e.g., the 
Master's shared connection) and passing it to 
{{PermissionStorage.getPermissions()}}. Currently {{null}} is passed as the 
{{Table}} parameter, which forces the method to create a new {{Connection}} 
internally. Note that a similar pattern in {{PermissionStorage.loadAll()}} 
already has a {{TODO}} comment acknowledging this issue: {{// TODO: Pass in a 
Connection rather than create one each time.}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to