[ 
https://issues.apache.org/jira/browse/HBASE-30058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

JeongMin Ju updated HBASE-30058:
--------------------------------
    Description: 
In a Kerberos-secured HBase cluster, each snapshot operation triggers two 
unnecessary {{ConnectionFactory.createConnection(conf)}} calls in 
{{SnapshotDescriptionUtils.validate()}}, which is invoked from 
{{MasterRpcServices.snapshot()}}. These short-lived connections are created and 
immediately closed, but each creation involves establishing a new ZooKeeper 
session with GSSAPI authentication, resulting in KDC requests for service 
tickets.

When batch snapshot jobs process many tables in a short period, this generates 
a large volume of KDC requests. The KDC may interpret this traffic as a 
brute-force or DDoS attack and block the HBase Master's IP. Once blocked, the 
Master can no longer authenticate any Kerberos operations, effectively 
rendering it non-functional and eventually causing it to fail.

h3. Root Cause

{{SnapshotDescriptionUtils.validate()}} calls:

# {{isSecurityAvailable(conf)}} — creates a full {{Connection}} + {{Admin}} 
just to check if the {{hbase:acl}} table exists
# {{writeAclToSnapshotDescription()}} — calls 
{{PermissionStorage.getTablePermissions(conf, tableName)}} which calls 
{{getPermissions()}} with {{Table t = null}}, creating another {{Connection}} 
to read from {{hbase:acl}}

Each {{ConnectionFactory.createConnection(conf)}} with the default 
{{ZKConnectionRegistry}} creates a new {{ReadOnlyZKClient}}, which establishes 
a ZK session with GSSAPI (Kerberos) SASL authentication. Since each connection 
gets a new JAAS {{LoginContext}} with a new {{Subject}} (in 
{{org.apache.zookeeper.Login}}), service tickets are not cached across 
connections, and every connection triggers a TGS request to the KDC.

{{isSecurityAvailable()}} is also called from {{RestoreSnapshotProcedure}} and 
{{CloneSnapshotProcedure}}, so the same issue affects snapshot restore/clone 
operations.

h3. Workaround

Setting 
{{hbase.client.registry.impl=org.apache.hadoop.hbase.client.RpcConnectionRegistry}}
 mitigates the issue. With {{RpcConnectionRegistry}}, new connections use 
RPC-based SASL authentication which runs under the server's shared UGI 
{{Subject}} (via {{ugi.doAs()}}). This allows service tickets to be cached in 
the shared {{Subject}} and reused across connections, eliminating repeated KDC 
requests after the initial authentication.

However, {{ZKConnectionRegistry}} creates a new JAAS {{LoginContext}} with a 
new {{Subject}} per ZK session (in {{org.apache.zookeeper.Login}}), so service 
tickets are never shared. This workaround does not address the unnecessary 
connection creation itself.

h3. Analysis: Ambiguity in Original Design Intent

During analysis of the fix, we encountered several design ambiguities in the 
original code that are worth documenting.

*1. {{isSecurityAvailable(conf)}} checks {{hbase:acl}} table existence, not 
configuration*

The original {{isSecurityAvailable}} creates a connection and checks whether 
the {{hbase:acl}} table exists via {{admin.tableExists()}}. The comment above 
its call site says _"set the acl to snapshot if security feature is enabled"_, 
which suggests it should check whether the security feature is enabled. 
However, the implementation checks table existence instead.

*2. {{hbase:acl}} table creation is decoupled from 
{{hbase.security.authorization}}*

The {{hbase:acl}} table is created by {{AccessController.postStartMaster()}} 
whenever {{AccessController}} is loaded as a master coprocessor, regardless of 
the {{hbase.security.authorization}} setting. This means:

||Scenario||{{hbase:acl}} exists||Authorization enforced||
|{{authorization=true}} + {{AccessController}} loaded|Yes|Yes|
|{{authorization=false}} + {{AccessController}} loaded|Yes|No 
({{NoopAccessChecker}} is used)|
|{{authorization=true}} + {{AccessController}} not loaded|No|Partially 
({{AccessChecker}} is used but {{hbase:acl}} is missing)|

This decoupling creates ambiguity: when {{hbase.security.authorization=false}} 
but {{AccessController}} is loaded, the {{hbase:acl}} table exists and ACL data 
is managed (grant/revoke work), but permissions are not enforced 
({{NoopAccessChecker}} allows everything).

*3. Purpose of storing ACL in snapshot descriptor*

The ACL stored in the snapshot descriptor via 
{{writeAclToSnapshotDescription()}} is consumed by 
{{RestoreSnapshotHelper.restoreSnapshotAcl()}} during snapshot restore/clone 
operations. It restores the original table's permissions to the newly created 
table. This is separate from {{SnapshotScannerHDFSAclController}}, which reads 
permissions directly from {{hbase:acl}} at snapshot time for HDFS-level ACL 
synchronization.

*4. Considered alternatives for {{isSecurityAvailable}}*

We considered two configuration-based alternatives to eliminate the connection 
creation:

- {{User.isHBaseSecurityEnabled(conf)}} — checks 
{{hbase.security.authentication=kerberos}}. But Kerberos authentication can be 
enabled without authorization (no {{hbase:acl}} table), so this would be too 
broad.
- {{AccessChecker.isAuthorizationSupported(conf)}} — checks 
{{hbase.security.authorization=true}}. But as noted above, {{hbase:acl}} can 
exist even when {{authorization=false}}, so this would miss the case where ACL 
data exists and should be preserved in snapshots.

Note that {{AccessChecker.isAuthorizationSupported(conf)}} is already checked 
in {{validate()}} just before {{isSecurityAvailable()}} (for setting the 
snapshot owner). If we used the same check for {{isSecurityAvailable}}, it 
would be a redundant condition.

Neither configuration-based approach perfectly matches the original behavior of 
checking {{hbase:acl}} table existence.

h3. Proposed Fix

Given the ambiguity, we chose to preserve the original behavior exactly while 
eliminating the unnecessary connection creation:

# Change {{isSecurityAvailable(Configuration conf)}} to 
{{isSecurityAvailable(Connection conn)}} — reuse an existing connection (e.g., 
the Master's shared connection) to check {{hbase:acl}} table existence via 
{{admin.tableExists()}}. This is functionally identical to the original, just 
without creating a new connection.
# Change {{writeAclToSnapshotDescription()}} to accept a {{Connection}} 
parameter and pass an existing {{Table}} instance to 
{{PermissionStorage.getTablePermissions()}} instead of {{null}}. This avoids 
the second unnecessary connection creation. Note that a similar pattern in 
{{PermissionStorage.loadAll()}} already has a {{TODO}} comment acknowledging 
this issue: {{// TODO: Pass in a Connection rather than create one each time.}}
# Update all callers of {{validate()}} and {{isSecurityAvailable()}} to pass 
through the available connection:
#* {{MasterRpcServices.snapshot()}} — {{server.getConnection()}}
#* {{RestoreSnapshotProcedure}} — {{env.getMasterServices().getConnection()}}
#* {{CloneSnapshotProcedure}} — {{env.getMasterServices().getConnection()}}

This approach makes zero behavioral changes — the same check is performed, the 
same data is read, the same ACL is written to snapshots — while completely 
eliminating the per-operation connection creation that caused the KDC flooding.

  was:
In a Kerberos-secured HBase cluster, each snapshot operation triggers two 
unnecessary {{ConnectionFactory.createConnection(conf)}} calls in 
{{SnapshotDescriptionUtils.validate()}}, which is invoked from 
{{MasterRpcServices.snapshot()}}. These short-lived connections are created and 
immediately closed, but each creation involves establishing a new ZooKeeper 
session with GSSAPI authentication, resulting in KDC requests for service 
tickets.

When batch snapshot jobs process many tables in a short period, this generates 
a large volume of KDC requests. The KDC may interpret this traffic as a 
brute-force or DDoS attack and block the HBase Master's IP. Once blocked, the 
Master can no longer authenticate any Kerberos operations, effectively 
rendering it non-functional and eventually causing it to fail.

h3. Root Cause

{{SnapshotDescriptionUtils.validate()}} calls:

1. {{isSecurityAvailable(conf)}} — creates a full {{Connection}} + {{Admin}} 
just to check if the {{hbase:acl}} table exists
2. {{writeAclToSnapshotDescription()}} — calls 
{{PermissionStorage.getTablePermissions(conf, tableName)}} which calls 
{{getPermissions()}} with {{Table t = null}}, creating another {{Connection}} 
to read from {{hbase:acl}}

Each {{ConnectionFactory.createConnection(conf)}} with the default 
{{ZKConnectionRegistry}} creates a new {{ReadOnlyZKClient}}, which establishes 
a ZK session with GSSAPI (Kerberos) SASL authentication. Since each connection 
gets a new JAAS {{LoginContext}} with a new {{Subject}} (in 
{{org.apache.zookeeper.Login}}), service tickets are not cached across 
connections, and every connection triggers a TGS request to the KDC.

{{isSecurityAvailable()}} is also called from {{RestoreSnapshotProcedure}} and 
{{CloneSnapshotProcedure}}, so the same issue affects snapshot restore/clone 
operations.

h3. Workaround

Setting 
{{hbase.client.registry.impl=org.apache.hadoop.hbase.client.RpcConnectionRegistry}}
 mitigates the issue. With {{RpcConnectionRegistry}}, new connections use 
RPC-based SASL authentication which runs under the server's shared UGI 
{{Subject}} (via {{ugi.doAs()}}). This allows service tickets to be cached in 
the shared {{Subject}} and reused across connections, eliminating repeated KDC 
requests after the initial authentication.

However, {{ZKConnectionRegistry}} creates a new JAAS {{LoginContext}} with a 
new {{Subject}} per ZK session (in {{org.apache.zookeeper.Login}}), so service 
tickets are never shared. This workaround does not address the unnecessary 
connection creation itself.

h3. Proposed Fix

1. Replace {{isSecurityAvailable(conf)}} with 
{{AccessChecker.isAuthorizationSupported(conf)}} — this checks the 
{{hbase.security.authorization}} configuration value instead of creating a 
connection to verify {{hbase:acl}} table existence. When authorization is 
enabled, {{hbase:acl}} table is guaranteed to exist as it is created by 
{{AccessController}} coprocessor.

2. For {{writeAclToSnapshotDescription()}}, avoid creating a new {{Connection}} 
by obtaining a {{Table}} instance from an existing connection (e.g., the 
Master's shared connection) and passing it to 
{{PermissionStorage.getTablePermissions()}}. Currently {{null}} is passed as 
the {{Table}} parameter, which forces the method to create a new {{Connection}} 
internally. Note that a similar pattern in {{PermissionStorage.loadAll()}} 
already has a {{TODO}} comment acknowledging this issue: {{// TODO: Pass in a 
Connection rather than create one each time.}}


> Snapshot operations create unnecessary short-lived connections causing 
> excessive KDC requests in Kerberos environments
> ----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-30058
>                 URL: https://issues.apache.org/jira/browse/HBASE-30058
>             Project: HBase
>          Issue Type: Bug
>          Components: security, snapshots
>            Reporter: JeongMin Ju
>            Assignee: JeongMin Ju
>            Priority: Major
>
> In a Kerberos-secured HBase cluster, each snapshot operation triggers two 
> unnecessary {{ConnectionFactory.createConnection(conf)}} calls in 
> {{SnapshotDescriptionUtils.validate()}}, which is invoked from 
> {{MasterRpcServices.snapshot()}}. These short-lived connections are created 
> and immediately closed, but each creation involves establishing a new 
> ZooKeeper session with GSSAPI authentication, resulting in KDC requests for 
> service tickets.
> When batch snapshot jobs process many tables in a short period, this 
> generates a large volume of KDC requests. The KDC may interpret this traffic 
> as a brute-force or DDoS attack and block the HBase Master's IP. Once 
> blocked, the Master can no longer authenticate any Kerberos operations, 
> effectively rendering it non-functional and eventually causing it to fail.
> h3. Root Cause
> {{SnapshotDescriptionUtils.validate()}} calls:
> # {{isSecurityAvailable(conf)}} — creates a full {{Connection}} + {{Admin}} 
> just to check if the {{hbase:acl}} table exists
> # {{writeAclToSnapshotDescription()}} — calls 
> {{PermissionStorage.getTablePermissions(conf, tableName)}} which calls 
> {{getPermissions()}} with {{Table t = null}}, creating another {{Connection}} 
> to read from {{hbase:acl}}
> Each {{ConnectionFactory.createConnection(conf)}} with the default 
> {{ZKConnectionRegistry}} creates a new {{ReadOnlyZKClient}}, which 
> establishes a ZK session with GSSAPI (Kerberos) SASL authentication. Since 
> each connection gets a new JAAS {{LoginContext}} with a new {{Subject}} (in 
> {{org.apache.zookeeper.Login}}), service tickets are not cached across 
> connections, and every connection triggers a TGS request to the KDC.
> {{isSecurityAvailable()}} is also called from {{RestoreSnapshotProcedure}} 
> and {{CloneSnapshotProcedure}}, so the same issue affects snapshot 
> restore/clone operations.
> h3. Workaround
> Setting 
> {{hbase.client.registry.impl=org.apache.hadoop.hbase.client.RpcConnectionRegistry}}
>  mitigates the issue. With {{RpcConnectionRegistry}}, new connections use 
> RPC-based SASL authentication which runs under the server's shared UGI 
> {{Subject}} (via {{ugi.doAs()}}). This allows service tickets to be cached in 
> the shared {{Subject}} and reused across connections, eliminating repeated 
> KDC requests after the initial authentication.
> However, {{ZKConnectionRegistry}} creates a new JAAS {{LoginContext}} with a 
> new {{Subject}} per ZK session (in {{org.apache.zookeeper.Login}}), so 
> service tickets are never shared. This workaround does not address the 
> unnecessary connection creation itself.
> h3. Analysis: Ambiguity in Original Design Intent
> During analysis of the fix, we encountered several design ambiguities in the 
> original code that are worth documenting.
> *1. {{isSecurityAvailable(conf)}} checks {{hbase:acl}} table existence, not 
> configuration*
> The original {{isSecurityAvailable}} creates a connection and checks whether 
> the {{hbase:acl}} table exists via {{admin.tableExists()}}. The comment above 
> its call site says _"set the acl to snapshot if security feature is 
> enabled"_, which suggests it should check whether the security feature is 
> enabled. However, the implementation checks table existence instead.
> *2. {{hbase:acl}} table creation is decoupled from 
> {{hbase.security.authorization}}*
> The {{hbase:acl}} table is created by {{AccessController.postStartMaster()}} 
> whenever {{AccessController}} is loaded as a master coprocessor, regardless 
> of the {{hbase.security.authorization}} setting. This means:
> ||Scenario||{{hbase:acl}} exists||Authorization enforced||
> |{{authorization=true}} + {{AccessController}} loaded|Yes|Yes|
> |{{authorization=false}} + {{AccessController}} loaded|Yes|No 
> ({{NoopAccessChecker}} is used)|
> |{{authorization=true}} + {{AccessController}} not loaded|No|Partially 
> ({{AccessChecker}} is used but {{hbase:acl}} is missing)|
> This decoupling creates ambiguity: when 
> {{hbase.security.authorization=false}} but {{AccessController}} is loaded, 
> the {{hbase:acl}} table exists and ACL data is managed (grant/revoke work), 
> but permissions are not enforced ({{NoopAccessChecker}} allows everything).
> *3. Purpose of storing ACL in snapshot descriptor*
> The ACL stored in the snapshot descriptor via 
> {{writeAclToSnapshotDescription()}} is consumed by 
> {{RestoreSnapshotHelper.restoreSnapshotAcl()}} during snapshot restore/clone 
> operations. It restores the original table's permissions to the newly created 
> table. This is separate from {{SnapshotScannerHDFSAclController}}, which 
> reads permissions directly from {{hbase:acl}} at snapshot time for HDFS-level 
> ACL synchronization.
> *4. Considered alternatives for {{isSecurityAvailable}}*
> We considered two configuration-based alternatives to eliminate the 
> connection creation:
> - {{User.isHBaseSecurityEnabled(conf)}} — checks 
> {{hbase.security.authentication=kerberos}}. But Kerberos authentication can 
> be enabled without authorization (no {{hbase:acl}} table), so this would be 
> too broad.
> - {{AccessChecker.isAuthorizationSupported(conf)}} — checks 
> {{hbase.security.authorization=true}}. But as noted above, {{hbase:acl}} can 
> exist even when {{authorization=false}}, so this would miss the case where 
> ACL data exists and should be preserved in snapshots.
> Note that {{AccessChecker.isAuthorizationSupported(conf)}} is already checked 
> in {{validate()}} just before {{isSecurityAvailable()}} (for setting the 
> snapshot owner). If we used the same check for {{isSecurityAvailable}}, it 
> would be a redundant condition.
> Neither configuration-based approach perfectly matches the original behavior 
> of checking {{hbase:acl}} table existence.
> h3. Proposed Fix
> Given the ambiguity, we chose to preserve the original behavior exactly while 
> eliminating the unnecessary connection creation:
> # Change {{isSecurityAvailable(Configuration conf)}} to 
> {{isSecurityAvailable(Connection conn)}} — reuse an existing connection 
> (e.g., the Master's shared connection) to check {{hbase:acl}} table existence 
> via {{admin.tableExists()}}. This is functionally identical to the original, 
> just without creating a new connection.
> # Change {{writeAclToSnapshotDescription()}} to accept a {{Connection}} 
> parameter and pass an existing {{Table}} instance to 
> {{PermissionStorage.getTablePermissions()}} instead of {{null}}. This avoids 
> the second unnecessary connection creation. Note that a similar pattern in 
> {{PermissionStorage.loadAll()}} already has a {{TODO}} comment acknowledging 
> this issue: {{// TODO: Pass in a Connection rather than create one each 
> time.}}
> # Update all callers of {{validate()}} and {{isSecurityAvailable()}} to pass 
> through the available connection:
> #* {{MasterRpcServices.snapshot()}} — {{server.getConnection()}}
> #* {{RestoreSnapshotProcedure}} — {{env.getMasterServices().getConnection()}}
> #* {{CloneSnapshotProcedure}} — {{env.getMasterServices().getConnection()}}
> This approach makes zero behavioral changes — the same check is performed, 
> the same data is read, the same ACL is written to snapshots — while 
> completely eliminating the per-operation connection creation that caused the 
> KDC flooding.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to