soowan4147 created HIVE-29584:
---------------------------------

             Summary: ObjectStore.handleDirectSqlError() leaks broken JDBC 
connection wrapper in PM ThreadLocal cache after MySQL transient outage
                 Key: HIVE-29584
                 URL: https://issues.apache.org/jira/browse/HIVE-29584
             Project: Hive
          Issue Type: Bug
          Components: Metastore, Standalone Metastore
    Affects Versions: 4.2.0, 4.0.0, 3.1.3
         Environment:   Hive: 3.1.3 (also verified affected: 4.0.0, 4.1.0)
  Standalone Metastore: yes
  Backend DB: MySQL 8.0
  JDBC Driver: mysql-connector-j 8.0.30
  Connection Pool: HikariCP 2.6.1
  ORM: DataNucleus 4.1.19 + datanucleus-core 4.1.17 + datanucleus-api-jdo 4.2.4
  JVM: OpenJDK 1.8.0_412
  Affected Client: Apache Amoro (long-lived thrift connection client)
            Reporter: soowan4147


## Summary

  After a brief MySQL outage (e.g., 4-second network glitch from a planned
  DBA operation), one HMS Thrift worker thread can permanently retain a broken
  HikariProxyConnection in its ObjectStore.pm ThreadLocal cache, leading to
  indefinite reuse of the same broken wrapper for hours until HMS is restarted.

  ## Root Cause

  `MetaStoreDirectSql.prepareTxn()` executes `SET 
@@session.sql_mode=ANSI_QUOTES`
  on every transaction. When this fails on a broken connection (e.g.,
  "Connection is closed" SQLException after MySQL transient outage),
  `ObjectStore.handleDirectSqlError()` falls back to ORM mode but does NOT
  invalidate the PersistenceManager. As a result:

  1. Same `ObjectStore.pm` is reused on next RPC
  2. Same broken HikariProxyConnection wrapper is reused (held via ThreadLocal)
  3. HikariCP cannot evict the in-use connection per design
  4. Only HMS restart releases the wrapper

  ## Production Incident Evidence (2026-04-26)

  - pool-6-thread-93872 retained broken wrapper for ~5 hours
  - 41,302 audit RPCs all from same client IP, all failing with same error
  - 6,047 "Falling back to ORM" + 14,309 ERROR logs in 5 hours
  - master02 normal threads' RPC throughput dropped 90%+ during incident
  - catalogd's ALTER_TABLE processing stalled 1~3 minutes per event
  - Resolved only by master HMS restart (3h 10m total impact)

  ## Source References

  Verified the defect exists in:
  - 3.1.3: `ObjectStore.java#L3646-L3697` (handleDirectSqlError)
  - 4.0.0: `ObjectStore.java#L4449-L4495` (same defect, no PM cleanup)
  - `MetaStoreDirectSql.java#L2026-L2034` (prepareTxn trigger)

  ## Steps to Reproduce

  1. Set up HMS 3.1.3+ with HikariCP backed by MySQL
  2. Create a long-lived metastore client that maps permanently to one HMS
     worker thread (e.g., Apache Amoro pod, Spark Thrift Server)
  3. Briefly disconnect MySQL (4 seconds via iptables drop or KILL CONNECTION)
  4. Observe: one worker thread continues to reuse the broken wrapper 
indefinitely
  5. Verify: log shows continuous "Falling back to ORM path due to direct SQL
     failure: Error setting ansi quotes: Connection is closed" from same thread

  ## Proposed Fix

  In `ObjectStore.handleDirectSqlError()`, when the cause is a connection-level
  SQLException, invalidate the PM:

  ```java
  if (isConnectionLevelError(ex)) {
      if (pm != null) {
          try {
              if (pm.currentTransaction().isActive()) {
                  pm.currentTransaction().rollback();
              }
              pm.close();   // releases HikariCP wrapper to pool
          } catch (Exception e) {
              // best effort
          }
          pm = null;
          directSql = null;
      }
  }

  This forces a fresh PM (and thus a fresh connection) on the next RPC,
  allowing the broken connection to be properly evicted by HikariCP.

  Workarounds (currently in use)

  - Client-side: shorten hive.metastore.client.socket.timeout on long-lived
  clients (e.g., Amoro) so they auto-reconnect every few minutes, breaking
  the permanent thread mapping
  - Operational: enable HikariCP leakDetectionThreshold, alarm on
  "Connection leak detection triggered" log, and auto-restart the affected HMS

  Related JIRAs (none directly fix this)

  - HIVE-22804 (sessionVariables workaround) — does not prevent the leak
  - HIVE-20192 (PM cleanup at thread exit) — different mechanism
  - HIVE-28788 (commit failure → starvation) — different trigger
  - HIVE-28839 (DataNucleus connection starvation) — different code path

  To my knowledge, this specific defect (PM ThreadLocal retaining broken
  wrapper after SQLException in handleDirectSqlError) has not been reported
  before.

  ---



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to