soowan4147 created HIVE-29584:
---------------------------------
Summary: ObjectStore.handleDirectSqlError() leaks broken JDBC
connection wrapper in PM ThreadLocal cache after MySQL transient outage
Key: HIVE-29584
URL: https://issues.apache.org/jira/browse/HIVE-29584
Project: Hive
Issue Type: Bug
Components: Metastore, Standalone Metastore
Affects Versions: 4.2.0, 4.0.0, 3.1.3
Environment: Hive: 3.1.3 (also verified affected: 4.0.0, 4.1.0)
Standalone Metastore: yes
Backend DB: MySQL 8.0
JDBC Driver: mysql-connector-j 8.0.30
Connection Pool: HikariCP 2.6.1
ORM: DataNucleus 4.1.19 + datanucleus-core 4.1.17 + datanucleus-api-jdo 4.2.4
JVM: OpenJDK 1.8.0_412
Affected Client: Apache Amoro (long-lived thrift connection client)
Reporter: soowan4147
## Summary
After a brief MySQL outage (e.g., 4-second network glitch from a planned
DBA operation), one HMS Thrift worker thread can permanently retain a broken
HikariProxyConnection in its ObjectStore.pm ThreadLocal cache, leading to
indefinite reuse of the same broken wrapper for hours until HMS is restarted.
## Root Cause
`MetaStoreDirectSql.prepareTxn()` executes `SET
@@session.sql_mode=ANSI_QUOTES`
on every transaction. When this fails on a broken connection (e.g.,
"Connection is closed" SQLException after MySQL transient outage),
`ObjectStore.handleDirectSqlError()` falls back to ORM mode but does NOT
invalidate the PersistenceManager. As a result:
1. Same `ObjectStore.pm` is reused on next RPC
2. Same broken HikariProxyConnection wrapper is reused (held via ThreadLocal)
3. HikariCP cannot evict the in-use connection per design
4. Only HMS restart releases the wrapper
## Production Incident Evidence (2026-04-26)
- pool-6-thread-93872 retained broken wrapper for ~5 hours
- 41,302 audit RPCs all from same client IP, all failing with same error
- 6,047 "Falling back to ORM" + 14,309 ERROR logs in 5 hours
- master02 normal threads' RPC throughput dropped 90%+ during incident
- catalogd's ALTER_TABLE processing stalled 1~3 minutes per event
- Resolved only by master HMS restart (3h 10m total impact)
## Source References
Verified the defect exists in:
- 3.1.3: `ObjectStore.java#L3646-L3697` (handleDirectSqlError)
- 4.0.0: `ObjectStore.java#L4449-L4495` (same defect, no PM cleanup)
- `MetaStoreDirectSql.java#L2026-L2034` (prepareTxn trigger)
## Steps to Reproduce
1. Set up HMS 3.1.3+ with HikariCP backed by MySQL
2. Create a long-lived metastore client that maps permanently to one HMS
worker thread (e.g., Apache Amoro pod, Spark Thrift Server)
3. Briefly disconnect MySQL (4 seconds via iptables drop or KILL CONNECTION)
4. Observe: one worker thread continues to reuse the broken wrapper
indefinitely
5. Verify: log shows continuous "Falling back to ORM path due to direct SQL
failure: Error setting ansi quotes: Connection is closed" from same thread
## Proposed Fix
In `ObjectStore.handleDirectSqlError()`, when the cause is a connection-level
SQLException, invalidate the PM:
```java
if (isConnectionLevelError(ex)) {
if (pm != null) {
try {
if (pm.currentTransaction().isActive()) {
pm.currentTransaction().rollback();
}
pm.close(); // releases HikariCP wrapper to pool
} catch (Exception e) {
// best effort
}
pm = null;
directSql = null;
}
}
This forces a fresh PM (and thus a fresh connection) on the next RPC,
allowing the broken connection to be properly evicted by HikariCP.
Workarounds (currently in use)
- Client-side: shorten hive.metastore.client.socket.timeout on long-lived
clients (e.g., Amoro) so they auto-reconnect every few minutes, breaking
the permanent thread mapping
- Operational: enable HikariCP leakDetectionThreshold, alarm on
"Connection leak detection triggered" log, and auto-restart the affected HMS
Related JIRAs (none directly fix this)
- HIVE-22804 (sessionVariables workaround) — does not prevent the leak
- HIVE-20192 (PM cleanup at thread exit) — different mechanism
- HIVE-28788 (commit failure → starvation) — different trigger
- HIVE-28839 (DataNucleus connection starvation) — different code path
To my knowledge, this specific defect (PM ThreadLocal retaining broken
wrapper after SQLException in handleDirectSqlError) has not been reported
before.
---
--
This message was sent by Atlassian Jira
(v8.20.10#820010)