GitHub user guangyy opened a pull request:
https://github.com/apache/hive/pull/453
Fix the leak of lock during concurrent partition drop
We have seen a leaked lock on hive metastore DB which caused all
PARTITION insertion failed on timeout waiting for lock until the
metastore service is restarted.
A transaction dump on the DB shows there is a thread that is Sleep which
potentiall holds the the lock, like:
trx_id: 33603171058
trx_state: RUNNING
trx_started: 2018-10-23 06:43:22
trx_requested_lock_id: NULL
trx_wait_started: NULL
trx_weight: 70298
trx_mysql_thread_id: 275402202
trx_query: NULL
trx_operation_state: NULL
trx_tables_in_use: 0
trx_tables_locked: 0
trx_lock_structs: 21286
trx_lock_memory_bytes: 2881064
trx_rows_locked: 98810
trx_rows_modified: 49012
trx_concurrency_tickets: 0
trx_isolation_level: READ COMMITTED
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 0
trx_is_read_only: 0
trx_autocommit_non_locking: 0
ID: 275402202
USER: metastore_gold
HOST: 10.37.182.82:36684
DB: metastoregold
COMMAND: Sleep
TIME: 1
STATE:
INFO: NULL
duration: 1316
Given the HOST ip, we trace back to the hive metastore instance and found
the following exceptions:
2018-10-23 06:43:22,805 WARN DataNucleus.Persistence: Exception thrown by
StateManager.isLoaded
No such database row
org.datanucleus.exceptions.NucleusObjectNotFoundException: No such database
row
at
org.datanucleus.store.rdbms.request.FetchRequest.execute(FetchRequest.java:357)
at
org.datanucleus.store.rdbms.RDBMSPersistenceHandler.fetchObject(RDBMSPersistenceHandler.java:324)
at
org.datanucleus.state.AbstractStateManager.loadFieldsFromDatastore(AbstractStateManager.java:1120)
at
org.datanucleus.state.JDOStateManager.loadSpecifiedFields(JDOStateManager.java:2916)
at
org.datanucleus.state.JDOStateManager.isLoaded(JDOStateManager.java:3219)
The problem is that the caller expects a NULL if the partition does not
exist, however, the convertToPart function would throw
an exception which lead to the leak.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/guangyy/hive guang--fix-db-lock-leak
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/hive/pull/453.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #453
----
commit 7f6ace1146c32b7e6b8f175cc9c18489119c7613
Author: Guang Yang <guang.yang@...>
Date: 2018-10-24T22:28:09Z
Fix the leak of lock during concurrent partition drop
We have seen a leaked lock on hive metastore DB which caused all
PARTITION insertion failed on timeout waiting for lock until the
metastore service is restarted.
A transaction dump on the DB shows there is a thread that is Sleep which
potentiall holds the the lock, like:
trx_id: 33603171058
trx_state: RUNNING
trx_started: 2018-10-23 06:43:22
trx_requested_lock_id: NULL
trx_wait_started: NULL
trx_weight: 70298
trx_mysql_thread_id: 275402202
trx_query: NULL
trx_operation_state: NULL
trx_tables_in_use: 0
trx_tables_locked: 0
trx_lock_structs: 21286
trx_lock_memory_bytes: 2881064
trx_rows_locked: 98810
trx_rows_modified: 49012
trx_concurrency_tickets: 0
trx_isolation_level: READ COMMITTED
trx_unique_checks: 1
trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
trx_adaptive_hash_latched: 0
trx_adaptive_hash_timeout: 0
trx_is_read_only: 0
trx_autocommit_non_locking: 0
ID: 275402202
USER: metastore_gold
HOST: 10.37.182.82:36684
DB: metastoregold
COMMAND: Sleep
TIME: 1
STATE:
INFO: NULL
duration: 1316
Given the HOST ip, we trace back to the hive metastore instance and found
the following exceptions:
2018-10-23 06:43:22,805 WARN DataNucleus.Persistence: Exception thrown by
StateManager.isLoaded
No such database row
org.datanucleus.exceptions.NucleusObjectNotFoundException: No such database
row
at
org.datanucleus.store.rdbms.request.FetchRequest.execute(FetchRequest.java:357)
at
org.datanucleus.store.rdbms.RDBMSPersistenceHandler.fetchObject(RDBMSPersistenceHandler.java:324)
at
org.datanucleus.state.AbstractStateManager.loadFieldsFromDatastore(AbstractStateManager.java:1120)
at
org.datanucleus.state.JDOStateManager.loadSpecifiedFields(JDOStateManager.java:2916)
at
org.datanucleus.state.JDOStateManager.isLoaded(JDOStateManager.java:3219)
The problem is that the caller expects a NULL if the partition does not
exist, however, the convertToPart function would throw
an exception which lead to the leak.
----
---