[jira] [Updated] (HIVE-28899) HMS 'get_table_meta' causes full scan on TBLS and increases MySQL CPU usage

Jihye Seo (Jira) Fri, 11 Apr 2025 00:05:06 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-28899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jihye Seo updated HIVE-28899:
-----------------------------
    Description: 
The SQL query generated by Hive Metastore’s {{get_table_meta}} call leads to a 
full table scan on the {{TBLS}} table, even though the {{DB_ID}} column is 
indexed. This causes serious performance degradation, especially when the 
number of tables is large.

In our production environment, this resulted in *high MySQL CPU usage* during 
periods of frequent {{get_table_meta}} access.

 

Problematic Query:
{code:java}
SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS 
`NUCLEUS_TYPE`,
       A0.CREATE_TIME,
       A0.LAST_ACCESS_TIME,
       A0.OWNER,
       A0.OWNER_TYPE,
       A0.RETENTION,
       A0.IS_REWRITE_ENABLED,
       A0.TBL_NAME,
       A0.TBL_TYPE,
       A0.WRITE_ID,
       A0.TBL_ID
FROM TBLS A0
LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID
WHERE B0.CTLG_NAME = 'hive'
  AND LOWER(B0.NAME) LIKE 'test' ESCAPE '\\'; {code}
EXPLAIN Result:
{code:java}
+----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+
| id | select_type | table | partitions | type   | possible_keys    | key     | 
key_len | ref               | rows  | filtered | Extra                 |
+----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+
|  1 | SIMPLE      | A0    | NULL       | ALL    | TBLS_N49         | NULL    | 
NULL    | NULL              | 76998 |   100.00 | Using temporary       |
|  1 | SIMPLE      | B0    | NULL       | eq_ref | PRIMARY,CTLG_FK1 | PRIMARY | 
8       | hive_c3s.A0.DB_ID |     1 |    98.75 | Using where; Distinct |
+----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+{code}
In the plan above:
 * {{TBLS}} performs a *full scan* ({{{}type = ALL{}}})

 * This is because the optimizer is forced to start with {{{}TBLS{}}}, as the 
{{DBS.NAME}} filter cannot utilize the index due to the {{LOWER()}} function

Removing {{LOWER()}} Fixes the Issue:

When we modify the query by removing the {{LOWER()}} function,
{code:java}
... WHERE B0.CTLG_NAME = 'hive' AND B0.NAME LIKE 'test' ESCAPE '\\';{code}
We get the following improved plan.
{code:java}
+----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+
| id | select_type | table | partitions | type  | possible_keys                 
   | key             | key_len | ref               | rows | filtered | Extra    
                                 |
+----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+
|  1 | SIMPLE      | B0    | NULL       | range | 
PRIMARY,UNIQUE_DATABASE,CTLG_FK1 | UNIQUE_DATABASE | 389     | NULL             
 |    1 |   100.00 | Using where; Using index; Using temporary |
|  1 | SIMPLE      | A0    | NULL       | ref   | TBLS_N49                      
   | TBLS_N49        | 9       | hive_c3s.B0.DB_ID |   45 |   100.00 | NULL     
                                 |
+----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+{code}
Here:
 * The optimizer is able to *start with {{DBS}}* using its {{NAME}} index

 * Then it joins to {{TBLS}} using the {{DB_ID}} index, avoiding a full scan

 

We’d like to ask if it’s feasible to avoid using the {{LOWER()}} function when 
matching database and table name patterns, so that the underlying database 
engine (e.g., MySQL) can utilize indexes more effectively.

  was:
The SQL query generated by Hive Metastore’s {{get_table_meta}} call leads to a 
full table scan on the {{TBLS}} table, even though the {{DB_ID}} column is 
indexed. This causes serious performance degradation, especially when the 
number of tables is large.

In our production environment, this resulted in *high MySQL CPU usage* during 
periods of frequent {{get_table_meta}} access.

 

🔎 Problematic Query:
{code:java}
SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS 
`NUCLEUS_TYPE`,
       A0.CREATE_TIME,
       A0.LAST_ACCESS_TIME,
       A0.OWNER,
       A0.OWNER_TYPE,
       A0.RETENTION,
       A0.IS_REWRITE_ENABLED,
       A0.TBL_NAME,
       A0.TBL_TYPE,
       A0.WRITE_ID,
       A0.TBL_ID
FROM TBLS A0
LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID
WHERE B0.CTLG_NAME = 'hive'
  AND LOWER(B0.NAME) LIKE 'test' ESCAPE '\\'; {code}
📈 EXPLAIN Result:
{code:java}
+----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+
| id | select_type | table | partitions | type   | possible_keys    | key     | 
key_len | ref               | rows  | filtered | Extra                 |
+----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+
|  1 | SIMPLE      | A0    | NULL       | ALL    | TBLS_N49         | NULL    | 
NULL    | NULL              | 76998 |   100.00 | Using temporary       |
|  1 | SIMPLE      | B0    | NULL       | eq_ref | PRIMARY,CTLG_FK1 | PRIMARY | 
8       | hive_c3s.A0.DB_ID |     1 |    98.75 | Using where; Distinct |
+----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+{code}
In the plan above:
 * {{TBLS}} performs a *full scan* ({{{}type = ALL{}}})

 * This is because the optimizer is forced to start with {{{}TBLS{}}}, as the 
{{DBS.NAME}} filter cannot utilize the index due to the {{LOWER()}} function

✅ Removing {{LOWER()}} Fixes the Issue

When we modify the query by removing the {{LOWER()}} function:
{code:java}
... WHERE B0.CTLG_NAME = 'hive' AND B0.NAME LIKE 'test' ESCAPE '\\';{code}
We get the following improved plan:
{code:java}
+----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+
| id | select_type | table | partitions | type  | possible_keys                 
   | key             | key_len | ref               | rows | filtered | Extra    
                                 |
+----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+
|  1 | SIMPLE      | B0    | NULL       | range | 
PRIMARY,UNIQUE_DATABASE,CTLG_FK1 | UNIQUE_DATABASE | 389     | NULL             
 |    1 |   100.00 | Using where; Using index; Using temporary |
|  1 | SIMPLE      | A0    | NULL       | ref   | TBLS_N49                      
   | TBLS_N49        | 9       | hive_c3s.B0.DB_ID |   45 |   100.00 | NULL     
                                 |
+----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+{code}
Here:
 * The optimizer is able to *start with {{DBS}}* using its {{NAME}} index

 * Then it joins to {{TBLS}} using the {{DB_ID}} index, avoiding a full scan

 

We’d like to ask if it’s feasible to avoid using the {{LOWER()}} function when 
matching database and table name patterns, so that the underlying database 
engine (e.g., MySQL) can utilize indexes more effectively.


> HMS 'get_table_meta' causes full scan on TBLS and increases MySQL CPU usage
> ---------------------------------------------------------------------------
>
>                 Key: HIVE-28899
>                 URL: https://issues.apache.org/jira/browse/HIVE-28899
>             Project: Hive
>          Issue Type: Improvement
>    Affects Versions: 3.1.0
>            Reporter: Jihye Seo
>            Priority: Major
>
> The SQL query generated by Hive Metastore’s {{get_table_meta}} call leads to 
> a full table scan on the {{TBLS}} table, even though the {{DB_ID}} column is 
> indexed. This causes serious performance degradation, especially when the 
> number of tables is large.
> In our production environment, this resulted in *high MySQL CPU usage* during 
> periods of frequent {{get_table_meta}} access.
>  
> Problematic Query:
> {code:java}
> SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS 
> `NUCLEUS_TYPE`,
>        A0.CREATE_TIME,
>        A0.LAST_ACCESS_TIME,
>        A0.OWNER,
>        A0.OWNER_TYPE,
>        A0.RETENTION,
>        A0.IS_REWRITE_ENABLED,
>        A0.TBL_NAME,
>        A0.TBL_TYPE,
>        A0.WRITE_ID,
>        A0.TBL_ID
> FROM TBLS A0
> LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID
> WHERE B0.CTLG_NAME = 'hive'
>   AND LOWER(B0.NAME) LIKE 'test' ESCAPE '\\'; {code}
> EXPLAIN Result:
> {code:java}
> +----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+
> | id | select_type | table | partitions | type   | possible_keys    | key     
> | key_len | ref               | rows  | filtered | Extra                 |
> +----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+
> |  1 | SIMPLE      | A0    | NULL       | ALL    | TBLS_N49         | NULL    
> | NULL    | NULL              | 76998 |   100.00 | Using temporary       |
> |  1 | SIMPLE      | B0    | NULL       | eq_ref | PRIMARY,CTLG_FK1 | PRIMARY 
> | 8       | hive_c3s.A0.DB_ID |     1 |    98.75 | Using where; Distinct |
> +----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+{code}
> In the plan above:
>  * {{TBLS}} performs a *full scan* ({{{}type = ALL{}}})
>  * This is because the optimizer is forced to start with {{{}TBLS{}}}, as the 
> {{DBS.NAME}} filter cannot utilize the index due to the {{LOWER()}} function
> Removing {{LOWER()}} Fixes the Issue:
> When we modify the query by removing the {{LOWER()}} function,
> {code:java}
> ... WHERE B0.CTLG_NAME = 'hive' AND B0.NAME LIKE 'test' ESCAPE '\\';{code}
> We get the following improved plan.
> {code:java}
> +----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+
> | id | select_type | table | partitions | type  | possible_keys               
>      | key             | key_len | ref               | rows | filtered | 
> Extra                                     |
> +----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+
> |  1 | SIMPLE      | B0    | NULL       | range | 
> PRIMARY,UNIQUE_DATABASE,CTLG_FK1 | UNIQUE_DATABASE | 389     | NULL           
>    |    1 |   100.00 | Using where; Using index; Using temporary |
> |  1 | SIMPLE      | A0    | NULL       | ref   | TBLS_N49                    
>      | TBLS_N49        | 9       | hive_c3s.B0.DB_ID |   45 |   100.00 | NULL 
>                                      |
> +----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+{code}
> Here:
>  * The optimizer is able to *start with {{DBS}}* using its {{NAME}} index
>  * Then it joins to {{TBLS}} using the {{DB_ID}} index, avoiding a full scan
>  
> We’d like to ask if it’s feasible to avoid using the {{LOWER()}} function 
> when matching database and table name patterns, so that the underlying 
> database engine (e.g., MySQL) can utilize indexes more effectively.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-28899) HMS 'get_table_meta' causes full scan on TBLS and increases MySQL CPU usage

Reply via email to