[
https://issues.apache.org/jira/browse/HIVE-28899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jihye Seo updated HIVE-28899:
-----------------------------
Description:
The SQL query generated by Hive Metastore’s {{get_table_meta}} call leads to a
full table scan on the {{TBLS}} table, even though the {{DB_ID}} column is
indexed. This causes serious performance degradation, especially when the
number of tables is large.
In our production environment, this resulted in *high MySQL CPU usage* during
periods of frequent {{get_table_meta}} access.
Problematic Query:
{code:java}
SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS
`NUCLEUS_TYPE`,
A0.CREATE_TIME,
A0.LAST_ACCESS_TIME,
A0.OWNER,
A0.OWNER_TYPE,
A0.RETENTION,
A0.IS_REWRITE_ENABLED,
A0.TBL_NAME,
A0.TBL_TYPE,
A0.WRITE_ID,
A0.TBL_ID
FROM TBLS A0
LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID
WHERE B0.CTLG_NAME = 'hive'
AND LOWER(B0.NAME) LIKE 'test' ESCAPE '\\'; {code}
EXPLAIN Result:
{code:java}
+----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+
| id | select_type | table | partitions | type | possible_keys | key |
key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+
| 1 | SIMPLE | A0 | NULL | ALL | TBLS_N49 | NULL |
NULL | NULL | 76998 | 100.00 | Using temporary |
| 1 | SIMPLE | B0 | NULL | eq_ref | PRIMARY,CTLG_FK1 | PRIMARY |
8 | hive_c3s.A0.DB_ID | 1 | 98.75 | Using where; Distinct |
+----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+{code}
In the plan above:
* {{TBLS}} performs a *full scan* ({{{}type = ALL{}}})
* This is because the optimizer is forced to start with {{{}TBLS{}}}, as the
{{DBS.NAME}} filter cannot utilize the index due to the {{LOWER()}} function
Removing {{LOWER()}} Fixes the Issue:
When we modify the query by removing the {{LOWER()}} function,
{code:java}
... WHERE B0.CTLG_NAME = 'hive' AND B0.NAME LIKE 'test' ESCAPE '\\';{code}
We get the following improved plan.
{code:java}
+----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+
| id | select_type | table | partitions | type | possible_keys
| key | key_len | ref | rows | filtered | Extra
|
+----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+
| 1 | SIMPLE | B0 | NULL | range |
PRIMARY,UNIQUE_DATABASE,CTLG_FK1 | UNIQUE_DATABASE | 389 | NULL
| 1 | 100.00 | Using where; Using index; Using temporary |
| 1 | SIMPLE | A0 | NULL | ref | TBLS_N49
| TBLS_N49 | 9 | hive_c3s.B0.DB_ID | 45 | 100.00 | NULL
|
+----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+{code}
Here:
* The optimizer is able to *start with {{DBS}}* using its {{NAME}} index
* Then it joins to {{TBLS}} using the {{DB_ID}} index, avoiding a full scan
We’d like to ask if it’s feasible to avoid using the {{LOWER()}} function when
matching database and table name patterns, so that the underlying database
engine (e.g., MySQL) can utilize indexes more effectively.
was:
The SQL query generated by Hive Metastore’s {{get_table_meta}} call leads to a
full table scan on the {{TBLS}} table, even though the {{DB_ID}} column is
indexed. This causes serious performance degradation, especially when the
number of tables is large.
In our production environment, this resulted in *high MySQL CPU usage* during
periods of frequent {{get_table_meta}} access.
🔎 Problematic Query:
{code:java}
SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS
`NUCLEUS_TYPE`,
A0.CREATE_TIME,
A0.LAST_ACCESS_TIME,
A0.OWNER,
A0.OWNER_TYPE,
A0.RETENTION,
A0.IS_REWRITE_ENABLED,
A0.TBL_NAME,
A0.TBL_TYPE,
A0.WRITE_ID,
A0.TBL_ID
FROM TBLS A0
LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID
WHERE B0.CTLG_NAME = 'hive'
AND LOWER(B0.NAME) LIKE 'test' ESCAPE '\\'; {code}
📈 EXPLAIN Result:
{code:java}
+----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+
| id | select_type | table | partitions | type | possible_keys | key |
key_len | ref | rows | filtered | Extra |
+----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+
| 1 | SIMPLE | A0 | NULL | ALL | TBLS_N49 | NULL |
NULL | NULL | 76998 | 100.00 | Using temporary |
| 1 | SIMPLE | B0 | NULL | eq_ref | PRIMARY,CTLG_FK1 | PRIMARY |
8 | hive_c3s.A0.DB_ID | 1 | 98.75 | Using where; Distinct |
+----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+{code}
In the plan above:
* {{TBLS}} performs a *full scan* ({{{}type = ALL{}}})
* This is because the optimizer is forced to start with {{{}TBLS{}}}, as the
{{DBS.NAME}} filter cannot utilize the index due to the {{LOWER()}} function
✅ Removing {{LOWER()}} Fixes the Issue
When we modify the query by removing the {{LOWER()}} function:
{code:java}
... WHERE B0.CTLG_NAME = 'hive' AND B0.NAME LIKE 'test' ESCAPE '\\';{code}
We get the following improved plan:
{code:java}
+----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+
| id | select_type | table | partitions | type | possible_keys
| key | key_len | ref | rows | filtered | Extra
|
+----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+
| 1 | SIMPLE | B0 | NULL | range |
PRIMARY,UNIQUE_DATABASE,CTLG_FK1 | UNIQUE_DATABASE | 389 | NULL
| 1 | 100.00 | Using where; Using index; Using temporary |
| 1 | SIMPLE | A0 | NULL | ref | TBLS_N49
| TBLS_N49 | 9 | hive_c3s.B0.DB_ID | 45 | 100.00 | NULL
|
+----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+{code}
Here:
* The optimizer is able to *start with {{DBS}}* using its {{NAME}} index
* Then it joins to {{TBLS}} using the {{DB_ID}} index, avoiding a full scan
We’d like to ask if it’s feasible to avoid using the {{LOWER()}} function when
matching database and table name patterns, so that the underlying database
engine (e.g., MySQL) can utilize indexes more effectively.
> HMS 'get_table_meta' causes full scan on TBLS and increases MySQL CPU usage
> ---------------------------------------------------------------------------
>
> Key: HIVE-28899
> URL: https://issues.apache.org/jira/browse/HIVE-28899
> Project: Hive
> Issue Type: Improvement
> Affects Versions: 3.1.0
> Reporter: Jihye Seo
> Priority: Major
>
> The SQL query generated by Hive Metastore’s {{get_table_meta}} call leads to
> a full table scan on the {{TBLS}} table, even though the {{DB_ID}} column is
> indexed. This causes serious performance degradation, especially when the
> number of tables is large.
> In our production environment, this resulted in *high MySQL CPU usage* during
> periods of frequent {{get_table_meta}} access.
>
> Problematic Query:
> {code:java}
> SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MTable' AS
> `NUCLEUS_TYPE`,
> A0.CREATE_TIME,
> A0.LAST_ACCESS_TIME,
> A0.OWNER,
> A0.OWNER_TYPE,
> A0.RETENTION,
> A0.IS_REWRITE_ENABLED,
> A0.TBL_NAME,
> A0.TBL_TYPE,
> A0.WRITE_ID,
> A0.TBL_ID
> FROM TBLS A0
> LEFT OUTER JOIN DBS B0 ON A0.DB_ID = B0.DB_ID
> WHERE B0.CTLG_NAME = 'hive'
> AND LOWER(B0.NAME) LIKE 'test' ESCAPE '\\'; {code}
> EXPLAIN Result:
> {code:java}
> +----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+
> | id | select_type | table | partitions | type | possible_keys | key
> | key_len | ref | rows | filtered | Extra |
> +----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+
> | 1 | SIMPLE | A0 | NULL | ALL | TBLS_N49 | NULL
> | NULL | NULL | 76998 | 100.00 | Using temporary |
> | 1 | SIMPLE | B0 | NULL | eq_ref | PRIMARY,CTLG_FK1 | PRIMARY
> | 8 | hive_c3s.A0.DB_ID | 1 | 98.75 | Using where; Distinct |
> +----+-------------+-------+------------+--------+------------------+---------+---------+-------------------+-------+----------+-----------------------+{code}
> In the plan above:
> * {{TBLS}} performs a *full scan* ({{{}type = ALL{}}})
> * This is because the optimizer is forced to start with {{{}TBLS{}}}, as the
> {{DBS.NAME}} filter cannot utilize the index due to the {{LOWER()}} function
> Removing {{LOWER()}} Fixes the Issue:
> When we modify the query by removing the {{LOWER()}} function,
> {code:java}
> ... WHERE B0.CTLG_NAME = 'hive' AND B0.NAME LIKE 'test' ESCAPE '\\';{code}
> We get the following improved plan.
> {code:java}
> +----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+
> | id | select_type | table | partitions | type | possible_keys
> | key | key_len | ref | rows | filtered |
> Extra |
> +----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+
> | 1 | SIMPLE | B0 | NULL | range |
> PRIMARY,UNIQUE_DATABASE,CTLG_FK1 | UNIQUE_DATABASE | 389 | NULL
> | 1 | 100.00 | Using where; Using index; Using temporary |
> | 1 | SIMPLE | A0 | NULL | ref | TBLS_N49
> | TBLS_N49 | 9 | hive_c3s.B0.DB_ID | 45 | 100.00 | NULL
> |
> +----+-------------+-------+------------+-------+----------------------------------+-----------------+---------+-------------------+------+----------+-------------------------------------------+{code}
> Here:
> * The optimizer is able to *start with {{DBS}}* using its {{NAME}} index
> * Then it joins to {{TBLS}} using the {{DB_ID}} index, avoiding a full scan
>
> We’d like to ask if it’s feasible to avoid using the {{LOWER()}} function
> when matching database and table name patterns, so that the underlying
> database engine (e.g., MySQL) can utilize indexes more effectively.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)