[jira] [Updated] (HIVE-28363) Improve heuristics of FilterStatsRule without column stats

Shohei Okumiya (Jira) Mon, 08 Jul 2024 01:22:04 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-28363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shohei Okumiya updated HIVE-28363:
----------------------------------
    Description: 
HIVE-13287 gave a better estimation of the selectivity of IN operators, 
especially when column stats are available. This ticket would try to improve 
the case where column stats are unavailable.

 

This is an example. The table has ten rows and no column stats on `id`.
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> DESCRIBE FORMATTED users id;
...
+------------------------+-----------------------------+
|    column_property     |            value            |
+------------------------+-----------------------------+
| col_name               | id                          |
| data_type              | int                         |
| min                    |                             |
| max                    |                             |
| num_nulls              |                             |
| distinct_count         |                             |
| avg_col_len            |                             |
| max_col_len            |                             |
| num_trues              |                             |
| num_falses             |                             |
| bit_vector             |                             |
| comment                | from deserializer           |
| COLUMN_STATS_ACCURATE  | {\"BASIC_STATS\":\"true\"}  |
+------------------------+-----------------------------+{code}
With a single needle, the estimated number becomes 10 * 0.5 = 5 because of the 
fallback heuristics.
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
WHERE id IN (1);
...
|                 TableScan                          |
|                   alias: users                     |
|                   filterExpr: (id = 1) (type: boolean) |
|                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
COMPLETE Column stats: NONE |
|                   Filter Operator                  |
|                     predicate: (id = 1) (type: boolean) |
|                     Statistics: Num rows: 5 Data size: 5 Basic stats: 
COMPLETE Column stats: NONE | {code}
The size is estimated to be the original size with two or more needles. The 
heuristics estimate the size as min(10, 10 * 0.5 * N) = 10. However, I believe 
users expect to observe some reduction when using IN.
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
WHERE id IN (1, 2);
|                 TableScan                          |
|                   alias: users                     |
|                   filterExpr: (id) IN (1, 2) (type: boolean) |
|                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
COMPLETE Column stats: NONE |
|                   Filter Operator                  |
|                     predicate: (id) IN (1, 2) (type: boolean) |
|                     Statistics: Num rows: 10 Data size: 11 Basic stats: 
COMPLETE Column stats: NONE | {code}
 

  was:
HIVE-13287 gave a better estimation on the selectivity of IN operators. This 
ticket would try

 

This is an example. The table has ten rows and no column stats on `id`.
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> DESCRIBE FORMATTED users id;
...
+------------------------+-----------------------------+
|    column_property     |            value            |
+------------------------+-----------------------------+
| col_name               | id                          |
| data_type              | int                         |
| min                    |                             |
| max                    |                             |
| num_nulls              |                             |
| distinct_count         |                             |
| avg_col_len            |                             |
| max_col_len            |                             |
| num_trues              |                             |
| num_falses             |                             |
| bit_vector             |                             |
| comment                | from deserializer           |
| COLUMN_STATS_ACCURATE  | {\"BASIC_STATS\":\"true\"}  |
+------------------------+-----------------------------+{code}
With a single needle, the estimated number becomes 10 * 0.5 = 5 because of the 
fallback heuristics.
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
WHERE id IN (1);
...
|                 TableScan                          |
|                   alias: users                     |
|                   filterExpr: (id = 1) (type: boolean) |
|                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
COMPLETE Column stats: NONE |
|                   Filter Operator                  |
|                     predicate: (id = 1) (type: boolean) |
|                     Statistics: Num rows: 5 Data size: 5 Basic stats: 
COMPLETE Column stats: NONE | {code}
The size is estimated to be the original size with two or more needles. The 
heuristics estimate the size as min(10, 10 * 0.5 * N) = 10. However, I believe 
users expect to observe some reduction when they use IN.
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
WHERE id IN (1, 2);
|                 TableScan                          |
|                   alias: users                     |
|                   filterExpr: (id) IN (1, 2) (type: boolean) |
|                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
COMPLETE Column stats: NONE |
|                   Filter Operator                  |
|                     predicate: (id) IN (1, 2) (type: boolean) |
|                     Statistics: Num rows: 10 Data size: 11 Basic stats: 
COMPLETE Column stats: NONE | {code}
 


> Improve heuristics of FilterStatsRule without column stats
> ----------------------------------------------------------
>
>                 Key: HIVE-28363
>                 URL: https://issues.apache.org/jira/browse/HIVE-28363
>             Project: Hive
>          Issue Type: Improvement
>          Components: Statistics
>    Affects Versions: 4.0.0
>            Reporter: Shohei Okumiya
>            Assignee: Shohei Okumiya
>            Priority: Major
>
> HIVE-13287 gave a better estimation of the selectivity of IN operators, 
> especially when column stats are available. This ticket would try to improve 
> the case where column stats are unavailable.
>  
> This is an example. The table has ten rows and no column stats on `id`.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:10000/defaul> DESCRIBE FORMATTED users id;
> ...
> +------------------------+-----------------------------+
> |    column_property     |            value            |
> +------------------------+-----------------------------+
> | col_name               | id                          |
> | data_type              | int                         |
> | min                    |                             |
> | max                    |                             |
> | num_nulls              |                             |
> | distinct_count         |                             |
> | avg_col_len            |                             |
> | max_col_len            |                             |
> | num_trues              |                             |
> | num_falses             |                             |
> | bit_vector             |                             |
> | comment                | from deserializer           |
> | COLUMN_STATS_ACCURATE  | {\"BASIC_STATS\":\"true\"}  |
> +------------------------+-----------------------------+{code}
> With a single needle, the estimated number becomes 10 * 0.5 = 5 because of 
> the fallback heuristics.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
> WHERE id IN (1);
> ...
> |                 TableScan                          |
> |                   alias: users                     |
> |                   filterExpr: (id = 1) (type: boolean) |
> |                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
> COMPLETE Column stats: NONE |
> |                   Filter Operator                  |
> |                     predicate: (id = 1) (type: boolean) |
> |                     Statistics: Num rows: 5 Data size: 5 Basic stats: 
> COMPLETE Column stats: NONE | {code}
> The size is estimated to be the original size with two or more needles. The 
> heuristics estimate the size as min(10, 10 * 0.5 * N) = 10. However, I 
> believe users expect to observe some reduction when using IN.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN SELECT * FROM users 
> WHERE id IN (1, 2);
> |                 TableScan                          |
> |                   alias: users                     |
> |                   filterExpr: (id) IN (1, 2) (type: boolean) |
> |                   Statistics: Num rows: 10 Data size: 11 Basic stats: 
> COMPLETE Column stats: NONE |
> |                   Filter Operator                  |
> |                     predicate: (id) IN (1, 2) (type: boolean) |
> |                     Statistics: Num rows: 10 Data size: 11 Basic stats: 
> COMPLETE Column stats: NONE | {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HIVE-28363) Improve heuristics of FilterStatsRule without column stats

Reply via email to