okumin created HIVE-28080:
-----------------------------
Summary: Propagate statistics from a source table to the
materialized CTE
Key: HIVE-28080
URL: https://issues.apache.org/jira/browse/HIVE-28080
Project: Hive
Issue Type: Improvement
Components: Query Planning
Affects Versions: 4.0.0-beta-1, 3.1.3, 2.3.8
Reporter: okumin
Assignee: okumin
Hive doesn't fill in the statistics of materialized CTEs, and the size of those
TableScans is underestimated. That causes Tez to run with fewer tasks or to
fail with OOM because MapJoin could be wrongly applied.
The following example shows Map 1 reading `src` generates 493 rows, but Map 3
reading `cte` is expected to scan only 1 row.
{code:java}
0: jdbc:hive2://hive-hiveserver2:10000/defaul> EXPLAIN WITH cte AS (
. . . . . . . . . . . . . . . . . . . . . . .> SELECT * FROM src
. . . . . . . . . . . . . . . . . . . . . . .> )
. . . . . . . . . . . . . . . . . . . . . . .> SELECT *
. . . . . . . . . . . . . . . . . . . . . . .> FROM cte a
. . . . . . . . . . . . . . . . . . . . . . .> JOIN cte b ON (a.key = b.key)
. . . . . . . . . . . . . . . . . . . . . . .> JOIN cte c ON (a.key = c.key);
...
+----------------------------------------------------+
| Explain |
+----------------------------------------------------+
| Plan optimized by CBO. |
| |
| Vertex dependency in Stage-4 |
| Map 2 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) |
| Reducer 4 <- Map 3 (SIMPLE_EDGE) |
| |
| Stage-3 |
| Fetch Operator |
| limit:-1 |
| Stage-4 |
| Map 2 vectorized |
| File Output Operator [FS_69] |
| Map Join Operator [MAPJOIN_68] (rows=1 width=444) |
|
Conds:MAPJOIN_67._col0=RS_61._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5"]
|
| <-Map 3 [BROADCAST_EDGE] vectorized |
| SHUFFLE [RS_61] |
| PartitionCols:_col0 |
| Select Operator [SEL_60] (rows=1 width=368) |
| Output:["_col0","_col1"] |
| Filter Operator [FIL_59] (rows=1 width=368) |
| predicate:key is not null |
| TableScan [TS_11] (rows=1 width=368) |
| default@cte,c,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
| <-Map Join Operator [MAPJOIN_67] (rows=1 width=404) |
|
Conds:SEL_66._col0=RS_64._col0(Inner),Output:["_col0","_col1","_col2","_col3"] |
| <-Reducer 4 [BROADCAST_EDGE] vectorized |
| BROADCAST [RS_64] |
| PartitionCols:_col0 |
| Select Operator [SEL_63] |
| Output:["_col0","_col1"] |
| <-Select Operator [SEL_66] (rows=1 width=368) |
| Output:["_col0","_col1"] |
| Filter Operator [FIL_65] (rows=1 width=368) |
| predicate:key is not null |
| TableScan [TS_5] (rows=1 width=368) |
| default@cte,a,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
| Stage-2 |
| Dependency Collection{} |
| Stage-1 |
| Map 1 vectorized |
| File Output Operator [FS_4] |
| table:{"name:":"default.cte"} |
| Select Operator [SEL_3] (rows=493 width=350) |
| Output:["_col0","_col1"] |
| TableScan [TS_0] (rows=493 width=350) |
|
default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
| Stage-0 |
| Move Operator |
| Please refer to the previous Stage-1 |
| |
+----------------------------------------------------+ {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)