[jira] [Updated] (HIVE-28080) Propagate statistics from a source table to the materialized CTE
[ https://issues.apache.org/jira/browse/HIVE-28080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Krisztian Kasa updated HIVE-28080: -- Fix Version/s: 4.1.0 Resolution: Fixed Status: Resolved (was: Patch Available) Merged to master. Thanks [~okumin] for the patch. > Propagate statistics from a source table to the materialized CTE > > > Key: HIVE-28080 > URL: https://issues.apache.org/jira/browse/HIVE-28080 > Project: Hive > Issue Type: Improvement > Components: Query Planning >Affects Versions: 2.3.8, 3.1.3, 4.0.0-beta-1 >Reporter: okumin >Assignee: okumin >Priority: Major > Labels: pull-request-available > Fix For: 4.1.0 > > > Hive doesn't fill in the statistics of materialized CTEs, and the size of > those TableScans is underestimated. That causes Tez to run with fewer tasks > or to fail with OOM because MapJoin could be wrongly applied. > > The following example shows Map 1 reading `src` generates 493 rows, but Map 3 > reading `cte` is expected to scan only 1 row. > {code:java} > 0: jdbc:hive2://hive-hiveserver2:1/defaul> EXPLAIN WITH cte AS ( > . . . . . . . . . . . . . . . . . . . . . . .> SELECT * FROM src > . . . . . . . . . . . . . . . . . . . . . . .> ) > . . . . . . . . . . . . . . . . . . . . . . .> SELECT * > . . . . . . . . . . . . . . . . . . . . . . .> FROM cte a > . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte b ON (a.key = b.key) > . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte c ON (a.key = c.key); > ... > ++ > | Explain | > ++ > | Plan optimized by CBO. | > | | > | Vertex dependency in Stage-4 | > | Map 2 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) | > | Reducer 4 <- Map 3 (SIMPLE_EDGE) | > | | > | Stage-3 | > | Fetch Operator | > | limit:-1 | > | Stage-4 | > | Map 2 vectorized | > | File Output Operator [FS_69] | > | Map Join Operator [MAPJOIN_68] (rows=1 width=444) | > | > Conds:MAPJOIN_67._col0=RS_61._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5"] > | > | <-Map 3 [BROADCAST_EDGE] vectorized | > | SHUFFLE [RS_61] | > | PartitionCols:_col0 | > | Select Operator [SEL_60] (rows=1 width=368) | > | Output:["_col0","_col1"] | > | Filter Operator [FIL_59] (rows=1 width=368) | > | predicate:key is not null | > | TableScan [TS_11] (rows=1 width=368) | > | > default@cte,c,Tbl:COMPLETE,Col:NONE,Output:["key","value"] | > | <-Map Join Operator [MAPJOIN_67] (rows=1 width=404) | > | > Conds:SEL_66._col0=RS_64._col0(Inner),Output:["_col0","_col1","_col2","_col3"] > | > | <-Reducer 4 [BROADCAST_EDGE] vectorized | > | BROADCAST [RS_64] | > | PartitionCols:_col0 | > | Select Operator [SEL_63] | > | Output:["_col0","_col1"] | > | <-Select Operator [SEL_66] (rows=1 width=368) | > | Output:["_col0","_col1"] | > | Filter Operator [FIL_65] (rows=1 width=368) | > | predicate:key is not null | > | TableScan [TS_5] (rows=1 width=368) | > | > default@cte,a,Tbl:COMPLETE,Col:NONE,Output:["key","value"] | > | Stage-2 | > | Dependency Collection{} | > | Stage-1 | > | Map 1 vectorized | > | File Output Operator [FS_4] | > | table:{"name:":"default.cte"} | > | Select Operator [SEL_3] (rows=493 width=350) | > | Output:["_col0","_col1"] | > | TableScan [TS_0] (rows=493 width=350) | > | > default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"] | > | Stage-0 | > | Move Operator | > | Please refer to the previous Stage-1 | > |
[jira] [Updated] (HIVE-28080) Propagate statistics from a source table to the materialized CTE
[ https://issues.apache.org/jira/browse/HIVE-28080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HIVE-28080: -- Labels: pull-request-available (was: ) > Propagate statistics from a source table to the materialized CTE > > > Key: HIVE-28080 > URL: https://issues.apache.org/jira/browse/HIVE-28080 > Project: Hive > Issue Type: Improvement > Components: Query Planning >Affects Versions: 2.3.8, 3.1.3, 4.0.0-beta-1 >Reporter: okumin >Assignee: okumin >Priority: Major > Labels: pull-request-available > > Hive doesn't fill in the statistics of materialized CTEs, and the size of > those TableScans is underestimated. That causes Tez to run with fewer tasks > or to fail with OOM because MapJoin could be wrongly applied. > > The following example shows Map 1 reading `src` generates 493 rows, but Map 3 > reading `cte` is expected to scan only 1 row. > {code:java} > 0: jdbc:hive2://hive-hiveserver2:1/defaul> EXPLAIN WITH cte AS ( > . . . . . . . . . . . . . . . . . . . . . . .> SELECT * FROM src > . . . . . . . . . . . . . . . . . . . . . . .> ) > . . . . . . . . . . . . . . . . . . . . . . .> SELECT * > . . . . . . . . . . . . . . . . . . . . . . .> FROM cte a > . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte b ON (a.key = b.key) > . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte c ON (a.key = c.key); > ... > ++ > | Explain | > ++ > | Plan optimized by CBO. | > | | > | Vertex dependency in Stage-4 | > | Map 2 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) | > | Reducer 4 <- Map 3 (SIMPLE_EDGE) | > | | > | Stage-3 | > | Fetch Operator | > | limit:-1 | > | Stage-4 | > | Map 2 vectorized | > | File Output Operator [FS_69] | > | Map Join Operator [MAPJOIN_68] (rows=1 width=444) | > | > Conds:MAPJOIN_67._col0=RS_61._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5"] > | > | <-Map 3 [BROADCAST_EDGE] vectorized | > | SHUFFLE [RS_61] | > | PartitionCols:_col0 | > | Select Operator [SEL_60] (rows=1 width=368) | > | Output:["_col0","_col1"] | > | Filter Operator [FIL_59] (rows=1 width=368) | > | predicate:key is not null | > | TableScan [TS_11] (rows=1 width=368) | > | > default@cte,c,Tbl:COMPLETE,Col:NONE,Output:["key","value"] | > | <-Map Join Operator [MAPJOIN_67] (rows=1 width=404) | > | > Conds:SEL_66._col0=RS_64._col0(Inner),Output:["_col0","_col1","_col2","_col3"] > | > | <-Reducer 4 [BROADCAST_EDGE] vectorized | > | BROADCAST [RS_64] | > | PartitionCols:_col0 | > | Select Operator [SEL_63] | > | Output:["_col0","_col1"] | > | <-Select Operator [SEL_66] (rows=1 width=368) | > | Output:["_col0","_col1"] | > | Filter Operator [FIL_65] (rows=1 width=368) | > | predicate:key is not null | > | TableScan [TS_5] (rows=1 width=368) | > | > default@cte,a,Tbl:COMPLETE,Col:NONE,Output:["key","value"] | > | Stage-2 | > | Dependency Collection{} | > | Stage-1 | > | Map 1 vectorized | > | File Output Operator [FS_4] | > | table:{"name:":"default.cte"} | > | Select Operator [SEL_3] (rows=493 width=350) | > | Output:["_col0","_col1"] | > | TableScan [TS_0] (rows=493 width=350) | > | > default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"] | > | Stage-0 | > | Move Operator | > | Please refer to the previous Stage-1 | > | | > ++ {code} -- This message was sent by
[jira] [Updated] (HIVE-28080) Propagate statistics from a source table to the materialized CTE
[ https://issues.apache.org/jira/browse/HIVE-28080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] okumin updated HIVE-28080: -- Status: Patch Available (was: Open) Pull Request https://github.com/apache/hive/pull/5089 > Propagate statistics from a source table to the materialized CTE > > > Key: HIVE-28080 > URL: https://issues.apache.org/jira/browse/HIVE-28080 > Project: Hive > Issue Type: Improvement > Components: Query Planning >Affects Versions: 4.0.0-beta-1, 3.1.3, 2.3.8 >Reporter: okumin >Assignee: okumin >Priority: Major > > Hive doesn't fill in the statistics of materialized CTEs, and the size of > those TableScans is underestimated. That causes Tez to run with fewer tasks > or to fail with OOM because MapJoin could be wrongly applied. > > The following example shows Map 1 reading `src` generates 493 rows, but Map 3 > reading `cte` is expected to scan only 1 row. > {code:java} > 0: jdbc:hive2://hive-hiveserver2:1/defaul> EXPLAIN WITH cte AS ( > . . . . . . . . . . . . . . . . . . . . . . .> SELECT * FROM src > . . . . . . . . . . . . . . . . . . . . . . .> ) > . . . . . . . . . . . . . . . . . . . . . . .> SELECT * > . . . . . . . . . . . . . . . . . . . . . . .> FROM cte a > . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte b ON (a.key = b.key) > . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte c ON (a.key = c.key); > ... > ++ > | Explain | > ++ > | Plan optimized by CBO. | > | | > | Vertex dependency in Stage-4 | > | Map 2 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) | > | Reducer 4 <- Map 3 (SIMPLE_EDGE) | > | | > | Stage-3 | > | Fetch Operator | > | limit:-1 | > | Stage-4 | > | Map 2 vectorized | > | File Output Operator [FS_69] | > | Map Join Operator [MAPJOIN_68] (rows=1 width=444) | > | > Conds:MAPJOIN_67._col0=RS_61._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5"] > | > | <-Map 3 [BROADCAST_EDGE] vectorized | > | SHUFFLE [RS_61] | > | PartitionCols:_col0 | > | Select Operator [SEL_60] (rows=1 width=368) | > | Output:["_col0","_col1"] | > | Filter Operator [FIL_59] (rows=1 width=368) | > | predicate:key is not null | > | TableScan [TS_11] (rows=1 width=368) | > | > default@cte,c,Tbl:COMPLETE,Col:NONE,Output:["key","value"] | > | <-Map Join Operator [MAPJOIN_67] (rows=1 width=404) | > | > Conds:SEL_66._col0=RS_64._col0(Inner),Output:["_col0","_col1","_col2","_col3"] > | > | <-Reducer 4 [BROADCAST_EDGE] vectorized | > | BROADCAST [RS_64] | > | PartitionCols:_col0 | > | Select Operator [SEL_63] | > | Output:["_col0","_col1"] | > | <-Select Operator [SEL_66] (rows=1 width=368) | > | Output:["_col0","_col1"] | > | Filter Operator [FIL_65] (rows=1 width=368) | > | predicate:key is not null | > | TableScan [TS_5] (rows=1 width=368) | > | > default@cte,a,Tbl:COMPLETE,Col:NONE,Output:["key","value"] | > | Stage-2 | > | Dependency Collection{} | > | Stage-1 | > | Map 1 vectorized | > | File Output Operator [FS_4] | > | table:{"name:":"default.cte"} | > | Select Operator [SEL_3] (rows=493 width=350) | > | Output:["_col0","_col1"] | > | TableScan [TS_0] (rows=493 width=350) | > | > default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"] | > | Stage-0 | > | Move Operator | > | Please refer to the previous Stage-1 | > | | > ++ {code} -- This message was sent by Atlassian