[jira] [Updated] (HIVE-28080) Propagate statistics from a source table to the materialized CTE

2024-02-27 Thread Krisztian Kasa (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Krisztian Kasa updated HIVE-28080:
--
Fix Version/s: 4.1.0
   Resolution: Fixed
   Status: Resolved  (was: Patch Available)

Merged to master. Thanks [~okumin] for the patch.

> Propagate statistics from a source table to the materialized CTE
> 
>
> Key: HIVE-28080
> URL: https://issues.apache.org/jira/browse/HIVE-28080
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Planning
>Affects Versions: 2.3.8, 3.1.3, 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.1.0
>
>
> Hive doesn't fill in the statistics of materialized CTEs, and the size of 
> those TableScans is underestimated. That causes Tez to run with fewer tasks 
> or to fail with OOM because MapJoin could be wrongly applied.
>  
> The following example shows Map 1 reading `src` generates 493 rows, but Map 3 
> reading `cte` is expected to scan only 1 row.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:1/defaul> EXPLAIN WITH cte AS (
> . . . . . . . . . . . . . . . . . . . . . . .>   SELECT * FROM src
> . . . . . . . . . . . . . . . . . . . . . . .> )
> . . . . . . . . . . . . . . . . . . . . . . .> SELECT *
> . . . . . . . . . . . . . . . . . . . . . . .> FROM cte a
> . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte b ON (a.key = b.key)
> . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte c ON (a.key = c.key);
> ...
> ++
> |                      Explain                       |
> ++
> | Plan optimized by CBO.                             |
> |                                                    |
> | Vertex dependency in Stage-4                       |
> | Map 2 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) |
> | Reducer 4 <- Map 3 (SIMPLE_EDGE)                   |
> |                                                    |
> | Stage-3                                            |
> |   Fetch Operator                                   |
> |     limit:-1                                       |
> |     Stage-4                                        |
> |       Map 2 vectorized                             |
> |       File Output Operator [FS_69]                 |
> |         Map Join Operator [MAPJOIN_68] (rows=1 width=444) |
> |           
> Conds:MAPJOIN_67._col0=RS_61._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5"]
>  |
> |         <-Map 3 [BROADCAST_EDGE] vectorized        |
> |           SHUFFLE [RS_61]                          |
> |             PartitionCols:_col0                    |
> |             Select Operator [SEL_60] (rows=1 width=368) |
> |               Output:["_col0","_col1"]             |
> |               Filter Operator [FIL_59] (rows=1 width=368) |
> |                 predicate:key is not null          |
> |                 TableScan [TS_11] (rows=1 width=368) |
> |                   
> default@cte,c,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> |         <-Map Join Operator [MAPJOIN_67] (rows=1 width=404) |
> |             
> Conds:SEL_66._col0=RS_64._col0(Inner),Output:["_col0","_col1","_col2","_col3"]
>  |
> |           <-Reducer 4 [BROADCAST_EDGE] vectorized  |
> |             BROADCAST [RS_64]                      |
> |               PartitionCols:_col0                  |
> |               Select Operator [SEL_63]             |
> |                 Output:["_col0","_col1"]           |
> |           <-Select Operator [SEL_66] (rows=1 width=368) |
> |               Output:["_col0","_col1"]             |
> |               Filter Operator [FIL_65] (rows=1 width=368) |
> |                 predicate:key is not null          |
> |                 TableScan [TS_5] (rows=1 width=368) |
> |                   
> default@cte,a,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> |         Stage-2                                    |
> |           Dependency Collection{}                  |
> |             Stage-1                                |
> |               Map 1 vectorized                     |
> |               File Output Operator [FS_4]          |
> |                 table:{"name:":"default.cte"}      |
> |                 Select Operator [SEL_3] (rows=493 width=350) |
> |                   Output:["_col0","_col1"]         |
> |                   TableScan [TS_0] (rows=493 width=350) |
> |                     
> default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> |         Stage-0                                    |
> |           Move Operator                            |
> |              Please refer to the previous Stage-1  |
> |         

[jira] [Updated] (HIVE-28080) Propagate statistics from a source table to the materialized CTE

2024-02-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HIVE-28080:
--
Labels: pull-request-available  (was: )

> Propagate statistics from a source table to the materialized CTE
> 
>
> Key: HIVE-28080
> URL: https://issues.apache.org/jira/browse/HIVE-28080
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Planning
>Affects Versions: 2.3.8, 3.1.3, 4.0.0-beta-1
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>  Labels: pull-request-available
>
> Hive doesn't fill in the statistics of materialized CTEs, and the size of 
> those TableScans is underestimated. That causes Tez to run with fewer tasks 
> or to fail with OOM because MapJoin could be wrongly applied.
>  
> The following example shows Map 1 reading `src` generates 493 rows, but Map 3 
> reading `cte` is expected to scan only 1 row.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:1/defaul> EXPLAIN WITH cte AS (
> . . . . . . . . . . . . . . . . . . . . . . .>   SELECT * FROM src
> . . . . . . . . . . . . . . . . . . . . . . .> )
> . . . . . . . . . . . . . . . . . . . . . . .> SELECT *
> . . . . . . . . . . . . . . . . . . . . . . .> FROM cte a
> . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte b ON (a.key = b.key)
> . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte c ON (a.key = c.key);
> ...
> ++
> |                      Explain                       |
> ++
> | Plan optimized by CBO.                             |
> |                                                    |
> | Vertex dependency in Stage-4                       |
> | Map 2 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) |
> | Reducer 4 <- Map 3 (SIMPLE_EDGE)                   |
> |                                                    |
> | Stage-3                                            |
> |   Fetch Operator                                   |
> |     limit:-1                                       |
> |     Stage-4                                        |
> |       Map 2 vectorized                             |
> |       File Output Operator [FS_69]                 |
> |         Map Join Operator [MAPJOIN_68] (rows=1 width=444) |
> |           
> Conds:MAPJOIN_67._col0=RS_61._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5"]
>  |
> |         <-Map 3 [BROADCAST_EDGE] vectorized        |
> |           SHUFFLE [RS_61]                          |
> |             PartitionCols:_col0                    |
> |             Select Operator [SEL_60] (rows=1 width=368) |
> |               Output:["_col0","_col1"]             |
> |               Filter Operator [FIL_59] (rows=1 width=368) |
> |                 predicate:key is not null          |
> |                 TableScan [TS_11] (rows=1 width=368) |
> |                   
> default@cte,c,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> |         <-Map Join Operator [MAPJOIN_67] (rows=1 width=404) |
> |             
> Conds:SEL_66._col0=RS_64._col0(Inner),Output:["_col0","_col1","_col2","_col3"]
>  |
> |           <-Reducer 4 [BROADCAST_EDGE] vectorized  |
> |             BROADCAST [RS_64]                      |
> |               PartitionCols:_col0                  |
> |               Select Operator [SEL_63]             |
> |                 Output:["_col0","_col1"]           |
> |           <-Select Operator [SEL_66] (rows=1 width=368) |
> |               Output:["_col0","_col1"]             |
> |               Filter Operator [FIL_65] (rows=1 width=368) |
> |                 predicate:key is not null          |
> |                 TableScan [TS_5] (rows=1 width=368) |
> |                   
> default@cte,a,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> |         Stage-2                                    |
> |           Dependency Collection{}                  |
> |             Stage-1                                |
> |               Map 1 vectorized                     |
> |               File Output Operator [FS_4]          |
> |                 table:{"name:":"default.cte"}      |
> |                 Select Operator [SEL_3] (rows=493 width=350) |
> |                   Output:["_col0","_col1"]         |
> |                   TableScan [TS_0] (rows=493 width=350) |
> |                     
> default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> |         Stage-0                                    |
> |           Move Operator                            |
> |              Please refer to the previous Stage-1  |
> |                                                    |
> ++ {code}



--
This message was sent by 

[jira] [Updated] (HIVE-28080) Propagate statistics from a source table to the materialized CTE

2024-02-19 Thread okumin (Jira)


 [ 
https://issues.apache.org/jira/browse/HIVE-28080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

okumin updated HIVE-28080:
--
Status: Patch Available  (was: Open)

Pull Request

https://github.com/apache/hive/pull/5089

> Propagate statistics from a source table to the materialized CTE
> 
>
> Key: HIVE-28080
> URL: https://issues.apache.org/jira/browse/HIVE-28080
> Project: Hive
>  Issue Type: Improvement
>  Components: Query Planning
>Affects Versions: 4.0.0-beta-1, 3.1.3, 2.3.8
>Reporter: okumin
>Assignee: okumin
>Priority: Major
>
> Hive doesn't fill in the statistics of materialized CTEs, and the size of 
> those TableScans is underestimated. That causes Tez to run with fewer tasks 
> or to fail with OOM because MapJoin could be wrongly applied.
>  
> The following example shows Map 1 reading `src` generates 493 rows, but Map 3 
> reading `cte` is expected to scan only 1 row.
> {code:java}
> 0: jdbc:hive2://hive-hiveserver2:1/defaul> EXPLAIN WITH cte AS (
> . . . . . . . . . . . . . . . . . . . . . . .>   SELECT * FROM src
> . . . . . . . . . . . . . . . . . . . . . . .> )
> . . . . . . . . . . . . . . . . . . . . . . .> SELECT *
> . . . . . . . . . . . . . . . . . . . . . . .> FROM cte a
> . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte b ON (a.key = b.key)
> . . . . . . . . . . . . . . . . . . . . . . .> JOIN cte c ON (a.key = c.key);
> ...
> ++
> |                      Explain                       |
> ++
> | Plan optimized by CBO.                             |
> |                                                    |
> | Vertex dependency in Stage-4                       |
> | Map 2 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) |
> | Reducer 4 <- Map 3 (SIMPLE_EDGE)                   |
> |                                                    |
> | Stage-3                                            |
> |   Fetch Operator                                   |
> |     limit:-1                                       |
> |     Stage-4                                        |
> |       Map 2 vectorized                             |
> |       File Output Operator [FS_69]                 |
> |         Map Join Operator [MAPJOIN_68] (rows=1 width=444) |
> |           
> Conds:MAPJOIN_67._col0=RS_61._col0(Inner),Output:["_col0","_col1","_col2","_col3","_col4","_col5"]
>  |
> |         <-Map 3 [BROADCAST_EDGE] vectorized        |
> |           SHUFFLE [RS_61]                          |
> |             PartitionCols:_col0                    |
> |             Select Operator [SEL_60] (rows=1 width=368) |
> |               Output:["_col0","_col1"]             |
> |               Filter Operator [FIL_59] (rows=1 width=368) |
> |                 predicate:key is not null          |
> |                 TableScan [TS_11] (rows=1 width=368) |
> |                   
> default@cte,c,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> |         <-Map Join Operator [MAPJOIN_67] (rows=1 width=404) |
> |             
> Conds:SEL_66._col0=RS_64._col0(Inner),Output:["_col0","_col1","_col2","_col3"]
>  |
> |           <-Reducer 4 [BROADCAST_EDGE] vectorized  |
> |             BROADCAST [RS_64]                      |
> |               PartitionCols:_col0                  |
> |               Select Operator [SEL_63]             |
> |                 Output:["_col0","_col1"]           |
> |           <-Select Operator [SEL_66] (rows=1 width=368) |
> |               Output:["_col0","_col1"]             |
> |               Filter Operator [FIL_65] (rows=1 width=368) |
> |                 predicate:key is not null          |
> |                 TableScan [TS_5] (rows=1 width=368) |
> |                   
> default@cte,a,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> |         Stage-2                                    |
> |           Dependency Collection{}                  |
> |             Stage-1                                |
> |               Map 1 vectorized                     |
> |               File Output Operator [FS_4]          |
> |                 table:{"name:":"default.cte"}      |
> |                 Select Operator [SEL_3] (rows=493 width=350) |
> |                   Output:["_col0","_col1"]         |
> |                   TableScan [TS_0] (rows=493 width=350) |
> |                     
> default@src,src,Tbl:COMPLETE,Col:NONE,Output:["key","value"] |
> |         Stage-0                                    |
> |           Move Operator                            |
> |              Please refer to the previous Stage-1  |
> |                                                    |
> ++ {code}



--
This message was sent by Atlassian