[jira] [Commented] (HIVE-8526) Hive : CBO incorrect join order in TPC-DS Q45 as self join selectivity has incorrect CE

Mostafa Mokhtar (JIRA) Fri, 31 Oct 2014 12:03:05 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-8526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192259#comment-14192259
 ]


Mostafa Mokhtar commented on HIVE-8526:
---------------------------------------

[~rhbutani] [~hagleitn] [~jpullokkaran]

Issue resolved in latest build 
{code}
STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Map 1 <- Map 8 (BROADCAST_EDGE)
        Map 6 <- Map 5 (BROADCAST_EDGE), Map 9 (BROADCAST_EDGE)
        Map 9 <- Map 7 (BROADCAST_EDGE)
        Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 6 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
        Reducer 4 <- Reducer 3 (SIMPLE_EDGE)
      DagName: mmokhtar_20141031145858_4f12c0f7-13ef-46e8-9535-81b02cc8a937:1
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: customer_address
                  filterExpr: ca_address_sk is not null (type: boolean)
                  Statistics: Num rows: 40000000 Data size: 40595195284 Basic 
stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: ca_address_sk is not null (type: boolean)
                    Statistics: Num rows: 40000000 Data size: 7640000000 Basic 
stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: ca_address_sk (type: int), ca_county (type: 
string), ca_zip (type: string)
                      outputColumnNames: _col0, _col1, _col2
                      Statistics: Num rows: 40000000 Data size: 7640000000 
Basic stats: COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        condition expressions:
                          0 {_col0}
                          1 {_col1} {_col2}
                        keys:
                          0 _col1 (type: int)
                          1 _col0 (type: int)
                        outputColumnNames: _col0, _col3, _col4
                        input vertices:
                          0 Map 8
                        Statistics: Num rows: 80000000 Data size: 14960000000 
Basic stats: COMPLETE Column stats: COMPLETE
                        Reduce Output Operator
                          key expressions: _col0 (type: int)
                          sort order: +
                          Map-reduce partition columns: _col0 (type: int)
                          Statistics: Num rows: 80000000 Data size: 14960000000 
Basic stats: COMPLETE Column stats: COMPLETE
                          value expressions: _col3 (type: string), _col4 (type: 
string)
            Execution mode: vectorized
        Map 5
            Map Operator Tree:
                TableScan
                  alias: date_dim
                  filterExpr: (((d_qoy = 2) and (d_year = 2000)) and d_date_sk 
is not null) (type: boolean)
                  Statistics: Num rows: 73049 Data size: 81741831 Basic stats: 
COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (((d_qoy = 2) and (d_year = 2000)) and d_date_sk 
is not null) (type: boolean)
                    Statistics: Num rows: 635 Data size: 7620 Basic stats: 
COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: d_date_sk (type: int)
                      outputColumnNames: _col0
                      Statistics: Num rows: 635 Data size: 2540 Basic stats: 
COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: int)
                        Statistics: Num rows: 635 Data size: 2540 Basic stats: 
COMPLETE Column stats: COMPLETE
                      Select Operator
                        expressions: _col0 (type: int)
                        outputColumnNames: _col0
                        Statistics: Num rows: 635 Data size: 2540 Basic stats: 
COMPLETE Column stats: COMPLETE
                        Group By Operator
                          keys: _col0 (type: int)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 317 Data size: 1268 Basic 
stats: COMPLETE Column stats: COMPLETE
                          Dynamic Partitioning Event Operator
                            Target Input: web_sales
                            Partition key expr: ws_sold_date_sk
                            Statistics: Num rows: 317 Data size: 1268 Basic 
stats: COMPLETE Column stats: COMPLETE
                            Target column: ws_sold_date_sk
                            Target Vertex: Map 6
            Execution mode: vectorized
        Map 6
            Map Operator Tree:
                TableScan
                  alias: web_sales
                  filterExpr: (ws_item_sk is not null and ws_bill_customer_sk 
is not null) (type: boolean)
                  Statistics: Num rows: 21594638446 Data size: 2850189889652 
Basic stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (ws_item_sk is not null and ws_bill_customer_sk 
is not null) (type: boolean)
                    Statistics: Num rows: 21591944812 Data size: 345449542868 
Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: ws_item_sk (type: int), ws_bill_customer_sk 
(type: int), ws_sales_price (type: float), ws_sold_date_sk (type: int)
                      outputColumnNames: _col0, _col1, _col2, _col3
                      Statistics: Num rows: 21591944812 Data size: 345449542868 
Basic stats: COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        condition expressions:
                          0 {_col0} {_col1} {_col2}
                          1
                        keys:
                          0 _col3 (type: int)
                          1 _col0 (type: int)
                        outputColumnNames: _col0, _col1, _col2
                        input vertices:
                          1 Map 5
                        Statistics: Num rows: 24138881964 Data size: 
289666583568 Basic stats: COMPLETE Column stats: COMPLETE
                        Map Join Operator
                          condition map:
                               Inner Join 0 to 1
                          condition expressions:
                            0 {_col1} {_col2}
                            1
                          keys:
                            0 _col0 (type: int)
                            1 _col0 (type: int)
                          outputColumnNames: _col1, _col2
                          input vertices:
                            1 Map 9
                          Statistics: Num rows: 6034720256 Data size: 
48277762048 Basic stats: COMPLETE Column stats: COMPLETE
                          Select Operator
                            expressions: _col1 (type: int), _col2 (type: float)
                            outputColumnNames: _col1, _col2
                            Statistics: Num rows: 6034720256 Data size: 
48277762048 Basic stats: COMPLETE Column stats: COMPLETE
                            Reduce Output Operator
                              key expressions: _col1 (type: int)
                              sort order: +
                              Map-reduce partition columns: _col1 (type: int)
                              Statistics: Num rows: 6034720256 Data size: 
48277762048 Basic stats: COMPLETE Column stats: COMPLETE
                              value expressions: _col2 (type: float)
            Execution mode: vectorized
        Map 7
            Map Operator Tree:
                TableScan
                  alias: item
                  filterExpr: ((i_item_sk) IN (2, 3, 5, 7, 11, 13, 17, 19, 23, 
29) and i_item_id is not null) (type: boolean)
                  Statistics: Num rows: 462000 Data size: 663862160 Basic 
stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: ((i_item_sk) IN (2, 3, 5, 7, 11, 13, 17, 19, 23, 
29) and i_item_id is not null) (type: boolean)
                    Statistics: Num rows: 231000 Data size: 24024000 Basic 
stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: i_item_id (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 231000 Data size: 23100000 Basic 
stats: COMPLETE Column stats: COMPLETE
                      Group By Operator
                        keys: _col0 (type: string)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 115500 Data size: 11550000 Basic 
stats: COMPLETE Column stats: COMPLETE
                        Reduce Output Operator
                          key expressions: _col0 (type: string)
                          sort order: +
                          Map-reduce partition columns: _col0 (type: string)
                          Statistics: Num rows: 115500 Data size: 11550000 
Basic stats: COMPLETE Column stats: COMPLETE
            Execution mode: vectorized
        Map 8
            Map Operator Tree:
                TableScan
                  alias: customer
                  filterExpr: (c_current_addr_sk is not null and c_customer_sk 
is not null) (type: boolean)
                  Statistics: Num rows: 80000000 Data size: 68801615852 Basic 
stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (c_current_addr_sk is not null and c_customer_sk 
is not null) (type: boolean)
                    Statistics: Num rows: 80000000 Data size: 640000000 Basic 
stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: c_customer_sk (type: int), c_current_addr_sk 
(type: int)
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 80000000 Data size: 640000000 Basic 
stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col1 (type: int)
                        sort order: +
                        Map-reduce partition columns: _col1 (type: int)
                        Statistics: Num rows: 80000000 Data size: 640000000 
Basic stats: COMPLETE Column stats: COMPLETE
                        value expressions: _col0 (type: int)
            Execution mode: vectorized
        Map 9
            Map Operator Tree:
                TableScan
                  alias: item
                  filterExpr: (i_item_id is not null and i_item_sk is not null) 
(type: boolean)
                  Statistics: Num rows: 462000 Data size: 663862160 Basic 
stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (i_item_id is not null and i_item_sk is not 
null) (type: boolean)
                    Statistics: Num rows: 462000 Data size: 48048000 Basic 
stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: i_item_sk (type: int), i_item_id (type: 
string)
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 462000 Data size: 48048000 Basic 
stats: COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Left Semi Join 0 to 1
                        condition expressions:
                          0 {_col0}
                          1
                        keys:
                          0 _col1 (type: string)
                          1 _col0 (type: string)
                        outputColumnNames: _col0
                        input vertices:
                          1 Map 7
                        Statistics: Num rows: 343407 Data size: 1373628 Basic 
stats: COMPLETE Column stats: COMPLETE
                        Select Operator
                          expressions: _col0 (type: int)
                          outputColumnNames: _col0
                          Statistics: Num rows: 343407 Data size: 1373628 Basic 
stats: COMPLETE Column stats: COMPLETE
                          Reduce Output Operator
                            key expressions: _col0 (type: int)
                            sort order: +
                            Map-reduce partition columns: _col0 (type: int)
                            Statistics: Num rows: 343407 Data size: 1373628 
Basic stats: COMPLETE Column stats: COMPLETE
            Execution mode: vectorized
        Reducer 2
            Reduce Operator Tree:
              Merge Join Operator
                condition map:
                     Inner Join 0 to 1
                condition expressions:
                  0 {VALUE._col2} {VALUE._col3}
                  1 {VALUE._col1}
                outputColumnNames: _col3, _col4, _col7
                Statistics: Num rows: 137325482305 Data size: 26229167120255 
Basic stats: COMPLETE Column stats: COMPLETE
                Select Operator
                  expressions: _col4 (type: string), _col3 (type: string), 
_col7 (type: float)
                  outputColumnNames: _col0, _col1, _col2
                  Statistics: Num rows: 137325482305 Data size: 26229167120255 
Basic stats: COMPLETE Column stats: COMPLETE
                  Group By Operator
                    aggregations: sum(_col2)
                    keys: _col0 (type: string), _col1 (type: string)
                    mode: hash
                    outputColumnNames: _col0, _col1, _col2
                    Statistics: Num rows: 137325482305 Data size: 
26778469049475 Basic stats: COMPLETE Column stats: COMPLETE
                    Reduce Output Operator
                      key expressions: _col0 (type: string), _col1 (type: 
string)
                      sort order: ++
                      Map-reduce partition columns: _col0 (type: string), _col1 
(type: string)
                      Statistics: Num rows: 137325482305 Data size: 
26778469049475 Basic stats: COMPLETE Column stats: COMPLETE
                      value expressions: _col2 (type: double)
        Reducer 3
            Reduce Operator Tree:
              Group By Operator
                aggregations: sum(VALUE._col0)
                keys: KEY._col0 (type: string), KEY._col1 (type: string)
                mode: mergepartial
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 137325482305 Data size: 27190445496390 
Basic stats: COMPLETE Column stats: COMPLETE
                Select Operator
                  expressions: _col0 (type: string), _col1 (type: string), 
_col2 (type: double)
                  outputColumnNames: _col0, _col1, _col2
                  Statistics: Num rows: 137325482305 Data size: 27190445496390 
Basic stats: COMPLETE Column stats: COMPLETE
                  Reduce Output Operator
                    key expressions: _col0 (type: string), _col1 (type: string)
                    sort order: ++
                    Statistics: Num rows: 137325482305 Data size: 
27190445496390 Basic stats: COMPLETE Column stats: COMPLETE
                    TopN Hash Memory Usage: 0.04
                    value expressions: _col2 (type: double)
            Execution mode: vectorized
        Reducer 4
            Reduce Operator Tree:
              Select Operator
                expressions: KEY.reducesinkkey0 (type: string), 
KEY.reducesinkkey1 (type: string), VALUE._col0 (type: double)
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 137325482305 Data size: 27190445496390 
Basic stats: COMPLETE Column stats: COMPLETE
                Limit
                  Number of rows: 100
                  Statistics: Num rows: 100 Data size: 19800 Basic stats: 
COMPLETE Column stats: COMPLETE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 100 Data size: 19800 Basic stats: 
COMPLETE Column stats: COMPLETE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Execution mode: vectorized

  Stage: Stage-0
    Fetch Operator
      limit: 100
      Processor Tree:
        ListSink
{code}

> Hive : CBO incorrect join order in TPC-DS Q45 as self join selectivity has 
> incorrect CE
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-8526
>                 URL: https://issues.apache.org/jira/browse/HIVE-8526
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>    Affects Versions: 0.14.0
>            Reporter: Mostafa Mokhtar
>            Assignee: Harish Butani
>            Priority: Critical
>             Fix For: 0.14.0
>
>         Attachments: HIVE-8526.1.patch
>
>
> The join order has Item joined last where it should be joined first
> Query 
> {code}
> select  ca_zip, ca_county, sum(ws_sales_price)
>  from
>     web_sales
>     JOIN customer ON web_sales.ws_bill_customer_sk = customer.c_customer_sk
>     JOIN customer_address ON customer.c_current_addr_sk = 
> customer_address.ca_address_sk 
>     JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
>     JOIN item ON web_sales.ws_item_sk = item.i_item_sk 
>  where
>         ( item.i_item_id in (select i_item_id
>                              from item i2
>                              where i2.i_item_sk in (2, 3, 5, 7, 11, 13, 17, 
> 19, 23, 29)
>                              )
>             )
>         and d_qoy = 2 and d_year = 2000
>  group by ca_zip, ca_county
>  order by ca_zip, ca_county
>  limit 100
> {code}
> Plan
> {code}
> 2014-10-20 18:43:16,521 DEBUG [main]: parse.SemanticAnalyzer 
> (SemanticAnalyzer.java:apply(12330)) - HiveSortRel(fetch=[100]): rowcount = 
> 1.710158597922807E7, cumulative cost = {7.169080587598123E10 rows, 
> 3.420317295845614E7 cpu, 0.0 io}, id = 579
>   HiveSortRel(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]): rowcount = 
> 1.710158597922807E7, cumulative cost = {6.827294821015483E10 rows, 
> 1.710158697922807E7 cpu, 0.0 io}, id = 577
>     HiveProjectRel(ca_zip=[$0], ca_county=[$1], _o__c2=[$2]): rowcount = 
> 1.710158597922807E7, cumulative cost = {6.485509054432843E10 rows, 1.0 cpu, 
> 0.0 io}, id = 575
>       HiveAggregateRel(group=[{0, 1}], agg#0=[sum($2)]): rowcount = 
> 1.710158597922807E7, cumulative cost = {6.485509054432843E10 rows, 1.0 cpu, 
> 0.0 io}, id = 573
>         HiveProjectRel($f0=[$2], $f1=[$1], $f2=[$0]): rowcount = 
> 6.0197670310147226E7, cumulative cost = {6.485509054432843E10 rows, 1.0 cpu, 
> 0.0 io}, id = 571
>           HiveProjectRel(ws_sales_price=[$2], ca_county=[$7], ca_zip=[$8]): 
> rowcount = 6.0197670310147226E7, cumulative cost = {6.485509054432843E10 
> rows, 1.0 cpu, 0.0 io}, id = 569
>             HiveFilterRel(condition=[AND(=($11, 2), =($10, 2000))]): rowcount 
> = 6.0197670310147226E7, cumulative cost = {6.485509054432843E10 rows, 1.0 
> cpu, 0.0 io}, id = 567
>               SemiJoinRel(condition=[=($13, $14)], joinType=[inner]): 
> rowcount = 3.371069537368245E10, cumulative cost = {6.485509054432843E10 
> rows, 1.0 cpu, 0.0 io}, id = 565
>                 HiveProjectRel(ws_item_sk=[$0], ws_bill_customer_sk=[$1], 
> ws_sales_price=[$2], ws_sold_date_sk=[$3], c_customer_sk=[$9], 
> c_current_addr_sk=[$10], ca_address_sk=[$11], ca_county=[$12], ca_zip=[$13], 
> d_date_sk=[$6], d_year=[$7], d_qoy=[$8], i_item_sk=[$4], i_item_id=[$5]): 
> rowcount = 3.371069537368245E10, cumulative cost = {6.485509054332843E10 
> rows, 0.0 cpu, 0.0 io}, id = 669
>                   HiveJoinRel(condition=[=($1, $9)], joinType=[inner]): 
> rowcount = 3.371069537368245E10, cumulative cost = {6.485509054332843E10 
> rows, 0.0 cpu, 0.0 io}, id = 667
>                     HiveJoinRel(condition=[=($3, $6)], joinType=[inner]): 
> rowcount = 2.1594638446E10, cumulative cost = {4.3189811941E10 rows, 0.0 cpu, 
> 0.0 io}, id = 664
>                       HiveJoinRel(condition=[=($0, $4)], joinType=[inner]): 
> rowcount = 2.1594638446E10, cumulative cost = {2.1595100446E10 rows, 0.0 cpu, 
> 0.0 io}, id = 601
>                         HiveProjectRel(ws_item_sk=[$2], 
> ws_bill_customer_sk=[$3], ws_sales_price=[$20], ws_sold_date_sk=[$33]): 
> rowcount = 2.1594638446E10, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id 
> = 497
>                           
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.web_sales]]): 
> rowcount = 2.1594638446E10, cumulative cost = {0}, id = 341
>                         HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]): 
> rowcount = 462000.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 555
>                           
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount = 
> 462000.0, cumulative cost = {0}, id = 340
>                       HiveProjectRel(d_date_sk=[$0], d_year=[$6], 
> d_qoy=[$10]): rowcount = 73049.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 
> io}, id = 551
>                         
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.date_dim]]): 
> rowcount = 73049.0, cumulative cost = {0}, id = 342
>                     HiveJoinRel(condition=[=($1, $2)], joinType=[inner]): 
> rowcount = 7.064015632843196E7, cumulative cost = {1.2E8 rows, 0.0 cpu, 0.0 
> io}, id = 598
>                       HiveProjectRel(c_customer_sk=[$0], 
> c_current_addr_sk=[$4]): rowcount = 8.0E7, cumulative cost = {0.0 rows, 0.0 
> cpu, 0.0 io}, id = 500
>                         
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer]]): 
> rowcount = 8.0E7, cumulative cost = {0}, id = 343
>                       HiveProjectRel(ca_address_sk=[$0], ca_county=[$7], 
> ca_zip=[$9]): rowcount = 4.0E7, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 
> io}, id = 547
>                         
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer_address]]): 
> rowcount = 4.0E7, cumulative cost = {0}, id = 339
>                 HiveProjectRel(i_item_id=[$1]): rowcount = 1.05119214745814, 
> cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 563
>                   HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]): rowcount = 
> 1.05119214745814, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 561
>                     HiveFilterRel(condition=[in($0, 2, 3, 5, 7, 11, 13, 17, 
> 19, 23, 29)]): rowcount = 1.05119214745814, cumulative cost = {0.0 rows, 0.0 
> cpu, 0.0 io}, id = 559
>                       
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount = 
> 462000.0, cumulative cost = {0}, id = 340
> {code}
> Then I rewrote the query trying to force CBO to generate the correct join 
> order
> {code}
> with items as (select i_item_sk from 
> item  where
>         ( item.i_item_id in (select i_item_id
>                              from item i2
>                              where i2.i_item_sk in (2, 3, 5, 7, 11, 13, 17, 
> 19, 23, 29)
>                              )
>             )
> )
> select  ca_zip, ca_county, sum(ws_sales_price)
>  from
>     web_sales
>     JOIN items ON web_sales.ws_item_sk = items.i_item_sk 
>     JOIN customer ON web_sales.ws_bill_customer_sk = customer.c_customer_sk
>     JOIN customer_address ON customer.c_current_addr_sk = 
> customer_address.ca_address_sk 
>     JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
>  where
>  d_qoy = 2 and d_year = 2000
>  group by ca_zip, ca_county
>  order by ca_zip, ca_county
>  limit 100
> {code}
> But the correct join order wasn't generated because CE for item x item + 
> filter has a selectivity of 1.
> {code}
> 2014-10-20 18:46:27,120 DEBUG [main]: parse.SemanticAnalyzer 
> (SemanticAnalyzer.java:apply(12330)) - HiveSortRel(fetch=[100]): rowcount = 
> 1.6595391288544238E7, cumulative cost = {2.8364280421639153E10 rows, 
> 3.3190782577088475E7 cpu, 0.0 io}, id = 1291
>   HiveSortRel(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]): rowcount = 
> 1.6595391288544238E7, cumulative cost = {2.505357243157397E10 rows, 
> 1.6595391288544238E7 cpu, 0.0 io}, id = 1289
>     HiveProjectRel(ca_zip=[$0], ca_county=[$1], _o__c2=[$2]): rowcount = 
> 1.6595391288544238E7, cumulative cost = {2.174286444150879E10 rows, 0.0 cpu, 
> 0.0 io}, id = 1287
>       HiveAggregateRel(group=[{0, 1}], agg#0=[sum($2)]): rowcount = 
> 1.6595391288544238E7, cumulative cost = {2.174286444150879E10 rows, 0.0 cpu, 
> 0.0 io}, id = 1285
>         HiveProjectRel($f0=[$9], $f1=[$8], $f2=[$2]): rowcount = 
> 6.019767031014723E7, cumulative cost = {2.174286444150879E10 rows, 0.0 cpu, 
> 0.0 io}, id = 1283
>           HiveProjectRel(ws_item_sk=[$5], ws_bill_customer_sk=[$6], 
> ws_sales_price=[$7], ws_sold_date_sk=[$8], i_item_sk=[$12], 
> c_customer_sk=[$0], c_current_addr_sk=[$1], ca_address_sk=[$2], 
> ca_county=[$3], ca_zip=[$4], d_date_sk=[$9], d_year=[$10], d_qoy=[$11]): 
> rowcount = 6.019767031014723E7, cumulative cost = {2.174286444150879E10 rows, 
> 0.0 cpu, 0.0 io}, id = 1380
>             HiveJoinRel(condition=[=($6, $0)], joinType=[inner]): rowcount = 
> 6.019767031014723E7, cumulative cost = {2.174286444150879E10 rows, 0.0 cpu, 
> 0.0 io}, id = 1378
>               HiveJoinRel(condition=[=($1, $2)], joinType=[inner]): rowcount 
> = 7.064015632843196E7, cumulative cost = {1.2E8 rows, 0.0 cpu, 0.0 io}, id = 
> 1309
>                 HiveProjectRel(c_customer_sk=[$0], c_current_addr_sk=[$4]): 
> rowcount = 8.0E7, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1269
>                   
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer]]): 
> rowcount = 8.0E7, cumulative cost = {0}, id = 1035
>                 HiveProjectRel(ca_address_sk=[$0], ca_county=[$7], 
> ca_zip=[$9]): rowcount = 4.0E7, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 
> io}, id = 1273
>                   
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer_address]]): 
> rowcount = 4.0E7, cumulative cost = {0}, id = 1032
>               HiveJoinRel(condition=[=($0, $7)], joinType=[inner]): rowcount 
> = 3.856185436785714E7, cumulative cost = {2.16336624308125E10 rows, 0.0 cpu, 
> 0.0 io}, id = 1376
>                 HiveJoinRel(condition=[=($3, $4)], joinType=[inner]): 
> rowcount = 3.856185436785714E7, cumulative cost = {2.159463857644464E10 rows, 
> 0.0 cpu, 0.0 io}, id = 1316
>                   HiveProjectRel(ws_item_sk=[$2], ws_bill_customer_sk=[$3], 
> ws_sales_price=[$20], ws_sold_date_sk=[$33]): rowcount = 2.1594638446E10, 
> cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1205
>                     
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.web_sales]]): 
> rowcount = 2.1594638446E10, cumulative cost = {0}, id = 1033
>                   HiveProjectRel(d_date_sk=[$0], d_year=[$6], d_qoy=[$10]): 
> rowcount = 130.44464285714287, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, 
> id = 1279
>                     HiveFilterRel(condition=[AND(=($10, 2), =($6, 2000))]): 
> rowcount = 130.44464285714287, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, 
> id = 1277
>                       
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.date_dim]]): 
> rowcount = 73049.0, cumulative cost = {0}, id = 1034
>                 HiveProjectRel(i_item_sk=[$0]): rowcount = 462000.0, 
> cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 1265
>                   HiveFilterRel(condition=[=(1, 1)]): rowcount = 462000.0, 
> cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 1263
>                     SemiJoinRel(condition=[=($1, $2)], joinType=[inner]): 
> rowcount = 462000.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 1261
>                       HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]): 
> rowcount = 462000.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1253
>                         
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount = 
> 462000.0, cumulative cost = {0}, id = 1024
>                       HiveProjectRel(i_item_id=[$1]): rowcount = 
> 1.05119214745814, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1259
>                         HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]): 
> rowcount = 1.05119214745814, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, 
> id = 1257
>                           HiveFilterRel(condition=[in($0, 2, 3, 5, 7, 11, 13, 
> 17, 19, 23, 29)]): rowcount = 1.05119214745814, cumulative cost = {0.0 rows, 
> 0.0 cpu, 0.0 io}, id = 1255
>                             
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount = 
> 462000.0, cumulative cost = {0}, id = 1024
> {code}
> This query generates the correct join order 
> {code}
>  with items as (select i_item_sk from 
> item  where
>          item.i_item_id in (select i_item_id
>                              from item i2
>                              where i2.i_item_sk in (2, 3, 5, 7, 11, 13, 17, 
> 19, 23, 29)
>                              )
>             
> ),
>   ws as (
>  select ws_bill_customer_sk,ws_sales_price,ws_sold_date_sk
> from  web_sales
>     JOIN items ON web_sales.ws_item_sk = items.i_item_sk 
>  )
>  select  ca_zip, ca_county, sum(ws_sales_price)
>  from ws 
>     JOIN customer ON ws.ws_bill_customer_sk = customer.c_customer_sk
>     JOIN customer_address ON customer.c_current_addr_sk = 
> customer_address.ca_address_sk 
>     JOIN date_dim ON ws.ws_sold_date_sk = date_dim.d_date_sk
>  where d_qoy = 2 and d_year = 2000
>  group by ca_zip, ca_county
>  order by ca_zip, ca_county
>  limit 100
> {code}
> Plan 
> {code}
> 2014-10-20 19:13:15,989 DEBUG [main]: parse.SemanticAnalyzer 
> (SemanticAnalyzer.java:apply(12330)) - HiveSortRel(fetch=[100]): rowcount = 
> 1.6595391288544238E7, cumulative cost = {4.99203570142713E10 rows, 
> 3.3190783577088475E7 cpu, 0.0 io}, id = 4367
>   HiveSortRel(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]): rowcount = 
> 1.6595391288544238E7, cumulative cost = {4.6609649024206116E10 rows, 
> 1.6595392288544238E7 cpu, 0.0 io}, id = 4365
>     HiveProjectRel(ca_zip=[$0], ca_county=[$1], _o__c2=[$2]): rowcount = 
> 1.6595391288544238E7, cumulative cost = {4.329894103414093E10 rows, 1.0 cpu, 
> 0.0 io}, id = 4363
>       HiveAggregateRel(group=[{0, 1}], agg#0=[sum($2)]): rowcount = 
> 1.6595391288544238E7, cumulative cost = {4.329894103414093E10 rows, 1.0 cpu, 
> 0.0 io}, id = 4361
>         HiveProjectRel($f0=[$7], $f1=[$6], $f2=[$1]): rowcount = 
> 6.019767031014723E7, cumulative cost = {4.329894103414093E10 rows, 1.0 cpu, 
> 0.0 io}, id = 4359
>           HiveProjectRel(ws_bill_customer_sk=[$5], ws_sales_price=[$6], 
> ws_sold_date_sk=[$7], c_customer_sk=[$0], c_current_addr_sk=[$1], 
> ca_address_sk=[$2], ca_county=[$3], ca_zip=[$4], d_date_sk=[$8], d_year=[$9], 
> d_qoy=[$10]): rowcount = 6.019767031014723E7, cumulative cost = 
> {4.329894103414093E10 rows, 1.0 cpu, 0.0 io}, id = 4426
>             HiveJoinRel(condition=[=($5, $0)], joinType=[inner]): rowcount = 
> 6.019767031014723E7, cumulative cost = {4.329894103414093E10 rows, 1.0 cpu, 
> 0.0 io}, id = 4424
>               HiveJoinRel(condition=[=($1, $2)], joinType=[inner]): rowcount 
> = 7.064015632843196E7, cumulative cost = {1.2E8 rows, 0.0 cpu, 0.0 io}, id = 
> 4392
>                 HiveProjectRel(c_customer_sk=[$0], c_current_addr_sk=[$4]): 
> rowcount = 8.0E7, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 4345
>                   
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer]]): 
> rowcount = 8.0E7, cumulative cost = {0}, id = 4101
>                 HiveProjectRel(ca_address_sk=[$0], ca_county=[$7], 
> ca_zip=[$9]): rowcount = 4.0E7, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 
> io}, id = 4349
>                   
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer_address]]): 
> rowcount = 4.0E7, cumulative cost = {0}, id = 4099
>               HiveJoinRel(condition=[=($2, $3)], joinType=[inner]): rowcount 
> = 3.856185436785714E7, cumulative cost = {4.318973902344464E10 rows, 1.0 cpu, 
> 0.0 io}, id = 4395
>                 HiveProjectRel(ws_bill_customer_sk=[$1], ws_sales_price=[$2], 
> ws_sold_date_sk=[$3]): rowcount = 2.1594638446E10, cumulative cost = 
> {2.1595100447E10 rows, 1.0 cpu, 0.0 io}, id = 4343
>                   HiveProjectRel(ws_item_sk=[$0], ws_bill_customer_sk=[$1], 
> ws_sales_price=[$2], ws_sold_date_sk=[$3], i_item_sk=[$4]): rowcount = 
> 2.1594638446E10, cumulative cost = {2.1595100447E10 rows, 1.0 cpu, 0.0 io}, 
> id = 4388
>                     HiveJoinRel(condition=[=($0, $4)], joinType=[inner]): 
> rowcount = 2.1594638446E10, cumulative cost = {2.1595100447E10 rows, 1.0 cpu, 
> 0.0 io}, id = 4383
>                       HiveProjectRel(ws_item_sk=[$2], 
> ws_bill_customer_sk=[$3], ws_sales_price=[$20], ws_sold_date_sk=[$33]): 
> rowcount = 2.1594638446E10, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id 
> = 4277
>                         
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.web_sales]]): 
> rowcount = 2.1594638446E10, cumulative cost = {0}, id = 4096
>                       HiveProjectRel(i_item_sk=[$0]): rowcount = 462000.0, 
> cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 4339
>                         HiveFilterRel(condition=[=(1, 1)]): rowcount = 
> 462000.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 4337
>                           SemiJoinRel(condition=[=($1, $2)], 
> joinType=[inner]): rowcount = 462000.0, cumulative cost = {1.0 rows, 1.0 cpu, 
> 0.0 io}, id = 4335
>                             HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]): 
> rowcount = 462000.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 4327
>                               
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount = 
> 462000.0, cumulative cost = {0}, id = 4088
>                             HiveProjectRel(i_item_id=[$1]): rowcount = 
> 1.05119214745814, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 4333
>                               HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]): 
> rowcount = 1.05119214745814, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, 
> id = 4331
>                                 HiveFilterRel(condition=[in($0, 2, 3, 5, 7, 
> 11, 13, 17, 19, 23, 29)]): rowcount = 1.05119214745814, cumulative cost = 
> {0.0 rows, 0.0 cpu, 0.0 io}, id = 4329
>                                   
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount = 
> 462000.0, cumulative cost = {0}, id = 4088
>                 HiveProjectRel(d_date_sk=[$0], d_year=[$6], d_qoy=[$10]): 
> rowcount = 130.44464285714287, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, 
> id = 4355
>                   HiveFilterRel(condition=[AND(=($10, 2), =($6, 2000))]): 
> rowcount = 130.44464285714287, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, 
> id = 4353
>                     
> HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.date_dim]]): 
> rowcount = 73049.0, cumulative cost = {0}, id = 4100
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-8526) Hive : CBO incorrect join order in TPC-DS Q45 as self join selectivity has incorrect CE

Reply via email to