[ https://issues.apache.org/jira/browse/HIVE-8526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14192295#comment-14192295 ]
Gunther Hagleitner commented on HIVE-8526: ------------------------------------------ HIVE-8653 fixes the problem. Closing as duplicate. > Hive : CBO incorrect join order in TPC-DS Q45 as self join selectivity has > incorrect CE > --------------------------------------------------------------------------------------- > > Key: HIVE-8526 > URL: https://issues.apache.org/jira/browse/HIVE-8526 > Project: Hive > Issue Type: Bug > Components: CBO > Affects Versions: 0.14.0 > Reporter: Mostafa Mokhtar > Assignee: Harish Butani > Priority: Critical > Fix For: 0.14.0 > > Attachments: HIVE-8526.1.patch > > > The join order has Item joined last where it should be joined first > Query > {code} > select ca_zip, ca_county, sum(ws_sales_price) > from > web_sales > JOIN customer ON web_sales.ws_bill_customer_sk = customer.c_customer_sk > JOIN customer_address ON customer.c_current_addr_sk = > customer_address.ca_address_sk > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > JOIN item ON web_sales.ws_item_sk = item.i_item_sk > where > ( item.i_item_id in (select i_item_id > from item i2 > where i2.i_item_sk in (2, 3, 5, 7, 11, 13, 17, > 19, 23, 29) > ) > ) > and d_qoy = 2 and d_year = 2000 > group by ca_zip, ca_county > order by ca_zip, ca_county > limit 100 > {code} > Plan > {code} > 2014-10-20 18:43:16,521 DEBUG [main]: parse.SemanticAnalyzer > (SemanticAnalyzer.java:apply(12330)) - HiveSortRel(fetch=[100]): rowcount = > 1.710158597922807E7, cumulative cost = {7.169080587598123E10 rows, > 3.420317295845614E7 cpu, 0.0 io}, id = 579 > HiveSortRel(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]): rowcount = > 1.710158597922807E7, cumulative cost = {6.827294821015483E10 rows, > 1.710158697922807E7 cpu, 0.0 io}, id = 577 > HiveProjectRel(ca_zip=[$0], ca_county=[$1], _o__c2=[$2]): rowcount = > 1.710158597922807E7, cumulative cost = {6.485509054432843E10 rows, 1.0 cpu, > 0.0 io}, id = 575 > HiveAggregateRel(group=[{0, 1}], agg#0=[sum($2)]): rowcount = > 1.710158597922807E7, cumulative cost = {6.485509054432843E10 rows, 1.0 cpu, > 0.0 io}, id = 573 > HiveProjectRel($f0=[$2], $f1=[$1], $f2=[$0]): rowcount = > 6.0197670310147226E7, cumulative cost = {6.485509054432843E10 rows, 1.0 cpu, > 0.0 io}, id = 571 > HiveProjectRel(ws_sales_price=[$2], ca_county=[$7], ca_zip=[$8]): > rowcount = 6.0197670310147226E7, cumulative cost = {6.485509054432843E10 > rows, 1.0 cpu, 0.0 io}, id = 569 > HiveFilterRel(condition=[AND(=($11, 2), =($10, 2000))]): rowcount > = 6.0197670310147226E7, cumulative cost = {6.485509054432843E10 rows, 1.0 > cpu, 0.0 io}, id = 567 > SemiJoinRel(condition=[=($13, $14)], joinType=[inner]): > rowcount = 3.371069537368245E10, cumulative cost = {6.485509054432843E10 > rows, 1.0 cpu, 0.0 io}, id = 565 > HiveProjectRel(ws_item_sk=[$0], ws_bill_customer_sk=[$1], > ws_sales_price=[$2], ws_sold_date_sk=[$3], c_customer_sk=[$9], > c_current_addr_sk=[$10], ca_address_sk=[$11], ca_county=[$12], ca_zip=[$13], > d_date_sk=[$6], d_year=[$7], d_qoy=[$8], i_item_sk=[$4], i_item_id=[$5]): > rowcount = 3.371069537368245E10, cumulative cost = {6.485509054332843E10 > rows, 0.0 cpu, 0.0 io}, id = 669 > HiveJoinRel(condition=[=($1, $9)], joinType=[inner]): > rowcount = 3.371069537368245E10, cumulative cost = {6.485509054332843E10 > rows, 0.0 cpu, 0.0 io}, id = 667 > HiveJoinRel(condition=[=($3, $6)], joinType=[inner]): > rowcount = 2.1594638446E10, cumulative cost = {4.3189811941E10 rows, 0.0 cpu, > 0.0 io}, id = 664 > HiveJoinRel(condition=[=($0, $4)], joinType=[inner]): > rowcount = 2.1594638446E10, cumulative cost = {2.1595100446E10 rows, 0.0 cpu, > 0.0 io}, id = 601 > HiveProjectRel(ws_item_sk=[$2], > ws_bill_customer_sk=[$3], ws_sales_price=[$20], ws_sold_date_sk=[$33]): > rowcount = 2.1594638446E10, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id > = 497 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.web_sales]]): > rowcount = 2.1594638446E10, cumulative cost = {0}, id = 341 > HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]): > rowcount = 462000.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 555 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount = > 462000.0, cumulative cost = {0}, id = 340 > HiveProjectRel(d_date_sk=[$0], d_year=[$6], > d_qoy=[$10]): rowcount = 73049.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 > io}, id = 551 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.date_dim]]): > rowcount = 73049.0, cumulative cost = {0}, id = 342 > HiveJoinRel(condition=[=($1, $2)], joinType=[inner]): > rowcount = 7.064015632843196E7, cumulative cost = {1.2E8 rows, 0.0 cpu, 0.0 > io}, id = 598 > HiveProjectRel(c_customer_sk=[$0], > c_current_addr_sk=[$4]): rowcount = 8.0E7, cumulative cost = {0.0 rows, 0.0 > cpu, 0.0 io}, id = 500 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer]]): > rowcount = 8.0E7, cumulative cost = {0}, id = 343 > HiveProjectRel(ca_address_sk=[$0], ca_county=[$7], > ca_zip=[$9]): rowcount = 4.0E7, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 > io}, id = 547 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer_address]]): > rowcount = 4.0E7, cumulative cost = {0}, id = 339 > HiveProjectRel(i_item_id=[$1]): rowcount = 1.05119214745814, > cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 563 > HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]): rowcount = > 1.05119214745814, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 561 > HiveFilterRel(condition=[in($0, 2, 3, 5, 7, 11, 13, 17, > 19, 23, 29)]): rowcount = 1.05119214745814, cumulative cost = {0.0 rows, 0.0 > cpu, 0.0 io}, id = 559 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount = > 462000.0, cumulative cost = {0}, id = 340 > {code} > Then I rewrote the query trying to force CBO to generate the correct join > order > {code} > with items as (select i_item_sk from > item where > ( item.i_item_id in (select i_item_id > from item i2 > where i2.i_item_sk in (2, 3, 5, 7, 11, 13, 17, > 19, 23, 29) > ) > ) > ) > select ca_zip, ca_county, sum(ws_sales_price) > from > web_sales > JOIN items ON web_sales.ws_item_sk = items.i_item_sk > JOIN customer ON web_sales.ws_bill_customer_sk = customer.c_customer_sk > JOIN customer_address ON customer.c_current_addr_sk = > customer_address.ca_address_sk > JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk > where > d_qoy = 2 and d_year = 2000 > group by ca_zip, ca_county > order by ca_zip, ca_county > limit 100 > {code} > But the correct join order wasn't generated because CE for item x item + > filter has a selectivity of 1. > {code} > 2014-10-20 18:46:27,120 DEBUG [main]: parse.SemanticAnalyzer > (SemanticAnalyzer.java:apply(12330)) - HiveSortRel(fetch=[100]): rowcount = > 1.6595391288544238E7, cumulative cost = {2.8364280421639153E10 rows, > 3.3190782577088475E7 cpu, 0.0 io}, id = 1291 > HiveSortRel(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]): rowcount = > 1.6595391288544238E7, cumulative cost = {2.505357243157397E10 rows, > 1.6595391288544238E7 cpu, 0.0 io}, id = 1289 > HiveProjectRel(ca_zip=[$0], ca_county=[$1], _o__c2=[$2]): rowcount = > 1.6595391288544238E7, cumulative cost = {2.174286444150879E10 rows, 0.0 cpu, > 0.0 io}, id = 1287 > HiveAggregateRel(group=[{0, 1}], agg#0=[sum($2)]): rowcount = > 1.6595391288544238E7, cumulative cost = {2.174286444150879E10 rows, 0.0 cpu, > 0.0 io}, id = 1285 > HiveProjectRel($f0=[$9], $f1=[$8], $f2=[$2]): rowcount = > 6.019767031014723E7, cumulative cost = {2.174286444150879E10 rows, 0.0 cpu, > 0.0 io}, id = 1283 > HiveProjectRel(ws_item_sk=[$5], ws_bill_customer_sk=[$6], > ws_sales_price=[$7], ws_sold_date_sk=[$8], i_item_sk=[$12], > c_customer_sk=[$0], c_current_addr_sk=[$1], ca_address_sk=[$2], > ca_county=[$3], ca_zip=[$4], d_date_sk=[$9], d_year=[$10], d_qoy=[$11]): > rowcount = 6.019767031014723E7, cumulative cost = {2.174286444150879E10 rows, > 0.0 cpu, 0.0 io}, id = 1380 > HiveJoinRel(condition=[=($6, $0)], joinType=[inner]): rowcount = > 6.019767031014723E7, cumulative cost = {2.174286444150879E10 rows, 0.0 cpu, > 0.0 io}, id = 1378 > HiveJoinRel(condition=[=($1, $2)], joinType=[inner]): rowcount > = 7.064015632843196E7, cumulative cost = {1.2E8 rows, 0.0 cpu, 0.0 io}, id = > 1309 > HiveProjectRel(c_customer_sk=[$0], c_current_addr_sk=[$4]): > rowcount = 8.0E7, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1269 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer]]): > rowcount = 8.0E7, cumulative cost = {0}, id = 1035 > HiveProjectRel(ca_address_sk=[$0], ca_county=[$7], > ca_zip=[$9]): rowcount = 4.0E7, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 > io}, id = 1273 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer_address]]): > rowcount = 4.0E7, cumulative cost = {0}, id = 1032 > HiveJoinRel(condition=[=($0, $7)], joinType=[inner]): rowcount > = 3.856185436785714E7, cumulative cost = {2.16336624308125E10 rows, 0.0 cpu, > 0.0 io}, id = 1376 > HiveJoinRel(condition=[=($3, $4)], joinType=[inner]): > rowcount = 3.856185436785714E7, cumulative cost = {2.159463857644464E10 rows, > 0.0 cpu, 0.0 io}, id = 1316 > HiveProjectRel(ws_item_sk=[$2], ws_bill_customer_sk=[$3], > ws_sales_price=[$20], ws_sold_date_sk=[$33]): rowcount = 2.1594638446E10, > cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1205 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.web_sales]]): > rowcount = 2.1594638446E10, cumulative cost = {0}, id = 1033 > HiveProjectRel(d_date_sk=[$0], d_year=[$6], d_qoy=[$10]): > rowcount = 130.44464285714287, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, > id = 1279 > HiveFilterRel(condition=[AND(=($10, 2), =($6, 2000))]): > rowcount = 130.44464285714287, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, > id = 1277 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.date_dim]]): > rowcount = 73049.0, cumulative cost = {0}, id = 1034 > HiveProjectRel(i_item_sk=[$0]): rowcount = 462000.0, > cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 1265 > HiveFilterRel(condition=[=(1, 1)]): rowcount = 462000.0, > cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 1263 > SemiJoinRel(condition=[=($1, $2)], joinType=[inner]): > rowcount = 462000.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 1261 > HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]): > rowcount = 462000.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1253 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount = > 462000.0, cumulative cost = {0}, id = 1024 > HiveProjectRel(i_item_id=[$1]): rowcount = > 1.05119214745814, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1259 > HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]): > rowcount = 1.05119214745814, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, > id = 1257 > HiveFilterRel(condition=[in($0, 2, 3, 5, 7, 11, 13, > 17, 19, 23, 29)]): rowcount = 1.05119214745814, cumulative cost = {0.0 rows, > 0.0 cpu, 0.0 io}, id = 1255 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount = > 462000.0, cumulative cost = {0}, id = 1024 > {code} > This query generates the correct join order > {code} > with items as (select i_item_sk from > item where > item.i_item_id in (select i_item_id > from item i2 > where i2.i_item_sk in (2, 3, 5, 7, 11, 13, 17, > 19, 23, 29) > ) > > ), > ws as ( > select ws_bill_customer_sk,ws_sales_price,ws_sold_date_sk > from web_sales > JOIN items ON web_sales.ws_item_sk = items.i_item_sk > ) > select ca_zip, ca_county, sum(ws_sales_price) > from ws > JOIN customer ON ws.ws_bill_customer_sk = customer.c_customer_sk > JOIN customer_address ON customer.c_current_addr_sk = > customer_address.ca_address_sk > JOIN date_dim ON ws.ws_sold_date_sk = date_dim.d_date_sk > where d_qoy = 2 and d_year = 2000 > group by ca_zip, ca_county > order by ca_zip, ca_county > limit 100 > {code} > Plan > {code} > 2014-10-20 19:13:15,989 DEBUG [main]: parse.SemanticAnalyzer > (SemanticAnalyzer.java:apply(12330)) - HiveSortRel(fetch=[100]): rowcount = > 1.6595391288544238E7, cumulative cost = {4.99203570142713E10 rows, > 3.3190783577088475E7 cpu, 0.0 io}, id = 4367 > HiveSortRel(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]): rowcount = > 1.6595391288544238E7, cumulative cost = {4.6609649024206116E10 rows, > 1.6595392288544238E7 cpu, 0.0 io}, id = 4365 > HiveProjectRel(ca_zip=[$0], ca_county=[$1], _o__c2=[$2]): rowcount = > 1.6595391288544238E7, cumulative cost = {4.329894103414093E10 rows, 1.0 cpu, > 0.0 io}, id = 4363 > HiveAggregateRel(group=[{0, 1}], agg#0=[sum($2)]): rowcount = > 1.6595391288544238E7, cumulative cost = {4.329894103414093E10 rows, 1.0 cpu, > 0.0 io}, id = 4361 > HiveProjectRel($f0=[$7], $f1=[$6], $f2=[$1]): rowcount = > 6.019767031014723E7, cumulative cost = {4.329894103414093E10 rows, 1.0 cpu, > 0.0 io}, id = 4359 > HiveProjectRel(ws_bill_customer_sk=[$5], ws_sales_price=[$6], > ws_sold_date_sk=[$7], c_customer_sk=[$0], c_current_addr_sk=[$1], > ca_address_sk=[$2], ca_county=[$3], ca_zip=[$4], d_date_sk=[$8], d_year=[$9], > d_qoy=[$10]): rowcount = 6.019767031014723E7, cumulative cost = > {4.329894103414093E10 rows, 1.0 cpu, 0.0 io}, id = 4426 > HiveJoinRel(condition=[=($5, $0)], joinType=[inner]): rowcount = > 6.019767031014723E7, cumulative cost = {4.329894103414093E10 rows, 1.0 cpu, > 0.0 io}, id = 4424 > HiveJoinRel(condition=[=($1, $2)], joinType=[inner]): rowcount > = 7.064015632843196E7, cumulative cost = {1.2E8 rows, 0.0 cpu, 0.0 io}, id = > 4392 > HiveProjectRel(c_customer_sk=[$0], c_current_addr_sk=[$4]): > rowcount = 8.0E7, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 4345 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer]]): > rowcount = 8.0E7, cumulative cost = {0}, id = 4101 > HiveProjectRel(ca_address_sk=[$0], ca_county=[$7], > ca_zip=[$9]): rowcount = 4.0E7, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 > io}, id = 4349 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer_address]]): > rowcount = 4.0E7, cumulative cost = {0}, id = 4099 > HiveJoinRel(condition=[=($2, $3)], joinType=[inner]): rowcount > = 3.856185436785714E7, cumulative cost = {4.318973902344464E10 rows, 1.0 cpu, > 0.0 io}, id = 4395 > HiveProjectRel(ws_bill_customer_sk=[$1], ws_sales_price=[$2], > ws_sold_date_sk=[$3]): rowcount = 2.1594638446E10, cumulative cost = > {2.1595100447E10 rows, 1.0 cpu, 0.0 io}, id = 4343 > HiveProjectRel(ws_item_sk=[$0], ws_bill_customer_sk=[$1], > ws_sales_price=[$2], ws_sold_date_sk=[$3], i_item_sk=[$4]): rowcount = > 2.1594638446E10, cumulative cost = {2.1595100447E10 rows, 1.0 cpu, 0.0 io}, > id = 4388 > HiveJoinRel(condition=[=($0, $4)], joinType=[inner]): > rowcount = 2.1594638446E10, cumulative cost = {2.1595100447E10 rows, 1.0 cpu, > 0.0 io}, id = 4383 > HiveProjectRel(ws_item_sk=[$2], > ws_bill_customer_sk=[$3], ws_sales_price=[$20], ws_sold_date_sk=[$33]): > rowcount = 2.1594638446E10, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id > = 4277 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.web_sales]]): > rowcount = 2.1594638446E10, cumulative cost = {0}, id = 4096 > HiveProjectRel(i_item_sk=[$0]): rowcount = 462000.0, > cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 4339 > HiveFilterRel(condition=[=(1, 1)]): rowcount = > 462000.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 4337 > SemiJoinRel(condition=[=($1, $2)], > joinType=[inner]): rowcount = 462000.0, cumulative cost = {1.0 rows, 1.0 cpu, > 0.0 io}, id = 4335 > HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]): > rowcount = 462000.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 4327 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount = > 462000.0, cumulative cost = {0}, id = 4088 > HiveProjectRel(i_item_id=[$1]): rowcount = > 1.05119214745814, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 4333 > HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]): > rowcount = 1.05119214745814, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, > id = 4331 > HiveFilterRel(condition=[in($0, 2, 3, 5, 7, > 11, 13, 17, 19, 23, 29)]): rowcount = 1.05119214745814, cumulative cost = > {0.0 rows, 0.0 cpu, 0.0 io}, id = 4329 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount = > 462000.0, cumulative cost = {0}, id = 4088 > HiveProjectRel(d_date_sk=[$0], d_year=[$6], d_qoy=[$10]): > rowcount = 130.44464285714287, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, > id = 4355 > HiveFilterRel(condition=[AND(=($10, 2), =($6, 2000))]): > rowcount = 130.44464285714287, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, > id = 4353 > > HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.date_dim]]): > rowcount = 73049.0, cumulative cost = {0}, id = 4100 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)