Mostafa Mokhtar created HIVE-8526:
-------------------------------------
Summary: Hive : CBO incorrect join order in TPC-DS Q45 as self
join selectivity has incorrect CE
Key: HIVE-8526
URL: https://issues.apache.org/jira/browse/HIVE-8526
Project: Hive
Issue Type: Bug
Components: CBO
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Harish Butani
Fix For: 0.14.0
The join order has Item joined last where it should be joined first
Query
{code}
select ca_zip, ca_county, sum(ws_sales_price)
from
web_sales
JOIN customer ON web_sales.ws_bill_customer_sk = customer.c_customer_sk
JOIN customer_address ON customer.c_current_addr_sk =
customer_address.ca_address_sk
JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
JOIN item ON web_sales.ws_item_sk = item.i_item_sk
where
( item.i_item_id in (select i_item_id
from item i2
where i2.i_item_sk in (2, 3, 5, 7, 11, 13, 17, 19,
23, 29)
)
)
and d_qoy = 2 and d_year = 2000
group by ca_zip, ca_county
order by ca_zip, ca_county
limit 100
{code}
Plan
{code}
2014-10-20 18:43:16,521 DEBUG [main]: parse.SemanticAnalyzer
(SemanticAnalyzer.java:apply(12330)) - HiveSortRel(fetch=[100]): rowcount =
1.710158597922807E7, cumulative cost = {7.169080587598123E10 rows,
3.420317295845614E7 cpu, 0.0 io}, id = 579
HiveSortRel(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]): rowcount =
1.710158597922807E7, cumulative cost = {6.827294821015483E10 rows,
1.710158697922807E7 cpu, 0.0 io}, id = 577
HiveProjectRel(ca_zip=[$0], ca_county=[$1], _o__c2=[$2]): rowcount =
1.710158597922807E7, cumulative cost = {6.485509054432843E10 rows, 1.0 cpu, 0.0
io}, id = 575
HiveAggregateRel(group=[{0, 1}], agg#0=[sum($2)]): rowcount =
1.710158597922807E7, cumulative cost = {6.485509054432843E10 rows, 1.0 cpu, 0.0
io}, id = 573
HiveProjectRel($f0=[$2], $f1=[$1], $f2=[$0]): rowcount =
6.0197670310147226E7, cumulative cost = {6.485509054432843E10 rows, 1.0 cpu,
0.0 io}, id = 571
HiveProjectRel(ws_sales_price=[$2], ca_county=[$7], ca_zip=[$8]):
rowcount = 6.0197670310147226E7, cumulative cost = {6.485509054432843E10 rows,
1.0 cpu, 0.0 io}, id = 569
HiveFilterRel(condition=[AND(=($11, 2), =($10, 2000))]): rowcount =
6.0197670310147226E7, cumulative cost = {6.485509054432843E10 rows, 1.0 cpu,
0.0 io}, id = 567
SemiJoinRel(condition=[=($13, $14)], joinType=[inner]): rowcount
= 3.371069537368245E10, cumulative cost = {6.485509054432843E10 rows, 1.0 cpu,
0.0 io}, id = 565
HiveProjectRel(ws_item_sk=[$0], ws_bill_customer_sk=[$1],
ws_sales_price=[$2], ws_sold_date_sk=[$3], c_customer_sk=[$9],
c_current_addr_sk=[$10], ca_address_sk=[$11], ca_county=[$12], ca_zip=[$13],
d_date_sk=[$6], d_year=[$7], d_qoy=[$8], i_item_sk=[$4], i_item_id=[$5]):
rowcount = 3.371069537368245E10, cumulative cost = {6.485509054332843E10 rows,
0.0 cpu, 0.0 io}, id = 669
HiveJoinRel(condition=[=($1, $9)], joinType=[inner]):
rowcount = 3.371069537368245E10, cumulative cost = {6.485509054332843E10 rows,
0.0 cpu, 0.0 io}, id = 667
HiveJoinRel(condition=[=($3, $6)], joinType=[inner]):
rowcount = 2.1594638446E10, cumulative cost = {4.3189811941E10 rows, 0.0 cpu,
0.0 io}, id = 664
HiveJoinRel(condition=[=($0, $4)], joinType=[inner]):
rowcount = 2.1594638446E10, cumulative cost = {2.1595100446E10 rows, 0.0 cpu,
0.0 io}, id = 601
HiveProjectRel(ws_item_sk=[$2],
ws_bill_customer_sk=[$3], ws_sales_price=[$20], ws_sold_date_sk=[$33]):
rowcount = 2.1594638446E10, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id =
497
HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.web_sales]]): rowcount
= 2.1594638446E10, cumulative cost = {0}, id = 341
HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]):
rowcount = 462000.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 555
HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount =
462000.0, cumulative cost = {0}, id = 340
HiveProjectRel(d_date_sk=[$0], d_year=[$6], d_qoy=[$10]):
rowcount = 73049.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 551
HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.date_dim]]): rowcount
= 73049.0, cumulative cost = {0}, id = 342
HiveJoinRel(condition=[=($1, $2)], joinType=[inner]):
rowcount = 7.064015632843196E7, cumulative cost = {1.2E8 rows, 0.0 cpu, 0.0
io}, id = 598
HiveProjectRel(c_customer_sk=[$0],
c_current_addr_sk=[$4]): rowcount = 8.0E7, cumulative cost = {0.0 rows, 0.0
cpu, 0.0 io}, id = 500
HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer]]): rowcount
= 8.0E7, cumulative cost = {0}, id = 343
HiveProjectRel(ca_address_sk=[$0], ca_county=[$7],
ca_zip=[$9]): rowcount = 4.0E7, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io},
id = 547
HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer_address]]):
rowcount = 4.0E7, cumulative cost = {0}, id = 339
HiveProjectRel(i_item_id=[$1]): rowcount = 1.05119214745814,
cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 563
HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]): rowcount =
1.05119214745814, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 561
HiveFilterRel(condition=[in($0, 2, 3, 5, 7, 11, 13, 17, 19,
23, 29)]): rowcount = 1.05119214745814, cumulative cost = {0.0 rows, 0.0 cpu,
0.0 io}, id = 559
HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount =
462000.0, cumulative cost = {0}, id = 340
{code}
Then I rewrote the query trying to force CBO to generate the correct join order
{code}
with items as (select i_item_sk from
item where
( item.i_item_id in (select i_item_id
from item i2
where i2.i_item_sk in (2, 3, 5, 7, 11, 13, 17, 19,
23, 29)
)
)
)
select ca_zip, ca_county, sum(ws_sales_price)
from
web_sales
JOIN items ON web_sales.ws_item_sk = items.i_item_sk
JOIN customer ON web_sales.ws_bill_customer_sk = customer.c_customer_sk
JOIN customer_address ON customer.c_current_addr_sk =
customer_address.ca_address_sk
JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
where
d_qoy = 2 and d_year = 2000
group by ca_zip, ca_county
order by ca_zip, ca_county
limit 100
{code}
But the correct join order wasn't generated because CE for item x item + filter
has a selectivity of 1.
{code}
2014-10-20 18:46:27,120 DEBUG [main]: parse.SemanticAnalyzer
(SemanticAnalyzer.java:apply(12330)) - HiveSortRel(fetch=[100]): rowcount =
1.6595391288544238E7, cumulative cost = {2.8364280421639153E10 rows,
3.3190782577088475E7 cpu, 0.0 io}, id = 1291
HiveSortRel(sort0=[$0], sort1=[$1], dir0=[ASC], dir1=[ASC]): rowcount =
1.6595391288544238E7, cumulative cost = {2.505357243157397E10 rows,
1.6595391288544238E7 cpu, 0.0 io}, id = 1289
HiveProjectRel(ca_zip=[$0], ca_county=[$1], _o__c2=[$2]): rowcount =
1.6595391288544238E7, cumulative cost = {2.174286444150879E10 rows, 0.0 cpu,
0.0 io}, id = 1287
HiveAggregateRel(group=[{0, 1}], agg#0=[sum($2)]): rowcount =
1.6595391288544238E7, cumulative cost = {2.174286444150879E10 rows, 0.0 cpu,
0.0 io}, id = 1285
HiveProjectRel($f0=[$9], $f1=[$8], $f2=[$2]): rowcount =
6.019767031014723E7, cumulative cost = {2.174286444150879E10 rows, 0.0 cpu, 0.0
io}, id = 1283
HiveProjectRel(ws_item_sk=[$5], ws_bill_customer_sk=[$6],
ws_sales_price=[$7], ws_sold_date_sk=[$8], i_item_sk=[$12], c_customer_sk=[$0],
c_current_addr_sk=[$1], ca_address_sk=[$2], ca_county=[$3], ca_zip=[$4],
d_date_sk=[$9], d_year=[$10], d_qoy=[$11]): rowcount = 6.019767031014723E7,
cumulative cost = {2.174286444150879E10 rows, 0.0 cpu, 0.0 io}, id = 1380
HiveJoinRel(condition=[=($6, $0)], joinType=[inner]): rowcount =
6.019767031014723E7, cumulative cost = {2.174286444150879E10 rows, 0.0 cpu, 0.0
io}, id = 1378
HiveJoinRel(condition=[=($1, $2)], joinType=[inner]): rowcount =
7.064015632843196E7, cumulative cost = {1.2E8 rows, 0.0 cpu, 0.0 io}, id = 1309
HiveProjectRel(c_customer_sk=[$0], c_current_addr_sk=[$4]):
rowcount = 8.0E7, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1269
HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer]]): rowcount
= 8.0E7, cumulative cost = {0}, id = 1035
HiveProjectRel(ca_address_sk=[$0], ca_county=[$7],
ca_zip=[$9]): rowcount = 4.0E7, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io},
id = 1273
HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.customer_address]]):
rowcount = 4.0E7, cumulative cost = {0}, id = 1032
HiveJoinRel(condition=[=($0, $7)], joinType=[inner]): rowcount =
3.856185436785714E7, cumulative cost = {2.16336624308125E10 rows, 0.0 cpu, 0.0
io}, id = 1376
HiveJoinRel(condition=[=($3, $4)], joinType=[inner]): rowcount
= 3.856185436785714E7, cumulative cost = {2.159463857644464E10 rows, 0.0 cpu,
0.0 io}, id = 1316
HiveProjectRel(ws_item_sk=[$2], ws_bill_customer_sk=[$3],
ws_sales_price=[$20], ws_sold_date_sk=[$33]): rowcount = 2.1594638446E10,
cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1205
HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.web_sales]]): rowcount
= 2.1594638446E10, cumulative cost = {0}, id = 1033
HiveProjectRel(d_date_sk=[$0], d_year=[$6], d_qoy=[$10]):
rowcount = 130.44464285714287, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io},
id = 1279
HiveFilterRel(condition=[AND(=($10, 2), =($6, 2000))]):
rowcount = 130.44464285714287, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io},
id = 1277
HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.date_dim]]): rowcount
= 73049.0, cumulative cost = {0}, id = 1034
HiveProjectRel(i_item_sk=[$0]): rowcount = 462000.0, cumulative
cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 1265
HiveFilterRel(condition=[=(1, 1)]): rowcount = 462000.0,
cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 1263
SemiJoinRel(condition=[=($1, $2)], joinType=[inner]):
rowcount = 462000.0, cumulative cost = {1.0 rows, 1.0 cpu, 0.0 io}, id = 1261
HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]): rowcount
= 462000.0, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1253
HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount =
462000.0, cumulative cost = {0}, id = 1024
HiveProjectRel(i_item_id=[$1]): rowcount =
1.05119214745814, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id = 1259
HiveProjectRel(i_item_sk=[$0], i_item_id=[$1]):
rowcount = 1.05119214745814, cumulative cost = {0.0 rows, 0.0 cpu, 0.0 io}, id
= 1257
HiveFilterRel(condition=[in($0, 2, 3, 5, 7, 11, 13,
17, 19, 23, 29)]): rowcount = 1.05119214745814, cumulative cost = {0.0 rows,
0.0 cpu, 0.0 io}, id = 1255
HiveTableScanRel(table=[[tpcds_bin_partitioned_orc_30000.item]]): rowcount =
462000.0, cumulative cost = {0}, id = 1024
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)