Mostafa Mokhtar created HIVE-8769:
-------------------------------------
Summary: Physical optimizer : Incorrect CE results in a shuffle
join instead of a Map join (PK/FK pattern not detected)
Key: HIVE-8769
URL: https://issues.apache.org/jira/browse/HIVE-8769
Project: Hive
Issue Type: Bug
Components: Physical Optimizer
Affects Versions: 0.14.0
Reporter: Mostafa Mokhtar
Assignee: Prasanth J
Fix For: 0.15.0
TPC-DS Q82 is running slower than hive 13 because the join type is not correct.
The estimate for item x inventory x date_dim is 227 Million rows while the
actual is 3K rows.
Hive 13 finishes in 753 seconds.
Hive 14 finishes in 1,267 seconds.
Hive 14 + force map join finished in 431 seconds.
Query
{code}
select i_item_id
,i_item_desc
,i_current_price
from item, inventory, date_dim, store_sales
where i_current_price between 30 and 30+30
and inv_item_sk = i_item_sk
and d_date_sk=inv_date_sk
and d_date between '2002-05-30' and '2002-07-30'
and i_manufact_id in (437,129,727,663)
and inv_quantity_on_hand between 100 and 500
and ss_item_sk = i_item_sk
group by i_item_id,i_item_desc,i_current_price
order by i_item_id
limit 100
{code}
Plan
{code}
STAGE PLANS:
Stage: Stage-1
Tez
Edges:
Map 7 <- Map 1 (BROADCAST_EDGE), Map 2 (BROADCAST_EDGE)
Reducer 4 <- Map 3 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE)
Reducer 5 <- Reducer 4 (SIMPLE_EDGE)
Reducer 6 <- Reducer 5 (SIMPLE_EDGE)
DagName: mmokhtar_20141106005353_7a2eb8df-12ff-4fe9-89b4-30f1e4e3fb90:1
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: item
filterExpr: ((i_current_price BETWEEN 30 AND 60 and
(i_manufact_id) IN (437, 129, 727, 663)) and i_item_sk is not null) (type:
boolean)
Statistics: Num rows: 462000 Data size: 663862160 Basic
stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: ((i_current_price BETWEEN 30 AND 60 and
(i_manufact_id) IN (437, 129, 727, 663)) and i_item_sk is not null) (type:
boolean)
Statistics: Num rows: 115500 Data size: 34185680 Basic
stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: i_item_sk (type: int), i_item_id (type:
string), i_item_desc (type: string), i_current_price (type: float)
outputColumnNames: _col0, _col1, _col2, _col3
Statistics: Num rows: 115500 Data size: 33724832 Basic
stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 115500 Data size: 33724832 Basic
stats: COMPLETE Column stats: COMPLETE
value expressions: _col1 (type: string), _col2 (type:
string), _col3 (type: float)
Execution mode: vectorized
Map 2
Map Operator Tree:
TableScan
alias: date_dim
filterExpr: (d_date BETWEEN '2002-05-30' AND '2002-07-30' and
d_date_sk is not null) (type: boolean)
Statistics: Num rows: 73049 Data size: 81741831 Basic stats:
COMPLETE Column stats: COMPLETE
Filter Operator
predicate: (d_date BETWEEN '2002-05-30' AND '2002-07-30'
and d_date_sk is not null) (type: boolean)
Statistics: Num rows: 36524 Data size: 3579352 Basic stats:
COMPLETE Column stats: COMPLETE
Select Operator
expressions: d_date_sk (type: int)
outputColumnNames: _col0
Statistics: Num rows: 36524 Data size: 146096 Basic
stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 36524 Data size: 146096 Basic
stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col0 (type: int)
outputColumnNames: _col0
Statistics: Num rows: 36524 Data size: 146096 Basic
stats: COMPLETE Column stats: COMPLETE
Group By Operator
keys: _col0 (type: int)
mode: hash
outputColumnNames: _col0
Statistics: Num rows: 18262 Data size: 73048 Basic
stats: COMPLETE Column stats: COMPLETE
Dynamic Partitioning Event Operator
Target Input: inventory
Partition key expr: inv_date_sk
Statistics: Num rows: 18262 Data size: 73048 Basic
stats: COMPLETE Column stats: COMPLETE
Target column: inv_date_sk
Target Vertex: Map 7
Execution mode: vectorized
Map 3
Map Operator Tree:
TableScan
alias: store_sales
filterExpr: ss_item_sk is not null (type: boolean)
Statistics: Num rows: 82510879939 Data size: 6873789738208
Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: ss_item_sk is not null (type: boolean)
Statistics: Num rows: 82510879939 Data size: 330043519756
Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: ss_item_sk (type: int)
outputColumnNames: _col0
Statistics: Num rows: 82510879939 Data size: 330043519756
Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: int)
sort order: +
Map-reduce partition columns: _col0 (type: int)
Statistics: Num rows: 82510879939 Data size:
330043519756 Basic stats: COMPLETE Column stats: COMPLETE
Execution mode: vectorized
Map 7
Map Operator Tree:
TableScan
alias: inventory
filterExpr: (inv_quantity_on_hand BETWEEN 100 AND 500 and
inv_item_sk is not null) (type: boolean)
Statistics: Num rows: 1,627,857,000 Data size: 19208695084
Basic stats: COMPLETE Column stats: COMPLETE
Filter Operator
predicate: (inv_quantity_on_hand BETWEEN 100 AND 500 and
inv_item_sk is not null) (type: boolean)
Statistics: Num rows: 813,928,500 Data size: 9604347540
Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: inv_item_sk (type: int), inv_date_sk (type:
int)
outputColumnNames: _col0, _col2
Statistics: Num rows: 813,928,500 Data size: 6511428000
Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {_col2}
1 {_col0} {_col1} {_col2} {_col3}
keys:
0 _col0 (type: int)
1 _col0 (type: int)
outputColumnNames: _col2, _col3, _col4, _col5, _col6
input vertices:
1 Map 1
Statistics: Num rows: 203,482,128 Data size:
59416781376 Basic stats: COMPLETE Column stats: COMPLETE
Map Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {_col3} {_col4} {_col5} {_col6}
1
keys:
0 _col2 (type: int)
1 _col0 (type: int)
outputColumnNames: _col3, _col4, _col5, _col6
input vertices:
1 Map 2
Statistics: Num rows: 227,514,273 Data size:
66434167716 Basic stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col3 (type: int), _col4 (type:
string), _col5 (type: string), _col6 (type: float)
outputColumnNames: _col3, _col4, _col5, _col6
Statistics: Num rows: 227514273 Data size:
66434167716 Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col3 (type: int)
sort order: +
Map-reduce partition columns: _col3 (type: int)
Statistics: Num rows: 227514273 Data size:
66434167716 Basic stats: COMPLETE Column stats: COMPLETE
value expressions: _col4 (type: string), _col5
(type: string), _col6 (type: float)
Execution mode: vectorized
Reducer 4
Reduce Operator Tree:
Merge Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0
1 {VALUE._col3} {VALUE._col4} {VALUE._col5}
outputColumnNames: _col5, _col6, _col7
Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic
stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col5 (type: string), _col6 (type: string),
_col7 (type: float)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 5156859392 Data size: 1485175504896
Basic stats: COMPLETE Column stats: COMPLETE
Group By Operator
keys: _col0 (type: string), _col1 (type: string), _col2
(type: float)
mode: hash
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 5156859392 Data size: 1485175504896
Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: string), _col1 (type:
string), _col2 (type: float)
sort order: +++
Map-reduce partition columns: _col0 (type: string), _col1
(type: string), _col2 (type: float)
Statistics: Num rows: 5156859392 Data size: 1485175504896
Basic stats: COMPLETE Column stats: COMPLETE
Reducer 5
Reduce Operator Tree:
Group By Operator
keys: KEY._col0 (type: string), KEY._col1 (type: string),
KEY._col2 (type: float)
mode: mergepartial
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic
stats: COMPLETE Column stats: COMPLETE
Select Operator
expressions: _col0 (type: string), _col1 (type: string),
_col2 (type: float)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 5156859392 Data size: 1485175504896
Basic stats: COMPLETE Column stats: COMPLETE
Reduce Output Operator
key expressions: _col0 (type: string)
sort order: +
Statistics: Num rows: 5156859392 Data size: 1485175504896
Basic stats: COMPLETE Column stats: COMPLETE
TopN Hash Memory Usage: 0.04
value expressions: _col1 (type: string), _col2 (type: float)
Execution mode: vectorized
Reducer 6
Reduce Operator Tree:
Select Operator
expressions: KEY.reducesinkkey0 (type: string), VALUE._col0
(type: string), VALUE._col1 (type: float)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 5156859392 Data size: 1485175504896 Basic
stats: COMPLETE Column stats: COMPLETE
Limit
Number of rows: 100
Statistics: Num rows: 100 Data size: 28800 Basic stats:
COMPLETE Column stats: COMPLETE
File Output Operator
compressed: false
Statistics: Num rows: 100 Data size: 28800 Basic stats:
COMPLETE Column stats: COMPLETE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Execution mode: vectorized
Stage: Stage-0
Fetch Operator
limit: 100
Processor Tree:
ListSink
{code}
Actual rows counts
{code}
VERTICES TOTAL_TASKS FAILED_ATTEMPTS KILLED_TASKS DURATION_SECONDS
CPU_TIME_MILLIS GC_TIME_MILLIS INPUT_RECORDS OUTPUT_RECORDS
Map 1 1 0 0 1.01
1,280 0 462,000 65
Map 2 1 0 0 0.41
400 23 10,000 62
Map 3 2574 0 0 947.79
442,220,640 1,887,714 82,510,879,939 82,510,879,939
Map 7 9 0 0 869.22
42,490 1,215 56,133,127 3,081
Reducer 4 1009 0 0 389.75
69,471,510 1,149,529 82,510,883,020 33
Reducer 5 253 0 0 93.73
938,930 26,150 33 33
Reducer 6 1 0 0 2.08
730 10 33 33
{code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)