[
https://issues.apache.org/jira/browse/HIVE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218898#comment-14218898
]
Szehon Ho commented on HIVE-8701:
---------------------------------
Hi Suhas, sure the plan I see looks like this, for a modified plan of
auto_join2 that forces mapjoin to be in the same operator:
{noformat}
STAGE DEPENDENCIES:
Stage-3 is a root stage
Stage-1 depends on stages: Stage-3
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-3
Spark
#### A masked pattern was here ####
Vertices:
Map 1
Map Operator Tree:
TableScan
alias: src2
Statistics: Num rows: 500 Data size: 5312 Basic stats:
COMPLETE Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 250 Data size: 2656 Basic stats:
COMPLETE Column stats: NONE
Spark HashTable Sink Operator
condition expressions:
0 {key}
1
keys:
0 key (type: string)
1 key (type: string)
Local Work:
Map Reduce Local Work
Map 3
Map Operator Tree:
TableScan
alias: smalltable
Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL
Column stats: NONE
Filter Operator
predicate: UDFToDouble(key) is not null (type: boolean)
Statistics: Num rows: 0 Data size: 0 Basic stats: NONE
Column stats: NONE
Spark HashTable Sink Operator
condition expressions:
0 {_col0} {_col5}
1 {key}
keys:
0 (_col0 + _col5) (type: double)
1 UDFToDouble(key) (type: double)
Local Work:
Map Reduce Local Work
Stage: Stage-1
Spark
#### A masked pattern was here ####
Vertices:
Map 2
Map Operator Tree:
TableScan
alias: src1
Statistics: Num rows: 500 Data size: 5312 Basic stats:
COMPLETE Column stats: NONE
Filter Operator
predicate: key is not null (type: boolean)
Statistics: Num rows: 250 Data size: 2656 Basic stats:
COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {key}
1 {key}
keys:
0 key (type: string)
1 key (type: string)
outputColumnNames: _col0, _col5
input vertices:
1 Map 1
Statistics: Num rows: 275 Data size: 2921 Basic stats:
COMPLETE Column stats: NONE
Filter Operator
predicate: (_col0 + _col5) is not null (type: boolean)
Statistics: Num rows: 138 Data size: 1465 Basic stats:
COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {_col0} {_col5}
1 {key}
keys:
0 (_col0 + _col5) (type: double)
1 UDFToDouble(key) (type: double)
outputColumnNames: _col0, _col5, _col10
input vertices:
1 Map 3
Statistics: Num rows: 151 Data size: 1611 Basic
stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: string), _col5 (type:
string), _col10 (type: string)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 151 Data size: 1611 Basic
stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 151 Data size: 1611 Basic
stats: COMPLETE Column stats: NONE
table:
input format:
org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
{noformat}
The issue is there are two mapjoins in the same work, which is actually good
most of time, but we should make sure we don't overwhelm the executor memory in
that case. Check should be straight-forward in theory, just to include also
size of any parent mapjoin that is directly connected (ie, no RS or HTS) in the
calculation of table size.
> Combine nested map joins into the parent map join if possible [Spark Branch]
> ----------------------------------------------------------------------------
>
> Key: HIVE-8701
> URL: https://issues.apache.org/jira/browse/HIVE-8701
> Project: Hive
> Issue Type: Sub-task
> Components: Spark
> Reporter: Xuefu Zhang
> Assignee: Szehon Ho
>
> With the work in HIVE-8616 enabled, the generated plan shows that the nested
> map join operator isn't merged to its parent when possible. This is
> demonstrated in auto_join2.q. The MR plan shown that this optimization is in
> place. We should do the same for Spark.
> {code}
> STAGE PLANS:
> Stage: Stage-1
> Spark
> Edges:
> Map 2 <- Map 3 (NONE, 0)
> Map 3 <- Map 1 (NONE, 0)
> DagName: xzhang_20141102074141_ac089634-bf01-4386-b1cf-3e7f2e99f6eb:3
> Vertices:
> Map 1
> Map Operator Tree:
> TableScan
> alias: src2
> Statistics: Num rows: 58 Data size: 5812 Basic stats:
> COMPLETE Column stats: NONE
> Filter Operator
> predicate: key is not null (type: boolean)
> Statistics: Num rows: 29 Data size: 2906 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: key (type: string)
> sort order: +
> Map-reduce partition columns: key (type: string)
> Statistics: Num rows: 29 Data size: 2906 Basic stats:
> COMPLETE Column stats: NONE
> Map 2
> Map Operator Tree:
> TableScan
> alias: src3
> Statistics: Num rows: 29 Data size: 5812 Basic stats:
> COMPLETE Column stats: NONE
> Filter Operator
> predicate: UDFToDouble(key) is not null (type: boolean)
> Statistics: Num rows: 15 Data size: 3006 Basic stats:
> COMPLETE Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {_col0}
> 1 {value}
> keys:
> 0 (_col0 + _col5) (type: double)
> 1 UDFToDouble(key) (type: double)
> outputColumnNames: _col0, _col11
> input vertices:
> 0 Map 3
> Statistics: Num rows: 17 Data size: 1813 Basic stats:
> COMPLETE Column stats: NONE
> Select Operator
> expressions: _col0 (type: string), _col11 (type:
> string)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 17 Data size: 1813 Basic stats:
> COMPLETE Column stats: NONE
> File Output Operator
> compressed: false
> Statistics: Num rows: 17 Data size: 1813 Basic
> stats: COMPLETE Column stats: NONE
> table:
> input format:
> org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> serde:
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
> Map 3
> Map Operator Tree:
> TableScan
> alias: src1
> Statistics: Num rows: 58 Data size: 5812 Basic stats:
> COMPLETE Column stats: NONE
> Filter Operator
> predicate: key is not null (type: boolean)
> Statistics: Num rows: 29 Data size: 2906 Basic stats:
> COMPLETE Column stats: NONE
> Map Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {key}
> 1 {key}
> keys:
> 0 key (type: string)
> 1 key (type: string)
> outputColumnNames: _col0, _col5
> input vertices:
> 1 Map 1
> Statistics: Num rows: 31 Data size: 3196 Basic stats:
> COMPLETE Column stats: NONE
> Filter Operator
> predicate: (_col0 + _col5) is not null (type: boolean)
> Statistics: Num rows: 16 Data size: 1649 Basic stats:
> COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: (_col0 + _col5) (type: double)
> sort order: +
> Map-reduce partition columns: (_col0 + _col5)
> (type: double)
> Statistics: Num rows: 16 Data size: 1649 Basic
> stats: COMPLETE Column stats: NONE
> value expressions: _col0 (type: string)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)