[
https://issues.apache.org/jira/browse/HIVE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14266681#comment-14266681
]
Chao commented on HIVE-9112:
----------------------------
Hi [~tedxu], looks like this is related to Constant Propagation. The (partial)
plan with this optimization:
{noformat}
...
78 Stage: Stage-3
79 Map Reduce
80 Map Operator Tree:
81 TableScan
82 Reduce Output Operator
83 key expressions: _col1 (type: int), 1 (type: int)
84 sort order: ++
85 Map-reduce partition columns: _col1 (type: int)
86 Statistics: Num rows: 27 Data size: 3298 Basic stats:
COMPLETE Column stats: NONE
87 value expressions: _col0 (type: int), _col3 (type: int)
88 TableScan
89 alias: lineitem
90 Statistics: Num rows: 100 Data size: 11999 Basic stats:
COMPLETE Column stats: NONE
91 Filter Operator
92 predicate: ((((l_shipmode = 'AIR') and l_orderkey is not
null) and l_linenumber is not null) and (l_linenumber = 1)) (type: boolean)
93 Statistics: Num rows: 6 Data size: 719 Basic stats: COMPLETE
Column stats: NONE
94 Select Operator
95 expressions: l_orderkey (type: int), 1 (type: int)
96 outputColumnNames: _col0, _col1
97 Statistics: Num rows: 6 Data size: 719 Basic stats:
COMPLETE Column stats: NONE
98 Group By Operator
99 keys: _col0 (type: int), _col1 (type: int)
100 mode: hash
101 outputColumnNames: _col0, _col1
102 Statistics: Num rows: 6 Data size: 719 Basic stats:
COMPLETE Column stats: NONE
103 Reduce Output Operator
104 key expressions: _col0 (type: int), _col1 (type: int)
105 sort order: ++
106 Map-reduce partition columns: _col0 (type: int), _col1
(type: int)
107 Statistics: Num rows: 6 Data size: 719 Basic stats:
COMPLETE Column stats: NONE
108 Reduce Operator Tree:
109 Join Operator
110 condition map:
111 Left Semi Join 0 to 1
112 keys:
113 0 _col1 (type: int), _col4 (type: int)
114 1 _col0 (type: int), _col1 (type: int)
115 outputColumnNames: _col0, _col3
116 Statistics: Num rows: 29 Data size: 3627 Basic stats: COMPLETE
Column stats: NONE
117 Select Operator
118 expressions: _col0 (type: int), _col3 (type: int)
119 outputColumnNames: _col0, _col1
120 Statistics: Num rows: 29 Data size: 3627 Basic stats: COMPLETE
Column stats: NONE
121 File Output Operator
122 compressed: false
123 Statistics: Num rows: 29 Data size: 3627 Basic stats:
COMPLETE Column stats: NONE
124 table:
125 input format: org.apache.hadoop.mapred.TextInputFormat
126 output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
127 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
...
{noformat}
And diff for this part (on the left is the plan w/o the optimization):
{noformat}
83c83
< key expressions: _col1 (type: int), _col4 (type: int)
---
> key expressions: _col1 (type: int), 1 (type: int)
85c85
< Map-reduce partition columns: _col1 (type: int), _col4 (type:
int)
---
> Map-reduce partition columns: _col1 (type: int)
95c95
< expressions: l_orderkey (type: int), l_linenumber (type: int)
---
> expressions: l_orderkey (type: int), 1 (type: int)
{noformat}
Notice that on line 85, the MR partition column {{_col4}} has been optimized
away, which causes an inconsistency.
Later on, output rows for join will be hashed to different reducers, and
therefore introduces wrong results.
I saw that [~navis] has a
[comment|https://issues.apache.org/jira/browse/HIVE-7232?focusedCommentId=14032106&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14032106]
about some similar issue, maybe it's related?
I'm not an expert in Constant Propagation, and I'm thinking whether you can
take a look at this issue? Thanks.
> Query may generate different results depending on the number of reducers
> ------------------------------------------------------------------------
>
> Key: HIVE-9112
> URL: https://issues.apache.org/jira/browse/HIVE-9112
> Project: Hive
> Issue Type: Bug
> Reporter: Chao
> Assignee: Chao
>
> Some queries may generate different results depending on the number of
> reducers, for example, tests like ppd_multi_insert.q, join_nullsafe.q,
> subquery_in.q, etc.
> Take subquery_in.q as example, if we add
> {noformat}
> set mapred.reduce.tasks=3;
> {noformat}
> to this test file, the result will be different (and wrong):
> {noformat}
> @@ -903,5 +903,3 @@ where li.l_linenumber = 1 and
> POSTHOOK: type: QUERY
> POSTHOOK: Input: default@lineitem
> #### A masked pattern was here ####
> -108570 8571
> -4297 1798
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)