[
https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gopal V updated HIVE-7232:
--------------------------
Description:
After HIVE-7121, tpc-h query5 has resulted in incorrect results.
Thanks to [~navis], it has been tracked down to the auto-parallel settings
which were initialized for ReduceSinkOperator, but not for
VectorReduceSinkOperator. The vector version inherits, but doesn't call
super.initializeOp() or set up the variable correctly from ReduceSinkDesc.
The query is tpc-h query5, with extra NULL checks just to be sure.
{code}
ELECT n_name,
sum(l_extendedprice * (1 - l_discount)) AS revenue
FROM customer,
orders,
lineitem,
supplier,
nation,
region
WHERE c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND l_suppkey = s_suppkey
AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey
AND n_regionkey = r_regionkey
AND r_name = 'ASIA'
AND o_orderdate >= '1994-01-01'
AND o_orderdate < '1995-01-01'
and l_orderkey is not null
and c_custkey is not null
and l_suppkey is not null
and c_nationkey is not null
and s_nationkey is not null
and n_regionkey is not null
GROUP BY n_name
ORDER BY revenue DESC;
{code}
The reducer which has the issue has the following plan
{code}
Reducer 3
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {KEY.reducesinkkey0} {VALUE._col2}
1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3}
outputColumnNames: _col0, _col3, _col10, _col11, _col14
Statistics: Num rows: 183333344 Data size: 95229140992 Basic
stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col10 (type: int)
sort order: +
Map-reduce partition columns: _col10 (type: int)
Statistics: Num rows: 183333344 Data size: 95229140992 Basic
stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: int), _col3 (type: int),
_col11 (type: int), _col14 (type: string)
{code}
was:
After HIVE-4867 has been merged in, some queries have exhibited a very weird
skew towards NULL keys emitted from the ReduceSinkOperator.
Added extra logging to print expr.column() in ExprNodeColumnEvaluator & in
reduce sink.
{code}
2014-06-14 00:37:19,186 INFO [TezChild]
org.apache.hadoop.hive.ql.exec.ReduceSinkOperator:
numDistributionKeys = 1 {null --> ExprNodeColumnEvaluator(_col10)}
key_row={"reducesinkkey0":442}
{code}
{code}
HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null);
int distKeyLength = firstKey.getDistKeyLength();
if(distKeyLength <= 1) {
StringBuffer x1 = new StringBuffer();
x1.append("numDistributionKeys = "+ numDistributionKeys + "\n");
for (int i = 0; i < numDistributionKeys; i++) {
x1.append(cachedKeys[0][i] + " --> " + keyEval[i] + "\n");
}
x1.append("key_row="+ SerDeUtils.getJSONString(row,
keyObjectInspector));
LOG.info("GOPAL: " + x1.toString());
}
{code}
The query is tpc-h query5, with extra NULL checks just to be sure.
{code}
ELECT n_name,
sum(l_extendedprice * (1 - l_discount)) AS revenue
FROM customer,
orders,
lineitem,
supplier,
nation,
region
WHERE c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND l_suppkey = s_suppkey
AND c_nationkey = s_nationkey
AND s_nationkey = n_nationkey
AND n_regionkey = r_regionkey
AND r_name = 'ASIA'
AND o_orderdate >= '1994-01-01'
AND o_orderdate < '1995-01-01'
and l_orderkey is not null
and c_custkey is not null
and l_suppkey is not null
and c_nationkey is not null
and s_nationkey is not null
and n_regionkey is not null
GROUP BY n_name
ORDER BY revenue DESC;
{code}
The reducer which has the issue has the following plan
{code}
Reducer 3
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {KEY.reducesinkkey0} {VALUE._col2}
1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3}
outputColumnNames: _col0, _col3, _col10, _col11, _col14
Statistics: Num rows: 183333344 Data size: 95229140992 Basic
stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: _col10 (type: int)
sort order: +
Map-reduce partition columns: _col10 (type: int)
Statistics: Num rows: 183333344 Data size: 95229140992 Basic
stats: COMPLETE Column stats: NONE
value expressions: _col0 (type: int), _col3 (type: int),
_col11 (type: int), _col14 (type: string)
{code}
Summary: VectorReduceSink is emitting incorrect JOIN keys (was:
ReduceSink is emitting NULL keys due to failed keyEval)
updated bug report with analysis
> VectorReduceSink is emitting incorrect JOIN keys
> ------------------------------------------------
>
> Key: HIVE-7232
> URL: https://issues.apache.org/jira/browse/HIVE-7232
> Project: Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.14.0
> Reporter: Gopal V
> Assignee: Gopal V
> Attachments: HIVE-7232-extra-logging.patch, HIVE-7232.1.patch.txt,
> q5.explain.txt, q5.sql
>
>
> After HIVE-7121, tpc-h query5 has resulted in incorrect results.
> Thanks to [~navis], it has been tracked down to the auto-parallel settings
> which were initialized for ReduceSinkOperator, but not for
> VectorReduceSinkOperator. The vector version inherits, but doesn't call
> super.initializeOp() or set up the variable correctly from ReduceSinkDesc.
> The query is tpc-h query5, with extra NULL checks just to be sure.
> {code}
> ELECT n_name,
> sum(l_extendedprice * (1 - l_discount)) AS revenue
> FROM customer,
> orders,
> lineitem,
> supplier,
> nation,
> region
> WHERE c_custkey = o_custkey
> AND l_orderkey = o_orderkey
> AND l_suppkey = s_suppkey
> AND c_nationkey = s_nationkey
> AND s_nationkey = n_nationkey
> AND n_regionkey = r_regionkey
> AND r_name = 'ASIA'
> AND o_orderdate >= '1994-01-01'
> AND o_orderdate < '1995-01-01'
> and l_orderkey is not null
> and c_custkey is not null
> and l_suppkey is not null
> and c_nationkey is not null
> and s_nationkey is not null
> and n_regionkey is not null
> GROUP BY n_name
> ORDER BY revenue DESC;
> {code}
> The reducer which has the issue has the following plan
> {code}
> Reducer 3
> Reduce Operator Tree:
> Join Operator
> condition map:
> Inner Join 0 to 1
> condition expressions:
> 0 {KEY.reducesinkkey0} {VALUE._col2}
> 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3}
> outputColumnNames: _col0, _col3, _col10, _col11, _col14
> Statistics: Num rows: 183333344 Data size: 95229140992 Basic
> stats: COMPLETE Column stats: NONE
> Reduce Output Operator
> key expressions: _col10 (type: int)
> sort order: +
> Map-reduce partition columns: _col10 (type: int)
> Statistics: Num rows: 183333344 Data size: 95229140992
> Basic stats: COMPLETE Column stats: NONE
> value expressions: _col0 (type: int), _col3 (type: int),
> _col11 (type: int), _col14 (type: string)
> {code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)