[jira] [Commented] (HIVE-7232) ReduceSink is emitting NULL keys due to failed keyEval
[ https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034859#comment-14034859 ] Navis commented on HIVE-7232: - [~gopalv] Could you try the query with pre HIVE-7021? Seemed caused by that. ReduceSink is emitting NULL keys due to failed keyEval -- Key: HIVE-7232 URL: https://issues.apache.org/jira/browse/HIVE-7232 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.14.0 Reporter: Gopal V Assignee: Navis Attachments: HIVE-7232-extra-logging.patch, q5.explain.txt, q5.sql After HIVE-4867 has been merged in, some queries have exhibited a very weird skew towards NULL keys emitted from the ReduceSinkOperator. Added extra logging to print expr.column() in ExprNodeColumnEvaluator in reduce sink. {code} 2014-06-14 00:37:19,186 INFO [TezChild] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: numDistributionKeys = 1 {null -- ExprNodeColumnEvaluator(_col10)} key_row={reducesinkkey0:442} {code} {code} HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null); int distKeyLength = firstKey.getDistKeyLength(); if(distKeyLength = 1) { StringBuffer x1 = new StringBuffer(); x1.append(numDistributionKeys = + numDistributionKeys + \n); for (int i = 0; i numDistributionKeys; i++) { x1.append(cachedKeys[0][i] + -- + keyEval[i] + \n); } x1.append(key_row=+ SerDeUtils.getJSONString(row, keyObjectInspector)); LOG.info(GOPAL: + x1.toString()); } {code} The query is tpc-h query5, with extra NULL checks just to be sure. {code} ELECT n_name, sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate = '1994-01-01' AND o_orderdate '1995-01-01' and l_orderkey is not null and c_custkey is not null and l_suppkey is not null and c_nationkey is not null and s_nationkey is not null and n_regionkey is not null GROUP BY n_name ORDER BY revenue DESC; {code} The reducer which has the issue has the following plan {code} Reducer 3 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {KEY.reducesinkkey0} {VALUE._col2} 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3} outputColumnNames: _col0, _col3, _col10, _col11, _col14 Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col10 (type: int) sort order: + Map-reduce partition columns: _col10 (type: int) Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col3 (type: int), _col11 (type: int), _col14 (type: string) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7232) ReduceSink is emitting NULL keys due to failed keyEval
[ https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034869#comment-14034869 ] Gopal V commented on HIVE-7232: --- [~navis]: I tested this with git commit id 50f517a3930 - it has been broken from before HIVE-7121. {code} $ hive --version Hive 0.14.0-SNAPSHOT Subversion git://cn041.l42scl.hortonworks.com/grid/5/dev/gopalv/tez-autobuild/hive -r 50f517a3930da0a987e6f6e908a91a7705bf9c60 Compiled by gopal on Tue Jun 17 23:16:51 PDT 2014 From source with checksum 8f75b133edadf23e29096f5e9b5d0f99 {code} Sorry about the reducesinkkey0 confusion. I have assigned this to myself for more investigation - will edit the bug tomorrow to the actual issue of incorrect results. ReduceSink is emitting NULL keys due to failed keyEval -- Key: HIVE-7232 URL: https://issues.apache.org/jira/browse/HIVE-7232 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.14.0 Reporter: Gopal V Assignee: Gopal V Attachments: HIVE-7232-extra-logging.patch, q5.explain.txt, q5.sql After HIVE-4867 has been merged in, some queries have exhibited a very weird skew towards NULL keys emitted from the ReduceSinkOperator. Added extra logging to print expr.column() in ExprNodeColumnEvaluator in reduce sink. {code} 2014-06-14 00:37:19,186 INFO [TezChild] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: numDistributionKeys = 1 {null -- ExprNodeColumnEvaluator(_col10)} key_row={reducesinkkey0:442} {code} {code} HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null); int distKeyLength = firstKey.getDistKeyLength(); if(distKeyLength = 1) { StringBuffer x1 = new StringBuffer(); x1.append(numDistributionKeys = + numDistributionKeys + \n); for (int i = 0; i numDistributionKeys; i++) { x1.append(cachedKeys[0][i] + -- + keyEval[i] + \n); } x1.append(key_row=+ SerDeUtils.getJSONString(row, keyObjectInspector)); LOG.info(GOPAL: + x1.toString()); } {code} The query is tpc-h query5, with extra NULL checks just to be sure. {code} ELECT n_name, sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate = '1994-01-01' AND o_orderdate '1995-01-01' and l_orderkey is not null and c_custkey is not null and l_suppkey is not null and c_nationkey is not null and s_nationkey is not null and n_regionkey is not null GROUP BY n_name ORDER BY revenue DESC; {code} The reducer which has the issue has the following plan {code} Reducer 3 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {KEY.reducesinkkey0} {VALUE._col2} 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3} outputColumnNames: _col0, _col3, _col10, _col11, _col14 Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col10 (type: int) sort order: + Map-reduce partition columns: _col10 (type: int) Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col3 (type: int), _col11 (type: int), _col14 (type: string) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7232) ReduceSink is emitting NULL keys due to failed keyEval
[ https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034322#comment-14034322 ] Gopal V commented on HIVE-7232: --- [~navis]: I found out that there are indeed o_orderkey entries which show up as 214800 in text, which lies outside the range of the TPC-H Identifier column spec. I will reload the data using bigint for o_orderkey soon. But I still want to locate and confirm the different results between MR and Tez here. ReduceSink is emitting NULL keys due to failed keyEval -- Key: HIVE-7232 URL: https://issues.apache.org/jira/browse/HIVE-7232 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.14.0 Reporter: Gopal V Assignee: Navis Attachments: HIVE-7232-extra-logging.patch, q5.explain.txt, q5.sql After HIVE-4867 has been merged in, some queries have exhibited a very weird skew towards NULL keys emitted from the ReduceSinkOperator. Added extra logging to print expr.column() in ExprNodeColumnEvaluator in reduce sink. {code} 2014-06-14 00:37:19,186 INFO [TezChild] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: numDistributionKeys = 1 {null -- ExprNodeColumnEvaluator(_col10)} key_row={reducesinkkey0:442} {code} {code} HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null); int distKeyLength = firstKey.getDistKeyLength(); if(distKeyLength = 1) { StringBuffer x1 = new StringBuffer(); x1.append(numDistributionKeys = + numDistributionKeys + \n); for (int i = 0; i numDistributionKeys; i++) { x1.append(cachedKeys[0][i] + -- + keyEval[i] + \n); } x1.append(key_row=+ SerDeUtils.getJSONString(row, keyObjectInspector)); LOG.info(GOPAL: + x1.toString()); } {code} The query is tpc-h query5, with extra NULL checks just to be sure. {code} ELECT n_name, sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate = '1994-01-01' AND o_orderdate '1995-01-01' and l_orderkey is not null and c_custkey is not null and l_suppkey is not null and c_nationkey is not null and s_nationkey is not null and n_regionkey is not null GROUP BY n_name ORDER BY revenue DESC; {code} The reducer which has the issue has the following plan {code} Reducer 3 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {KEY.reducesinkkey0} {VALUE._col2} 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3} outputColumnNames: _col0, _col3, _col10, _col11, _col14 Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col10 (type: int) sort order: + Map-reduce partition columns: _col10 (type: int) Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col3 (type: int), _col11 (type: int), _col14 (type: string) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7232) ReduceSink is emitting NULL keys due to failed keyEval
[ https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034675#comment-14034675 ] Navis commented on HIVE-7232: - Looks like something is wrong in broadcast join. I'll look into this. ReduceSink is emitting NULL keys due to failed keyEval -- Key: HIVE-7232 URL: https://issues.apache.org/jira/browse/HIVE-7232 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.14.0 Reporter: Gopal V Assignee: Navis Attachments: HIVE-7232-extra-logging.patch, q5.explain.txt, q5.sql After HIVE-4867 has been merged in, some queries have exhibited a very weird skew towards NULL keys emitted from the ReduceSinkOperator. Added extra logging to print expr.column() in ExprNodeColumnEvaluator in reduce sink. {code} 2014-06-14 00:37:19,186 INFO [TezChild] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: numDistributionKeys = 1 {null -- ExprNodeColumnEvaluator(_col10)} key_row={reducesinkkey0:442} {code} {code} HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null); int distKeyLength = firstKey.getDistKeyLength(); if(distKeyLength = 1) { StringBuffer x1 = new StringBuffer(); x1.append(numDistributionKeys = + numDistributionKeys + \n); for (int i = 0; i numDistributionKeys; i++) { x1.append(cachedKeys[0][i] + -- + keyEval[i] + \n); } x1.append(key_row=+ SerDeUtils.getJSONString(row, keyObjectInspector)); LOG.info(GOPAL: + x1.toString()); } {code} The query is tpc-h query5, with extra NULL checks just to be sure. {code} ELECT n_name, sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate = '1994-01-01' AND o_orderdate '1995-01-01' and l_orderkey is not null and c_custkey is not null and l_suppkey is not null and c_nationkey is not null and s_nationkey is not null and n_regionkey is not null GROUP BY n_name ORDER BY revenue DESC; {code} The reducer which has the issue has the following plan {code} Reducer 3 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {KEY.reducesinkkey0} {VALUE._col2} 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3} outputColumnNames: _col0, _col3, _col10, _col11, _col14 Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col10 (type: int) sort order: + Map-reduce partition columns: _col10 (type: int) Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col3 (type: int), _col11 (type: int), _col14 (type: string) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7232) ReduceSink is emitting NULL keys due to failed keyEval
[ https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14034734#comment-14034734 ] Navis commented on HIVE-7232: - I've reproduced the problem. It occurs on mapjoin + vetorization combination. ReduceSink is emitting NULL keys due to failed keyEval -- Key: HIVE-7232 URL: https://issues.apache.org/jira/browse/HIVE-7232 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.14.0 Reporter: Gopal V Assignee: Navis Attachments: HIVE-7232-extra-logging.patch, q5.explain.txt, q5.sql After HIVE-4867 has been merged in, some queries have exhibited a very weird skew towards NULL keys emitted from the ReduceSinkOperator. Added extra logging to print expr.column() in ExprNodeColumnEvaluator in reduce sink. {code} 2014-06-14 00:37:19,186 INFO [TezChild] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: numDistributionKeys = 1 {null -- ExprNodeColumnEvaluator(_col10)} key_row={reducesinkkey0:442} {code} {code} HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null); int distKeyLength = firstKey.getDistKeyLength(); if(distKeyLength = 1) { StringBuffer x1 = new StringBuffer(); x1.append(numDistributionKeys = + numDistributionKeys + \n); for (int i = 0; i numDistributionKeys; i++) { x1.append(cachedKeys[0][i] + -- + keyEval[i] + \n); } x1.append(key_row=+ SerDeUtils.getJSONString(row, keyObjectInspector)); LOG.info(GOPAL: + x1.toString()); } {code} The query is tpc-h query5, with extra NULL checks just to be sure. {code} ELECT n_name, sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate = '1994-01-01' AND o_orderdate '1995-01-01' and l_orderkey is not null and c_custkey is not null and l_suppkey is not null and c_nationkey is not null and s_nationkey is not null and n_regionkey is not null GROUP BY n_name ORDER BY revenue DESC; {code} The reducer which has the issue has the following plan {code} Reducer 3 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {KEY.reducesinkkey0} {VALUE._col2} 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3} outputColumnNames: _col0, _col3, _col10, _col11, _col14 Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col10 (type: int) sort order: + Map-reduce partition columns: _col10 (type: int) Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col3 (type: int), _col11 (type: int), _col14 (type: string) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7232) ReduceSink is emitting NULL keys due to failed keyEval
[ https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032176#comment-14032176 ] Navis commented on HIVE-7232: - [~gopalv] Could you attach full explain result and query? Explain result on my notebook is different with yours (For me, it's Reducer 6 not Reducer 3). Seemed hard to reproduce with small sized data (used factor 1). ReduceSink is emitting NULL keys due to failed keyEval -- Key: HIVE-7232 URL: https://issues.apache.org/jira/browse/HIVE-7232 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.14.0 Reporter: Gopal V Assignee: Navis After HIVE-4867 has been merged in, some queries have exhibited a very weird skew towards NULL keys emitted from the ReduceSinkOperator. Added extra logging to print expr.column() in ExprNodeColumnEvaluator in reduce sink. {code} 2014-06-14 00:37:19,186 INFO [TezChild] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: numDistributionKeys = 1 {null -- ExprNodeColumnEvaluator(_col10)} key_row={reducesinkkey0:442} {code} {code} HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null); int distKeyLength = firstKey.getDistKeyLength(); if(distKeyLength = 1) { StringBuffer x1 = new StringBuffer(); x1.append(numDistributionKeys = + numDistributionKeys + \n); for (int i = 0; i numDistributionKeys; i++) { x1.append(cachedKeys[0][i] + -- + keyEval[i] + \n); } x1.append(key_row=+ SerDeUtils.getJSONString(row, keyObjectInspector)); LOG.info(GOPAL: + x1.toString()); } {code} The query is tpc-h query5, with extra NULL checks just to be sure. {code} ELECT n_name, sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate = '1994-01-01' AND o_orderdate '1995-01-01' and l_orderkey is not null and c_custkey is not null and l_suppkey is not null and c_nationkey is not null and s_nationkey is not null and n_regionkey is not null GROUP BY n_name ORDER BY revenue DESC; {code} The reducer which has the issue has the following plan {code} Reducer 3 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {KEY.reducesinkkey0} {VALUE._col2} 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3} outputColumnNames: _col0, _col3, _col10, _col11, _col14 Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col10 (type: int) sort order: + Map-reduce partition columns: _col10 (type: int) Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col3 (type: int), _col11 (type: int), _col14 (type: string) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7232) ReduceSink is emitting NULL keys due to failed keyEval
[ https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032250#comment-14032250 ] Navis commented on HIVE-7232: - Could you print out whole row instead of keys? {noformat} x1.append(row=+ SerDeUtils.getJSONString(row, rowInspector)); {noformat} Thanks, ReduceSink is emitting NULL keys due to failed keyEval -- Key: HIVE-7232 URL: https://issues.apache.org/jira/browse/HIVE-7232 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.14.0 Reporter: Gopal V Assignee: Navis After HIVE-4867 has been merged in, some queries have exhibited a very weird skew towards NULL keys emitted from the ReduceSinkOperator. Added extra logging to print expr.column() in ExprNodeColumnEvaluator in reduce sink. {code} 2014-06-14 00:37:19,186 INFO [TezChild] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: numDistributionKeys = 1 {null -- ExprNodeColumnEvaluator(_col10)} key_row={reducesinkkey0:442} {code} {code} HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null); int distKeyLength = firstKey.getDistKeyLength(); if(distKeyLength = 1) { StringBuffer x1 = new StringBuffer(); x1.append(numDistributionKeys = + numDistributionKeys + \n); for (int i = 0; i numDistributionKeys; i++) { x1.append(cachedKeys[0][i] + -- + keyEval[i] + \n); } x1.append(key_row=+ SerDeUtils.getJSONString(row, keyObjectInspector)); LOG.info(GOPAL: + x1.toString()); } {code} The query is tpc-h query5, with extra NULL checks just to be sure. {code} ELECT n_name, sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate = '1994-01-01' AND o_orderdate '1995-01-01' and l_orderkey is not null and c_custkey is not null and l_suppkey is not null and c_nationkey is not null and s_nationkey is not null and n_regionkey is not null GROUP BY n_name ORDER BY revenue DESC; {code} The reducer which has the issue has the following plan {code} Reducer 3 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {KEY.reducesinkkey0} {VALUE._col2} 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3} outputColumnNames: _col0, _col3, _col10, _col11, _col14 Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col10 (type: int) sort order: + Map-reduce partition columns: _col10 (type: int) Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col3 (type: int), _col11 (type: int), _col14 (type: string) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7232) ReduceSink is emitting NULL keys due to failed keyEval
[ https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033253#comment-14033253 ] Navis commented on HIVE-7232: - _col10 is null which is VALUE._col0 of MAP2, which is again o_orderkey {noformat} Reduce Output Operator key expressions: o_custkey (type: int) sort order: + Map-reduce partition columns: o_custkey (type: int) Statistics: Num rows: 1 Data size: 86571942280 Basic stats: COMPLETE Column stats: NONE value expressions: o_orderkey (type: int), o_orderdate (type: string) {noformat} Is the table orders contains nulls? ReduceSink is emitting NULL keys due to failed keyEval -- Key: HIVE-7232 URL: https://issues.apache.org/jira/browse/HIVE-7232 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.14.0 Reporter: Gopal V Assignee: Navis Attachments: HIVE-7232-extra-logging.patch, q5.explain.txt, q5.sql After HIVE-4867 has been merged in, some queries have exhibited a very weird skew towards NULL keys emitted from the ReduceSinkOperator. Added extra logging to print expr.column() in ExprNodeColumnEvaluator in reduce sink. {code} 2014-06-14 00:37:19,186 INFO [TezChild] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: numDistributionKeys = 1 {null -- ExprNodeColumnEvaluator(_col10)} key_row={reducesinkkey0:442} {code} {code} HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null); int distKeyLength = firstKey.getDistKeyLength(); if(distKeyLength = 1) { StringBuffer x1 = new StringBuffer(); x1.append(numDistributionKeys = + numDistributionKeys + \n); for (int i = 0; i numDistributionKeys; i++) { x1.append(cachedKeys[0][i] + -- + keyEval[i] + \n); } x1.append(key_row=+ SerDeUtils.getJSONString(row, keyObjectInspector)); LOG.info(GOPAL: + x1.toString()); } {code} The query is tpc-h query5, with extra NULL checks just to be sure. {code} ELECT n_name, sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate = '1994-01-01' AND o_orderdate '1995-01-01' and l_orderkey is not null and c_custkey is not null and l_suppkey is not null and c_nationkey is not null and s_nationkey is not null and n_regionkey is not null GROUP BY n_name ORDER BY revenue DESC; {code} The reducer which has the issue has the following plan {code} Reducer 3 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {KEY.reducesinkkey0} {VALUE._col2} 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3} outputColumnNames: _col0, _col3, _col10, _col11, _col14 Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col10 (type: int) sort order: + Map-reduce partition columns: _col10 (type: int) Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col3 (type: int), _col11 (type: int), _col14 (type: string) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7232) ReduceSink is emitting NULL keys due to failed keyEval
[ https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14033435#comment-14033435 ] Gopal V commented on HIVE-7232: --- [~navis]: TPC-H data shouldn't have any NULLs in the join keys. I will re-run the scans tomorrow. I can see one case where the schema from HIVE-600 might be completely broken. The Integer requirement in TPC-H requires only -2,147,483,646 to 2,147,483,647. Though rethinking this a bit, I think HIVE-600's schema a bug which assumed O_ORDERKEY would be an int (might not be 32-bit anymore at 1Tb scale). I will verify that we're not overflowing that integer limit at a higher scale tomorrow producing nulls. I can confirm that. But that aside, I am more concerned about the difference in output between Tez MR. In MR, no stage with a reduce sink will have a key row fed by a reduce input. I will debug this more tomorrow to narrow down the query to a pair of shuffle-joins and compare output between MR Tez plans. ReduceSink is emitting NULL keys due to failed keyEval -- Key: HIVE-7232 URL: https://issues.apache.org/jira/browse/HIVE-7232 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.14.0 Reporter: Gopal V Assignee: Navis Attachments: HIVE-7232-extra-logging.patch, q5.explain.txt, q5.sql After HIVE-4867 has been merged in, some queries have exhibited a very weird skew towards NULL keys emitted from the ReduceSinkOperator. Added extra logging to print expr.column() in ExprNodeColumnEvaluator in reduce sink. {code} 2014-06-14 00:37:19,186 INFO [TezChild] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: numDistributionKeys = 1 {null -- ExprNodeColumnEvaluator(_col10)} key_row={reducesinkkey0:442} {code} {code} HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null); int distKeyLength = firstKey.getDistKeyLength(); if(distKeyLength = 1) { StringBuffer x1 = new StringBuffer(); x1.append(numDistributionKeys = + numDistributionKeys + \n); for (int i = 0; i numDistributionKeys; i++) { x1.append(cachedKeys[0][i] + -- + keyEval[i] + \n); } x1.append(key_row=+ SerDeUtils.getJSONString(row, keyObjectInspector)); LOG.info(GOPAL: + x1.toString()); } {code} The query is tpc-h query5, with extra NULL checks just to be sure. {code} ELECT n_name, sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate = '1994-01-01' AND o_orderdate '1995-01-01' and l_orderkey is not null and c_custkey is not null and l_suppkey is not null and c_nationkey is not null and s_nationkey is not null and n_regionkey is not null GROUP BY n_name ORDER BY revenue DESC; {code} The reducer which has the issue has the following plan {code} Reducer 3 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {KEY.reducesinkkey0} {VALUE._col2} 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3} outputColumnNames: _col0, _col3, _col10, _col11, _col14 Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col10 (type: int) sort order: + Map-reduce partition columns: _col10 (type: int) Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col3 (type: int), _col11 (type: int), _col14 (type: string) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7232) ReduceSink is emitting NULL keys due to failed keyEval
[ https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032033#comment-14032033 ] Gopal V commented on HIVE-7232: --- [~ashutoshc]: Incorrect results as well. Ran the same query with Tez MR, got different results. MR doesn't hit the same scenario becuase of the empty Map task, which doesn't have any input columns named reducesinkkey0. Tez seems to hit a corner case where there are 2 shuffle joins one after the other - there is an input col named KEY.reducesinkkey0 and an output col named reducesinkkey0, which have no relation to each other. {code} $ diff -y -W 72 results/q5.tez.txt results/q5.mr.txt CHINA 985314.0848|VIETNAM 1.897236998313891E10 INDIA 819113.441801 |CHINA 1.894405687452681E10 VIETNAM 637407.2255|INDONESIA 1.89306456994551 JAPAN 523754.9791|JAPAN 1.892184676125508E10 INDONESIA 517900.1924|INDIA 1.886882412417209E10 {code} ReduceSink is emitting NULL keys due to failed keyEval -- Key: HIVE-7232 URL: https://issues.apache.org/jira/browse/HIVE-7232 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.14.0 Reporter: Gopal V After HIVE-4867 has been merged in, some queries have exhibited a very weird skew towards NULL keys emitted from the ReduceSinkOperator. Added extra logging to print expr.column() in ExprNodeColumnEvaluator in reduce sink. {code} 2014-06-14 00:37:19,186 INFO [TezChild] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: numDistributionKeys = 1 {null -- ExprNodeColumnEvaluator(_col10)} key_row={reducesinkkey0:442} {code} {code} HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null); int distKeyLength = firstKey.getDistKeyLength(); if(distKeyLength = 1) { StringBuffer x1 = new StringBuffer(); x1.append(numDistributionKeys = + numDistributionKeys + \n); for (int i = 0; i numDistributionKeys; i++) { x1.append(cachedKeys[0][i] + -- + keyEval[i] + \n); } x1.append(key_row=+ SerDeUtils.getJSONString(row, keyObjectInspector)); LOG.info(GOPAL: + x1.toString()); } {code} The query is tpc-h query5, with extra NULL checks just to be sure. {code} ELECT n_name, sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate = '1994-01-01' AND o_orderdate '1995-01-01' and l_orderkey is not null and c_custkey is not null and l_suppkey is not null and c_nationkey is not null and s_nationkey is not null and n_regionkey is not null GROUP BY n_name ORDER BY revenue DESC; {code} The reducer which has the issue has the following plan {code} Reducer 3 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {KEY.reducesinkkey0} {VALUE._col2} 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3} outputColumnNames: _col0, _col3, _col10, _col11, _col14 Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col10 (type: int) sort order: + Map-reduce partition columns: _col10 (type: int) Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col3 (type: int), _col11 (type: int), _col14 (type: string) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7232) ReduceSink is emitting NULL keys due to failed keyEval
[ https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032088#comment-14032088 ] Ashutosh Chauhan commented on HIVE-7232: Seems like this also can get triggered for MR path. I think latest patch on HIVE-5771 is failing for test like subquery_in.q because they are hitting into this issue. ReduceSink is emitting NULL keys due to failed keyEval -- Key: HIVE-7232 URL: https://issues.apache.org/jira/browse/HIVE-7232 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.14.0 Reporter: Gopal V Assignee: Navis After HIVE-4867 has been merged in, some queries have exhibited a very weird skew towards NULL keys emitted from the ReduceSinkOperator. Added extra logging to print expr.column() in ExprNodeColumnEvaluator in reduce sink. {code} 2014-06-14 00:37:19,186 INFO [TezChild] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: numDistributionKeys = 1 {null -- ExprNodeColumnEvaluator(_col10)} key_row={reducesinkkey0:442} {code} {code} HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null); int distKeyLength = firstKey.getDistKeyLength(); if(distKeyLength = 1) { StringBuffer x1 = new StringBuffer(); x1.append(numDistributionKeys = + numDistributionKeys + \n); for (int i = 0; i numDistributionKeys; i++) { x1.append(cachedKeys[0][i] + -- + keyEval[i] + \n); } x1.append(key_row=+ SerDeUtils.getJSONString(row, keyObjectInspector)); LOG.info(GOPAL: + x1.toString()); } {code} The query is tpc-h query5, with extra NULL checks just to be sure. {code} ELECT n_name, sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate = '1994-01-01' AND o_orderdate '1995-01-01' and l_orderkey is not null and c_custkey is not null and l_suppkey is not null and c_nationkey is not null and s_nationkey is not null and n_regionkey is not null GROUP BY n_name ORDER BY revenue DESC; {code} The reducer which has the issue has the following plan {code} Reducer 3 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {KEY.reducesinkkey0} {VALUE._col2} 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3} outputColumnNames: _col0, _col3, _col10, _col11, _col14 Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col10 (type: int) sort order: + Map-reduce partition columns: _col10 (type: int) Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col3 (type: int), _col11 (type: int), _col14 (type: string) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7232) ReduceSink is emitting NULL keys due to failed keyEval
[ https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032106#comment-14032106 ] Navis commented on HIVE-7232: - Fail of subquery_in.q in HIVE-5771 seemed not caused by HIVE-4867 but strongly related with it because HIVE-4867 have (intentionally) broken internal assumption on keys/values of RS. With constant propagation optimizer, subquery_in.q is making different keys for each aliases of join, which seemed not valid. {code} -- sq_1 Reduce Output Operator key expressions: _col1 (type: int) sort order: ++ Map-reduce partition columns: _col1 (type: int) {code} and {code} -- others Reduce Output Operator key expressions: _col0 (type: int), _col1 (type: int) sort order: ++ Map-reduce partition columns: _col0 (type: int), _col1 (type: int) {code} ReduceSink is emitting NULL keys due to failed keyEval -- Key: HIVE-7232 URL: https://issues.apache.org/jira/browse/HIVE-7232 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.14.0 Reporter: Gopal V Assignee: Navis After HIVE-4867 has been merged in, some queries have exhibited a very weird skew towards NULL keys emitted from the ReduceSinkOperator. Added extra logging to print expr.column() in ExprNodeColumnEvaluator in reduce sink. {code} 2014-06-14 00:37:19,186 INFO [TezChild] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: numDistributionKeys = 1 {null -- ExprNodeColumnEvaluator(_col10)} key_row={reducesinkkey0:442} {code} {code} HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null); int distKeyLength = firstKey.getDistKeyLength(); if(distKeyLength = 1) { StringBuffer x1 = new StringBuffer(); x1.append(numDistributionKeys = + numDistributionKeys + \n); for (int i = 0; i numDistributionKeys; i++) { x1.append(cachedKeys[0][i] + -- + keyEval[i] + \n); } x1.append(key_row=+ SerDeUtils.getJSONString(row, keyObjectInspector)); LOG.info(GOPAL: + x1.toString()); } {code} The query is tpc-h query5, with extra NULL checks just to be sure. {code} ELECT n_name, sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate = '1994-01-01' AND o_orderdate '1995-01-01' and l_orderkey is not null and c_custkey is not null and l_suppkey is not null and c_nationkey is not null and s_nationkey is not null and n_regionkey is not null GROUP BY n_name ORDER BY revenue DESC; {code} The reducer which has the issue has the following plan {code} Reducer 3 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {KEY.reducesinkkey0} {VALUE._col2} 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3} outputColumnNames: _col0, _col3, _col10, _col11, _col14 Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col10 (type: int) sort order: + Map-reduce partition columns: _col10 (type: int) Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col3 (type: int), _col11 (type: int), _col14 (type: string) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7232) ReduceSink is emitting NULL keys due to failed keyEval
[ https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032108#comment-14032108 ] Navis commented on HIVE-7232: - For this problem, I cannot understand that the RS which is a child of JOIN can get ROW of format, {noformat} {reducesinkkey0:442} {noformat} In my reading, join would emit ROW and rowOI which is labeled with output columns, like below {noformat} _col0{KEY.reducesinkkey0} _col3{VALUE._col2} _col10 {VALUE._col0} _col11 {KEY.reducesinkkey0} _col14 {VALUE._col3} {noformat} I don't have environment for hadoop-2, so it's hard to verify, so it might take some time. ReduceSink is emitting NULL keys due to failed keyEval -- Key: HIVE-7232 URL: https://issues.apache.org/jira/browse/HIVE-7232 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.14.0 Reporter: Gopal V Assignee: Navis After HIVE-4867 has been merged in, some queries have exhibited a very weird skew towards NULL keys emitted from the ReduceSinkOperator. Added extra logging to print expr.column() in ExprNodeColumnEvaluator in reduce sink. {code} 2014-06-14 00:37:19,186 INFO [TezChild] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: numDistributionKeys = 1 {null -- ExprNodeColumnEvaluator(_col10)} key_row={reducesinkkey0:442} {code} {code} HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null); int distKeyLength = firstKey.getDistKeyLength(); if(distKeyLength = 1) { StringBuffer x1 = new StringBuffer(); x1.append(numDistributionKeys = + numDistributionKeys + \n); for (int i = 0; i numDistributionKeys; i++) { x1.append(cachedKeys[0][i] + -- + keyEval[i] + \n); } x1.append(key_row=+ SerDeUtils.getJSONString(row, keyObjectInspector)); LOG.info(GOPAL: + x1.toString()); } {code} The query is tpc-h query5, with extra NULL checks just to be sure. {code} ELECT n_name, sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate = '1994-01-01' AND o_orderdate '1995-01-01' and l_orderkey is not null and c_custkey is not null and l_suppkey is not null and c_nationkey is not null and s_nationkey is not null and n_regionkey is not null GROUP BY n_name ORDER BY revenue DESC; {code} The reducer which has the issue has the following plan {code} Reducer 3 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {KEY.reducesinkkey0} {VALUE._col2} 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3} outputColumnNames: _col0, _col3, _col10, _col11, _col14 Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col10 (type: int) sort order: + Map-reduce partition columns: _col10 (type: int) Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col3 (type: int), _col11 (type: int), _col14 (type: string) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7232) ReduceSink is emitting NULL keys due to failed keyEval
[ https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14032123#comment-14032123 ] Gopal V commented on HIVE-7232: --- [~navis]: I can run tests for you, if you have a patch file with log lines. I can reproduce this issue consistently for all recent runs of this query. ReduceSink is emitting NULL keys due to failed keyEval -- Key: HIVE-7232 URL: https://issues.apache.org/jira/browse/HIVE-7232 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.14.0 Reporter: Gopal V Assignee: Navis After HIVE-4867 has been merged in, some queries have exhibited a very weird skew towards NULL keys emitted from the ReduceSinkOperator. Added extra logging to print expr.column() in ExprNodeColumnEvaluator in reduce sink. {code} 2014-06-14 00:37:19,186 INFO [TezChild] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: numDistributionKeys = 1 {null -- ExprNodeColumnEvaluator(_col10)} key_row={reducesinkkey0:442} {code} {code} HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null); int distKeyLength = firstKey.getDistKeyLength(); if(distKeyLength = 1) { StringBuffer x1 = new StringBuffer(); x1.append(numDistributionKeys = + numDistributionKeys + \n); for (int i = 0; i numDistributionKeys; i++) { x1.append(cachedKeys[0][i] + -- + keyEval[i] + \n); } x1.append(key_row=+ SerDeUtils.getJSONString(row, keyObjectInspector)); LOG.info(GOPAL: + x1.toString()); } {code} The query is tpc-h query5, with extra NULL checks just to be sure. {code} ELECT n_name, sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate = '1994-01-01' AND o_orderdate '1995-01-01' and l_orderkey is not null and c_custkey is not null and l_suppkey is not null and c_nationkey is not null and s_nationkey is not null and n_regionkey is not null GROUP BY n_name ORDER BY revenue DESC; {code} The reducer which has the issue has the following plan {code} Reducer 3 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {KEY.reducesinkkey0} {VALUE._col2} 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3} outputColumnNames: _col0, _col3, _col10, _col11, _col14 Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col10 (type: int) sort order: + Map-reduce partition columns: _col10 (type: int) Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col3 (type: int), _col11 (type: int), _col14 (type: string) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-7232) ReduceSink is emitting NULL keys due to failed keyEval
[ https://issues.apache.org/jira/browse/HIVE-7232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14031791#comment-14031791 ] Ashutosh Chauhan commented on HIVE-7232: [~gopalv] Is this resulting in wrong results (because NULL key got emitted incorrectly) or this resulting in lower perf (because it resulted in a skew towards NULL) ? ReduceSink is emitting NULL keys due to failed keyEval -- Key: HIVE-7232 URL: https://issues.apache.org/jira/browse/HIVE-7232 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.14.0 Reporter: Gopal V After HIVE-4867 has been merged in, some queries have exhibited a very weird skew towards NULL keys emitted from the ReduceSinkOperator. Added extra logging to print expr.column() in ExprNodeColumnEvaluator in reduce sink. {code} 2014-06-14 00:37:19,186 INFO [TezChild] org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: numDistributionKeys = 1 {null -- ExprNodeColumnEvaluator(_col10)} key_row={reducesinkkey0:442} {code} {code} HiveKey firstKey = toHiveKey(cachedKeys[0], tag, null); int distKeyLength = firstKey.getDistKeyLength(); if(distKeyLength = 1) { StringBuffer x1 = new StringBuffer(); x1.append(numDistributionKeys = + numDistributionKeys + \n); for (int i = 0; i numDistributionKeys; i++) { x1.append(cachedKeys[0][i] + -- + keyEval[i] + \n); } x1.append(key_row=+ SerDeUtils.getJSONString(row, keyObjectInspector)); LOG.info(GOPAL: + x1.toString()); } {code} The query is tpc-h query5, with extra NULL checks just to be sure. {code} ELECT n_name, sum(l_extendedprice * (1 - l_discount)) AS revenue FROM customer, orders, lineitem, supplier, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_suppkey = s_suppkey AND c_nationkey = s_nationkey AND s_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' AND o_orderdate = '1994-01-01' AND o_orderdate '1995-01-01' and l_orderkey is not null and c_custkey is not null and l_suppkey is not null and c_nationkey is not null and s_nationkey is not null and n_regionkey is not null GROUP BY n_name ORDER BY revenue DESC; {code} The reducer which has the issue has the following plan {code} Reducer 3 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {KEY.reducesinkkey0} {VALUE._col2} 1 {VALUE._col0} {KEY.reducesinkkey0} {VALUE._col3} outputColumnNames: _col0, _col3, _col10, _col11, _col14 Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col10 (type: int) sort order: + Map-reduce partition columns: _col10 (type: int) Statistics: Num rows: 18344 Data size: 95229140992 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: int), _col3 (type: int), _col11 (type: int), _col14 (type: string) {code} -- This message was sent by Atlassian JIRA (v6.2#6252)