[jira] [Updated] (HIVE-4781) LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval
[ https://issues.apache.org/jira/browse/HIVE-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated HIVE-4781: --- Attachment: HIVE-4781.txt got an error when using arc {quote} Parse Exception: Expected a hunk header, like 'Index: /path/to/file.ext' (svn), 'Property changes on: /path/to/file.ext' (svn properties), 'commit 59bcc3ad6775562f845953cf01624225' (git show), 'diff --git' (git diff), or '--- filename' (unified diff). 1 {\quote} seems caused by data added for the new test query? So, I upload this patch. LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval -- Key: HIVE-4781 URL: https://issues.apache.org/jira/browse/HIVE-4781 Project: Hive Issue Type: Bug Affects Versions: 0.12.0 Reporter: Yin Huai Assignee: Yin Huai Attachments: HIVE-4781.txt, wrong_semi_join.txt Suppose that we have a query shown below {code:sql} SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key); {\code} When the number of rows of t2 is larger than hive.join.emit.interval, JoinOperator will emit rows from t1, which will result in redundant output. Let's say t1 is {code} 1 {\code} and t2 is {code} 1 1 1 1 {\code} When hive.join.emit.interval=1, the output of above query will be {code} 1 1 1 1 {\code} The correct result should be {code} 1 {\code} This problem cannot be found in unit test. Because there is a GBY operator inserted before JoinOperator and we have only 1 mapper, the output of map phase only has distinct keys. Please apply the patch 'wrong_semi_join.txt' attached below and use {code} ant test -Dtestcase=TestMinimrCliDriver -Dqfile=left_semi_join.q -Dtest.silent=false {\code} to replay the problem. The wrong result can be found in {code} hive_root_dir/build/ql/test/logs/clientpositive {\code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4781) LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval
[ https://issues.apache.org/jira/browse/HIVE-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated HIVE-4781: --- Status: Patch Available (was: Open) I tested all unit tests before the commit of HIVE-4496. all unit tests pass LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval -- Key: HIVE-4781 URL: https://issues.apache.org/jira/browse/HIVE-4781 Project: Hive Issue Type: Bug Affects Versions: 0.12.0 Reporter: Yin Huai Assignee: Yin Huai Attachments: HIVE-4781.txt, wrong_semi_join.txt Suppose that we have a query shown below {code:sql} SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key); {\code} When the number of rows of t2 is larger than hive.join.emit.interval, JoinOperator will emit rows from t1, which will result in redundant output. Let's say t1 is {code} 1 {\code} and t2 is {code} 1 1 1 1 {\code} When hive.join.emit.interval=1, the output of above query will be {code} 1 1 1 1 {\code} The correct result should be {code} 1 {\code} This problem cannot be found in unit test. Because there is a GBY operator inserted before JoinOperator and we have only 1 mapper, the output of map phase only has distinct keys. Please apply the patch 'wrong_semi_join.txt' attached below and use {code} ant test -Dtestcase=TestMinimrCliDriver -Dqfile=left_semi_join.q -Dtest.silent=false {\code} to replay the problem. The wrong result can be found in {code} hive_root_dir/build/ql/test/logs/clientpositive {\code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4781) LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval
[ https://issues.apache.org/jira/browse/HIVE-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated HIVE-4781: --- Description: Suppose that we have a query shown below {code:sql} SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key); {\code} When the number of rows of t2 is larger than hive.join.emit.interval, JoinOperator will emit rows from t1, which will result in redundant output. Let's say t1 is {code} 1 {\code} and t2 is {code} 1 1 1 1 {\code} When hive.join.emit.interval=1, the output of above query will be {code} 1 1 1 1 {\code} The correct result should be {code} 1 {\code} This problem cannot be found in unit test. Because there is a GBY operator inserted before JoinOperator and we have only 1 mapper, the output of map phase only has distinct keys. Please apply the patch 'wrong_semi_join.txt' attached below and use {code} ant test -Dtestcase=TestMinimrCliDriver -Dqfile=left_semi_join.q -Dtest.silent=false {\code} to replay the problem. The wrong result can be found in {code} hive_root_dir/build/ql/test/logs/clientpositive {\code} was: Suppose that we have a query shown below {code:sql} SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key); {\code} When the number of rows of t2 is larger than hive.join.emit.interval, JoinOperator will emit rows from t1, which will result in redundant output. Let's say t1 is {code} key 1 {\code} and t2 is {code} key 1 1 1 1 {\code} When hive.join.emit.interval=1, the output of above query will be {code} 1 1 1 1 {\code} The correct result should be {code} 1 {\code} This problem cannot be found in unit test. Because there is a GBY operator inserted before JoinOperator and we have only 1 mapper, the output of map phase only has distinct keys. Please apply the patch 'wrong_semi_join.txt' attached below and use {code} ant test -Dtestcase=TestMinimrCliDriver -Dqfile=left_semi_join.q -Dtest.silent=false {\code} to replay the problem. The wrong result can be found in {code} hive_root_dir/build/ql/test/logs/clientpositive {\code} LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval -- Key: HIVE-4781 URL: https://issues.apache.org/jira/browse/HIVE-4781 Project: Hive Issue Type: Bug Affects Versions: 0.12.0 Reporter: Yin Huai Assignee: Yin Huai Attachments: wrong_semi_join.txt Suppose that we have a query shown below {code:sql} SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key); {\code} When the number of rows of t2 is larger than hive.join.emit.interval, JoinOperator will emit rows from t1, which will result in redundant output. Let's say t1 is {code} 1 {\code} and t2 is {code} 1 1 1 1 {\code} When hive.join.emit.interval=1, the output of above query will be {code} 1 1 1 1 {\code} The correct result should be {code} 1 {\code} This problem cannot be found in unit test. Because there is a GBY operator inserted before JoinOperator and we have only 1 mapper, the output of map phase only has distinct keys. Please apply the patch 'wrong_semi_join.txt' attached below and use {code} ant test -Dtestcase=TestMinimrCliDriver -Dqfile=left_semi_join.q -Dtest.silent=false {\code} to replay the problem. The wrong result can be found in {code} hive_root_dir/build/ql/test/logs/clientpositive {\code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4781) LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval
[ https://issues.apache.org/jira/browse/HIVE-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated HIVE-4781: --- Summary: LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval (was: LEFT SEMI JOIN generates wrong results when ) LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval -- Key: HIVE-4781 URL: https://issues.apache.org/jira/browse/HIVE-4781 Project: Hive Issue Type: Bug Affects Versions: 0.12.0 Reporter: Yin Huai Assignee: Yin Huai Attachments: wrong_semi_join.txt Suppose that we have a query shown below {code:sql} SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key); {\code} When the number of rows of t2 is larger than hive.join.emit.interval, JoinOperator will emit rows from t1, which will result in redundant output. Let's say t1 is {code} key 1 {\code} and t2 is {code} key 1 1 1 1 {\code} When hive.join.emit.interval=1, the output of above query will be {code} 1 1 1 1 {\code} The correct result should be {code} 1 {\code} This problem cannot be found in unit test. Because there is a GBY operator inserted before JoinOperator and we have only 1 mapper, the output of map phase only has distinct keys. Please apply the patch 'wrong_semi_join.txt' attached below and use {code} ant test -Dtestcase=TestMinimrCliDriver -Dqfile=left_semi_join.q -Dtest.silent=false {\code} to replay the problem. The wrong result can be found in {code} hive_root_dir/build/ql/test/logs/clientpositive {\code} -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira