subject:"\[jira\] \[Updated\] \(HIVE\-4781\) LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval"

[jira] [Updated] (HIVE-4781) LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval

2013-06-27 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-4781:
---

Attachment: HIVE-4781.txt

got an error when using arc
{quote}
Parse Exception: Expected a hunk header, like 'Index: /path/to/file.ext' (svn), 
'Property changes on: /path/to/file.ext' (svn properties), 'commit 
59bcc3ad6775562f845953cf01624225' (git show), 'diff --git' (git diff), or '--- 
filename' (unified diff).

   1   
{\quote}

seems caused by data added for the new test query?

So, I upload this patch.

 LEFT SEMI JOIN generates wrong results when the number of rows belonging to a 
 single key of the right table exceed hive.join.emit.interval
 --

 Key: HIVE-4781
 URL: https://issues.apache.org/jira/browse/HIVE-4781
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4781.txt, wrong_semi_join.txt


 Suppose that we have a query shown below
 {code:sql}
 SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key);
 {\code}
 When the number of rows of t2 is larger than hive.join.emit.interval, 
 JoinOperator will emit rows from t1, which will result in redundant output.
 Let's say t1 is
 {code}
 1
 {\code}
 and t2 is
 {code}
 1
 1
 1
 1
 {\code}
 When hive.join.emit.interval=1, the output of above query will be
 {code}
 1
 1
 1
 1
 {\code}
 The correct result should be 
 {code}
 1
 {\code}
 This problem cannot be found in unit test. Because there is a GBY operator 
 inserted before JoinOperator and we have only 1 mapper, the output of map 
 phase only has distinct keys.
 Please apply the patch 'wrong_semi_join.txt' attached below and use 
 {code}
 ant test -Dtestcase=TestMinimrCliDriver -Dqfile=left_semi_join.q 
 -Dtest.silent=false
 {\code} to replay the problem. The wrong result can be found in 
 {code}
 hive_root_dir/build/ql/test/logs/clientpositive
 {\code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-4781) LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval

2013-06-27 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-4781:
---

Status: Patch Available  (was: Open)

I tested all unit tests before the commit of HIVE-4496. all unit tests pass

 LEFT SEMI JOIN generates wrong results when the number of rows belonging to a 
 single key of the right table exceed hive.join.emit.interval
 --

 Key: HIVE-4781
 URL: https://issues.apache.org/jira/browse/HIVE-4781
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: HIVE-4781.txt, wrong_semi_join.txt


 Suppose that we have a query shown below
 {code:sql}
 SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key);
 {\code}
 When the number of rows of t2 is larger than hive.join.emit.interval, 
 JoinOperator will emit rows from t1, which will result in redundant output.
 Let's say t1 is
 {code}
 1
 {\code}
 and t2 is
 {code}
 1
 1
 1
 1
 {\code}
 When hive.join.emit.interval=1, the output of above query will be
 {code}
 1
 1
 1
 1
 {\code}
 The correct result should be 
 {code}
 1
 {\code}
 This problem cannot be found in unit test. Because there is a GBY operator 
 inserted before JoinOperator and we have only 1 mapper, the output of map 
 phase only has distinct keys.
 Please apply the patch 'wrong_semi_join.txt' attached below and use 
 {code}
 ant test -Dtestcase=TestMinimrCliDriver -Dqfile=left_semi_join.q 
 -Dtest.silent=false
 {\code} to replay the problem. The wrong result can be found in 
 {code}
 hive_root_dir/build/ql/test/logs/clientpositive
 {\code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-4781) LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval

2013-06-26 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-4781:
---

Description: 
Suppose that we have a query shown below
{code:sql}
SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key);
{\code}

When the number of rows of t2 is larger than hive.join.emit.interval, 
JoinOperator will emit rows from t1, which will result in redundant output.

Let's say t1 is
{code}
1
{\code}
and t2 is
{code}
1
1
1
1
{\code}

When hive.join.emit.interval=1, the output of above query will be
{code}
1
1
1
1
{\code}
The correct result should be 
{code}
1
{\code}

This problem cannot be found in unit test. Because there is a GBY operator 
inserted before JoinOperator and we have only 1 mapper, the output of map phase 
only has distinct keys.

Please apply the patch 'wrong_semi_join.txt' attached below and use 
{code}
ant test -Dtestcase=TestMinimrCliDriver -Dqfile=left_semi_join.q 
-Dtest.silent=false
{\code} to replay the problem. The wrong result can be found in 
{code}
hive_root_dir/build/ql/test/logs/clientpositive
{\code}

  was:
Suppose that we have a query shown below
{code:sql}
SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key);
{\code}

When the number of rows of t2 is larger than hive.join.emit.interval, 
JoinOperator will emit rows from t1, which will result in redundant output.

Let's say t1 is
{code}
key

1
{\code}
and t2 is
{code}
key

1
1
1
1
{\code}

When hive.join.emit.interval=1, the output of above query will be
{code}
1
1
1
1
{\code}
The correct result should be 
{code}
1
{\code}

This problem cannot be found in unit test. Because there is a GBY operator 
inserted before JoinOperator and we have only 1 mapper, the output of map phase 
only has distinct keys.

Please apply the patch 'wrong_semi_join.txt' attached below and use 
{code}
ant test -Dtestcase=TestMinimrCliDriver -Dqfile=left_semi_join.q 
-Dtest.silent=false
{\code} to replay the problem. The wrong result can be found in 
{code}
hive_root_dir/build/ql/test/logs/clientpositive
{\code}


 LEFT SEMI JOIN generates wrong results when the number of rows belonging to a 
 single key of the right table exceed hive.join.emit.interval
 --

 Key: HIVE-4781
 URL: https://issues.apache.org/jira/browse/HIVE-4781
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: wrong_semi_join.txt


 Suppose that we have a query shown below
 {code:sql}
 SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key);
 {\code}
 When the number of rows of t2 is larger than hive.join.emit.interval, 
 JoinOperator will emit rows from t1, which will result in redundant output.
 Let's say t1 is
 {code}
 1
 {\code}
 and t2 is
 {code}
 1
 1
 1
 1
 {\code}
 When hive.join.emit.interval=1, the output of above query will be
 {code}
 1
 1
 1
 1
 {\code}
 The correct result should be 
 {code}
 1
 {\code}
 This problem cannot be found in unit test. Because there is a GBY operator 
 inserted before JoinOperator and we have only 1 mapper, the output of map 
 phase only has distinct keys.
 Please apply the patch 'wrong_semi_join.txt' attached below and use 
 {code}
 ant test -Dtestcase=TestMinimrCliDriver -Dqfile=left_semi_join.q 
 -Dtest.silent=false
 {\code} to replay the problem. The wrong result can be found in 
 {code}
 hive_root_dir/build/ql/test/logs/clientpositive
 {\code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-4781) LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval

2013-06-21 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/HIVE-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated HIVE-4781:
---

Summary: LEFT SEMI JOIN generates wrong results when the number of rows 
belonging to a single key of the right table exceed hive.join.emit.interval  
(was: LEFT SEMI JOIN generates wrong results when )

 LEFT SEMI JOIN generates wrong results when the number of rows belonging to a 
 single key of the right table exceed hive.join.emit.interval
 --

 Key: HIVE-4781
 URL: https://issues.apache.org/jira/browse/HIVE-4781
 Project: Hive
  Issue Type: Bug
Affects Versions: 0.12.0
Reporter: Yin Huai
Assignee: Yin Huai
 Attachments: wrong_semi_join.txt


 Suppose that we have a query shown below
 {code:sql}
 SELECT key FROM t1 LEFT SEMI JOIN t2 ON (t1.key=t2.key);
 {\code}
 When the number of rows of t2 is larger than hive.join.emit.interval, 
 JoinOperator will emit rows from t1, which will result in redundant output.
 Let's say t1 is
 {code}
 key
 
 1
 {\code}
 and t2 is
 {code}
 key
 
 1
 1
 1
 1
 {\code}
 When hive.join.emit.interval=1, the output of above query will be
 {code}
 1
 1
 1
 1
 {\code}
 The correct result should be 
 {code}
 1
 {\code}
 This problem cannot be found in unit test. Because there is a GBY operator 
 inserted before JoinOperator and we have only 1 mapper, the output of map 
 phase only has distinct keys.
 Please apply the patch 'wrong_semi_join.txt' attached below and use 
 {code}
 ant test -Dtestcase=TestMinimrCliDriver -Dqfile=left_semi_join.q 
 -Dtest.silent=false
 {\code} to replay the problem. The wrong result can be found in 
 {code}
 hive_root_dir/build/ql/test/logs/clientpositive
 {\code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-4781) LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval

[jira] [Updated] (HIVE-4781) LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval

[jira] [Updated] (HIVE-4781) LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval

[jira] [Updated] (HIVE-4781) LEFT SEMI JOIN generates wrong results when the number of rows belonging to a single key of the right table exceed hive.join.emit.interval

4 matches

Site Navigation

Mail list logo

Footer information