[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-24 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902004#action_12902004
 ] 

Ning Zhang commented on HIVE-741:
-

+1. Will commit if tests pass.

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
> Attachments: join_nulls.q.out, patch-741-1.txt, patch-741-2.txt, 
> patch-741-3.txt, patch-741-4.txt, patch-741-5.txt, patch-741-6.txt, 
> patch-741.txt, smbjoin_nulls.q.txt
>
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-23 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901515#action_12901515
 ] 

Ning Zhang commented on HIVE-741:
-

Amareshwari, aside from Yongqiang comment, join_null.q's result is not 
deterministic -- the SMB joins result in different orders. Can you make it 
deterministic by adding a 'order by' clause at the end of the queries? I'll 
attach my test run results

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
> Attachments: patch-741-1.txt, patch-741-2.txt, patch-741-3.txt, 
> patch-741-4.txt, patch-741-5.txt, patch-741.txt, smbjoin_nulls.q.txt
>
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-23 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901508#action_12901508
 ] 

He Yongqiang commented on HIVE-741:
---

also about Ning's comments:
>>2) SMBMapJoinOperator.compareKey() is called for each row so it is critical 
>>for performance. In your code the hasNullElement() could be called 4 times in 
>>the worse case. If you cache the result it can be called only twice.
Agree. Not sure how much overhead is there, will try to estimate the overhead 
over production running. That will be great if you can try to cache the null 
check results, so that it can only happen one time for each key. 

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
> Attachments: patch-741-1.txt, patch-741-2.txt, patch-741-3.txt, 
> patch-741-4.txt, patch-741-5.txt, patch-741.txt, smbjoin_nulls.q.txt
>
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-23 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901503#action_12901503
 ] 

He Yongqiang commented on HIVE-741:
---

+1. The patch looks good to me. 
(Only have one minor comment on the name of "hasNullElements", should we rename 
it since this function is used to determine all keys are null?)



> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
> Attachments: patch-741-1.txt, patch-741-2.txt, patch-741-3.txt, 
> patch-741-4.txt, patch-741-5.txt, patch-741.txt, smbjoin_nulls.q.txt
>
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-22 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901275#action_12901275
 ] 

Ning Zhang commented on HIVE-741:
-

Looks good except one mintor thing: SerDeUtils.java:369 should return true? 
Amareshwari, can you upload a new patch and I'll run unit tests. 

Yongqiang, can you test this patch on the production SMB join queries?



> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
> Attachments: patch-741-1.txt, patch-741-2.txt, patch-741-3.txt, 
> patch-741-4.txt, patch-741.txt, smbjoin_nulls.q.txt
>
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-20 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900941#action_12900941
 ] 

Ning Zhang commented on HIVE-741:
-

The SMB test case still has a minor issue: the tables was created as 2 buckets 
but there is only 1 file in the tables. This is conflicting to the table 
schema. If a table is defined as bucketd 2, there should be 2 files in the 
partition or table. They SMB join takes the 1st file in T1 join the 1st file in 
T2, and 2nd file in T1 join 2nd file in T2. So the test case should cover this 
use case. 

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
> Attachments: patch-741-1.txt, patch-741-2.txt, patch-741-3.txt, 
> patch-741.txt, smbjoin_nulls.q.txt
>
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-19 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900570#action_12900570
 ] 

Ning Zhang commented on HIVE-741:
-

Looks good in general. Some minor comments:
 1) add more test cases for SMB joins. Currently the only test case has only 1 
bucket which does not cover the most common use case. Can you add more test 
cases for more buckets? You can take a look at bucketed join queries included 
in the client positive tests.
 2) SMBMapJoinOperator.compareKey() is called for each row so it is critical 
for performance. In your code the hasNullElement() could be called 4 times in 
the worse case. If you cache the result it can be called only twice. 

Yongqiang, any further comments?

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
> Attachments: patch-741-1.txt, patch-741-2.txt, patch-741.txt, 
> smbjoin_nulls.q.txt
>
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-19 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900198#action_12900198
 ] 

Amareshwari Sriramadasu commented on HIVE-741:
--

All the tests passed with latest patch : patch-741-2.txt

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
> Attachments: patch-741-1.txt, patch-741-2.txt, patch-741.txt, 
> smbjoin_nulls.q.txt
>
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-17 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899753#action_12899753
 ] 

Ning Zhang commented on HIVE-741:
-

@Amareshwari, sorry the syntax was wrong for the 3 table joins. Below is the 
correct syntax and plan.


{code}
explain select * from src a left outer join src b on (a.value=b.value) right 
outer join src c on (b.value=c.value);
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_RIGHTOUTERJOIN (TOK_LEFTOUTERJOIN (TOK_TABREF src 
a) (TOK_TABREF src b) (= (. (TOK_TABLE_OR_COL a) value) (. (TOK_TABLE_OR_COL b) 
value))) (TOK_TABREF src c) (= (. (TOK_TABLE_OR_COL b) value) (. 
(TOK_TABLE_OR_COL c) value (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
Map Reduce
  Alias -> Map Operator Tree:
a 
  TableScan
alias: a
Reduce Output Operator
  key expressions:
expr: value
type: string
  sort order: +
  Map-reduce partition columns:
expr: value
type: string
  tag: 0
  value expressions:
expr: key
type: string
expr: value
type: string
b 
  TableScan
alias: b
Reduce Output Operator
  key expressions:
expr: value
type: string
  sort order: +
  Map-reduce partition columns:
expr: value
type: string
  tag: 1
  value expressions:
expr: key
type: string
expr: value
type: string
c 
  TableScan
alias: c
Reduce Output Operator
  key expressions:
expr: value
type: string
  sort order: +
  Map-reduce partition columns:
expr: value
type: string
  tag: 2
  value expressions:
expr: key
type: string
expr: value
type: string
  Reduce Operator Tree:
Join Operator
  condition map:
   Left Outer Join0 to 1
   Right Outer Join1 to 2
  condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0} {VALUE._col1}
2 {VALUE._col0} {VALUE._col1}
  handleSkewJoin: false
  outputColumnNames: _col0, _col1, _col4, _col5, _col8, _col9
  Select Operator
expressions:
  expr: _col0
  type: string
  expr: _col1
  type: string
  expr: _col4
  type: string
  expr: _col5
  type: string
  expr: _col8
  type: string
  expr: _col9
  type: string
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5
File Output Operator
  compressed: true
  GlobalTableId: 0
  table:
  input format: org.apache.hadoop.mapred.TextInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
Fetch Operator
  limit: -1
{code}

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
> Attachments: patch-741.txt
>
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-17 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899365#action_12899365
 ] 

Amareshwari Sriramadasu commented on HIVE-741:
--

Thanks Ning for your comments.

bq.  select * FROM myinput1 a left outer JOIN myinput1 b ON a.value = b.value;
bq.  select * FROM myinput1 a right outer JOIN myinput1 b ON a.value = b.value;
This is happening because I'm assuming nr.get(0) in JoinOperator is the 
join-key. It seems it not always true that key is the first element in the 
ArrayList. When I modified a the code to the following, above queries are 
giving correct results.
{code}
  StructObjectInspector soi = (StructObjectInspector) 
inputObjInspectors[tag];
  StructField sf = soi.getStructFieldRef(Utilities.ReduceField.KEY
  .toString());
  Object keyObject = soi.getStructFieldData(row, sf);
  if (SerDeUtils.isNullObject(keyObject, soi)) {
endGroup();
startGroup();
 }
{code}
Added method SerDeUtils.isNullObject(keyObject, soi) to know if the object 
passed is representing a NULL object.

bq. select * FROM myinput1 a left outer JOIN myinput1 b right outer join 
myinput1 c ON a.value = b.value and b.value = c.value;
Looking at Stage-1 of "explain" for the above query:
{noformat}
Stage: Stage-1
Map Reduce
  Alias -> Map Operator Tree:
a
  TableScan
alias: a
Reduce Output Operator
  sort order:
  tag: 0
  value expressions:
expr: key
type: int
expr: value
type: int
b
  TableScan
alias: b
Reduce Output Operator
  sort order:
  tag: 1
  value expressions:
expr: key
type: int
expr: value
type: int
  Reduce Operator Tree:
Join Operator
  condition map:
   Left Outer Join0 to 1
  condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0} {VALUE._col1}
  handleSkewJoin: false
  outputColumnNames: _col0, _col1, _col4, _col5
  Filter Operator
predicate:
expr: (_col1 = _col5)
type: boolean
File Output Operator
  compressed: false
  GlobalTableId: 0
  table:
  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
{noformat}
Join happens without join key?  Here, join output is the Cartesian product of a 
and b on which FilterOperator is applied, Am I right? I see the semantics of 
inner/outer join on two tables without join condition is to produce Cartesian 
product. As a side note: "MySql does not allow outer joins without join 
condition". 
If Join is allowed without join condition to produce Cartesian product of the 
two tables, then my patch should be changed to consider if join-key is defined 
for the join or not.  I could reproduce it by simple query "select * FROM 
myinput1 a JOIN myinput1 b". I think the same applies to MapJoin as well. 

bq. Verified that SMBMapJoinOperator already filters nulls properly. 
bq. Can you also add one or few tests for sort merge join? 
It seems my verification was wrong here, I thought if the table is sorted and 
hive.optimize.bucketmapjoin, hive.optimize.bucketmapjoin.sortedmerge are set to 
true, MapJoin uses SMBMapJoinOperator. But it was using MapJoinOperator it 
self. When I created a table with "sorted by" column, I see it using 
SMBMapJoinOperator. Currently if there are any nulls in the input table, 
SMBJoin fails with NullPointerException:
{noformat}
Caused by: java.lang.NullPointerException
at org.apache.hadoop.io.IntWritable.compareTo(IntWritable.java:60)
at 
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:115)
at 
org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.compareKeys(SMBMapJoinOperator.java:389)
at 
org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.processKey(SMBMapJoinOperator.java:438)
at 
org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.processOp(SMBMapJoinOperator.java:205)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:458)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:698)
at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:45)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:458)
at 
org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.fetchOneRow(SMBMapJoinOperator.java:479)
... 17 more
{noformat}
Will look into this. 

bq. For inner, left and right outer joins, a simpler fix wou

[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-16 Thread Namit Jain (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899177#action_12899177
 ] 

Namit Jain commented on HIVE-741:
-

For inner, left and right outer joins, a simpler fix would be to add a filter 
on top.

For example, for 

A join B on A.c1 = B.c2


add a filter before A (A.c1 is not null) and B (B.c2 is not null)

For 

A left outer join B on A.c1 = B.c2, the filter before A is not needed and 
similarly, for 
A right outer join B on A.c1 = B.c2, the filter before B is not needed



Some special handling might still be needed for full outer joins and sort-merge 
joins

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
> Attachments: patch-741.txt
>
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-16 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899044#action_12899044
 ] 

Ning Zhang commented on HIVE-741:
-

@Amareshwari, aside from adding new test cases for sort merge join, this patch 
also has some bugs. 

For example in your test data:

{code}
hive> select * from myinput1
NULL356
484 NULL
10  10

-- incorrect result below
hive> select * FROM myinput1 a left outer JOIN myinput1 b ON a.value = b.value; 
484 NULL484 NULL
10  10  10  10
NULL356 NULLNULL

hive> select * FROM myinput1 a right outer JOIN myinput1 b ON a.value = b.value;
484 NULL484 NULL
10  10  10  10
NULLNULLNULL356

hive> select * FROM myinput1 a left outer JOIN myinput1 b right outer join 
myinput1 c ON a.value = b.value and b.value = c.value;
NULLNULLNULLNULL484 NULL
NULLNULLNULLNULL10  10
NULLNULLNULLNULLNULL356
{code}

Can you take a look? I'm not sure whether ending a group and starting a new 
group for each null-keyed row works for all cases particularly in joins 
involving more than 2 tables and mixture of left and right outer joins.  

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
> Attachments: patch-741.txt
>
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-16 Thread He Yongqiang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898996#action_12898996
 ] 

He Yongqiang commented on HIVE-741:
---

the change looks good to me. Can you also add one or few tests for sort merge 
join?

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
> Attachments: patch-741.txt
>
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-15 Thread Ted Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898772#action_12898772
 ] 

Ted Xu commented on HIVE-741:
-

bq. a left outer join preserves rows on the left side regardless of whether the 
ON clause evaluates true
That's right, thanks.

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-15 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898756#action_12898756
 ] 

Ning Zhang commented on HIVE-741:
-

@Amareshwar, currently we already distinguish different join types with 
different functions (take a look at CommonJoinOperator.joinObjects()). I look 
forward to seeing your proposal to avoid grouping null-keyed rows.

@Ted, I agree with Amareshwar and John that we cannot avoid rows (or the value 
part of the key-value pairs) with null as a key. However you have a point in 
that if we know the join operator does not involve outer join at all (we 
already have a flag noOuterJoin in JoinDesc), then we could avoid sending rows 
will null keys from the mappers to the reducers. This will save bandwidth as 
well as processing time. Could you open another JIRA and be able to submit a 
patch?


> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-13 Thread John Sichi (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898374#action_12898374
 ] 

John Sichi commented on HIVE-741:
-

@Ted:  as Amareshwari mentioned, a left outer join preserves rows on the left 
side regardless of whether the ON clause evaluates true.  So in that case (and 
similar cases for right/full outer join), we can't filter out the rows with 
null join keys.


> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-13 Thread Ted Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898194#action_12898194
 ] 

Ted Xu commented on HIVE-741:
-

Sorry I'm mistakenly expressed my idea, I mean values with NULL join keys shall 
be filtered out in mappers. 

I think values with NULL join keys shall be filtered out because NULL equals 
nothing, and Hive only support equal join.

Please correct me if I'm wrong. 

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-13 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898141#action_12898141
 ] 

Amareshwari Sriramadasu commented on HIVE-741:
--

bq. but it will be great if we can filter the NULL values in mappers.
Ted, we should not filter out the NULL values in mappers. Because for outer 
joins, these rows should be cartesian producted with nulls as shown in expected 
output in the [comment | 
https://issues.apache.org/jira/browse/HIVE-741?focusedCommentId=12896789&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12896789]

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-13 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898136#action_12898136
 ] 

Amareshwari Sriramadasu commented on HIVE-741:
--

Thanks Ning for the details.

To summarize the implementation of join: 
* In reduce-side join, rows with same join keys are grouped together; and  in 
MapSide join, rows with same join keys are added the same entry in the hash 
table. 
* CommonJoinOperator.checkAndGenObject: The rows with same join key are 
cartesian producted with each other(i.e. with rows of different aliases). If 
there are no rows in one table alias, the rows of other table alias are ignored 
(for inner join) or cartesian producted with nulls (outer joins). 

The above implementation works fine except for null join keys ; Since these 
rows are grouped together/hashed to same entry, the current issue exists.
 
bq. I think the fix would be to check the NULL value of the join keys and do 
proper output based on the semantics of different types of joins.
This would need special handling for each type of join (inner, left outer, 
right outer, full outer  an etc.). So, I'm thinking the better solution is not 
group rows with null join keys together. Then the above join algorithm works 
correctly for all types of joins.

Currently they are grouped together because HiveKey.compare compares the bytes 
of the key (in case of reduce-side join) and MapJoinObjectKey.equals returns 
true if both keys are null (in case of map-side join). I'm trying to see if can 
come up with a solution which does not group rows with null join keys together. 
Please correct me if am wrong.

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-13 Thread Ted Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898132#action_12898132
 ] 

Ted Xu commented on HIVE-741:
-

Thanks for Ning and Amareshwari, we are looking forward to see the bug fixed. I 
think it's Okay to solve it by modifying the *JoinOperators, but it will be 
great if we can filter the NULL values in mappers, say, in ReduceSinkOperator, 
provided if we can know which part of the reduce sink key is from join (other 
than from group by, distinct, etc,.).  

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Amareshwari Sriramadasu
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-10 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896974#action_12896974
 ] 

Ning Zhang commented on HIVE-741:
-

The joins are implemented in the JoinOperator and CommonJoinOperators for 
regular reduce-side joins. The map-side joins are implemented in the 
MapJoinOperator. 

In the reduce side joins, the join keys are treated as distribution keys from 
the mappers to the reducers so that each group (marked by beginGroup() and 
endGroup()) will consists of rows with the same join keys. The reduce-side 
joins will cache all rows within a group except the last one (aka streaming 
table), which is scanned and cartesian producted with the cached rows of the 
other tables. I think the fix would be to check the NULL value of the join keys 
and do proper output based on the semantics of different types of joins. 

For the map-side join, it's basically a hash join where the small table is read 
in entirety in a hash table and probed while scanning the streaming table. 

There are other types of joins (bucketed map-side join, sort merge join etc.), 
but they all rely on the 3 classes mentioned above. 

Let me know if you have further questions for you to get started. 

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Ning Zhang
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-10 Thread Ning Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896793#action_12896793
 ] 

Ning Zhang commented on HIVE-741:
-

I'm not actively working on it. Please feel free to take it.

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Ning Zhang
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-10 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896792#action_12896792
 ] 

Amareshwari Sriramadasu commented on HIVE-741:
--

By adding logs to ExecReducer, I see that the input to reduce is the following:
{noformat}
{"key":{"joinkey0":null},"value":{"_col0":null,"_col1":35},"alias":0}
{"key":{"joinkey0":null},"value":{"_col0":12,"_col1":null},"alias":1}
{"key":{"joinkey0":10},"value":{"_col0":10,"_col1":1000},"alias":0}
{"key":{"joinkey0":10},"value":{"_col0":10,"_col1":100},"alias":0}
{"key":{"joinkey0":12},"value":{"_col0":12,"_col1":null},"alias":0}
{"key":{"joinkey0":35},"value":{"_col0":null,"_col1":35},"alias":1}
{"key":{"joinkey0":100},"value":{"_col0":100,"_col1":100},"alias":0}
{"key":{"joinkey0":100},"value":{"_col0":10,"_col1":100},"alias":1}
{"key":{"joinkey0":100},"value":{"_col0":100,"_col1":100},"alias":1}
{"key":{"joinkey0":1000},"value":{"_col0":10,"_col1":1000},"alias":1}
{noformat}

And joinkey with null values are processed under same group, I think they 
should be processed in different groups, because comparison between nulls is 
not defined. 

> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Ning Zhang
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (HIVE-741) NULL is not handled correctly in join

2010-08-10 Thread Amareshwari Sriramadasu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896789#action_12896789
 ] 

Amareshwari Sriramadasu commented on HIVE-741:
--

bq. When 'null' is replaced by '' it works.
I see the same result even if 'null' is replaced with ''.

To reproduce the above, I created a table input1 with
{code}
hive> create table input1 (key int, value int);
{code}

Loaded the following input using Load data command
{noformat}
in.txt
^A35
12^A
10^A1000
10^A100
100^A100
{noformat}

I see the following output for join queries executed.
{noformat}

hive> select * from input1 a join input1 b on a.key=b.value;

Output:
NULL35  12  NULL
100 100 10  100
100 100 100 100

Expected Output:
100 100 10  100
100 100 100 100

hive> select * from input1 a left outer join input1 b on a.key=b.value;

OUTPUT:
NULL35  12  NULL
10  1000NULLNULL
10  100 NULLNULL
12  NULLNULLNULL
100 100 10  100
100 100 100 100

Expected Output:
NULL35  NULLNULL
10  1000NULLNULL
10  100 NULLNULL
12  NULLNULLNULL
100 100 10  100
100 100 100 100

hive> select * from input1 a right outer join input1 b on a.key=b.value;

OUTPUT:
NULL35  12  NULL
NULLNULLNULL35
100 100 10  100
100 100 100 100
NULLNULL10  1000

ExpectedOutput:
NULLNULLNULL35
NULLNULL12  NULL
100 100 10  100
100 100 100 100
NULLNULL10  1000

{noformat}

Expected output is obtained from mysql db for a similar query.

Ning, if you are not working on the fix for this, I would like to contribute. 
Would need your help understanding join code also, as I'm a new to hive. 


> NULL is not handled correctly in join
> -
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
>  Issue Type: Bug
>Reporter: Ning Zhang
>Assignee: Ning Zhang
>
> With the following data in table input4_cb:
> KeyValue
> --   
> NULL 325
> 18  NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL32518   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.