[ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898136#action_12898136
 ] 

Amareshwari Sriramadasu commented on HIVE-741:
----------------------------------------------

Thanks Ning for the details.

To summarize the implementation of join: 
* In reduce-side join, rows with same join keys are grouped together; and  in 
MapSide join, rows with same join keys are added the same entry in the hash 
table. 
* CommonJoinOperator.checkAndGenObject: The rows with same join key are 
cartesian producted with each other(i.e. with rows of different aliases). If 
there are no rows in one table alias, the rows of other table alias are ignored 
(for inner join) or cartesian producted with nulls (outer joins). 

The above implementation works fine except for null join keys ; Since these 
rows are grouped together/hashed to same entry, the current issue exists.
 
bq. I think the fix would be to check the NULL value of the join keys and do 
proper output based on the semantics of different types of joins.
This would need special handling for each type of join (inner, left outer, 
right outer, full outer  an etc.). So, I'm thinking the better solution is not 
group rows with null join keys together. Then the above join algorithm works 
correctly for all types of joins.

Currently they are grouped together because HiveKey.compare compares the bytes 
of the key (in case of reduce-side join) and MapJoinObjectKey.equals returns 
true if both keys are null (in case of map-side join). I'm trying to see if can 
come up with a solution which does not group rows with null join keys together. 
Please correct me if am wrong.

> NULL is not handled correctly in join
> -------------------------------------
>
>                 Key: HIVE-741
>                 URL: https://issues.apache.org/jira/browse/HIVE-741
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Amareshwari Sriramadasu
>
> With the following data in table input4_cb:
> Key        Value
> ------       --------
> NULL     325
> 18          NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL    325    18   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to