[
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896974#action_12896974
]
Ning Zhang commented on HIVE-741:
---------------------------------
The joins are implemented in the JoinOperator and CommonJoinOperators for
regular reduce-side joins. The map-side joins are implemented in the
MapJoinOperator.
In the reduce side joins, the join keys are treated as distribution keys from
the mappers to the reducers so that each group (marked by beginGroup() and
endGroup()) will consists of rows with the same join keys. The reduce-side
joins will cache all rows within a group except the last one (aka streaming
table), which is scanned and cartesian producted with the cached rows of the
other tables. I think the fix would be to check the NULL value of the join keys
and do proper output based on the semantics of different types of joins.
For the map-side join, it's basically a hash join where the small table is read
in entirety in a hash table and probed while scanning the streaming table.
There are other types of joins (bucketed map-side join, sort merge join etc.),
but they all rely on the 3 classes mentioned above.
Let me know if you have further questions for you to get started.
> NULL is not handled correctly in join
> -------------------------------------
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
> Issue Type: Bug
> Reporter: Ning Zhang
> Assignee: Ning Zhang
>
> With the following data in table input4_cb:
> Key Value
> ------ --------
> NULL 325
> 18 NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL 325 18 NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.