[
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899365#action_12899365
]
Amareshwari Sriramadasu commented on HIVE-741:
----------------------------------------------
Thanks Ning for your comments.
bq. select * FROM myinput1 a left outer JOIN myinput1 b ON a.value = b.value;
bq. select * FROM myinput1 a right outer JOIN myinput1 b ON a.value = b.value;
This is happening because I'm assuming nr.get(0) in JoinOperator is the
join-key. It seems it not always true that key is the first element in the
ArrayList. When I modified a the code to the following, above queries are
giving correct results.
{code}
StructObjectInspector soi = (StructObjectInspector)
inputObjInspectors[tag];
StructField sf = soi.getStructFieldRef(Utilities.ReduceField.KEY
.toString());
Object keyObject = soi.getStructFieldData(row, sf);
if (SerDeUtils.isNullObject(keyObject, soi)) {
endGroup();
startGroup();
}
{code}
Added method SerDeUtils.isNullObject(keyObject, soi) to know if the object
passed is representing a NULL object.
bq. select * FROM myinput1 a left outer JOIN myinput1 b right outer join
myinput1 c ON a.value = b.value and b.value = c.value;
Looking at Stage-1 of "explain" for the above query:
{noformat}
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Reduce Output Operator
sort order:
tag: 0
value expressions:
expr: key
type: int
expr: value
type: int
b
TableScan
alias: b
Reduce Output Operator
sort order:
tag: 1
value expressions:
expr: key
type: int
expr: value
type: int
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0} {VALUE._col1}
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col4, _col5
Filter Operator
predicate:
expr: (_col1 = _col5)
type: boolean
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
{noformat}
Join happens without join key? Here, join output is the Cartesian product of a
and b on which FilterOperator is applied, Am I right? I see the semantics of
inner/outer join on two tables without join condition is to produce Cartesian
product. As a side note: "MySql does not allow outer joins without join
condition".
If Join is allowed without join condition to produce Cartesian product of the
two tables, then my patch should be changed to consider if join-key is defined
for the join or not. I could reproduce it by simple query "select * FROM
myinput1 a JOIN myinput1 b". I think the same applies to MapJoin as well.
bq. Verified that SMBMapJoinOperator already filters nulls properly.
bq. Can you also add one or few tests for sort merge join?
It seems my verification was wrong here, I thought if the table is sorted and
hive.optimize.bucketmapjoin, hive.optimize.bucketmapjoin.sortedmerge are set to
true, MapJoin uses SMBMapJoinOperator. But it was using MapJoinOperator it
self. When I created a table with "sorted by" column, I see it using
SMBMapJoinOperator. Currently if there are any nulls in the input table,
SMBJoin fails with NullPointerException:
{noformat}
Caused by: java.lang.NullPointerException
at org.apache.hadoop.io.IntWritable.compareTo(IntWritable.java:60)
at
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:115)
at
org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.compareKeys(SMBMapJoinOperator.java:389)
at
org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.processKey(SMBMapJoinOperator.java:438)
at
org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.processOp(SMBMapJoinOperator.java:205)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:458)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:698)
at
org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:45)
at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:458)
at
org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.fetchOneRow(SMBMapJoinOperator.java:479)
... 17 more
{noformat}
Will look into this.
bq. For inner, left and right outer joins, a simpler fix would be to add a
filter on top.
Now, I agree it would be simpler :). Will consider this also and see if i can
do some special handling for full outer joins.
> NULL is not handled correctly in join
> -------------------------------------
>
> Key: HIVE-741
> URL: https://issues.apache.org/jira/browse/HIVE-741
> Project: Hadoop Hive
> Issue Type: Bug
> Reporter: Ning Zhang
> Assignee: Amareshwari Sriramadasu
> Attachments: patch-741.txt
>
>
> With the following data in table input4_cb:
> Key Value
> ------ --------
> NULL 325
> 18 NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL 325 18 NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.