[jira] Commented: (HIVE-741) NULL is not handled correctly in join

Amareshwari Sriramadasu (JIRA) Tue, 17 Aug 2010 03:16:02 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899365#action_12899365
 ]


Amareshwari Sriramadasu commented on HIVE-741:
----------------------------------------------

Thanks Ning for your comments.

bq.  select * FROM myinput1 a left outer JOIN myinput1 b ON a.value = b.value;
bq.  select * FROM myinput1 a right outer JOIN myinput1 b ON a.value = b.value;
This is happening because I'm assuming nr.get(0) in JoinOperator is the 
join-key. It seems it not always true that key is the first element in the 
ArrayList. When I modified a the code to the following, above queries are 
giving correct results.
{code}
      StructObjectInspector soi = (StructObjectInspector) 
inputObjInspectors[tag];
      StructField sf = soi.getStructFieldRef(Utilities.ReduceField.KEY
          .toString());
      Object keyObject = soi.getStructFieldData(row, sf);
      if (SerDeUtils.isNullObject(keyObject, soi)) {
        endGroup();
        startGroup();
     }
{code}
Added method SerDeUtils.isNullObject(keyObject, soi) to know if the object 
passed is representing a NULL object.

bq. select * FROM myinput1 a left outer JOIN myinput1 b right outer join 
myinput1 c ON a.value = b.value and b.value = c.value;
Looking at Stage-1 of "explain" for the above query:
{noformat}
Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a
          TableScan
            alias: a
            Reduce Output Operator
              sort order:
              tag: 0
              value expressions:
                    expr: key
                    type: int
                    expr: value
                    type: int
        b
          TableScan
            alias: b
            Reduce Output Operator
              sort order:
              tag: 1
              value expressions:
                    expr: key
                    type: int
                    expr: value
                    type: int
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0} {VALUE._col1}
            1 {VALUE._col0} {VALUE._col1}
          handleSkewJoin: false
          outputColumnNames: _col0, _col1, _col4, _col5
          Filter Operator
            predicate:
                expr: (_col1 = _col5)
                type: boolean
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
{noformat}
Join happens without join key?  Here, join output is the Cartesian product of a 
and b on which FilterOperator is applied, Am I right? I see the semantics of 
inner/outer join on two tables without join condition is to produce Cartesian 
product. As a side note: "MySql does not allow outer joins without join 
condition". 
If Join is allowed without join condition to produce Cartesian product of the 
two tables, then my patch should be changed to consider if join-key is defined 
for the join or not.  I could reproduce it by simple query "select * FROM 
myinput1 a JOIN myinput1 b". I think the same applies to MapJoin as well. 

bq. Verified that SMBMapJoinOperator already filters nulls properly. 
bq. Can you also add one or few tests for sort merge join? 
It seems my verification was wrong here, I thought if the table is sorted and 
hive.optimize.bucketmapjoin, hive.optimize.bucketmapjoin.sortedmerge are set to 
true, MapJoin uses SMBMapJoinOperator. But it was using MapJoinOperator it 
self. When I created a table with "sorted by" column, I see it using 
SMBMapJoinOperator. Currently if there are any nulls in the input table, 
SMBJoin fails with NullPointerException:
{noformat}
Caused by: java.lang.NullPointerException
        at org.apache.hadoop.io.IntWritable.compareTo(IntWritable.java:60)
        at 
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:115)
        at 
org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.compareKeys(SMBMapJoinOperator.java:389)
        at 
org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.processKey(SMBMapJoinOperator.java:438)
        at 
org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.processOp(SMBMapJoinOperator.java:205)
        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:458)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:698)
        at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:45)
        at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:458)
        at 
org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.fetchOneRow(SMBMapJoinOperator.java:479)
        ... 17 more
{noformat}
Will look into this. 

bq. For inner, left and right outer joins, a simpler fix would be to add a 
filter on top. 
Now, I agree it would be simpler :). Will consider this also and see if i can 
do some special handling for full outer joins.

> NULL is not handled correctly in join
> -------------------------------------
>
>                 Key: HIVE-741
>                 URL: https://issues.apache.org/jira/browse/HIVE-741
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: Ning Zhang
>            Assignee: Amareshwari Sriramadasu
>         Attachments: patch-741.txt
>
>
> With the following data in table input4_cb:
> Key        Value
> ------       --------
> NULL     325
> 18          NULL
> The following query:
> {code}
> select * from input4_cb a join input4_cb b on a.key = b.value;
> {code}
> returns the following result:
> NULL    325    18   NULL
> The correct result should be empty set.
> When 'null' is replaced by '' it works.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-741) NULL is not handled correctly in join

Reply via email to