[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12902004#action_12902004 ] Ning Zhang commented on HIVE-741: - +1. Will commit if tests pass. > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > Attachments: join_nulls.q.out, patch-741-1.txt, patch-741-2.txt, > patch-741-3.txt, patch-741-4.txt, patch-741-5.txt, patch-741-6.txt, > patch-741.txt, smbjoin_nulls.q.txt > > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901515#action_12901515 ] Ning Zhang commented on HIVE-741: - Amareshwari, aside from Yongqiang comment, join_null.q's result is not deterministic -- the SMB joins result in different orders. Can you make it deterministic by adding a 'order by' clause at the end of the queries? I'll attach my test run results > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > Attachments: patch-741-1.txt, patch-741-2.txt, patch-741-3.txt, > patch-741-4.txt, patch-741-5.txt, patch-741.txt, smbjoin_nulls.q.txt > > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901508#action_12901508 ] He Yongqiang commented on HIVE-741: --- also about Ning's comments: >>2) SMBMapJoinOperator.compareKey() is called for each row so it is critical >>for performance. In your code the hasNullElement() could be called 4 times in >>the worse case. If you cache the result it can be called only twice. Agree. Not sure how much overhead is there, will try to estimate the overhead over production running. That will be great if you can try to cache the null check results, so that it can only happen one time for each key. > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > Attachments: patch-741-1.txt, patch-741-2.txt, patch-741-3.txt, > patch-741-4.txt, patch-741-5.txt, patch-741.txt, smbjoin_nulls.q.txt > > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901503#action_12901503 ] He Yongqiang commented on HIVE-741: --- +1. The patch looks good to me. (Only have one minor comment on the name of "hasNullElements", should we rename it since this function is used to determine all keys are null?) > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > Attachments: patch-741-1.txt, patch-741-2.txt, patch-741-3.txt, > patch-741-4.txt, patch-741-5.txt, patch-741.txt, smbjoin_nulls.q.txt > > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901275#action_12901275 ] Ning Zhang commented on HIVE-741: - Looks good except one mintor thing: SerDeUtils.java:369 should return true? Amareshwari, can you upload a new patch and I'll run unit tests. Yongqiang, can you test this patch on the production SMB join queries? > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > Attachments: patch-741-1.txt, patch-741-2.txt, patch-741-3.txt, > patch-741-4.txt, patch-741.txt, smbjoin_nulls.q.txt > > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900941#action_12900941 ] Ning Zhang commented on HIVE-741: - The SMB test case still has a minor issue: the tables was created as 2 buckets but there is only 1 file in the tables. This is conflicting to the table schema. If a table is defined as bucketd 2, there should be 2 files in the partition or table. They SMB join takes the 1st file in T1 join the 1st file in T2, and 2nd file in T1 join 2nd file in T2. So the test case should cover this use case. > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > Attachments: patch-741-1.txt, patch-741-2.txt, patch-741-3.txt, > patch-741.txt, smbjoin_nulls.q.txt > > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900570#action_12900570 ] Ning Zhang commented on HIVE-741: - Looks good in general. Some minor comments: 1) add more test cases for SMB joins. Currently the only test case has only 1 bucket which does not cover the most common use case. Can you add more test cases for more buckets? You can take a look at bucketed join queries included in the client positive tests. 2) SMBMapJoinOperator.compareKey() is called for each row so it is critical for performance. In your code the hasNullElement() could be called 4 times in the worse case. If you cache the result it can be called only twice. Yongqiang, any further comments? > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > Attachments: patch-741-1.txt, patch-741-2.txt, patch-741.txt, > smbjoin_nulls.q.txt > > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900198#action_12900198 ] Amareshwari Sriramadasu commented on HIVE-741: -- All the tests passed with latest patch : patch-741-2.txt > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > Attachments: patch-741-1.txt, patch-741-2.txt, patch-741.txt, > smbjoin_nulls.q.txt > > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899753#action_12899753 ] Ning Zhang commented on HIVE-741: - @Amareshwari, sorry the syntax was wrong for the 3 table joins. Below is the correct syntax and plan. {code} explain select * from src a left outer join src b on (a.value=b.value) right outer join src c on (b.value=c.value); OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_RIGHTOUTERJOIN (TOK_LEFTOUTERJOIN (TOK_TABREF src a) (TOK_TABREF src b) (= (. (TOK_TABLE_OR_COL a) value) (. (TOK_TABLE_OR_COL b) value))) (TOK_TABREF src c) (= (. (TOK_TABLE_OR_COL b) value) (. (TOK_TABLE_OR_COL c) value (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a TableScan alias: a Reduce Output Operator key expressions: expr: value type: string sort order: + Map-reduce partition columns: expr: value type: string tag: 0 value expressions: expr: key type: string expr: value type: string b TableScan alias: b Reduce Output Operator key expressions: expr: value type: string sort order: + Map-reduce partition columns: expr: value type: string tag: 1 value expressions: expr: key type: string expr: value type: string c TableScan alias: c Reduce Output Operator key expressions: expr: value type: string sort order: + Map-reduce partition columns: expr: value type: string tag: 2 value expressions: expr: key type: string expr: value type: string Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 Right Outer Join1 to 2 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 {VALUE._col0} {VALUE._col1} 2 {VALUE._col0} {VALUE._col1} handleSkewJoin: false outputColumnNames: _col0, _col1, _col4, _col5, _col8, _col9 Select Operator expressions: expr: _col0 type: string expr: _col1 type: string expr: _col4 type: string expr: _col5 type: string expr: _col8 type: string expr: _col9 type: string outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 File Output Operator compressed: true GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 {code} > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > Attachments: patch-741.txt > > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899365#action_12899365 ] Amareshwari Sriramadasu commented on HIVE-741: -- Thanks Ning for your comments. bq. select * FROM myinput1 a left outer JOIN myinput1 b ON a.value = b.value; bq. select * FROM myinput1 a right outer JOIN myinput1 b ON a.value = b.value; This is happening because I'm assuming nr.get(0) in JoinOperator is the join-key. It seems it not always true that key is the first element in the ArrayList. When I modified a the code to the following, above queries are giving correct results. {code} StructObjectInspector soi = (StructObjectInspector) inputObjInspectors[tag]; StructField sf = soi.getStructFieldRef(Utilities.ReduceField.KEY .toString()); Object keyObject = soi.getStructFieldData(row, sf); if (SerDeUtils.isNullObject(keyObject, soi)) { endGroup(); startGroup(); } {code} Added method SerDeUtils.isNullObject(keyObject, soi) to know if the object passed is representing a NULL object. bq. select * FROM myinput1 a left outer JOIN myinput1 b right outer join myinput1 c ON a.value = b.value and b.value = c.value; Looking at Stage-1 of "explain" for the above query: {noformat} Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a TableScan alias: a Reduce Output Operator sort order: tag: 0 value expressions: expr: key type: int expr: value type: int b TableScan alias: b Reduce Output Operator sort order: tag: 1 value expressions: expr: key type: int expr: value type: int Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 {VALUE._col0} {VALUE._col1} handleSkewJoin: false outputColumnNames: _col0, _col1, _col4, _col5 Filter Operator predicate: expr: (_col1 = _col5) type: boolean File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat {noformat} Join happens without join key? Here, join output is the Cartesian product of a and b on which FilterOperator is applied, Am I right? I see the semantics of inner/outer join on two tables without join condition is to produce Cartesian product. As a side note: "MySql does not allow outer joins without join condition". If Join is allowed without join condition to produce Cartesian product of the two tables, then my patch should be changed to consider if join-key is defined for the join or not. I could reproduce it by simple query "select * FROM myinput1 a JOIN myinput1 b". I think the same applies to MapJoin as well. bq. Verified that SMBMapJoinOperator already filters nulls properly. bq. Can you also add one or few tests for sort merge join? It seems my verification was wrong here, I thought if the table is sorted and hive.optimize.bucketmapjoin, hive.optimize.bucketmapjoin.sortedmerge are set to true, MapJoin uses SMBMapJoinOperator. But it was using MapJoinOperator it self. When I created a table with "sorted by" column, I see it using SMBMapJoinOperator. Currently if there are any nulls in the input table, SMBJoin fails with NullPointerException: {noformat} Caused by: java.lang.NullPointerException at org.apache.hadoop.io.IntWritable.compareTo(IntWritable.java:60) at org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:115) at org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.compareKeys(SMBMapJoinOperator.java:389) at org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.processKey(SMBMapJoinOperator.java:438) at org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.processOp(SMBMapJoinOperator.java:205) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:458) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:698) at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:45) at org.apache.hadoop.hive.ql.exec.Operator.process(Operator.java:458) at org.apache.hadoop.hive.ql.exec.SMBMapJoinOperator.fetchOneRow(SMBMapJoinOperator.java:479) ... 17 more {noformat} Will look into this. bq. For inner, left and right outer joins, a simpler fix wou
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899177#action_12899177 ] Namit Jain commented on HIVE-741: - For inner, left and right outer joins, a simpler fix would be to add a filter on top. For example, for A join B on A.c1 = B.c2 add a filter before A (A.c1 is not null) and B (B.c2 is not null) For A left outer join B on A.c1 = B.c2, the filter before A is not needed and similarly, for A right outer join B on A.c1 = B.c2, the filter before B is not needed Some special handling might still be needed for full outer joins and sort-merge joins > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > Attachments: patch-741.txt > > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899044#action_12899044 ] Ning Zhang commented on HIVE-741: - @Amareshwari, aside from adding new test cases for sort merge join, this patch also has some bugs. For example in your test data: {code} hive> select * from myinput1 NULL356 484 NULL 10 10 -- incorrect result below hive> select * FROM myinput1 a left outer JOIN myinput1 b ON a.value = b.value; 484 NULL484 NULL 10 10 10 10 NULL356 NULLNULL hive> select * FROM myinput1 a right outer JOIN myinput1 b ON a.value = b.value; 484 NULL484 NULL 10 10 10 10 NULLNULLNULL356 hive> select * FROM myinput1 a left outer JOIN myinput1 b right outer join myinput1 c ON a.value = b.value and b.value = c.value; NULLNULLNULLNULL484 NULL NULLNULLNULLNULL10 10 NULLNULLNULLNULLNULL356 {code} Can you take a look? I'm not sure whether ending a group and starting a new group for each null-keyed row works for all cases particularly in joins involving more than 2 tables and mixture of left and right outer joins. > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > Attachments: patch-741.txt > > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898996#action_12898996 ] He Yongqiang commented on HIVE-741: --- the change looks good to me. Can you also add one or few tests for sort merge join? > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > Attachments: patch-741.txt > > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898772#action_12898772 ] Ted Xu commented on HIVE-741: - bq. a left outer join preserves rows on the left side regardless of whether the ON clause evaluates true That's right, thanks. > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898756#action_12898756 ] Ning Zhang commented on HIVE-741: - @Amareshwar, currently we already distinguish different join types with different functions (take a look at CommonJoinOperator.joinObjects()). I look forward to seeing your proposal to avoid grouping null-keyed rows. @Ted, I agree with Amareshwar and John that we cannot avoid rows (or the value part of the key-value pairs) with null as a key. However you have a point in that if we know the join operator does not involve outer join at all (we already have a flag noOuterJoin in JoinDesc), then we could avoid sending rows will null keys from the mappers to the reducers. This will save bandwidth as well as processing time. Could you open another JIRA and be able to submit a patch? > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898374#action_12898374 ] John Sichi commented on HIVE-741: - @Ted: as Amareshwari mentioned, a left outer join preserves rows on the left side regardless of whether the ON clause evaluates true. So in that case (and similar cases for right/full outer join), we can't filter out the rows with null join keys. > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898194#action_12898194 ] Ted Xu commented on HIVE-741: - Sorry I'm mistakenly expressed my idea, I mean values with NULL join keys shall be filtered out in mappers. I think values with NULL join keys shall be filtered out because NULL equals nothing, and Hive only support equal join. Please correct me if I'm wrong. > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898141#action_12898141 ] Amareshwari Sriramadasu commented on HIVE-741: -- bq. but it will be great if we can filter the NULL values in mappers. Ted, we should not filter out the NULL values in mappers. Because for outer joins, these rows should be cartesian producted with nulls as shown in expected output in the [comment | https://issues.apache.org/jira/browse/HIVE-741?focusedCommentId=12896789&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12896789] > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898136#action_12898136 ] Amareshwari Sriramadasu commented on HIVE-741: -- Thanks Ning for the details. To summarize the implementation of join: * In reduce-side join, rows with same join keys are grouped together; and in MapSide join, rows with same join keys are added the same entry in the hash table. * CommonJoinOperator.checkAndGenObject: The rows with same join key are cartesian producted with each other(i.e. with rows of different aliases). If there are no rows in one table alias, the rows of other table alias are ignored (for inner join) or cartesian producted with nulls (outer joins). The above implementation works fine except for null join keys ; Since these rows are grouped together/hashed to same entry, the current issue exists. bq. I think the fix would be to check the NULL value of the join keys and do proper output based on the semantics of different types of joins. This would need special handling for each type of join (inner, left outer, right outer, full outer an etc.). So, I'm thinking the better solution is not group rows with null join keys together. Then the above join algorithm works correctly for all types of joins. Currently they are grouped together because HiveKey.compare compares the bytes of the key (in case of reduce-side join) and MapJoinObjectKey.equals returns true if both keys are null (in case of map-side join). I'm trying to see if can come up with a solution which does not group rows with null join keys together. Please correct me if am wrong. > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898132#action_12898132 ] Ted Xu commented on HIVE-741: - Thanks for Ning and Amareshwari, we are looking forward to see the bug fixed. I think it's Okay to solve it by modifying the *JoinOperators, but it will be great if we can filter the NULL values in mappers, say, in ReduceSinkOperator, provided if we can know which part of the reduce sink key is from join (other than from group by, distinct, etc,.). > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Amareshwari Sriramadasu > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896974#action_12896974 ] Ning Zhang commented on HIVE-741: - The joins are implemented in the JoinOperator and CommonJoinOperators for regular reduce-side joins. The map-side joins are implemented in the MapJoinOperator. In the reduce side joins, the join keys are treated as distribution keys from the mappers to the reducers so that each group (marked by beginGroup() and endGroup()) will consists of rows with the same join keys. The reduce-side joins will cache all rows within a group except the last one (aka streaming table), which is scanned and cartesian producted with the cached rows of the other tables. I think the fix would be to check the NULL value of the join keys and do proper output based on the semantics of different types of joins. For the map-side join, it's basically a hash join where the small table is read in entirety in a hash table and probed while scanning the streaming table. There are other types of joins (bucketed map-side join, sort merge join etc.), but they all rely on the 3 classes mentioned above. Let me know if you have further questions for you to get started. > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Ning Zhang > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896793#action_12896793 ] Ning Zhang commented on HIVE-741: - I'm not actively working on it. Please feel free to take it. > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Ning Zhang > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896792#action_12896792 ] Amareshwari Sriramadasu commented on HIVE-741: -- By adding logs to ExecReducer, I see that the input to reduce is the following: {noformat} {"key":{"joinkey0":null},"value":{"_col0":null,"_col1":35},"alias":0} {"key":{"joinkey0":null},"value":{"_col0":12,"_col1":null},"alias":1} {"key":{"joinkey0":10},"value":{"_col0":10,"_col1":1000},"alias":0} {"key":{"joinkey0":10},"value":{"_col0":10,"_col1":100},"alias":0} {"key":{"joinkey0":12},"value":{"_col0":12,"_col1":null},"alias":0} {"key":{"joinkey0":35},"value":{"_col0":null,"_col1":35},"alias":1} {"key":{"joinkey0":100},"value":{"_col0":100,"_col1":100},"alias":0} {"key":{"joinkey0":100},"value":{"_col0":10,"_col1":100},"alias":1} {"key":{"joinkey0":100},"value":{"_col0":100,"_col1":100},"alias":1} {"key":{"joinkey0":1000},"value":{"_col0":10,"_col1":1000},"alias":1} {noformat} And joinkey with null values are processed under same group, I think they should be processed in different groups, because comparison between nulls is not defined. > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Ning Zhang > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (HIVE-741) NULL is not handled correctly in join
[ https://issues.apache.org/jira/browse/HIVE-741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12896789#action_12896789 ] Amareshwari Sriramadasu commented on HIVE-741: -- bq. When 'null' is replaced by '' it works. I see the same result even if 'null' is replaced with ''. To reproduce the above, I created a table input1 with {code} hive> create table input1 (key int, value int); {code} Loaded the following input using Load data command {noformat} in.txt ^A35 12^A 10^A1000 10^A100 100^A100 {noformat} I see the following output for join queries executed. {noformat} hive> select * from input1 a join input1 b on a.key=b.value; Output: NULL35 12 NULL 100 100 10 100 100 100 100 100 Expected Output: 100 100 10 100 100 100 100 100 hive> select * from input1 a left outer join input1 b on a.key=b.value; OUTPUT: NULL35 12 NULL 10 1000NULLNULL 10 100 NULLNULL 12 NULLNULLNULL 100 100 10 100 100 100 100 100 Expected Output: NULL35 NULLNULL 10 1000NULLNULL 10 100 NULLNULL 12 NULLNULLNULL 100 100 10 100 100 100 100 100 hive> select * from input1 a right outer join input1 b on a.key=b.value; OUTPUT: NULL35 12 NULL NULLNULLNULL35 100 100 10 100 100 100 100 100 NULLNULL10 1000 ExpectedOutput: NULLNULLNULL35 NULLNULL12 NULL 100 100 10 100 100 100 100 100 NULLNULL10 1000 {noformat} Expected output is obtained from mysql db for a similar query. Ning, if you are not working on the fix for this, I would like to contribute. Would need your help understanding join code also, as I'm a new to hive. > NULL is not handled correctly in join > - > > Key: HIVE-741 > URL: https://issues.apache.org/jira/browse/HIVE-741 > Project: Hadoop Hive > Issue Type: Bug >Reporter: Ning Zhang >Assignee: Ning Zhang > > With the following data in table input4_cb: > KeyValue > -- > NULL 325 > 18 NULL > The following query: > {code} > select * from input4_cb a join input4_cb b on a.key = b.value; > {code} > returns the following result: > NULL32518 NULL > The correct result should be empty set. > When 'null' is replaced by '' it works. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.