[jira] [Assigned] (HIVE-27324) Hive query with NOT IN condition is giving incorrect results when the sub query table contains the null value.

Diksha (Jira) Thu, 06 Jul 2023 02:49:04 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-27324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Diksha reassigned HIVE-27324:
-----------------------------

    Assignee: Diksha

> Hive query with NOT IN condition is giving incorrect results when the sub 
> query table contains the null value.
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-27324
>                 URL: https://issues.apache.org/jira/browse/HIVE-27324
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive
>    Affects Versions: All Versions, 3.1.0
>            Reporter: Shobika Selvaraj
>            Assignee: Diksha
>            Priority: Major
>
> Hive query giving empty results when the sub query table contains the null 
> value. 
> We encountered two issues here. 
> 1) The query - "select * from t3 where age not in (select distinct(age) age 
> from t1);" is giving empty results when the table t1 contains a null value. 
> Disabling cbo didn't helped here.
> 2) Let's consider the table t3 has null value and table t1 doesn't have any 
> null values. Now if we run the above query it is returning other data's but 
> not the null value from t3. If we disable the cbo then it's giving result 
> with null value. 
>  
> *REPRO STEPS WITH DETAILED EXPLANATION AS BELOW:*
> *FIRST ISSUE:*
> Create two tables and insert data with null values as below:
> -------
> create table t3 (id int,name string, age int);
>  
> insert into t3 
> values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23),(9,'ron',3),(10,'Sam',22),(11,'nick',19),(12,'fed',18),(13,'kong',13),(14,'hela',45);
>  
> create table t1 (id int,name string, age int);
> insert into t1 
> values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23);
> -------
>  
> Then executed the below query: 
> ------
> select * from t3
> where age not in (select distinct(age) age from t1);
> ------
>  
> The result should be as below:
> {code:java}
> +--------+----------+---------+
> | t3.id  | t3.name  | t3.age  |
> +--------+----------+---------+
> | 9      | ron      | 3       |
> | 10     | Sam      | 22      |
> | 11     | nick     | 19      |
> | 12     | fed      | 18      |
> | 13     | kong     | 13      |
> +--------+----------+---------+
> 5 rows selected (35.897 seconds) {code}
> But when we run the above query it is giving zero records:
> {code:java}
> 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> select * from t3
> . . . . . . . . . . . . . . . . . . . . . . .> where age not in (select 
> distinct(age) age from t1);
> INFO  : Compiling 
> command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96): 
> select * from t3
> where age not in (select distinct(age) age from t1)
> INFO  : Warning: Map Join MAPJOIN[37][bigTable=?] in task 'Map 1' is a cross 
> product
> ......
> .....
> INFO  : Completed executing 
> command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96); 
> Time taken: 10.191 seconds
> INFO  : OK
> +--------+----------+---------+
> | t3.id  | t3.name  | t3.age  |
> +--------+----------+---------+
> +--------+----------+---------+
> No rows selected (12.17 seconds) {code}
> The query works fine when we use nvl function or not null condition. 
>  
> So as a workaround we can use nvl function for both main and sub query as 
> below:
> {code:java}
> select * from t3 where nvl(age,'-') not in (select distinct(nvl(age,'-')) age 
> from t1); {code}
>  
> *SECOND ISSUE:* 
> Also while testing multiple scenario's i found one more issue as well.
> When the sub query table (t1) doesn't contain any null values then the query 
> is giving result but it is ignoring the null values of the main table(t3) .
>  
> For example: Created another table t4 and inserted the data's without any 
> null values:
> create table t4 (id int,name string, age int);
> insert into t4 
> values(1,'Sagar',23),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23);
> Now i tested with the below query and it gives 5 records. The count should be 
> six and it omitted the null value of the table t3:
> {code:java}
> 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> select * from t3
> . . . . . . . . . . . . . . . . . . . . . . .> where age not in (select 
> distinct(age) age from t4);
> INFO  : Compiling 
> command(queryId=hive_20230427164745_f20f47ce-614d-493d-8910-99a118de089c): 
> select * from t3
> where age not in (select distinct(age) age from t4)
> INFO  : Warning: Map Join MAPJOIN[37][bigTable=?] in task 'Map 1' is a cross 
> product
> ......
> ......
> INFO  : Completed executing 
> command(queryId=hive_20230427164745_f20f47ce-614d-493d-8910-99a118de089c); 
> Time taken: 17.724 seconds
> INFO  : OK
> +--------+----------+---------+
> | t3.id  | t3.name  | t3.age  |
> +--------+----------+---------+
> | 9      | ron      | 3       |
> | 10     | Sam      | 22      |
> | 11     | nick     | 19      |
> | 12     | fed      | 18      |
> | 13     | kong     | 13      |
> +--------+----------+---------+
> 5 rows selected (19.674 seconds) {code}
> But disabling CBO fixed this issue:
> {code:java}
> 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> set hive.cbo.enable=false;
> No rows affected (0.02 seconds)
> 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> select * from t3
> . . . . . . . . . . . . . . . . . . . . . . .> where age not in (select 
> distinct(age) age from t4);
> INFO  : Compiling 
> command(queryId=hive_20230427165029_6c3167b0-c706-4230-a32f-2e4bc528be8f): 
> select * from t3
> where age not in (select distinct(age) age from t4)
> INFO  : Warning: Map Join MAPJOIN[41][bigTable=t3] in task 'Map 1' is a cross 
> product
> INFO  : Semantic Analysis Completed (retrial = false)
> .......
> .......
> INFO  : Completed executing 
> command(queryId=hive_20230427165029_6c3167b0-c706-4230-a32f-2e4bc528be8f); 
> Time taken: 13.738 seconds
> INFO  : OK
> +--------+----------+---------+
> | t3.id  | t3.name  | t3.age  |
> +--------+----------+---------+
> | 2      | Sultan   | NULL    |
> | 9      | ron      | 3       |
> | 10     | Sam      | 22      |
> | 11     | nick     | 19      |
> | 12     | fed      | 18      |
> | 13     | kong     | 13      |
> +--------+----------+---------+
> 6 rows selected (14.526 seconds) {code}
> I am not sure this how hive works or not by omitting the null values. Even if 
> it omits the null values then it should give the other records{*}(FOR ISSUE 
> 1){*} for the sub query table which contains null values.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (HIVE-27324) Hive query with NOT IN condition is giving incorrect results when the sub query table contains the null value.

Reply via email to