[ https://issues.apache.org/jira/browse/HIVE-27324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Diksha reassigned HIVE-27324: ----------------------------- Assignee: Diksha > Hive query with NOT IN condition is giving incorrect results when the sub > query table contains the null value. > -------------------------------------------------------------------------------------------------------------- > > Key: HIVE-27324 > URL: https://issues.apache.org/jira/browse/HIVE-27324 > Project: Hive > Issue Type: Bug > Components: Hive > Affects Versions: All Versions, 3.1.0 > Reporter: Shobika Selvaraj > Assignee: Diksha > Priority: Major > > Hive query giving empty results when the sub query table contains the null > value. > We encountered two issues here. > 1) The query - "select * from t3 where age not in (select distinct(age) age > from t1);" is giving empty results when the table t1 contains a null value. > Disabling cbo didn't helped here. > 2) Let's consider the table t3 has null value and table t1 doesn't have any > null values. Now if we run the above query it is returning other data's but > not the null value from t3. If we disable the cbo then it's giving result > with null value. > > *REPRO STEPS WITH DETAILED EXPLANATION AS BELOW:* > *FIRST ISSUE:* > Create two tables and insert data with null values as below: > ------- > create table t3 (id int,name string, age int); > > insert into t3 > values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23),(9,'ron',3),(10,'Sam',22),(11,'nick',19),(12,'fed',18),(13,'kong',13),(14,'hela',45); > > create table t1 (id int,name string, age int); > insert into t1 > values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23); > ------- > > Then executed the below query: > ------ > select * from t3 > where age not in (select distinct(age) age from t1); > ------ > > The result should be as below: > {code:java} > +--------+----------+---------+ > | t3.id | t3.name | t3.age | > +--------+----------+---------+ > | 9 | ron | 3 | > | 10 | Sam | 22 | > | 11 | nick | 19 | > | 12 | fed | 18 | > | 13 | kong | 13 | > +--------+----------+---------+ > 5 rows selected (35.897 seconds) {code} > But when we run the above query it is giving zero records: > {code:java} > 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> select * from t3 > . . . . . . . . . . . . . . . . . . . . . . .> where age not in (select > distinct(age) age from t1); > INFO : Compiling > command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96): > select * from t3 > where age not in (select distinct(age) age from t1) > INFO : Warning: Map Join MAPJOIN[37][bigTable=?] in task 'Map 1' is a cross > product > ...... > ..... > INFO : Completed executing > command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96); > Time taken: 10.191 seconds > INFO : OK > +--------+----------+---------+ > | t3.id | t3.name | t3.age | > +--------+----------+---------+ > +--------+----------+---------+ > No rows selected (12.17 seconds) {code} > The query works fine when we use nvl function or not null condition. > > So as a workaround we can use nvl function for both main and sub query as > below: > {code:java} > select * from t3 where nvl(age,'-') not in (select distinct(nvl(age,'-')) age > from t1); {code} > > *SECOND ISSUE:* > Also while testing multiple scenario's i found one more issue as well. > When the sub query table (t1) doesn't contain any null values then the query > is giving result but it is ignoring the null values of the main table(t3) . > > For example: Created another table t4 and inserted the data's without any > null values: > create table t4 (id int,name string, age int); > insert into t4 > values(1,'Sagar',23),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23); > Now i tested with the below query and it gives 5 records. The count should be > six and it omitted the null value of the table t3: > {code:java} > 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> select * from t3 > . . . . . . . . . . . . . . . . . . . . . . .> where age not in (select > distinct(age) age from t4); > INFO : Compiling > command(queryId=hive_20230427164745_f20f47ce-614d-493d-8910-99a118de089c): > select * from t3 > where age not in (select distinct(age) age from t4) > INFO : Warning: Map Join MAPJOIN[37][bigTable=?] in task 'Map 1' is a cross > product > ...... > ...... > INFO : Completed executing > command(queryId=hive_20230427164745_f20f47ce-614d-493d-8910-99a118de089c); > Time taken: 17.724 seconds > INFO : OK > +--------+----------+---------+ > | t3.id | t3.name | t3.age | > +--------+----------+---------+ > | 9 | ron | 3 | > | 10 | Sam | 22 | > | 11 | nick | 19 | > | 12 | fed | 18 | > | 13 | kong | 13 | > +--------+----------+---------+ > 5 rows selected (19.674 seconds) {code} > But disabling CBO fixed this issue: > {code:java} > 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> set hive.cbo.enable=false; > No rows affected (0.02 seconds) > 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> select * from t3 > . . . . . . . . . . . . . . . . . . . . . . .> where age not in (select > distinct(age) age from t4); > INFO : Compiling > command(queryId=hive_20230427165029_6c3167b0-c706-4230-a32f-2e4bc528be8f): > select * from t3 > where age not in (select distinct(age) age from t4) > INFO : Warning: Map Join MAPJOIN[41][bigTable=t3] in task 'Map 1' is a cross > product > INFO : Semantic Analysis Completed (retrial = false) > ....... > ....... > INFO : Completed executing > command(queryId=hive_20230427165029_6c3167b0-c706-4230-a32f-2e4bc528be8f); > Time taken: 13.738 seconds > INFO : OK > +--------+----------+---------+ > | t3.id | t3.name | t3.age | > +--------+----------+---------+ > | 2 | Sultan | NULL | > | 9 | ron | 3 | > | 10 | Sam | 22 | > | 11 | nick | 19 | > | 12 | fed | 18 | > | 13 | kong | 13 | > +--------+----------+---------+ > 6 rows selected (14.526 seconds) {code} > I am not sure this how hive works or not by omitting the null values. Even if > it omits the null values then it should give the other records{*}(FOR ISSUE > 1){*} for the sub query table which contains null values. -- This message was sent by Atlassian Jira (v8.20.10#820010)