[jira] [Commented] (HIVE-27324) Hive query with NOT IN condition is giving incorrect results when the sub query table contains the null value.

2023-11-03 Thread Sankar Hariappan (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17782632#comment-17782632
 ] 

Sankar Hariappan commented on HIVE-27324:
-

[~diksha193]: Pls backport this fix to branch-3 as well.

> Hive query with NOT IN condition is giving incorrect results when the sub 
> query table contains the null value.
> --
>
> Key: HIVE-27324
> URL: https://issues.apache.org/jira/browse/HIVE-27324
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: 3.1.3
>Reporter: Shobika Selvaraj
>Assignee: Diksha
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: sql_queries.png
>
>
> Hive query giving empty results when the sub query table contains the null 
> value. 
> We encountered two issues here. 
> 1) The query - "select * from t3 where age not in (select distinct(age) age 
> from t1);" is giving empty results when the table t1 contains a null value. 
> Disabling cbo didn't helped here.
> 2) Let's consider the table t3 has null value and table t1 doesn't have any 
> null values. Now if we run the above query it is returning other data's but 
> not the null value from t3. If we disable the cbo then it's giving result 
> with null value. 
>  
> *REPRO STEPS WITH DETAILED EXPLANATION AS BELOW:*
> *FIRST ISSUE:*
> Create two tables and insert data with null values as below:
> ---
> create table t3 (id int,name string, age int);
>  
> insert into t3 
> values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23),(9,'ron',3),(10,'Sam',22),(11,'nick',19),(12,'fed',18),(13,'kong',13),(14,'hela',45);
>  
> create table t1 (id int,name string, age int);
> insert into t1 
> values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23);
> ---
>  
> Then executed the below query: 
> --
> select * from t3
> where age not in (select distinct(age) age from t1);
> --
>  
> The result should be as below:
> {code:java}
> ++--+-+
> | t3.id  | t3.name  | t3.age  |
> ++--+-+
> | 9      | ron      | 3       |
> | 10     | Sam      | 22      |
> | 11     | nick     | 19      |
> | 12     | fed      | 18      |
> | 13     | kong     | 13      |
> ++--+-+
> 5 rows selected (35.897 seconds) {code}
> But when we run the above query it is giving zero records:
> {code:java}
> 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> select * from t3
> . . . . . . . . . . . . . . . . . . . . . . .> where age not in (select 
> distinct(age) age from t1);
> INFO  : Compiling 
> command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96): 
> select * from t3
> where age not in (select distinct(age) age from t1)
> INFO  : Warning: Map Join MAPJOIN[37][bigTable=?] in task 'Map 1' is a cross 
> product
> ..
> .
> INFO  : Completed executing 
> command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96); 
> Time taken: 10.191 seconds
> INFO  : OK
> ++--+-+
> | t3.id  | t3.name  | t3.age  |
> ++--+-+
> ++--+-+
> No rows selected (12.17 seconds) {code}
> The query works fine when we use nvl function or not null condition. 
>  
> So as a workaround we can use nvl function for both main and sub query as 
> below:
> {code:java}
> select * from t3 where nvl(age,'-') not in (select distinct(nvl(age,'-')) age 
> from t1); {code}
>  
> *SECOND ISSUE:* 
> Also while testing multiple scenario's i found one more issue as well.
> When the sub query table (t1) doesn't contain any null values then the query 
> is giving result but it is ignoring the null values of the main table(t3) .
>  
> For example: Created another table t4 and inserted the data's without any 
> null values:
> create table t4 (id int,name string, age int);
> insert into t4 
> values(1,'Sagar',23),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23);
> Now i tested with the below query and it gives 5 records. The count should be 
> six and it omitted the null value of the table t3:
> {code:java}
> 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> select * from t3
> . . . . . . . . . . . . . . . . . . . . . . .> where age not in (select 
> distinct(age) age from t4);
> INFO  : Compiling 
> command(queryId=hive_20230427164745_f20f47ce-614d-493d-8910-99a118de089c): 
> select * from t3
> where age not in (select distinct(age) age from t4)
> INFO  : Warning: Map Join MAPJOIN[37][bigTable=?] in task 'Map 1' is a cross 
> product
> ..
> ..
> INFO  : Completed executing 
> 

[jira] [Commented] (HIVE-27324) Hive query with NOT IN condition is giving incorrect results when the sub query table contains the null value.

2023-09-16 Thread Diksha (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765949#comment-17765949
 ] 

Diksha commented on HIVE-27324:
---

{quote}!sql_queries.png!
{quote}

> Hive query with NOT IN condition is giving incorrect results when the sub 
> query table contains the null value.
> --
>
> Key: HIVE-27324
> URL: https://issues.apache.org/jira/browse/HIVE-27324
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: All Versions, 3.1.0
>Reporter: Shobika Selvaraj
>Assignee: Diksha
>Priority: Major
>  Labels: pull-request-available
> Attachments: sql_queries.png
>
>
> Hive query giving empty results when the sub query table contains the null 
> value. 
> We encountered two issues here. 
> 1) The query - "select * from t3 where age not in (select distinct(age) age 
> from t1);" is giving empty results when the table t1 contains a null value. 
> Disabling cbo didn't helped here.
> 2) Let's consider the table t3 has null value and table t1 doesn't have any 
> null values. Now if we run the above query it is returning other data's but 
> not the null value from t3. If we disable the cbo then it's giving result 
> with null value. 
>  
> *REPRO STEPS WITH DETAILED EXPLANATION AS BELOW:*
> *FIRST ISSUE:*
> Create two tables and insert data with null values as below:
> ---
> create table t3 (id int,name string, age int);
>  
> insert into t3 
> values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23),(9,'ron',3),(10,'Sam',22),(11,'nick',19),(12,'fed',18),(13,'kong',13),(14,'hela',45);
>  
> create table t1 (id int,name string, age int);
> insert into t1 
> values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23);
> ---
>  
> Then executed the below query: 
> --
> select * from t3
> where age not in (select distinct(age) age from t1);
> --
>  
> The result should be as below:
> {code:java}
> ++--+-+
> | t3.id  | t3.name  | t3.age  |
> ++--+-+
> | 9      | ron      | 3       |
> | 10     | Sam      | 22      |
> | 11     | nick     | 19      |
> | 12     | fed      | 18      |
> | 13     | kong     | 13      |
> ++--+-+
> 5 rows selected (35.897 seconds) {code}
> But when we run the above query it is giving zero records:
> {code:java}
> 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> select * from t3
> . . . . . . . . . . . . . . . . . . . . . . .> where age not in (select 
> distinct(age) age from t1);
> INFO  : Compiling 
> command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96): 
> select * from t3
> where age not in (select distinct(age) age from t1)
> INFO  : Warning: Map Join MAPJOIN[37][bigTable=?] in task 'Map 1' is a cross 
> product
> ..
> .
> INFO  : Completed executing 
> command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96); 
> Time taken: 10.191 seconds
> INFO  : OK
> ++--+-+
> | t3.id  | t3.name  | t3.age  |
> ++--+-+
> ++--+-+
> No rows selected (12.17 seconds) {code}
> The query works fine when we use nvl function or not null condition. 
>  
> So as a workaround we can use nvl function for both main and sub query as 
> below:
> {code:java}
> select * from t3 where nvl(age,'-') not in (select distinct(nvl(age,'-')) age 
> from t1); {code}
>  
> *SECOND ISSUE:* 
> Also while testing multiple scenario's i found one more issue as well.
> When the sub query table (t1) doesn't contain any null values then the query 
> is giving result but it is ignoring the null values of the main table(t3) .
>  
> For example: Created another table t4 and inserted the data's without any 
> null values:
> create table t4 (id int,name string, age int);
> insert into t4 
> values(1,'Sagar',23),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23);
> Now i tested with the below query and it gives 5 records. The count should be 
> six and it omitted the null value of the table t3:
> {code:java}
> 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> select * from t3
> . . . . . . . . . . . . . . . . . . . . . . .> where age not in (select 
> distinct(age) age from t4);
> INFO  : Compiling 
> command(queryId=hive_20230427164745_f20f47ce-614d-493d-8910-99a118de089c): 
> select * from t3
> where age not in (select distinct(age) age from t4)
> INFO  : Warning: Map Join MAPJOIN[37][bigTable=?] in task 'Map 1' is a cross 
> product
> ..
> ..
> INFO  : Completed executing 
> command(queryId=hive_20230427164745_f20f47ce-614d-493d-8910-99a118de089c); 
> 

[jira] [Commented] (HIVE-27324) Hive query with NOT IN condition is giving incorrect results when the sub query table contains the null value.

2023-09-16 Thread Diksha (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765948#comment-17765948
 ] 

Diksha commented on HIVE-27324:
---

sql queries ran in online sql compiler:

create table t3 (id int,name varchar(255), age int);
insert into t3 
values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23),(9,'ron',3),(10,'Sam',22),(11,'nick',19),(12,'fed',18),(13,'kong',13),(14,'hela',45);

create table t4 (id int,name varchar(255), age int);
insert into t4 
values(1,'Sagar',23),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23);

create table t5 (id int,name varchar(255), ages int);
insert into t5 
values(1,'Sagar',23),(3,'Surya',NULL),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23);

select * from t3
where age in (select distinct(age) age from t4)
order by age ;

select * from t3
where age not in (select distinct(age) age from t4  )
order by age ;

select * from t3
where age not in (select distinct(ages) ages from t5 where t5.ages is not null)
order by age ;

select count(*) from t3
where age not in (select distinct(ages) ages from t5 )
order by age ;

select count(*) from t3
where age not in (23,22, null );

select count(*) from t3
where age not in (select distinct(age)age from t3 t1 where t1.age > 10);

 

 

Outputs:
6|Ramya|5
1|Sagar|23
3|Surya|23
5|Scott|23
7||23
8||23
4|Raman|45
14|hela|45
9|ron|3
13|kong|13
12|fed|18
11|nick|19
10|Sam|22
9|ron|3
13|kong|13
12|fed|18
11|nick|19
10|Sam|22

0
0

2

[Execution complete with exit code 0]

> Hive query with NOT IN condition is giving incorrect results when the sub 
> query table contains the null value.
> --
>
> Key: HIVE-27324
> URL: https://issues.apache.org/jira/browse/HIVE-27324
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: All Versions, 3.1.0
>Reporter: Shobika Selvaraj
>Assignee: Diksha
>Priority: Major
>  Labels: pull-request-available
>
> Hive query giving empty results when the sub query table contains the null 
> value. 
> We encountered two issues here. 
> 1) The query - "select * from t3 where age not in (select distinct(age) age 
> from t1);" is giving empty results when the table t1 contains a null value. 
> Disabling cbo didn't helped here.
> 2) Let's consider the table t3 has null value and table t1 doesn't have any 
> null values. Now if we run the above query it is returning other data's but 
> not the null value from t3. If we disable the cbo then it's giving result 
> with null value. 
>  
> *REPRO STEPS WITH DETAILED EXPLANATION AS BELOW:*
> *FIRST ISSUE:*
> Create two tables and insert data with null values as below:
> ---
> create table t3 (id int,name string, age int);
>  
> insert into t3 
> values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23),(9,'ron',3),(10,'Sam',22),(11,'nick',19),(12,'fed',18),(13,'kong',13),(14,'hela',45);
>  
> create table t1 (id int,name string, age int);
> insert into t1 
> values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23);
> ---
>  
> Then executed the below query: 
> --
> select * from t3
> where age not in (select distinct(age) age from t1);
> --
>  
> The result should be as below:
> {code:java}
> ++--+-+
> | t3.id  | t3.name  | t3.age  |
> ++--+-+
> | 9      | ron      | 3       |
> | 10     | Sam      | 22      |
> | 11     | nick     | 19      |
> | 12     | fed      | 18      |
> | 13     | kong     | 13      |
> ++--+-+
> 5 rows selected (35.897 seconds) {code}
> But when we run the above query it is giving zero records:
> {code:java}
> 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> select * from t3
> . . . . . . . . . . . . . . . . . . . . . . .> where age not in (select 
> distinct(age) age from t1);
> INFO  : Compiling 
> command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96): 
> select * from t3
> where age not in (select distinct(age) age from t1)
> INFO  : Warning: Map Join MAPJOIN[37][bigTable=?] in task 'Map 1' is a cross 
> product
> ..
> .
> INFO  : Completed executing 
> command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96); 
> Time taken: 10.191 seconds
> INFO  : OK
> ++--+-+
> | t3.id  | t3.name  | t3.age  |
> ++--+-+
> ++--+-+
> No rows selected (12.17 seconds) {code}
> The query works fine when we use nvl function or not null condition. 
>  
> So as a workaround we can use nvl function for both main and sub query as 

[jira] [Commented] (HIVE-27324) Hive query with NOT IN condition is giving incorrect results when the sub query table contains the null value.

2023-09-16 Thread Sankar Hariappan (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765925#comment-17765925
 ] 

Sankar Hariappan commented on HIVE-27324:
-

[~diksha193] Pls attach the mysql output for queries in discussion.

> Hive query with NOT IN condition is giving incorrect results when the sub 
> query table contains the null value.
> --
>
> Key: HIVE-27324
> URL: https://issues.apache.org/jira/browse/HIVE-27324
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: All Versions, 3.1.0
>Reporter: Shobika Selvaraj
>Assignee: Diksha
>Priority: Major
>  Labels: pull-request-available
>
> Hive query giving empty results when the sub query table contains the null 
> value. 
> We encountered two issues here. 
> 1) The query - "select * from t3 where age not in (select distinct(age) age 
> from t1);" is giving empty results when the table t1 contains a null value. 
> Disabling cbo didn't helped here.
> 2) Let's consider the table t3 has null value and table t1 doesn't have any 
> null values. Now if we run the above query it is returning other data's but 
> not the null value from t3. If we disable the cbo then it's giving result 
> with null value. 
>  
> *REPRO STEPS WITH DETAILED EXPLANATION AS BELOW:*
> *FIRST ISSUE:*
> Create two tables and insert data with null values as below:
> ---
> create table t3 (id int,name string, age int);
>  
> insert into t3 
> values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23),(9,'ron',3),(10,'Sam',22),(11,'nick',19),(12,'fed',18),(13,'kong',13),(14,'hela',45);
>  
> create table t1 (id int,name string, age int);
> insert into t1 
> values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23);
> ---
>  
> Then executed the below query: 
> --
> select * from t3
> where age not in (select distinct(age) age from t1);
> --
>  
> The result should be as below:
> {code:java}
> ++--+-+
> | t3.id  | t3.name  | t3.age  |
> ++--+-+
> | 9      | ron      | 3       |
> | 10     | Sam      | 22      |
> | 11     | nick     | 19      |
> | 12     | fed      | 18      |
> | 13     | kong     | 13      |
> ++--+-+
> 5 rows selected (35.897 seconds) {code}
> But when we run the above query it is giving zero records:
> {code:java}
> 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> select * from t3
> . . . . . . . . . . . . . . . . . . . . . . .> where age not in (select 
> distinct(age) age from t1);
> INFO  : Compiling 
> command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96): 
> select * from t3
> where age not in (select distinct(age) age from t1)
> INFO  : Warning: Map Join MAPJOIN[37][bigTable=?] in task 'Map 1' is a cross 
> product
> ..
> .
> INFO  : Completed executing 
> command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96); 
> Time taken: 10.191 seconds
> INFO  : OK
> ++--+-+
> | t3.id  | t3.name  | t3.age  |
> ++--+-+
> ++--+-+
> No rows selected (12.17 seconds) {code}
> The query works fine when we use nvl function or not null condition. 
>  
> So as a workaround we can use nvl function for both main and sub query as 
> below:
> {code:java}
> select * from t3 where nvl(age,'-') not in (select distinct(nvl(age,'-')) age 
> from t1); {code}
>  
> *SECOND ISSUE:* 
> Also while testing multiple scenario's i found one more issue as well.
> When the sub query table (t1) doesn't contain any null values then the query 
> is giving result but it is ignoring the null values of the main table(t3) .
>  
> For example: Created another table t4 and inserted the data's without any 
> null values:
> create table t4 (id int,name string, age int);
> insert into t4 
> values(1,'Sagar',23),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23);
> Now i tested with the below query and it gives 5 records. The count should be 
> six and it omitted the null value of the table t3:
> {code:java}
> 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> select * from t3
> . . . . . . . . . . . . . . . . . . . . . . .> where age not in (select 
> distinct(age) age from t4);
> INFO  : Compiling 
> command(queryId=hive_20230427164745_f20f47ce-614d-493d-8910-99a118de089c): 
> select * from t3
> where age not in (select distinct(age) age from t4)
> INFO  : Warning: Map Join MAPJOIN[37][bigTable=?] in task 'Map 1' is a cross 
> product
> ..
> ..
> INFO  : Completed executing 
> 

[jira] [Commented] (HIVE-27324) Hive query with NOT IN condition is giving incorrect results when the sub query table contains the null value.

2023-05-09 Thread Seonggon Namgung (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17720894#comment-17720894
 ] 

Seonggon Namgung commented on HIVE-27324:
-

I think the first issue is intended behavior while the second issue is bug.

1. Both parent query and subquery contain null.
According to the comment in QBSubQuery.java, Not-In query returns empty result 
if its subquery contains null.
In Hive (and other database systems), the result of something <> NULL is NULL, 
which is treated as false in conditional clause.
So "a not in (NULL, a, b, c, ...)" is always evaluated to false, and the result 
of parent query becomes empty.
(You can also refer to [this 
document|https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf].)

2. Only parent query contains null.
Because NULL not in (a, b, c, ...) is always false too, the correct result 
should be 5 rows.
(I also got 5 rows when I ran the second query in PostgreSQL.)
So we can conclude that Hive generates wrong query plan for Not-In query when 
CBO is disabled.
I compared the query plan for the second query with and without CBO, and I 
found that condition "t3.age is not null" is missing in non-CBO plan, which 
seems to be the root cause of this issue.

> Hive query with NOT IN condition is giving incorrect results when the sub 
> query table contains the null value.
> --
>
> Key: HIVE-27324
> URL: https://issues.apache.org/jira/browse/HIVE-27324
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: All Versions, 3.1.0
>Reporter: Shobika Selvaraj
>Priority: Major
>
> Hive query giving empty results when the sub query table contains the null 
> value. 
> We encountered two issues here. 
> 1) The query - "select * from t3 where age not in (select distinct(age) age 
> from t1);" is giving empty results when the table t1 contains a null value. 
> Disabling cbo didn't helped here.
> 2) Let's consider the table t3 has null value and table t1 doesn't have any 
> null values. Now if we run the above query it is returning other data's but 
> not the null value from t3. If we disable the cbo then it's giving result 
> with null value. 
>  
> *REPRO STEPS WITH DETAILED EXPLANATION AS BELOW:*
> *FIRST ISSUE:*
> Create two tables and insert data with null values as below:
> ---
> create table t3 (id int,name string, age int);
>  
> insert into t3 
> values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23),(9,'ron',3),(10,'Sam',22),(11,'nick',19),(12,'fed',18),(13,'kong',13),(14,'hela',45);
>  
> create table t1 (id int,name string, age int);
> insert into t1 
> values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23);
> ---
>  
> Then executed the below query: 
> --
> select * from t3
> where age not in (select distinct(age) age from t1);
> --
>  
> The result should be as below:
> {code:java}
> ++--+-+
> | t3.id  | t3.name  | t3.age  |
> ++--+-+
> | 9      | ron      | 3       |
> | 10     | Sam      | 22      |
> | 11     | nick     | 19      |
> | 12     | fed      | 18      |
> | 13     | kong     | 13      |
> ++--+-+
> 5 rows selected (35.897 seconds) {code}
> But when we run the above query it is giving zero records:
> {code:java}
> 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> select * from t3
> . . . . . . . . . . . . . . . . . . . . . . .> where age not in (select 
> distinct(age) age from t1);
> INFO  : Compiling 
> command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96): 
> select * from t3
> where age not in (select distinct(age) age from t1)
> INFO  : Warning: Map Join MAPJOIN[37][bigTable=?] in task 'Map 1' is a cross 
> product
> ..
> .
> INFO  : Completed executing 
> command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96); 
> Time taken: 10.191 seconds
> INFO  : OK
> ++--+-+
> | t3.id  | t3.name  | t3.age  |
> ++--+-+
> ++--+-+
> No rows selected (12.17 seconds) {code}
> The query works fine when we use nvl function or not null condition. 
>  
> So as a workaround we can use nvl function for both main and sub query as 
> below:
> {code:java}
> select * from t3 where nvl(age,'-') not in (select distinct(nvl(age,'-')) age 
> from t1); {code}
>  
> *SECOND ISSUE:* 
> Also while testing multiple scenario's i found one more issue as well.
> When the sub query table (t1) doesn't contain any null values then the query 
> is giving result but it is ignoring the null values of the main table(t3) .
>  
> For example: Created 

[jira] [Commented] (HIVE-27324) Hive query with NOT IN condition is giving incorrect results when the sub query table contains the null value.

2023-05-06 Thread Shobika Selvaraj (Jira)


[ 
https://issues.apache.org/jira/browse/HIVE-27324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17720099#comment-17720099
 ] 

Shobika Selvaraj commented on HIVE-27324:
-

Also created open source build and did a itests and found the same issue exists 
in OSS as well. Results of itests as below:

1) Created 2 tables and inserted null values in the sub query table and ran the 
itests and the output is as below:

 
{code:java}
PREHOOK: query: select * from t3 where age not in (select distinct(age) age 
from t1)
PREHOOK: type: QUERY
PREHOOK: Input: default@t1
PREHOOK: Input: default@t3
PREHOOK: Output: hdfs://### HDFS PATH ###
POSTHOOK: query: select * from t3 where age not in (select distinct(age) age 
from t1)
POSTHOOK: type: QUERY
POSTHOOK: Input: default@t1
POSTHOOK: Input: default@t3
POSTHOOK: Output: hdfs://### HDFS PATH ###
{code}
It didn't gave any result. To confirm this i did another test.

2) Created another two tables. And inserted data's without null value in the 
sub query table and ran the itests. Result as below:

 
{code:java}
PREHOOK: query: select * from t3 where age not in (select distinct(age) age 
from t4)
PREHOOK: type: QUERY
PREHOOK: Input: default@t3
PREHOOK: Input: default@t4
PREHOOK: Output: hdfs://### HDFS PATH ###
POSTHOOK: query: select * from t3 where age not in (select distinct(age) age 
from t4)
POSTHOOK: type: QUERY
POSTHOOK: Input: default@t3
POSTHOOK: Input: default@t4
POSTHOOK: Output: hdfs://### HDFS PATH ###
10  Sam 22
11  nick    19
12  fed 18
13  kong    13
9   ron 3
 
{code}
 

> Hive query with NOT IN condition is giving incorrect results when the sub 
> query table contains the null value.
> --
>
> Key: HIVE-27324
> URL: https://issues.apache.org/jira/browse/HIVE-27324
> Project: Hive
>  Issue Type: Bug
>  Components: Hive
>Affects Versions: All Versions, 3.1.0
>Reporter: Shobika Selvaraj
>Priority: Major
>
> Hive query giving empty results when the sub query table contains the null 
> value. 
> We encountered two issues here. 
> 1) The query - "select * from t3 where age not in (select distinct(age) age 
> from t1);" is giving empty results when the table t1 contains a null value. 
> Disabling cbo didn't helped here.
> 2) Let's consider the table t3 has null value and table t1 doesn't have any 
> null values. Now if we run the above query it is returning other data's but 
> not the null value from t3. If we disable the cbo then it's giving result 
> with null value. 
>  
> *REPRO STEPS WITH DETAILED EXPLANATION AS BELOW:*
> *FIRST ISSUE:*
> Create two tables and insert data with null values as below:
> ---
> create table t3 (id int,name string, age int);
>  
> insert into t3 
> values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23),(9,'ron',3),(10,'Sam',22),(11,'nick',19),(12,'fed',18),(13,'kong',13),(14,'hela',45);
>  
> create table t1 (id int,name string, age int);
> insert into t1 
> values(1,'Sagar',23),(2,'Sultan',NULL),(3,'Surya',23),(4,'Raman',45),(5,'Scott',23),(6,'Ramya',5),(7,'',23),(8,'',23);
> ---
>  
> Then executed the below query: 
> --
> select * from t3
> where age not in (select distinct(age) age from t1);
> --
>  
> The result should be as below:
> {code:java}
> ++--+-+
> | t3.id  | t3.name  | t3.age  |
> ++--+-+
> | 9      | ron      | 3       |
> | 10     | Sam      | 22      |
> | 11     | nick     | 19      |
> | 12     | fed      | 18      |
> | 13     | kong     | 13      |
> ++--+-+
> 5 rows selected (35.897 seconds) {code}
> But when we run the above query it is giving zero records:
> {code:java}
> 0: jdbc:hive2://zk0-shobia.lhlexkfu3vfebcezzj> select * from t3
> . . . . . . . . . . . . . . . . . . . . . . .> where age not in (select 
> distinct(age) age from t1);
> INFO  : Compiling 
> command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96): 
> select * from t3
> where age not in (select distinct(age) age from t1)
> INFO  : Warning: Map Join MAPJOIN[37][bigTable=?] in task 'Map 1' is a cross 
> product
> ..
> .
> INFO  : Completed executing 
> command(queryId=hive_20230427164202_e25b671a-f3bd-41e4-b364-844466305d96); 
> Time taken: 10.191 seconds
> INFO  : OK
> ++--+-+
> | t3.id  | t3.name  | t3.age  |
> ++--+-+
> ++--+-+
> No rows selected (12.17 seconds) {code}
> The query works fine when we use nvl function or not null condition. 
>  
> So as a workaround we can use nvl function for both main and sub query as 
> below:
> --
> *select * from t3 where nvl(age,'{-}') not in (select 
>