[ https://issues.apache.org/jira/browse/HIVE-17148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104919#comment-16104919 ]
Vlad Gudikov edited comment on HIVE-17148 at 7/28/17 1:07 PM: -------------------------------------------------------------- ROOT-CAUSE: The problem was with the predicates that were created according to HiveJoinAddNotNullRule. This rule is creating predicates from fields that take part in join filter, no matter if this fields are used as parameters of functions or not. SOLUTION: Create predicate based on functions that take part in filters as well as fields. The point is to check if left part and right part of the filter is not null, not just fields that are part of the join filter. I.e we have to tables test1(a1 int, a2 int) and test2(b1). When we execute following query *select * from ct1 c1 inner join ct2 c2 on (COALESCE(a1,b1)=a2);* we get to predicates for filter operator: b1 is not null --- right part a1 is not null and a2 is not null -- left part Applying predicate for left part of join will result in data loss as we exclude rows with null fields. COALESCE is a good example for this case as the main purpose of COALESCE function is to get not null values from tables. To fix the data loss we need to check that coalesce won't bring us null values as we can't join nulls. My fix will check that left part and right part will look like: b1 is not null -- right part (still checking fields on null condition) COALESCE(a1,a2) is not null (checking that whole function won't bring us null values) In next patch I'm going to change related failed tests with the fixed stage plans. was (Author: allgoodok): ROOT-CAUSE: The problem was with the predicates that were created according to HiveJoinAddNotNullRule. This rule is creating predicates from fields that take part in join filter, no matter if this fields are used as parameters of functions or not. SOLUTION: Create predicate based on functions that take part in filters as well as fields. The point is to check if left part and right part of the filter is not null, not just fields that are part of the join filter. I.e we have to tables test1(a1 int, a2 int) and test2(b1). When we execute following query strong text*select * from ct1 c1 inner join ct2 c2 on (COALESCE(a1,b1)=a2);*strong text* we get to predicates for filter operator: b1 is not null --- right part a1 is not null and a2 is not null -- left part Applying predicate for left part of join will result in data loss as we exclude rows with null fields. COALESCE is a good example for this case as the main purpose of COALESCE function is to get not null values from tables. To fix the data loss we need to check that coalesce won't bring us null values as we can't join nulls. My fix will check that left part and right part will look like: b1 is not null -- right part (still checking fields on null condition) COALESCE(a1,a2) is not null (checking that whole function won't bring us null values) In next patch I'm going to change related failed tests with the fixed stage plans. > Incorrect result for Hive join query with COALESCE in WHERE condition > --------------------------------------------------------------------- > > Key: HIVE-17148 > URL: https://issues.apache.org/jira/browse/HIVE-17148 > Project: Hive > Issue Type: Bug > Components: CBO > Affects Versions: 2.1.1 > Reporter: Vlad Gudikov > Assignee: Vlad Gudikov > Attachments: HIVE-17148.patch > > > The issue exists in Hive-2.1. In Hive-1.2 the query works fine with cbo > enabled: > STEPS TO REPRODUCE: > {code} > Step 1: Create a table ct1 > create table ct1 (a1 string,b1 string); > Step 2: Create a table ct2 > create table ct2 (a2 string); > Step 3 : Insert following data into table ct1 > insert into table ct1 (a1) values ('1'); > Step 4 : Insert following data into table ct2 > insert into table ct2 (a2) values ('1'); > Step 5 : Execute the following query > select * from ct1 c1, ct2 c2 where COALESCE(a1,b1)=a2; > {code} > ACTUAL RESULT: > {code} > The query returns nothing; > {code} > EXPECTED RESULT: > {code} > 1 NULL 1 > {code} > The issue seems to be because of the incorrect query plan. In the plan we can > see: > predicate:(a1 is not null and b1 is not null) > which does not look correct. As a result, it is filtering out all the rows is > any column mentioned in the COALESCE has null value. > Please find the query plan below: > {code} > Plan optimized by CBO. > Vertex dependency in root stage > Map 1 <- Map 2 (BROADCAST_EDGE) > Stage-0 > Fetch Operator > limit:-1 > Stage-1 > Map 1 > File Output Operator [FS_10] > Map Join Operator [MAPJOIN_15] (rows=1 width=4) > > Conds:SEL_2.COALESCE(_col0,_col1)=RS_7._col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1","_col2"] > <-Map 2 [BROADCAST_EDGE] > BROADCAST [RS_7] > PartitionCols:_col0 > Select Operator [SEL_5] (rows=1 width=1) > Output:["_col0"] > Filter Operator [FIL_14] (rows=1 width=1) > predicate:a2 is not null > TableScan [TS_3] (rows=1 width=1) > default@ct2,c2,Tbl:COMPLETE,Col:NONE,Output:["a2"] > <-Select Operator [SEL_2] (rows=1 width=4) > Output:["_col0","_col1"] > Filter Operator [FIL_13] (rows=1 width=4) > predicate:(a1 is not null and b1 is not null) > TableScan [TS_0] (rows=1 width=4) > default@ct1,c1,Tbl:COMPLETE,Col:NONE,Output:["a1","b1"] > {code} > This happens only if join is inner type, otherwise HiveJoinAddNotRule which > creates this problem is skipped. -- This message was sent by Atlassian JIRA (v6.4.14#64029)