[ https://issues.apache.org/jira/browse/HIVE-10841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alexander Pivovarov updated HIVE-10841: --------------------------------------- Component/s: Query Planning > [WHERE col is not null] does not work sometimes for queries with many JOIN > statements > ------------------------------------------------------------------------------------- > > Key: HIVE-10841 > URL: https://issues.apache.org/jira/browse/HIVE-10841 > Project: Hive > Issue Type: Bug > Components: Query Planning, Query Processor > Affects Versions: 0.13.0, 0.14.0, 0.13.1, 1.2.0 > Reporter: Alexander Pivovarov > > The result from the following SELECT query is 3 rows but it should be 1 row. > I checked it in MySQL - it returned 1 row. > To reproduce the issue in Hive > 1. prepare tables > {code} > drop table if exists L; > drop table if exists LA; > drop table if exists FR; > drop table if exists A; > drop table if exists PI; > drop table if exists acct; > create table L as select 4436 id; > create table LA as select 4436 loan_id, 4748 aid, 4415 pi_id; > create table FR as select 4436 loan_id; > create table A as select 4748 id; > create table PI as select 4415 id; > create table acct as select 4748 aid, 10 acc_n, 122 brn; > insert into table acct values(4748, null, null); > insert into table acct values(4748, null, null); > {code} > 2. run SELECT query > {code} > select > acct.ACC_N, > acct.brn > FROM L > JOIN LA ON L.id = LA.loan_id > JOIN FR ON L.id = FR.loan_id > JOIN A ON LA.aid = A.id > JOIN PI ON PI.id = LA.pi_id > JOIN acct ON A.id = acct.aid > WHERE > L.id = 4436 > and acct.brn is not null; > {code} > the result is 3 rows > {code} > 10 122 > NULL NULL > NULL NULL > {code} > but it should be 1 row > {code} > 10 122 > {code} > 2.1 "explain select ..." output for hive-1.3.0 MR > {code} > STAGE DEPENDENCIES: > Stage-12 is a root stage > Stage-9 depends on stages: Stage-12 > Stage-0 depends on stages: Stage-9 > STAGE PLANS: > Stage: Stage-12 > Map Reduce Local Work > Alias -> Map Local Tables: > a > Fetch Operator > limit: -1 > acct > Fetch Operator > limit: -1 > fr > Fetch Operator > limit: -1 > l > Fetch Operator > limit: -1 > pi > Fetch Operator > limit: -1 > Alias -> Map Local Operator Tree: > a > TableScan > alias: a > Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column > stats: NONE > Filter Operator > predicate: id is not null (type: boolean) > Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE > Column stats: NONE > HashTable Sink Operator > keys: > 0 _col5 (type: int) > 1 id (type: int) > 2 aid (type: int) > acct > TableScan > alias: acct > Statistics: Num rows: 3 Data size: 31 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: aid is not null (type: boolean) > Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE > Column stats: NONE > HashTable Sink Operator > keys: > 0 _col5 (type: int) > 1 id (type: int) > 2 aid (type: int) > fr > TableScan > alias: fr > Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column > stats: NONE > Filter Operator > predicate: (loan_id = 4436) (type: boolean) > Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE > Column stats: NONE > HashTable Sink Operator > keys: > 0 4436 (type: int) > 1 4436 (type: int) > 2 4436 (type: int) > l > TableScan > alias: l > Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column > stats: NONE > Filter Operator > predicate: (id = 4436) (type: boolean) > Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE > Column stats: NONE > HashTable Sink Operator > keys: > 0 4436 (type: int) > 1 4436 (type: int) > 2 4436 (type: int) > pi > TableScan > alias: pi > Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column > stats: NONE > Filter Operator > predicate: id is not null (type: boolean) > Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE > Column stats: NONE > HashTable Sink Operator > keys: > 0 _col6 (type: int) > 1 id (type: int) > Stage: Stage-9 > Map Reduce > Map Operator Tree: > TableScan > alias: la > Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: (((loan_id is not null and aid is not null) and > pi_id is not null) and (loan_id = 4436)) (type: boolean) > Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE > Column stats: NONE > Map Join Operator > condition map: > Inner Join 0 to 1 > Inner Join 0 to 2 > keys: > 0 4436 (type: int) > 1 4436 (type: int) > 2 4436 (type: int) > outputColumnNames: _col5, _col6 > Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE > Column stats: NONE > Map Join Operator > condition map: > Inner Join 0 to 1 > Inner Join 1 to 2 > keys: > 0 _col5 (type: int) > 1 id (type: int) > 2 aid (type: int) > outputColumnNames: _col6, _col19, _col20 > Statistics: Num rows: 4 Data size: 17 Basic stats: COMPLETE > Column stats: NONE > Map Join Operator > condition map: > Inner Join 0 to 1 > keys: > 0 _col6 (type: int) > 1 id (type: int) > outputColumnNames: _col19, _col20 > Statistics: Num rows: 4 Data size: 18 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col19 (type: int), _col20 (type: int) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 4 Data size: 18 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 4 Data size: 18 Basic stats: > COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Local Work: > Map Reduce Local Work > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > Time taken: 0.57 seconds, Fetched: 142 row(s) > {code} > 2.2. "explain select..." output for hive-0.13.1 Tez > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Tez > Edges: > Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE), Reducer 6 > (SIMPLE_EDGE) > Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Map 9 (SIMPLE_EDGE) > Reducer 6 <- Map 5 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE), Map 8 > (SIMPLE_EDGE) > DagName: lcapp_20150528111717_06c57a5b-8dc6-4ce9-bce7-b9e0a7818fe4:1 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: acct > Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE > Column stats: NONE > Reduce Output Operator > key expressions: aid (type: int) > sort order: + > Map-reduce partition columns: aid (type: int) > Statistics: Num rows: 1 Data size: 4 Basic stats: > COMPLETE Column stats: NONE > value expressions: acc_n (type: int), brn (type: int) > Map 4 > Map Operator Tree: > TableScan > alias: a > Statistics: Num rows: 46 Data size: 187 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: id (type: int) > sort order: + > Map-reduce partition columns: id (type: int) > Statistics: Num rows: 46 Data size: 187 Basic stats: > COMPLETE Column stats: NONE > Map 5 > Map Operator Tree: > TableScan > alias: la > Statistics: Num rows: 28 Data size: 347 Basic stats: > COMPLETE Column stats: NONE > Filter Operator > predicate: (loan_id = 4436) (type: boolean) > Statistics: Num rows: 14 Data size: 173 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: loan_id (type: int) > sort order: + > Map-reduce partition columns: loan_id (type: int) > Statistics: Num rows: 14 Data size: 173 Basic stats: > COMPLETE Column stats: NONE > value expressions: aid (type: int), pi_id (type: int) > Map 7 > Map Operator Tree: > TableScan > alias: fr > Statistics: Num rows: 46 Data size: 187 Basic stats: > COMPLETE Column stats: NONE > Filter Operator > predicate: (loan_id = 4436) (type: boolean) > Statistics: Num rows: 23 Data size: 93 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: loan_id (type: int) > sort order: + > Map-reduce partition columns: loan_id (type: int) > Statistics: Num rows: 23 Data size: 93 Basic stats: > COMPLETE Column stats: NONE > Map 8 > Map Operator Tree: > TableScan > alias: l > Statistics: Num rows: 46 Data size: 187 Basic stats: > COMPLETE Column stats: NONE > Filter Operator > predicate: (id = 4436) (type: boolean) > Statistics: Num rows: 23 Data size: 93 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: id (type: int) > sort order: + > Map-reduce partition columns: id (type: int) > Statistics: Num rows: 23 Data size: 93 Basic stats: > COMPLETE Column stats: NONE > Map 9 > Map Operator Tree: > TableScan > alias: pi > Statistics: Num rows: 46 Data size: 187 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: id (type: int) > sort order: + > Map-reduce partition columns: id (type: int) > Statistics: Num rows: 46 Data size: 187 Basic stats: > COMPLETE Column stats: NONE > Reducer 2 > Reduce Operator Tree: > Join Operator > condition map: > Inner Join 0 to 1 > Inner Join 1 to 2 > condition expressions: > 0 {VALUE._col2} > 1 > 2 {VALUE._col1} {VALUE._col2} > outputColumnNames: _col2, _col15, _col16 > Statistics: Num rows: 110 Data size: 448 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: _col2 (type: int) > sort order: + > Map-reduce partition columns: _col2 (type: int) > Statistics: Num rows: 110 Data size: 448 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col15 (type: int), _col16 (type: int) > Reducer 3 > Reduce Operator Tree: > Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {VALUE._col1} {VALUE._col2} > 1 > outputColumnNames: _col1, _col2 > Statistics: Num rows: 121 Data size: 492 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col1 (type: int), _col2 (type: int) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 121 Data size: 492 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 121 Data size: 492 Basic stats: > COMPLETE Column stats: NONE > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Reducer 6 > Reduce Operator Tree: > Join Operator > condition map: > Inner Join 0 to 1 > Inner Join 0 to 2 > condition expressions: > 0 > 1 {VALUE._col1} {VALUE._col2} > 2 > outputColumnNames: _col4, _col5 > Statistics: Num rows: 50 Data size: 204 Basic stats: COMPLETE > Column stats: NONE > Reduce Output Operator > key expressions: _col4 (type: int) > sort order: + > Map-reduce partition columns: _col4 (type: int) > Statistics: Num rows: 50 Data size: 204 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col5 (type: int) > Stage: Stage-0 > Fetch Operator > limit: -1 > Time taken: 1.377 seconds, Fetched: 146 row(s) > {code} > 3. The workaround is to put "acct.brn is not null" to join condition > {code} > select > acct.ACC_N, > acct.brn > FROM L > JOIN LA ON L.id = LA.loan_id > JOIN FR ON L.id = FR.loan_id > JOIN A ON LA.aid = A.id > JOIN PI ON PI.id = LA.pi_id > JOIN acct ON A.id = acct.aid and acct.brn is not null > WHERE > L.id = 4436; > OK > 10 122 > Time taken: 23.479 seconds, Fetched: 1 row(s) > {code} > I tried it on hive-1.3.0 (MR) and hive-0.13.1 (MR and Tez) - all combinations > have the issue -- This message was sent by Atlassian JIRA (v6.3.4#6332)