[ https://issues.apache.org/jira/browse/HIVE-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jesus Camacho Rodriguez updated HIVE-12465: ------------------------------------------- Attachment: (was: HIVE-12465.01.patch) > Hive might produce wrong results when (outer) joins are merged > -------------------------------------------------------------- > > Key: HIVE-12465 > URL: https://issues.apache.org/jira/browse/HIVE-12465 > Project: Hive > Issue Type: Bug > Affects Versions: 1.3.0, 2.0.0 > Reporter: Jesus Camacho Rodriguez > Assignee: Jesus Camacho Rodriguez > Priority: Blocker > Attachments: HIVE-12465.01.patch, HIVE-12465.patch > > > Consider the following query: > {noformat} > select * from > (select * from tab where tab.key = 0)a > full outer join > (select * from tab_part where tab_part.key = 98)b > join > tab_part c > on a.key = b.key and b.key = c.key; > {noformat} > Hive should execute the full outer join operation (without ON clause) and > then the join operation (ON a.key = b.key and b.key = c.key). Instead, it > merges both joins, generating the following plan: > {noformat} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Map Operator Tree: > TableScan > alias: tab > filterExpr: (key = 0) (type: boolean) > Statistics: Num rows: 242 Data size: 22748 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: (key = 0) (type: boolean) > Statistics: Num rows: 121 Data size: 11374 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: 0 (type: int), value (type: string), ds (type: > string) > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 121 Data size: 11374 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: _col0 (type: int) > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 121 Data size: 11374 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col1 (type: string), _col2 (type: > string) > TableScan > alias: tab_part > filterExpr: (key = 98) (type: boolean) > Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE > Column stats: NONE > Filter Operator > predicate: (key = 98) (type: boolean) > Statistics: Num rows: 250 Data size: 23500 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: 98 (type: int), value (type: string), ds (type: > string) > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 250 Data size: 23500 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: _col0 (type: int) > sort order: + > Map-reduce partition columns: _col0 (type: int) > Statistics: Num rows: 250 Data size: 23500 Basic stats: > COMPLETE Column stats: NONE > value expressions: _col1 (type: string), _col2 (type: > string) > TableScan > alias: c > Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE > Column stats: NONE > Reduce Output Operator > key expressions: key (type: int) > sort order: + > Map-reduce partition columns: key (type: int) > Statistics: Num rows: 500 Data size: 47000 Basic stats: > COMPLETE Column stats: NONE > value expressions: value (type: string), ds (type: string) > Reduce Operator Tree: > Join Operator > condition map: > Outer Join 0 to 1 > Inner Join 1 to 2 > keys: > 0 _col0 (type: int) > 1 _col0 (type: int) > 2 key (type: int) > outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, > _col7, _col8 > Statistics: Num rows: 1100 Data size: 103400 Basic stats: COMPLETE > Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 1100 Data size: 103400 Basic stats: > COMPLETE Column stats: NONE > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > {noformat} > That plan is equivalent to the following query, which is different than the > original one: > {noformat} > select * from > (select * from tab where tab.key = 0)a > full outer join > (select * from tab_part where tab_part.key = 98)b > on a.key = b.key > join > tab_part c > on b.key = c.key; > {noformat} > It seems to be a problem in the recognition of join operations that can be > merged into a single multijoin operator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)