[ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910336#action_12910336 ]
John Sichi commented on HIVE-1342: ---------------------------------- Finallly got back to this one. Let me provide some specific examples to better explain what I wrote. First, latest trunk without any patch. {noformat} -- Q1.trunk: Without a nested select, the plan is correct for this query. -- (we're not allowed to push filter down into null-generating side of outer join) hive> explain > SELECT a.foo as foo1, b.foo as foo2, b.bar > FROM pokes a LEFT OUTER JOIN pokes2 b > ON a.foo=b.foo > WHERE b.bar=3; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))) (TOK_WHERE (= (. (TOK_TABLE_OR_COL b) bar) 3)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a TableScan alias: a Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 0 value expressions: expr: foo type: int b TableScan alias: b Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 1 value expressions: expr: foo type: int expr: bar type: string Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col0} {VALUE._col1} handleSkewJoin: false outputColumnNames: _col0, _col4, _col5 Filter Operator predicate: expr: (_col5 = 3) type: boolean Select Operator expressions: expr: _col0 type: int expr: _col4 type: int expr: _col5 type: string outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 -- Q2.trunk: For this equivalent query written using a nested select, the plan is incorrect. -- (filter got pushed down when it shouldn't; note that in the wrapping select, a.bar should resolve to b.bar in the nested select) hive> explain > SELECT * FROM > (SELECT a.foo as foo1, b.foo as foo2, b.bar > FROM pokes a LEFT OUTER JOIN pokes2 b > ON a.foo=b.foo) a > WHERE a.bar=3; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a:a TableScan alias: a Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 0 value expressions: expr: foo type: int a:b TableScan alias: b Filter Operator predicate: expr: (bar = 3) type: boolean Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 1 value expressions: expr: foo type: int expr: bar type: string Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col0} {VALUE._col1} handleSkewJoin: false outputColumnNames: _col0, _col4, _col5 Select Operator expressions: expr: _col0 type: int expr: _col4 type: int expr: _col5 type: string outputColumnNames: _col0, _col1, _col2 Filter Operator predicate: expr: (_col2 = 3) type: boolean Select Operator expressions: expr: _col0 type: int expr: _col1 type: int expr: _col2 type: string outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 -- Q3.trunk: However, for this semantically different case, the plan is correct. -- (we're allowed to push the filter down for an inner join) hive> > explain > SELECT * FROM > (SELECT a.foo as foo1, b.foo as foo2, a.bar > FROM pokes a JOIN pokes2 b > ON a.foo=b.foo) a > WHERE a.bar=3; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a:a TableScan alias: a Filter Operator predicate: expr: (bar = 3) type: boolean Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 0 value expressions: expr: foo type: int expr: bar type: string a:b TableScan alias: b Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 1 value expressions: expr: foo type: int Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 {VALUE._col0} handleSkewJoin: false outputColumnNames: _col0, _col1, _col4 Select Operator expressions: expr: _col0 type: int expr: _col4 type: int expr: _col1 type: string outputColumnNames: _col0, _col1, _col2 Filter Operator predicate: expr: (_col2 = 3) type: boolean Select Operator expressions: expr: _col0 type: int expr: _col1 type: int expr: _col2 type: string outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 {noformat} Now, repeating Q1/Q2/Q3 with the patch: {noformat} -- Q1.patch: this plan is good -- (same result as Q1.trunk, as expected) hive> explain > SELECT a.foo as foo1, b.foo as foo2, b.bar > FROM pokes a LEFT OUTER JOIN pokes2 b > ON a.foo=b.foo > WHERE b.bar=3; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))) (TOK_WHERE (= (. (TOK_TABLE_OR_COL b) bar) 3)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a TableScan alias: a Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 0 value expressions: expr: foo type: int b TableScan alias: b Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 1 value expressions: expr: foo type: int expr: bar type: string Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col0} {VALUE._col1} handleSkewJoin: false outputColumnNames: _col0, _col4, _col5 Filter Operator predicate: expr: (_col5 = 3) type: boolean Select Operator expressions: expr: _col0 type: int expr: _col4 type: int expr: _col5 type: string outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 -- Q2.patch: this time, the plan is good (no pushdown) -- (the patch fixes the bug exhibited on trunk) hive> explain > SELECT * FROM > (SELECT a.foo as foo1, b.foo as foo2, b.bar > FROM pokes a LEFT OUTER JOIN pokes2 b > ON a.foo=b.foo) a > WHERE a.bar=3; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a:a TableScan alias: a Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 0 value expressions: expr: foo type: int a:b TableScan alias: b Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 1 value expressions: expr: foo type: int expr: bar type: string Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col0} {VALUE._col1} handleSkewJoin: false outputColumnNames: _col0, _col4, _col5 Select Operator expressions: expr: _col0 type: int expr: _col4 type: int expr: _col5 type: string outputColumnNames: _col0, _col1, _col2 Filter Operator predicate: expr: (_col2 = 3) type: boolean Select Operator expressions: expr: _col0 type: int expr: _col1 type: int expr: _col2 type: string outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 -- Q3.patch: whoops, now the plan is valid but suboptimal since the filter pushdown did not happen -- (whereas it did with trunk) hive> explain > SELECT * FROM > (SELECT a.foo as foo1, b.foo as foo2, a.bar > FROM pokes a JOIN pokes2 b > ON a.foo=b.foo) a > WHERE a.bar=3; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a:a TableScan alias: a Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 0 value expressions: expr: foo type: int expr: bar type: string a:b TableScan alias: b Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 1 value expressions: expr: foo type: int Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 {VALUE._col0} handleSkewJoin: false outputColumnNames: _col0, _col1, _col4 Select Operator expressions: expr: _col0 type: int expr: _col4 type: int expr: _col1 type: string outputColumnNames: _col0, _col1, _col2 Filter Operator predicate: expr: (_col2 = 3) type: boolean Select Operator expressions: expr: _col0 type: int expr: _col1 type: int expr: _col2 type: string outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 {noformat} So, we need a patch which takes care of Q2 while not causing a plan optimality regression for Q3. > Predicate push down get error result when sub-queries have the same alias > name > ------------------------------------------------------------------------------- > > Key: HIVE-1342 > URL: https://issues.apache.org/jira/browse/HIVE-1342 > Project: Hadoop Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 0.6.0 > Reporter: Ted Xu > Assignee: Ted Xu > Priority: Critical > Fix For: 0.7.0 > > Attachments: cmd.hql, explain, ppd_same_alias_1.patch, > ppd_same_alias_2.patch > > > Query is over-optimized by PPD when sub-queries have the same alias name, see > the query: > ------------------------------- > create table if not exists dm_fact_buyer_prd_info_d ( > category_id string > ,gmv_trade_num int > ,user_id int > ) > PARTITIONED BY (ds int); > set hive.optimize.ppd=true; > set hive.map.aggr=true; > explain select category_id1,category_id2,assoc_idx > from ( > select > category_id1 > , category_id2 > , count(distinct user_id) as assoc_idx > from ( > select > t1.category_id as category_id1 > , t2.category_id as category_id2 > , t1.user_id > from ( > select category_id, user_id > from dm_fact_buyer_prd_info_d > group by category_id, user_id ) t1 > join ( > select category_id, user_id > from dm_fact_buyer_prd_info_d > group by category_id, user_id ) t2 on > t1.user_id=t2.user_id > ) t1 > group by category_id1, category_id2 ) t_o > where category_id1 <> category_id2 > and assoc_idx > 2; > ----------------------------- > The query above will fail when execute, throwing exception: "can not cast > UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". > I explained the query and the execute plan looks really wired ( only Stage-1, > see the highlighted predicate): > ------------------------------- > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > t_o:t1:t1:dm_fact_buyer_prd_info_d > TableScan > alias: dm_fact_buyer_prd_info_d > Filter Operator > predicate: > expr: *(category_id <> user_id)* > type: boolean > Select Operator > expressions: > expr: category_id > type: string > expr: user_id > type: bigint > outputColumnNames: category_id, user_id > Group By Operator > keys: > expr: category_id > type: string > expr: user_id > type: bigint > mode: hash > outputColumnNames: _col0, _col1 > Reduce Output Operator > key expressions: > expr: _col0 > type: string > expr: _col1 > type: bigint > sort order: ++ > Map-reduce partition columns: > expr: _col0 > type: string > expr: _col1 > type: bigint > tag: -1 > Reduce Operator Tree: > Group By Operator > keys: > expr: KEY._col0 > type: string > expr: KEY._col1 > type: bigint > mode: mergepartial > outputColumnNames: _col0, _col1 > Select Operator > expressions: > expr: _col0 > type: string > expr: _col1 > type: bigint > outputColumnNames: _col0, _col1 > File Output Operator > compressed: true > GlobalTableId: 0 > table: > input format: > org.apache.hadoop.mapred.SequenceFileInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > ---------------------------------- > If disabling predicate push down (set hive.optimize.ppd=true), the error is > gone; I tried disabling map side aggregate, the error is gone,too. > *Changing the alias of subquery 't1' (either the inner one or the join > result), the bug disappears, too.* -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.