[ https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884796#action_12884796 ]
John Sichi commented on HIVE-1342: ---------------------------------- Hmmm....I looked into this one some more. Let me summarize what I found. On trunk as it is today (without this patch), predicate pushdown does not (in general) get optimized when we have a nested select with a join. For example: {noformat} explain SELECT * FROM ( SELECT a.foo as foo1, b.foo as foo2, a.bar FROM pokes a LEFT OUTER JOIN pokes2 b ON a.foo=b.foo) z WHERE bar=3; ... STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: z:a TableScan alias: a Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 0 value expressions: expr: foo type: int expr: bar type: string z:b TableScan alias: b Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 1 value expressions: expr: foo type: int Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 {VALUE._col0} handleSkewJoin: false outputColumnNames: _col0, _col1, _col2 Select Operator expressions: expr: _col0 type: int expr: _col2 type: int expr: _col1 type: string outputColumnNames: _col0, _col1, _col2 Filter Operator predicate: expr: (_col2 = 3) type: boolean Select Operator expressions: expr: _col0 type: int expr: _col1 type: int expr: _col2 type: string outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 {noformat} However, it does kick in (sometimes correctly, sometimes incorrectly) in the special case where aliases are reused. For example, it happens to work correctly for a query like this: {noformat} explain SELECT * FROM ( SELECT a.foo as foo1, b.foo as foo2, a.bar FROM pokes a LEFT OUTER JOIN pokes2 b ON a.foo=b.foo) a WHERE a.bar=3; {noformat} But in cases like the original ones in the bug reports, it gets applied incorrectly. Ted's patch attempts to limit the damage by uniformly preventing the optimization from applying for the pattern of nested select over join (regardless of whether aliases have been reused). If this is the best we can do for 0.6, then we'll have to live with that and then open another issue for correcting the real problem so that we can get full optimization (particularly for views). I don't think it's a question of keeping the implementation simple; the patch as is does not fix the optimization, it just disables it. > Predicate push down get error result when sub-queries have the same alias > name > ------------------------------------------------------------------------------- > > Key: HIVE-1342 > URL: https://issues.apache.org/jira/browse/HIVE-1342 > Project: Hadoop Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 0.6.0 > Reporter: Ted Xu > Assignee: Ted Xu > Priority: Critical > Fix For: 0.6.0 > > Attachments: cmd.hql, explain, ppd_same_alias_1.patch, > ppd_same_alias_2.patch > > > Query is over-optimized by PPD when sub-queries have the same alias name, see > the query: > ------------------------------- > create table if not exists dm_fact_buyer_prd_info_d ( > category_id string > ,gmv_trade_num int > ,user_id int > ) > PARTITIONED BY (ds int); > set hive.optimize.ppd=true; > set hive.map.aggr=true; > explain select category_id1,category_id2,assoc_idx > from ( > select > category_id1 > , category_id2 > , count(distinct user_id) as assoc_idx > from ( > select > t1.category_id as category_id1 > , t2.category_id as category_id2 > , t1.user_id > from ( > select category_id, user_id > from dm_fact_buyer_prd_info_d > group by category_id, user_id ) t1 > join ( > select category_id, user_id > from dm_fact_buyer_prd_info_d > group by category_id, user_id ) t2 on > t1.user_id=t2.user_id > ) t1 > group by category_id1, category_id2 ) t_o > where category_id1 <> category_id2 > and assoc_idx > 2; > ----------------------------- > The query above will fail when execute, throwing exception: "can not cast > UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". > I explained the query and the execute plan looks really wired ( only Stage-1, > see the highlighted predicate): > ------------------------------- > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > t_o:t1:t1:dm_fact_buyer_prd_info_d > TableScan > alias: dm_fact_buyer_prd_info_d > Filter Operator > predicate: > expr: *(category_id <> user_id)* > type: boolean > Select Operator > expressions: > expr: category_id > type: string > expr: user_id > type: bigint > outputColumnNames: category_id, user_id > Group By Operator > keys: > expr: category_id > type: string > expr: user_id > type: bigint > mode: hash > outputColumnNames: _col0, _col1 > Reduce Output Operator > key expressions: > expr: _col0 > type: string > expr: _col1 > type: bigint > sort order: ++ > Map-reduce partition columns: > expr: _col0 > type: string > expr: _col1 > type: bigint > tag: -1 > Reduce Operator Tree: > Group By Operator > keys: > expr: KEY._col0 > type: string > expr: KEY._col1 > type: bigint > mode: mergepartial > outputColumnNames: _col0, _col1 > Select Operator > expressions: > expr: _col0 > type: string > expr: _col1 > type: bigint > outputColumnNames: _col0, _col1 > File Output Operator > compressed: true > GlobalTableId: 0 > table: > input format: > org.apache.hadoop.mapred.SequenceFileInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > ---------------------------------- > If disabling predicate push down (set hive.optimize.ppd=true), the error is > gone; I tried disabling map side aggregate, the error is gone,too. > *Changing the alias of subquery 't1' (either the inner one or the join > result), the bug disappears, too.* -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.