[ 
https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12884796#action_12884796
 ] 

John Sichi commented on HIVE-1342:
----------------------------------

Hmmm....I looked into this one some more.  Let me summarize what I found.

On trunk as it is today (without this patch), predicate pushdown does not (in 
general) get optimized when we have a nested select with a join.  For example:

{noformat}
explain
SELECT * FROM (
    SELECT a.foo as foo1, b.foo as foo2, a.bar
    FROM pokes a LEFT OUTER JOIN pokes2 b
    ON a.foo=b.foo) z
WHERE bar=3;

...

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        z:a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
                    expr: bar
                    type: string
        z:b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0} {VALUE._col1}
            1 {VALUE._col0}
          handleSkewJoin: false
          outputColumnNames: _col0, _col1, _col2
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col2
                  type: int
                  expr: _col1
                  type: string
            outputColumnNames: _col0, _col1, _col2
            Filter Operator
              predicate:
                  expr: (_col2 = 3)
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: int
                      expr: _col2
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

{noformat}

However, it does kick in (sometimes correctly, sometimes incorrectly) in the 
special case where aliases are reused.  For example, it happens to work 
correctly for a query like this:

{noformat}
explain
SELECT * FROM (
    SELECT a.foo as foo1, b.foo as foo2, a.bar
    FROM pokes a LEFT OUTER JOIN pokes2 b
    ON a.foo=b.foo) a
WHERE a.bar=3;
{noformat}

But in cases like the original ones in the bug reports, it gets applied 
incorrectly.

Ted's patch attempts to limit the damage by uniformly preventing the 
optimization from applying for the pattern of nested select over join 
(regardless of whether aliases have been reused).

If this is the best we can do for 0.6, then we'll have to live with that and 
then open another issue for correcting the real problem so that we can get full 
optimization (particularly for views).

I don't think it's a question of keeping the implementation simple; the patch 
as is does not fix the optimization, it just disables it.


> Predicate push down get error result when sub-queries have the same alias 
> name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Assignee: Ted Xu
>            Priority: Critical
>             Fix For: 0.6.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch, 
> ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see 
> the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
>               category_id string
>               ,gmv_trade_num  int
>               ,user_id    int
>               )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
>               select 
>                       category_id1
>                       , category_id2
>                       , count(distinct user_id) as assoc_idx
>               from (
>                       select 
>                               t1.category_id as category_id1
>                               , t2.category_id as category_id2
>                               , t1.user_id
>                       from (
>                               select category_id, user_id
>                               from dm_fact_buyer_prd_info_d
>                               group by category_id, user_id ) t1
>                       join (
>                               select category_id, user_id
>                               from dm_fact_buyer_prd_info_d
>                               group by category_id, user_id ) t2 on 
> t1.user_id=t2.user_id 
>                       ) t1
>                       group by category_id1, category_id2 ) t_o
>                       where category_id1 <> category_id2
>                       and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast 
> UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, 
> see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is 
> gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join 
> result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to