[jira] Commented: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

John Sichi (JIRA) Thu, 16 Sep 2010 14:50:01 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910336#action_12910336
 ]


John Sichi commented on HIVE-1342:
----------------------------------

Finallly got back to this one.  Let me provide some specific examples to better 
explain what I wrote.

First, latest trunk without any patch.

{noformat}
-- Q1.trunk:  Without a nested select, the plan is correct for this query.
-- (we're not allowed to push filter down into null-generating side of outer 
join)
hive> explain
    > SELECT a.foo as foo1, b.foo as foo2, b.bar
    > FROM pokes a LEFT OUTER JOIN pokes2 b
    > ON a.foo=b.foo
    > WHERE b.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF 
pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) 
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 
(. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) 
foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))) (TOK_WHERE (= (. 
(TOK_TABLE_OR_COL b) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
        b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
                    expr: bar
                    type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0}
            1 {VALUE._col0} {VALUE._col1}
          handleSkewJoin: false
          outputColumnNames: _col0, _col4, _col5
          Filter Operator
            predicate:
                expr: (_col5 = 3)
                type: boolean
            Select Operator
              expressions:
                    expr: _col0
                    type: int
                    expr: _col4
                    type: int
                    expr: _col5
                    type: string
              outputColumnNames: _col0, _col1, _col2
              File Output Operator
                compressed: false
                GlobalTableId: 0
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

-- Q2.trunk:  For this equivalent query written using a nested select, the plan 
is incorrect.
-- (filter got pushed down when it shouldn't; note that in the wrapping select, 
a.bar should resolve to b.bar in the nested select)
hive> explain
    > SELECT * FROM
    >     (SELECT a.foo as foo1, b.foo as foo2, b.bar
    >     FROM pokes a LEFT OUTER JOIN pokes2 b
    >     ON a.foo=b.foo) a
    > WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN 
(TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. 
(TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) 
(TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. 
(TOK_TABLE_OR_COL b) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. 
(TOK_TABLE_OR_COL a) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a:a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
        a:b 
          TableScan
            alias: b
            Filter Operator
              predicate:
                  expr: (bar = 3)
                  type: boolean
              Reduce Output Operator
                key expressions:
                      expr: foo
                      type: int
                sort order: +
                Map-reduce partition columns:
                      expr: foo
                      type: int
                tag: 1
                value expressions:
                      expr: foo
                      type: int
                      expr: bar
                      type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0}
            1 {VALUE._col0} {VALUE._col1}
          handleSkewJoin: false
          outputColumnNames: _col0, _col4, _col5
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col4
                  type: int
                  expr: _col5
                  type: string
            outputColumnNames: _col0, _col1, _col2
            Filter Operator
              predicate:
                  expr: (_col2 = 3)
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: int
                      expr: _col2
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

-- Q3.trunk:  However, for this semantically different case, the plan is 
correct.
-- (we're allowed to push the filter down for an inner join)
hive> 
    > explain
    > SELECT * FROM
    >     (SELECT a.foo as foo1, b.foo as foo2, a.bar
    >     FROM pokes a JOIN pokes2 b
    >     ON a.foo=b.foo) a
    > WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF 
pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. 
(TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) 
(TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. 
(TOK_TABLE_OR_COL a) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. 
(TOK_TABLE_OR_COL a) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a:a 
          TableScan
            alias: a
            Filter Operator
              predicate:
                  expr: (bar = 3)
                  type: boolean
              Reduce Output Operator
                key expressions:
                      expr: foo
                      type: int
                sort order: +
                Map-reduce partition columns:
                      expr: foo
                      type: int
                tag: 0
                value expressions:
                      expr: foo
                      type: int
                      expr: bar
                      type: string
        a:b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          condition expressions:
            0 {VALUE._col0} {VALUE._col1}
            1 {VALUE._col0}
          handleSkewJoin: false
          outputColumnNames: _col0, _col1, _col4
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col4
                  type: int
                  expr: _col1
                  type: string
            outputColumnNames: _col0, _col1, _col2
            Filter Operator
              predicate:
                  expr: (_col2 = 3)
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: int
                      expr: _col2
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{noformat}

Now, repeating Q1/Q2/Q3 with the patch:

{noformat}

-- Q1.patch:  this plan is good
-- (same result as Q1.trunk, as expected)
hive> explain
    > SELECT a.foo as foo1, b.foo as foo2, b.bar
    > FROM pokes a LEFT OUTER JOIN pokes2 b
    > ON a.foo=b.foo
    > WHERE b.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF pokes a) (TOK_TABREF 
pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) 
(TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 
(. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) 
foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))) (TOK_WHERE (= (. 
(TOK_TABLE_OR_COL b) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
        b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
                    expr: bar
                    type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0}
            1 {VALUE._col0} {VALUE._col1}
          handleSkewJoin: false
          outputColumnNames: _col0, _col4, _col5
          Filter Operator
            predicate:
                expr: (_col5 = 3)
                type: boolean
            Select Operator
              expressions:
                    expr: _col0
                    type: int
                    expr: _col4
                    type: int
                    expr: _col5
                    type: string
              outputColumnNames: _col0, _col1, _col2
              File Output Operator
                compressed: false
                GlobalTableId: 0
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

-- Q2.patch:  this time, the plan is good (no pushdown)
-- (the patch fixes the bug exhibited on trunk)
hive> explain
    > SELECT * FROM
    >     (SELECT a.foo as foo1, b.foo as foo2, b.bar
    >     FROM pokes a LEFT OUTER JOIN pokes2 b
    >     ON a.foo=b.foo) a
    > WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN 
(TOK_TABREF pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. 
(TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) 
(TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. 
(TOK_TABLE_OR_COL b) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. 
(TOK_TABLE_OR_COL a) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a:a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
        a:b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
                    expr: bar
                    type: string
      Reduce Operator Tree:
        Join Operator
          condition map:
               Left Outer Join0 to 1
          condition expressions:
            0 {VALUE._col0}
            1 {VALUE._col0} {VALUE._col1}
          handleSkewJoin: false
          outputColumnNames: _col0, _col4, _col5
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col4
                  type: int
                  expr: _col5
                  type: string
            outputColumnNames: _col0, _col1, _col2
            Filter Operator
              predicate:
                  expr: (_col2 = 3)
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: int
                      expr: _col2
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1

-- Q3.patch:  whoops, now the plan is valid but suboptimal since the filter 
pushdown did not happen
-- (whereas it did with trunk)
hive> explain
    > SELECT * FROM
    >     (SELECT a.foo as foo1, b.foo as foo2, a.bar
    >     FROM pokes a JOIN pokes2 b
    >     ON a.foo=b.foo) a
    > WHERE a.bar=3;
OK
ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF 
pokes a) (TOK_TABREF pokes2 b) (= (. (TOK_TABLE_OR_COL a) foo) (. 
(TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) 
(TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. 
(TOK_TABLE_OR_COL a) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. 
(TOK_TABLE_OR_COL a) bar) 3))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        a:a 
          TableScan
            alias: a
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 0
              value expressions:
                    expr: foo
                    type: int
                    expr: bar
                    type: string
        a:b 
          TableScan
            alias: b
            Reduce Output Operator
              key expressions:
                    expr: foo
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: foo
                    type: int
              tag: 1
              value expressions:
                    expr: foo
                    type: int
      Reduce Operator Tree:
        Join Operator
          condition map:
               Inner Join 0 to 1
          condition expressions:
            0 {VALUE._col0} {VALUE._col1}
            1 {VALUE._col0}
          handleSkewJoin: false
          outputColumnNames: _col0, _col1, _col4
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col4
                  type: int
                  expr: _col1
                  type: string
            outputColumnNames: _col0, _col1, _col2
            Filter Operator
              predicate:
                  expr: (_col2 = 3)
                  type: boolean
              Select Operator
                expressions:
                      expr: _col0
                      type: int
                      expr: _col1
                      type: int
                      expr: _col2
                      type: string
                outputColumnNames: _col0, _col1, _col2
                File Output Operator
                  compressed: false
                  GlobalTableId: 0
                  table:
                      input format: org.apache.hadoop.mapred.TextInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1
{noformat}

So, we need a patch which takes care of Q2 while not causing a plan optimality 
regression for Q3.


> Predicate push down get error result when sub-queries have the same alias 
> name 
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-1342
>                 URL: https://issues.apache.org/jira/browse/HIVE-1342
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Ted Xu
>            Assignee: Ted Xu
>            Priority: Critical
>             Fix For: 0.7.0
>
>         Attachments: cmd.hql, explain, ppd_same_alias_1.patch, 
> ppd_same_alias_2.patch
>
>
> Query is over-optimized by PPD when sub-queries have the same alias name, see 
> the query:
> -------------------------------
> create table if not exists dm_fact_buyer_prd_info_d (
>               category_id string
>               ,gmv_trade_num  int
>               ,user_id    int
>               )
> PARTITIONED BY (ds int);
> set hive.optimize.ppd=true;
> set hive.map.aggr=true;
> explain select category_id1,category_id2,assoc_idx
> from (
>               select 
>                       category_id1
>                       , category_id2
>                       , count(distinct user_id) as assoc_idx
>               from (
>                       select 
>                               t1.category_id as category_id1
>                               , t2.category_id as category_id2
>                               , t1.user_id
>                       from (
>                               select category_id, user_id
>                               from dm_fact_buyer_prd_info_d
>                               group by category_id, user_id ) t1
>                       join (
>                               select category_id, user_id
>                               from dm_fact_buyer_prd_info_d
>                               group by category_id, user_id ) t2 on 
> t1.user_id=t2.user_id 
>                       ) t1
>                       group by category_id1, category_id2 ) t_o
>                       where category_id1 <> category_id2
>                       and assoc_idx > 2;
> -----------------------------
> The query above will fail when execute, throwing exception: "can not cast 
> UDFOpNotEqual(Text, IntWritable) to UDFOpNotEqual(Text, Text)". 
> I explained the query and the execute plan looks really wired ( only Stage-1, 
> see the highlighted predicate):
> -------------------------------
> Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         t_o:t1:t1:dm_fact_buyer_prd_info_d 
>           TableScan
>             alias: dm_fact_buyer_prd_info_d
>             Filter Operator
>               predicate:
>                   expr: *(category_id <> user_id)*
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: category_id
>                       type: string
>                       expr: user_id
>                       type: bigint
>                 outputColumnNames: category_id, user_id
>                 Group By Operator
>                   keys:
>                         expr: category_id
>                         type: string
>                         expr: user_id
>                         type: bigint
>                   mode: hash
>                   outputColumnNames: _col0, _col1
>                   Reduce Output Operator
>                     key expressions:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     sort order: ++
>                     Map-reduce partition columns:
>                           expr: _col0
>                           type: string
>                           expr: _col1
>                           type: bigint
>                     tag: -1
>       Reduce Operator Tree:
>         Group By Operator
>           keys:
>                 expr: KEY._col0
>                 type: string
>                 expr: KEY._col1
>                 type: bigint
>           mode: mergepartial
>           outputColumnNames: _col0, _col1
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: string
>                   expr: _col1
>                   type: bigint
>             outputColumnNames: _col0, _col1
>             File Output Operator
>               compressed: true
>               GlobalTableId: 0
>               table:
>                   input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                   output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>  ----------------------------------
> If disabling predicate push down (set hive.optimize.ppd=true), the error is 
> gone; I tried disabling map side aggregate, the error is gone,too. 
> *Changing the alias of subquery 't1' (either the inner one or the join 
> result), the bug disappears, too.*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1342) Predicate push down get error result when sub-queries have the same alias name

Reply via email to