[jira] [Created] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

dima machlin (JIRA) Tue, 10 Jun 2014 05:44:35 -0700

dima machlin created HIVE-7205:
----------------------------------

             Summary: Wrong results when union all of grouping followed by 
group by with correlation optimization
                 Key: HIVE-7205
                 URL: https://issues.apache.org/jira/browse/HIVE-7205
             Project: Hive
          Issue Type: Bug
    Affects Versions: 0.12.0
            Reporter: dima machlin
            Priority: Critical



use case :

table TBL (a string,b string) contains single row : 'a','a'
the following query :

select b, sum(cc) from (
        select b,count(1) as cc from TBL group by b
        union all
        select a as b,count(1) as cc from TBL group by a
) z
group by b

returns 

a 1
a 1
while set hive.optimize.correlation=true;

if we change set hive.optimize.correlation=false;
it returns correct results : a 2



The plan with correlation optimization :

ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
(TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
(TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b)))) (TOK_QUERY 
(TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
(TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
(TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL a))))) 
z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
(TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
(TOK_TABLE_OR_COL cc)))) (TOK_GROUPBY (TOK_TABLE_OR_COL b))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        null-subquery1:z-subquery1:TBL 
          TableScan
            alias: TBL
            Select Operator
              expressions:
                    expr: b
                    type: string
              outputColumnNames: b
              Group By Operator
                aggregations:
                      expr: count(1)
                bucketGroup: false
                keys:
                      expr: b
                      type: string
                mode: hash
                outputColumnNames: _col0, _col1
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: string
                  sort order: +
                  Map-reduce partition columns:
                        expr: _col0
                        type: string
                  tag: 0
                  value expressions:
                        expr: _col1
                        type: bigint
        null-subquery2:z-subquery2:TBL 
          TableScan
            alias: TBL
            Select Operator
              expressions:
                    expr: a
                    type: string
              outputColumnNames: a
              Group By Operator
                aggregations:
                      expr: count(1)
                bucketGroup: false
                keys:
                      expr: a
                      type: string
                mode: hash
                outputColumnNames: _col0, _col1
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: string
                  sort order: +
                  Map-reduce partition columns:
                        expr: _col0
                        type: string
                  tag: 1
                  value expressions:
                        expr: _col1
                        type: bigint
      Reduce Operator Tree:
        Demux Operator
          Group By Operator
            aggregations:
                  expr: count(VALUE._col0)
            bucketGroup: false
            keys:
                  expr: KEY._col0
                  type: string
            mode: mergepartial
            outputColumnNames: _col0, _col1
            Select Operator
              expressions:
                    expr: _col0
                    type: string
                    expr: _col1
                    type: bigint
              outputColumnNames: _col0, _col1
              Union
                Select Operator
                  expressions:
                        expr: _col0
                        type: string
                        expr: _col1
                        type: bigint
                  outputColumnNames: _col0, _col1
                  Mux Operator
                    Group By Operator
                      aggregations:
                            expr: sum(_col1)
                      bucketGroup: false
                      keys:
                            expr: _col0
                            type: string
                      mode: complete
                      outputColumnNames: _col0, _col1
                      Select Operator
                        expressions:
                              expr: _col0
                              type: string
                              expr: _col1
                              type: bigint
                        outputColumnNames: _col0, _col1
                        File Output Operator
                          compressed: false
                          GlobalTableId: 0
                          table:
                              input format: 
org.apache.hadoop.mapred.TextInputFormat
                              output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
          Group By Operator
            aggregations:
                  expr: count(VALUE._col0)
            bucketGroup: false
            keys:
                  expr: KEY._col0
                  type: string
            mode: mergepartial
            outputColumnNames: _col0, _col1
            Select Operator
              expressions:
                    expr: _col0
                    type: string
                    expr: _col1
                    type: bigint
              outputColumnNames: _col0, _col1
              Union
                Select Operator
                  expressions:
                        expr: _col0
                        type: string
                        expr: _col1
                        type: bigint
                  outputColumnNames: _col0, _col1
                  Mux Operator
                    Group By Operator
                      aggregations:
                            expr: sum(_col1)
                      bucketGroup: false
                      keys:
                            expr: _col0
                            type: string
                      mode: complete
                      outputColumnNames: _col0, _col1
                      Select Operator
                        expressions:
                              expr: _col0
                              type: string
                              expr: _col1
                              type: bigint
                        outputColumnNames: _col0, _col1
                        File Output Operator
                          compressed: false
                          GlobalTableId: 0
                          table:
                              input format: 
org.apache.hadoop.mapred.TextInputFormat
                              output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1


Plan without correlation optimization :

ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_UNION (TOK_QUERY (TOK_FROM 
(TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR 
(TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL b)))) (TOK_QUERY 
(TOK_FROM (TOK_TABREF (TOK_TABNAME DB TBL))) (TOK_INSERT (TOK_DESTINATION 
(TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (TOK_TABLE_OR_COL a) b) 
(TOK_SELEXPR (TOK_FUNCTION count 1) cc)) (TOK_GROUPBY (TOK_TABLE_OR_COL a))))) 
z)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT 
(TOK_SELEXPR (TOK_TABLE_OR_COL b)) (TOK_SELEXPR (TOK_FUNCTION sum 
(TOK_TABLE_OR_COL cc)))) (TOK_GROUPBY (TOK_TABLE_OR_COL b))))

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1, Stage-3
  Stage-3 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        null-subquery2:z-subquery2:TBL 
          TableScan
            alias: TBL
            Select Operator
              expressions:
                    expr: a
                    type: string
              outputColumnNames: a
              Group By Operator
                aggregations:
                      expr: count(1)
                bucketGroup: false
                keys:
                      expr: a
                      type: string
                mode: hash
                outputColumnNames: _col0, _col1
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: string
                  sort order: +
                  Map-reduce partition columns:
                        expr: _col0
                        type: string
                  tag: -1
                  value expressions:
                        expr: _col1
                        type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0, _col1
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        
maprfs:/user/hadoop/tmp/hive/hive_2014-06-10_15-42-37_188_2403850033480056671-5/-mr-10002
 
          TableScan
            Union
              Select Operator
                expressions:
                      expr: _col0
                      type: string
                      expr: _col1
                      type: bigint
                outputColumnNames: _col0, _col1
                Group By Operator
                  aggregations:
                        expr: sum(_col1)
                  bucketGroup: false
                  keys:
                        expr: _col0
                        type: string
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                    sort order: +
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                    tag: -1
                    value expressions:
                          expr: _col1
                          type: bigint
        
maprfs:/user/hadoop/tmp/hive/hive_2014-06-10_15-42-37_188_2403850033480056671-5/-mr-10003
 
          TableScan
            Union
              Select Operator
                expressions:
                      expr: _col0
                      type: string
                      expr: _col1
                      type: bigint
                outputColumnNames: _col0, _col1
                Group By Operator
                  aggregations:
                        expr: sum(_col1)
                  bucketGroup: false
                  keys:
                        expr: _col0
                        type: string
                  mode: hash
                  outputColumnNames: _col0, _col1
                  Reduce Output Operator
                    key expressions:
                          expr: _col0
                          type: string
                    sort order: +
                    Map-reduce partition columns:
                          expr: _col0
                          type: string
                    tag: -1
                    value expressions:
                          expr: _col1
                          type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: sum(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0, _col1
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

  Stage: Stage-3
    Map Reduce
      Alias -> Map Operator Tree:
        null-subquery1:z-subquery1:TBL 
          TableScan
            alias: TBL
            Select Operator
              expressions:
                    expr: b
                    type: string
              outputColumnNames: b
              Group By Operator
                aggregations:
                      expr: count(1)
                bucketGroup: false
                keys:
                      expr: b
                      type: string
                mode: hash
                outputColumnNames: _col0, _col1
                Reduce Output Operator
                  key expressions:
                        expr: _col0
                        type: string
                  sort order: +
                  Map-reduce partition columns:
                        expr: _col0
                        type: string
                  tag: -1
                  value expressions:
                        expr: _col1
                        type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: string
          mode: mergepartial
          outputColumnNames: _col0, _col1
          Select Operator
            expressions:
                  expr: _col0
                  type: string
                  expr: _col1
                  type: bigint
            outputColumnNames: _col0, _col1
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-0
    Fetch Operator
      limit: -1









--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (HIVE-7205) Wrong results when union all of grouping followed by group by with correlation optimization

Reply via email to