multi group by with multiple distinct columns come to a wrong result

chenchun Thu, 13 Jun 2013 02:12:55 -0700

Currently if there are multiple distinct columns, I think the optimize of 
HIVE-609 won't work. But the default configure param 
"hive.optimize.multigroupby.common.distincts" is true so the query still 
spraying over the distinct columns first and gets an wrong result. 
Why not make it more intelligent to discover whether there are multiple 
distinct columns or not and also remove the configure param 
"hive.optimize.multigroupby.common.distincts"?


set hive.optimize.multigroupby.common.distincts=true;
explain 
from dim.city 
insert overwrite table city_common_distinct_1 select id, count(distinct name), 
count(distinct locationid) group by id 
insert overwrite table city_common_distinct_2 select cityid, count(distinct 
name), count(distinct locationid) group by cityid;


STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-3 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-3
  Stage-4 depends on stages: Stage-0
  Stage-5 depends on stages: Stage-2
  Stage-1 depends on stages: Stage-5
  Stage-6 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        dim.city 
          TableScan
            alias: dim.city
            Reduce Output Operator
              key expressions:
                    expr: name
                    type: string
                    expr: locationid
                    type: bigint
              sort order: ++
              Map-reduce partition columns:
                    expr: name
                    type: string
                    expr: locationid
                    type: bigint
              tag: -1
              value expressions:
                    expr: id
                    type: int
                    expr: cityid
                    type: int
      Reduce Operator Tree:
        Forward
          Group By Operator
            aggregations:
                  expr: count(DISTINCT KEY._col0)
                  expr: count(DISTINCT KEY._col1)
            bucketGroup: false
            keys:
                  expr: VALUE._col0
                  type: int
            mode: hash
            outputColumnNames: _col0, _col1, _col2
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
          Group By Operator
            aggregations:
                  expr: count(DISTINCT KEY._col0)
                  expr: count(DISTINCT KEY._col1)
            bucketGroup: false
            keys:
                  expr: VALUE._col1
                  type: int
            mode: hash
            outputColumnNames: _col0, _col1, _col2
            File Output Operator
              compressed: false
              GlobalTableId: 0
              table:
                  input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

  Stage: Stage-3
    Map Reduce
      Alias -> Map Operator Tree:
        
hdfs://hadoop00:9000/opt/tmp/hivelog/hive_2013-06-13_17-04-04_748_8386975577755424550/-mr-10004
 
            Reduce Output Operator
              key expressions:
                    expr: _col0
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: _col0
                    type: int
              tag: -1
              value expressions:
                    expr: _col1
                    type: bigint
                    expr: _col2
                    type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
                expr: count(VALUE._col1)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: int
          mode: final
          outputColumnNames: _col0, _col1, _col2
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col1
                  type: bigint
                  expr: _col2
                  type: bigint
            outputColumnNames: _col0, _col1, _col2
            Select Operator
              expressions:
                    expr: _col0
                    type: int
                    expr: UDFToInteger(_col1)
                    type: int
                    expr: UDFToInteger(_col2)
                    type: int
              outputColumnNames: _col0, _col1, _col2
              File Output Operator
                compressed: false
                GlobalTableId: 1
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: test.city_common_distinct_1

  Stage: Stage-0
    Move Operator
      tables:
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: test.city_common_distinct_1

  Stage: Stage-4
    Stats-Aggr Operator

  Stage: Stage-5
    Map Reduce
      Alias -> Map Operator Tree:
        
hdfs://hadoop00:9000/opt/tmp/hivelog/hive_2013-06-13_17-04-04_748_8386975577755424550/-mr-10005
 
            Reduce Output Operator
              key expressions:
                    expr: _col0
                    type: int
              sort order: +
              Map-reduce partition columns:
                    expr: _col0
                    type: int
              tag: -1
              value expressions:
                    expr: _col1
                    type: bigint
                    expr: _col2
                    type: bigint
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: count(VALUE._col0)
                expr: count(VALUE._col1)
          bucketGroup: false
          keys:
                expr: KEY._col0
                type: int
          mode: final
          outputColumnNames: _col0, _col1, _col2
          Select Operator
            expressions:
                  expr: _col0
                  type: int
                  expr: _col1
                  type: bigint
                  expr: _col2
                  type: bigint
            outputColumnNames: _col0, _col1, _col2
            Select Operator
              expressions:
                    expr: _col0
                    type: int
                    expr: UDFToInteger(_col1)
                    type: int
                    expr: UDFToInteger(_col2)
                    type: int
              outputColumnNames: _col0, _col1, _col2
              File Output Operator
                compressed: false
                GlobalTableId: 2
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                    name: test.city_common_distinct_2

  Stage: Stage-1
    Move Operator
      tables:
          replace: true
          table:
              input format: org.apache.hadoop.mapred.TextInputFormat
              output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
              serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
              name: test.city_common_distinct_2

  Stage: Stage-6
    Stats-Aggr Operator


-- 
chenchun

multi group by with multiple distinct columns come to a wrong result

Reply via email to