Currently if there are multiple distinct columns, I think the optimize of
HIVE-609 won't work. But the default configure param
"hive.optimize.multigroupby.common.distincts" is true so the query still
spraying over the distinct columns first and gets an wrong result.
Why not make it more intelligent to discover whether there are multiple
distinct columns or not and also remove the configure param
"hive.optimize.multigroupby.common.distincts"?
set hive.optimize.multigroupby.common.distincts=true;
explain
from dim.city
insert overwrite table city_common_distinct_1 select id, count(distinct name),
count(distinct locationid) group by id
insert overwrite table city_common_distinct_2 select cityid, count(distinct
name), count(distinct locationid) group by cityid;
STAGE DEPENDENCIES:
Stage-2 is a root stage
Stage-3 depends on stages: Stage-2
Stage-0 depends on stages: Stage-3
Stage-4 depends on stages: Stage-0
Stage-5 depends on stages: Stage-2
Stage-1 depends on stages: Stage-5
Stage-6 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
dim.city
TableScan
alias: dim.city
Reduce Output Operator
key expressions:
expr: name
type: string
expr: locationid
type: bigint
sort order: ++
Map-reduce partition columns:
expr: name
type: string
expr: locationid
type: bigint
tag: -1
value expressions:
expr: id
type: int
expr: cityid
type: int
Reduce Operator Tree:
Forward
Group By Operator
aggregations:
expr: count(DISTINCT KEY._col0)
expr: count(DISTINCT KEY._col1)
bucketGroup: false
keys:
expr: VALUE._col0
type: int
mode: hash
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Group By Operator
aggregations:
expr: count(DISTINCT KEY._col0)
expr: count(DISTINCT KEY._col1)
bucketGroup: false
keys:
expr: VALUE._col1
type: int
mode: hash
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-3
Map Reduce
Alias -> Map Operator Tree:
hdfs://hadoop00:9000/opt/tmp/hivelog/hive_2013-06-13_17-04-04_748_8386975577755424550/-mr-10004
Reduce Output Operator
key expressions:
expr: _col0
type: int
sort order: +
Map-reduce partition columns:
expr: _col0
type: int
tag: -1
value expressions:
expr: _col1
type: bigint
expr: _col2
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(VALUE._col0)
expr: count(VALUE._col1)
bucketGroup: false
keys:
expr: KEY._col0
type: int
mode: final
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: bigint
expr: _col2
type: bigint
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: UDFToInteger(_col1)
type: int
expr: UDFToInteger(_col2)
type: int
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: test.city_common_distinct_1
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: test.city_common_distinct_1
Stage: Stage-4
Stats-Aggr Operator
Stage: Stage-5
Map Reduce
Alias -> Map Operator Tree:
hdfs://hadoop00:9000/opt/tmp/hivelog/hive_2013-06-13_17-04-04_748_8386975577755424550/-mr-10005
Reduce Output Operator
key expressions:
expr: _col0
type: int
sort order: +
Map-reduce partition columns:
expr: _col0
type: int
tag: -1
value expressions:
expr: _col1
type: bigint
expr: _col2
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(VALUE._col0)
expr: count(VALUE._col1)
bucketGroup: false
keys:
expr: KEY._col0
type: int
mode: final
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: bigint
expr: _col2
type: bigint
outputColumnNames: _col0, _col1, _col2
Select Operator
expressions:
expr: _col0
type: int
expr: UDFToInteger(_col1)
type: int
expr: UDFToInteger(_col2)
type: int
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 2
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: test.city_common_distinct_2
Stage: Stage-1
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: test.city_common_distinct_2
Stage: Stage-6
Stats-Aggr Operator
--
chenchun