Hi,
They are rigorously equivalent.
You can see this with the following queries:
CREATE TABLE t1 (a INT, b INT, c INT) ;
EXPLAIN
SELECT DISTINCT a,b,c
FROM t1
;
EXPLAIN
SELECT a,b,c
FROM t1
GROUP BY a,b,c
;
Both queries will return the exact same query plan:
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2 vectorized
File Output Operator [FS_8]
compressed:false
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
table:{"input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Group By Operator [OP_7]
| keys:KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2
(type: int)
| outputColumnNames:["_col0","_col1","_col2"]
| Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
|<-Map 1 [SIMPLE_EDGE]
Reduce Output Operator [RS_3]
key expressions:_col0 (type: int), _col1 (type: int),
_col2 (type: int)
Map-reduce partition columns:_col0 (type: int), _col1
(type: int), _col2 (type: int)
sort order:+++
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
Group By Operator [GBY_2]
keys:a (type: int), b (type: int), c (type: int)
outputColumnNames:["_col0","_col1","_col2"]
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
Select Operator [SEL_1]
outputColumnNames:["a","b","c"]
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
TableScan [TS_0]
alias:t1
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
However, these two queries are NOT equivalent:
SELECT COUNT(DISTINCT a,b,c)
FROM t1
;
SELECT COUNT(1)
FROM (
SELECT a,b,c
FROM t1
GROUP BY a,b,c
) T
;
In general, the first one is faster except that it can fail if Hive
optimize it poorly, while the second one is slower but more reliable.
Also, most importantly, they don't give the same results as COUNT(DISTINCT
a, b, c) will ignore any row where a, b or c is null.
Their respective query plans are :
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2
File Output Operator [FS_6]
compressed:false
Statistics:Num rows: 1 Data size: 16 Basic stats: COMPLETE
Column stats: NONE
table:{"input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Group By Operator [GBY_4]
| aggregations:["count(DISTINCT KEY._col0:0._col0,
KEY._col0:0._col1, KEY._col0:0._col2)"]
| outputColumnNames:["_col0"]
| Statistics:Num rows: 1 Data size: 16 Basic stats: COMPLETE
Column stats: NONE
|<-Map 1 [SIMPLE_EDGE]
Reduce Output Operator [RS_3]
key expressions:_col0 (type: int), _col1 (type: int),
_col2 (type: int)
sort order:+++
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
Group By Operator [GBY_2]
aggregations:["count(DISTINCT a, b, c)"]
keys:a (type: int), b (type: int), c (type: int)
outputColumnNames:["_col0","_col1","_col2","_col3"]
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
Select Operator [SEL_1]
outputColumnNames:["a","b","c"]
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
TableScan [TS_0]
alias:t1
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
and
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 3 vectorized
File Output Operator [FS_13]
compressed:false
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
table:{"input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Group By Operator [OP_12]
| aggregations:["count(VALUE._col0)"]
| outputColumnNames:["_col0"]
| Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
|<-Reducer 2 [SIMP