Hi,
They are rigorously equivalent.
You can see this with the following queries:
CREATE TABLE t1 (a INT, b INT, c INT) ;
EXPLAIN
SELECT DISTINCT a,b,c
FROM t1
;
EXPLAIN
SELECT a,b,c
FROM t1
GROUP BY a,b,c
;
Both queries will return the exact same query plan:
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2 vectorized
File Output Operator [FS_8]
compressed:false
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
table:{"input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Group By Operator [OP_7]
| keys:KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2
(type: int)
| outputColumnNames:["_col0","_col1","_col2"]
| Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
|<-Map 1 [SIMPLE_EDGE]
Reduce Output Operator [RS_3]
key expressions:_col0 (type: int), _col1 (type: int),
_col2 (type: int)
Map-reduce partition columns:_col0 (type: int), _col1
(type: int), _col2 (type: int)
sort order:+++
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
Group By Operator [GBY_2]
keys:a (type: int), b (type: int), c (type: int)
outputColumnNames:["_col0","_col1","_col2"]
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
Select Operator [SEL_1]
outputColumnNames:["a","b","c"]
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
TableScan [TS_0]
alias:t1
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
However, these two queries are NOT equivalent:
SELECT COUNT(DISTINCT a,b,c)
FROM t1
;
SELECT COUNT(1)
FROM (
SELECT a,b,c
FROM t1
GROUP BY a,b,c
) T
;
In general, the first one is faster except that it can fail if Hive
optimize it poorly, while the second one is slower but more reliable.
Also, most importantly, they don't give the same results as COUNT(DISTINCT
a, b, c) will ignore any row where a, b or c is null.
Their respective query plans are :
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 2
File Output Operator [FS_6]
compressed:false
Statistics:Num rows: 1 Data size: 16 Basic stats: COMPLETE
Column stats: NONE
table:{"input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Group By Operator [GBY_4]
| aggregations:["count(DISTINCT KEY._col0:0._col0,
KEY._col0:0._col1, KEY._col0:0._col2)"]
| outputColumnNames:["_col0"]
| Statistics:Num rows: 1 Data size: 16 Basic stats: COMPLETE
Column stats: NONE
|<-Map 1 [SIMPLE_EDGE]
Reduce Output Operator [RS_3]
key expressions:_col0 (type: int), _col1 (type: int),
_col2 (type: int)
sort order:+++
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
Group By Operator [GBY_2]
aggregations:["count(DISTINCT a, b, c)"]
keys:a (type: int), b (type: int), c (type: int)
outputColumnNames:["_col0","_col1","_col2","_col3"]
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
Select Operator [SEL_1]
outputColumnNames:["a","b","c"]
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
TableScan [TS_0]
alias:t1
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
and
Stage-0
Fetch Operator
limit:-1
Stage-1
Reducer 3 vectorized
File Output Operator [FS_13]
compressed:false
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
table:{"input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Group By Operator [OP_12]
| aggregations:["count(VALUE._col0)"]
| outputColumnNames:["_col0"]
| Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
|<-Reducer 2 [SIMPLE_EDGE]
Reduce Output Operator [RS_8]
sort order:
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
value expressions:_col0 (type: bigint)
Group By Operator [GBY_7]
aggregations:["count(1)"]
outputColumnNames:["_col0"]
Statistics:Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
Select Operator [SEL_5]
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
Group By Operator [GBY_4]
| keys:KEY._col0 (type: int), KEY._col1 (type:
int), KEY._col2 (type: int)
| outputColumnNames:["_col0","_col1","_col2"]
| Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
|<-Map 1 [SIMPLE_EDGE]
Reduce Output Operator [RS_3]
key expressions:_col0 (type: int), _col1
(type: int), _col2 (type: int)
Map-reduce partition columns:_col0 (type:
int), _col1 (type: int), _col2 (type: int)
sort order:+++
Statistics:Num rows: 1 Data size: 0 Basic
stats: PARTIAL Column stats: NONE
Group By Operator [GBY_2]
keys:a (type: int), b (type: int), c
(type: int)
outputColumnNames:["_col0","_col1","_col2"]
Statistics:Num rows: 1 Data size: 0 Basic
stats: PARTIAL Column stats: NONE
Select Operator [SEL_1]
outputColumnNames:["a","b","c"]
Statistics:Num rows: 1 Data size: 0
Basic stats: PARTIAL Column stats: NONE
TableScan [TS_0]
alias:t1
Statistics:Num rows: 1 Data size: 0
Basic stats: PARTIAL Column stats: NONE
On Mon, 2 Jul 2018 at 10:03, 孙志禹 <[email protected]> wrote:
> Dear all,
> Does the code1 below have a better efficiency than code2?
> Thanks!
> --------------------------------------------
> CODE1:
> select
> distinct a,b,c
> from table1
>
> CODE2:
> select
> a,b,c
> from table1
> group by
> a,b,c
>