Re: Question about efficiency of SELECT DISTINCT

2018-07-02 Thread Furcy Pin
Hi,

They are rigorously equivalent.

You can see this with the following queries:

CREATE TABLE t1 (a INT, b INT, c INT) ;

EXPLAIN
SELECT DISTINCT a,b,c
FROM t1
;

EXPLAIN
SELECT a,b,c
FROM t1
GROUP BY a,b,c
;


Both queries will return the exact same query plan:

Stage-0
   Fetch Operator
  limit:-1
  Stage-1
 Reducer 2 vectorized
 File Output Operator [FS_8]
compressed:false
Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
table:{"input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Group By Operator [OP_7]
|  keys:KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2
(type: int)
|  outputColumnNames:["_col0","_col1","_col2"]
|  Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
|<-Map 1 [SIMPLE_EDGE]
   Reduce Output Operator [RS_3]
  key expressions:_col0 (type: int), _col1 (type: int),
_col2 (type: int)
  Map-reduce partition columns:_col0 (type: int), _col1
(type: int), _col2 (type: int)
  sort order:+++
  Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
  Group By Operator [GBY_2]
 keys:a (type: int), b (type: int), c (type: int)
 outputColumnNames:["_col0","_col1","_col2"]
 Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
 Select Operator [SEL_1]
outputColumnNames:["a","b","c"]
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
TableScan [TS_0]
   alias:t1
   Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE





However, these two queries are NOT equivalent:

SELECT COUNT(DISTINCT a,b,c)
FROM t1
;

SELECT COUNT(1)
FROM (
  SELECT a,b,c
  FROM t1
  GROUP BY a,b,c
  ) T
;

In general, the first one is faster except that it can fail if Hive
optimize it poorly, while the second one is slower but more reliable.
Also, most importantly, they don't give the same results as COUNT(DISTINCT
a, b, c) will ignore any row where a, b or c is null.

Their respective query plans are :

Stage-0
   Fetch Operator
  limit:-1
  Stage-1
 Reducer 2
 File Output Operator [FS_6]
compressed:false
Statistics:Num rows: 1 Data size: 16 Basic stats: COMPLETE
Column stats: NONE
table:{"input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Group By Operator [GBY_4]
|  aggregations:["count(DISTINCT KEY._col0:0._col0,
KEY._col0:0._col1, KEY._col0:0._col2)"]
|  outputColumnNames:["_col0"]
|  Statistics:Num rows: 1 Data size: 16 Basic stats: COMPLETE
Column stats: NONE
|<-Map 1 [SIMPLE_EDGE]
   Reduce Output Operator [RS_3]
  key expressions:_col0 (type: int), _col1 (type: int),
_col2 (type: int)
  sort order:+++
  Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
  Group By Operator [GBY_2]
 aggregations:["count(DISTINCT a, b, c)"]
 keys:a (type: int), b (type: int), c (type: int)
 outputColumnNames:["_col0","_col1","_col2","_col3"]
 Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
 Select Operator [SEL_1]
outputColumnNames:["a","b","c"]
Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
TableScan [TS_0]
   alias:t1
   Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE

and

Stage-0
   Fetch Operator
  limit:-1
  Stage-1
 Reducer 3 vectorized
 File Output Operator [FS_13]
compressed:false
Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
table:{"input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
Group By Operator [OP_12]
|  aggregations:["count(VALUE._col0)"]
|  outputColumnNames:["_col0"]
|  Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
|<-Reducer 2 [SIMP

Question about efficiency of SELECT DISTINCT

2018-07-02 Thread 孙志禹
Dear all,    Does the code1 below have a better efficiency than code2?    
Thanks!
    CODE1:        select            
 distinct a,b,c        from table1
    CODE2:        select             a,b,c        from table1        group by   
           a,b,c