Re: Question about efficiency of SELECT DISTINCT

Furcy Pin Mon, 02 Jul 2018 05:00:25 -0700

Hi,

They are rigorously equivalent.


You can see this with the following queries:

CREATE TABLE t1 (a INT, b INT, c INT) ;

EXPLAIN
SELECT DISTINCT a,b,c
FROM t1
;

EXPLAIN
SELECT a,b,c
FROM t1
GROUP BY a,b,c
;


Both queries will return the exact same query plan:

Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Reducer 2 vectorized
         File Output Operator [FS_8]
            compressed:false
            Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL Column
stats: NONE
            table:{"input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Group By Operator [OP_7]
            |  keys:KEY._col0 (type: int), KEY._col1 (type: int), KEY._col2
(type: int)
            |  outputColumnNames:["_col0","_col1","_col2"]
            |  Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
            |<-Map 1 [SIMPLE_EDGE]
               Reduce Output Operator [RS_3]
                  key expressions:_col0 (type: int), _col1 (type: int),
_col2 (type: int)
                  Map-reduce partition columns:_col0 (type: int), _col1
(type: int), _col2 (type: int)
                  sort order:+++
                  Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
                  Group By Operator [GBY_2]
                     keys:a (type: int), b (type: int), c (type: int)
                     outputColumnNames:["_col0","_col1","_col2"]
                     Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
                     Select Operator [SEL_1]
                        outputColumnNames:["a","b","c"]
                        Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
                        TableScan [TS_0]
                           alias:t1
                           Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE





However, these two queries are NOT equivalent:

SELECT COUNT(DISTINCT a,b,c)
FROM t1
;

SELECT COUNT(1)
FROM (
  SELECT a,b,c
  FROM t1
  GROUP BY a,b,c
  ) T
;

In general, the first one is faster except that it can fail if Hive
optimize it poorly, while the second one is slower but more reliable.
Also, most importantly, they don't give the same results as COUNT(DISTINCT
a, b, c) will ignore any row where a, b or c is null.

Their respective query plans are :

Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Reducer 2
         File Output Operator [FS_6]
            compressed:false
            Statistics:Num rows: 1 Data size: 16 Basic stats: COMPLETE
Column stats: NONE
            table:{"input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Group By Operator [GBY_4]
            |  aggregations:["count(DISTINCT KEY._col0:0._col0,
KEY._col0:0._col1, KEY._col0:0._col2)"]
            |  outputColumnNames:["_col0"]
            |  Statistics:Num rows: 1 Data size: 16 Basic stats: COMPLETE
Column stats: NONE
            |<-Map 1 [SIMPLE_EDGE]
               Reduce Output Operator [RS_3]
                  key expressions:_col0 (type: int), _col1 (type: int),
_col2 (type: int)
                  sort order:+++
                  Statistics:Num rows: 1 Data size: 0 Basic stats: PARTIAL
Column stats: NONE
                  Group By Operator [GBY_2]
                     aggregations:["count(DISTINCT a, b, c)"]
                     keys:a (type: int), b (type: int), c (type: int)
                     outputColumnNames:["_col0","_col1","_col2","_col3"]
                     Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
                     Select Operator [SEL_1]
                        outputColumnNames:["a","b","c"]
                        Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
                        TableScan [TS_0]
                           alias:t1
                           Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE

and

Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Reducer 3 vectorized
         File Output Operator [FS_13]
            compressed:false
            Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
            table:{"input
format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"}
            Group By Operator [OP_12]
            |  aggregations:["count(VALUE._col0)"]
            |  outputColumnNames:["_col0"]
            |  Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
            |<-Reducer 2 [SIMPLE_EDGE]
               Reduce Output Operator [RS_8]
                  sort order:
                  Statistics:Num rows: 1 Data size: 8 Basic stats: COMPLETE
Column stats: NONE
                  value expressions:_col0 (type: bigint)
                  Group By Operator [GBY_7]
                     aggregations:["count(1)"]
                     outputColumnNames:["_col0"]
                     Statistics:Num rows: 1 Data size: 8 Basic stats:
COMPLETE Column stats: NONE
                     Select Operator [SEL_5]
                        Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
                        Group By Operator [GBY_4]
                        |  keys:KEY._col0 (type: int), KEY._col1 (type:
int), KEY._col2 (type: int)
                        |  outputColumnNames:["_col0","_col1","_col2"]
                        |  Statistics:Num rows: 1 Data size: 0 Basic stats:
PARTIAL Column stats: NONE
                        |<-Map 1 [SIMPLE_EDGE]
                           Reduce Output Operator [RS_3]
                              key expressions:_col0 (type: int), _col1
(type: int), _col2 (type: int)
                              Map-reduce partition columns:_col0 (type:
int), _col1 (type: int), _col2 (type: int)
                              sort order:+++
                              Statistics:Num rows: 1 Data size: 0 Basic
stats: PARTIAL Column stats: NONE
                              Group By Operator [GBY_2]
                                 keys:a (type: int), b (type: int), c
(type: int)
                                 outputColumnNames:["_col0","_col1","_col2"]
                                 Statistics:Num rows: 1 Data size: 0 Basic
stats: PARTIAL Column stats: NONE
                                 Select Operator [SEL_1]
                                    outputColumnNames:["a","b","c"]
                                    Statistics:Num rows: 1 Data size: 0
Basic stats: PARTIAL Column stats: NONE
                                    TableScan [TS_0]
                                       alias:t1
                                       Statistics:Num rows: 1 Data size: 0
Basic stats: PARTIAL Column stats: NONE




On Mon, 2 Jul 2018 at 10:03, 孙志禹 <[email protected]> wrote:

> Dear all,
>     Does the code1 below have a better efficiency than code2?
>     Thanks!
> --------------------------------------------
>     CODE1：
>         select
>              distinct a,b,c
>         from table1
>
>     CODE2:
>         select
>              a,b,c
>         from table1
>         group by
>              a,b,c
>

Re: Question about efficiency of SELECT DISTINCT

Reply via email to