-----Original Message-----
From: pgsql-performance-ow...@postgresql.org 
[mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of David G Johnston
Sent: Saturday, September 13, 2014 7:34 AM
To: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] weird execution plan

>Not everyone does so its nice to make certain - especially since I'm not all 
>that familiar with the code involved.  But since no one else has answered I 
>will theorize.
>
>SELECT count(*) FROM ( SELECT DISTINCT col FROM tbl )
>
>vs
>
>SELECT count(DISTINCT col) FROM tbl
>
>The code for "SELECT DISTINCT col" is likely highly efficient because it works 
>on complete sets of records.
>
>The code for "SELECT count(DISTINCT col)" is at a relative disadvantage since 
>it must evaluate one row at a time and remember whether it had seen the same 
>value previously before deciding whether to >increment a counter.
>
>With a large number of duplicate rows the process of making the row set 
>smaller before counting the end result will perform better since fewer rows 
>must be evaluated in the less efficient count(DISTINCT) >expression - the time 
>saved there more than offset by the fact that you are effectively passing over 
>that subset of the data a second time.
>
>HashAggregate(1M rows) + Aggregate(200k rows) < Aggregate(1M rows)
>
>David J.

Thanks David!

I am so surprised to the findings you put here. Just did an explain plan on the 
example you gave and pasted the result below, you're correct. 

"Select count(distinct col1)" is really  a very common SQL statement we write 
daily, in Postgres, we need to rewrite it so that the aggregate doesn't happen 
on a very large data sets... I am wondering if this is something to be improved 
from the optimizer ifself, instead of developers to rewrite SQL. Like having 
the optimizer just do the counting in the end instead of doing it each time.  I 
used Oracle before, never saw this issue...

But really thank you for pointing this out, very valuable lesson-learnt in PG 
SQL writing for me and our developers.

dev=# explain analyze select count(*) from (select distinct wid from 
terms_weekly) foo;
                                                                          QUERY 
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1278656.00..1278656.01 rows=1 width=0) (actual 
time=24316.335..24316.336 rows=1 loops=1)
   ->  HashAggregate  (cost=1278651.50..1278653.50 rows=200 width=42) (actual 
time=23899.916..24242.010 rows=1298124 loops=1)
         ->  Append  (cost=0.00..1171738.20 rows=42765321 width=42) (actual 
time=0.028..13631.898 rows=42765320 loops=1)
               ->  Seq Scan on search_terms_weekly  (cost=0.00..0.00 rows=1 
width=516) (actual time=0.001..0.001 rows=0 loops=1)
               ->  Seq Scan on search_terms_weekly_20140503  
(cost=0.00..293352.90 rows=10702190 width=42) (actual time=0.026..2195.460 
rows=10702190 loops=1)
               ->  Seq Scan on search_terms_weekly_20140510  
(cost=0.00..298773.53 rows=10878953 width=42) (actual time=8.244..3163.087 
rows=10878953 loops=1)
               ->  Seq Scan on search_terms_weekly_20140517  
(cost=0.00..288321.17 rows=10537717 width=41) (actual time=7.345..2520.531 
rows=10537717 loops=1)
               ->  Seq Scan on search_terms_weekly_20140524  
(cost=0.00..291290.60 rows=10646460 width=41) (actual time=8.543..2693.833 
rows=10646460 loops=1)
 Total runtime: 24333.830 ms
(9 rows)

dev=# explain analyze select count(distinct wid) from terms_weekly;



                                                                       QUERY 
PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=1278651.50..1278651.51 rows=1 width=42) (actual 
time=585774.511..585774.511 rows=1 loops=1)
   ->  Append  (cost=0.00..1171738.20 rows=42765321 width=42) (actual 
time=0.019..10656.782 rows=42765320 loops=1)
         ->  Seq Scan on search_terms_weekly  (cost=0.00..0.00 rows=1 
width=516) (actual time=0.002..0.002 rows=0 loops=1)
         ->  Seq Scan on search_terms_weekly_20140503  (cost=0.00..293352.90 
rows=10702190 width=42) (actual time=0.017..2225.397 rows=10702190 loops=1)
         ->  Seq Scan on search_terms_weekly_20140510  (cost=0.00..298773.53 
rows=10878953 width=42) (actual time=0.009..2244.918 rows=10878953 loops=1)
         ->  Seq Scan on search_terms_weekly_20140517  (cost=0.00..288321.17 
rows=10537717 width=41) (actual time=0.008..1822.088 rows=10537717 loops=1)
         ->  Seq Scan on search_terms_weekly_20140524  (cost=0.00..291290.60 
rows=10646460 width=41) (actual time=0.006..1561.229 rows=10646460 loops=1)
 Total runtime: 585774.568 ms
(8 rows)


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Reply via email to