[PERFORM] Slow Count-Distinct Query

2014-04-06 Thread Jeff Janes
On Friday, April 4, 2014, Varadharajan Mukundan srinath...@gmail.com
wrote:

 Hi Jeff,

 It looks like the original emailer wrote a query that the planner is not
 smart enough to plan properly (A known limitation of that kind of query).
  He then made a bunch of changes, none of which worked.  He then re-wrote
 the query into a form for which the planner does a better job on.  What we
 do not know is, what would happen if he undoes all of those other changes
 and *just* uses the new form of the query?



 I was also pairing up with Chris (original emailer) on this issue and in
 order to reproduce it, i've a created a two column table with following
 schema with 1.9 Million dummy rows:

 =

 a=# \d participants
  Table public.participants
  Column |  Type  | Modifiers

 ++---
  id | integer| not null default
 nextval('participants_id_seq'::regclass)
  email  | character varying(255) |

 I've tried out various scenarios on this table and recorded it as a
 transcript below: (Please read it as we read a shell script from top to
 bottom continuously to get the whole idea):

 *; Create table and Insert 1.9 Million rows*

  a=# create table participants(id serial, email varchar(255));
 NOTICE:  CREATE TABLE will create implicit sequence participants_id_seq
 for serial column participants.id
 CREATE TABLE
 a=# \d participants
  Table public.participants
  Column |  Type  | Modifiers

 ++---
  id | integer| not null default
 nextval('participants_id_seq'::regclass)
  email  | character varying(255) |

 a=# copy participants from '/tmp/a.csv' with csv;
 COPY 135


Thanks for the detailed response.  I don't have access to your /tmp/a.csv
of course, I so I just used this:

insert into participants (email) select md5(floor(random()*100)::text)
from generate_series(1,200);

This gives each email showing up about twice.




 a=# EXPLAIN (ANALYZE) select count(1) from (select email from participants
 where email=email group by email) x;
 QUERY PLAN

 ---
  Aggregate  (cost=37874.94..37874.96 rows=1 width=0) (actual
 time=1549.258..1549.258 rows=1 loops=1)
-  HashAggregate  (cost=37763.19..37812.86 rows=4967 width=16) (actual
 time=1168.114..1461.672 rows=100 loops=1)
  -  Seq Scan on participants  (cost=0.00..37738.19 rows=1
 width=16) (actual time=0.045..411.267 rows=135 loops=1)
Filter: ((email)::text = (email)::text)
  Total runtime: 1567.586 ms
 (5 rows)


What you have done here is trick the planner into thinking your query will
be 200 times smaller than it actually is, and thus the hash table will be
200 times smaller than it actually is and therefore will fit in allowed
memory.  This is effective at getting the more efficient hash agg.  But it
no more safe than just giving it explicit permission to use that much
memory for this query by increasing work_mem by 200 fold.

I am kind of surprised that the planner is so easily fooled by that.




 *; Creation of idx on email field*

 a=# create index email_idx on participants(email);
 CREATE INDEX



 a=#  EXPLAIN (ANALYZE)  select count(1) from (select email from
 participants group by email) x;
  QUERY PLAN

 -
  Aggregate  (cost=48622.59..48622.60 rows=1 width=0) (actual
 time=1242.718..1242.718 rows=1 loops=1)
-  HashAggregate  (cost=26273.09..36206.20 rows=993311 width=16)
 (actual time=855.215..1150.781 rows=100 loops=1)
  -  Seq Scan on participants  (cost=0.00..21273.25 rows=135
 width=16) (actual time=0.058..217.105 rows=135 loops=1)
  Total runtime: 1264.234 ms
 (4 rows)



I can't reproduce this at all, except by increasing work_mem.   The hash
table needed for this is no smaller than the hash table needed before the
index was built.  Did you increase work_mem before the above plan?

Instead what I get is the index only scan (to provide order) feeding into a
Group.



 a=# drop index email_idx;
 DROP INDEX

 *; Creation of partial index on email *

 a=# create index email_idx on participants(email) where email=email;
 CREATE INDEX

 a=#  EXPLAIN (ANALYZE)  select count(distinct email) from participants
 where email=email;
QUERY PLAN

 

Re: [PERFORM] Slow Count-Distinct Query

2014-04-06 Thread Varadharajan Mukundan
Hi Jeff,

Instead what I get is the index only scan (to provide order) feeding into a
 Group.


That's interesting. We tested out in two versions of Postgres (9.2 and 9.3)
in different Mac machines and ended up with index-only scan only after the
partial index. I remember doing a vacuum full analyse after each and every
step.


 I usually get this plan without the cluster, to.  I think it depends on
 the luck of the sampling in the autoanalyze.


That's interesting as well. I think something like increasing the sample
size would make it much better? Because, we had to perform so many steps to
get the index-only scan working whereas its really obvious for anyone to
guess that it should be the right approach. Also in a far corner of my
mind, i'm also thinking whether any OS specific parameter would be
considered (and is different in your system compared to my system) for
coming up plans and choosing one of them.

-- 
Thanks,
M. Varadharajan



Experience is what you get when you didn't get what you wanted
   -By Prof. Randy Pausch in The Last Lecture

My Journal :- www.thinkasgeek.wordpress.com


Re: [PERFORM] Slow Count-Distinct Query

2014-04-01 Thread bricklen
On Sun, Mar 30, 2014 at 12:45 PM, Christopher Jackson crjac...@gmail.comwrote:

   Hi all,

   tl;dr - How can I speed up my count-distinct query?


Depending on how often you need to run that query and how important it is
to you, if you are willing to accept a performance hit on
INSERT/UPDATE/DELETE of the participants table, you could create a
summary table containing just the count of unique email addresses or the
list of unique email addresses populated via trigger on
INSERT/UPDATE/DELETE of the  participants table. Another option is try out
the new Materialized views (
http://www.postgresql.org/docs/current/static/sql-creatematerializedview.html)
available in 9.3.


Re: [PERFORM] Slow Count-Distinct Query

2014-04-01 Thread Christopher Jackson
Hi Bricklen,

Thanks for the feedback.  I'll play around with materialized views.  My
understanding is they have to be manually triggered for refresh and there's
an exclusive lock on the view while the refresh is taking place.  Is this
your understanding as well?  I'm using PG 9.3.3.  If this is true, I'm
curious what clever ways people have come up with to mitigate any issues
with the lock.

   Thanks again,
  Chris


On Tue, Apr 1, 2014 at 7:34 PM, bricklen brick...@gmail.com wrote:


 On Sun, Mar 30, 2014 at 12:45 PM, Christopher Jackson 
 crjac...@gmail.comwrote:

   Hi all,

   tl;dr - How can I speed up my count-distinct query?


 Depending on how often you need to run that query and how important it is
 to you, if you are willing to accept a performance hit on
 INSERT/UPDATE/DELETE of the participants table, you could create a
 summary table containing just the count of unique email addresses or the
 list of unique email addresses populated via trigger on
 INSERT/UPDATE/DELETE of the  participants table. Another option is try out
 the new Materialized views (
 http://www.postgresql.org/docs/current/static/sql-creatematerializedview.html)
 available in 9.3.




Re: [PERFORM] Slow Count-Distinct Query

2014-04-01 Thread Michael Paquier
On Wed, Apr 2, 2014 at 1:22 PM, Christopher Jackson crjac...@gmail.com wrote:

 Hi Bricklen,

 Thanks for the feedback.  I'll play around with materialized views.  My
 understanding is they have to be manually triggered for refresh
Yep.

 and there's an exclusive lock on the view while the refresh is taking place.  
 Is this
 your understanding as well?
Re-yep.

 I'm using PG 9.3.3.  If this is true, I'm
 curious what clever ways people have come up with to mitigate any issues
 with the lock.
Kevin Grittner has implemented REFRESH MATERIALIZED VIEW CONCURRENTLY
in 9.4. A unique index is needed on the materialized view as well to
authorize this concurrent operation. It has the merit to allow SELECT
operations on the matview during the refresh.
-- 
Michael


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow Count-Distinct Query

2014-04-01 Thread Christopher Jackson
Hi Bricklen,

   Thanks again for the feedback.  The concurrent refresh sounds cool.  I
just saw the 9.4 release is tentatively scheduled for later this year.  Do
you know what people have been doing for view refreshes in the meantime?

   Thanks


On Tue, Apr 1, 2014 at 11:48 PM, Michael Paquier
michael.paqu...@gmail.comwrote:

 On Wed, Apr 2, 2014 at 1:22 PM, Christopher Jackson crjac...@gmail.com
 wrote:
 
  Hi Bricklen,
 
  Thanks for the feedback.  I'll play around with materialized views.
  My
  understanding is they have to be manually triggered for refresh
 Yep.

  and there's an exclusive lock on the view while the refresh is taking
 place.  Is this
  your understanding as well?
 Re-yep.

  I'm using PG 9.3.3.  If this is true, I'm
  curious what clever ways people have come up with to mitigate any issues
  with the lock.
 Kevin Grittner has implemented REFRESH MATERIALIZED VIEW CONCURRENTLY
 in 9.4. A unique index is needed on the materialized view as well to
 authorize this concurrent operation. It has the merit to allow SELECT
 operations on the matview during the refresh.
 --
 Michael



Re: [PERFORM] Slow Count-Distinct Query

2014-03-31 Thread Shaun Thomas
  tl;dr - How can I speed up my count-distinct query?  

You can't.

Doing a count(distinct x) is much different than a count(1), which can simply 
scan available indexes. To build a distinct, it has to construct an in-memory 
hash of every valid email, and count the distinct values therein. This will 
pretty much never be fast, especially with 2M rows involved.

I could be wrong about this, and the back-end folks might have a different 
answer, but I wouldn't hold my breath.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd | Suite 400 | Chicago IL, 60604
312-676-8870
stho...@optionshouse.com

__

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to 
this email


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow Count-Distinct Query

2014-03-31 Thread Tom Lane
Christopher Jackson crjac...@gmail.com writes:
   tl;dr - How can I speed up my count-distinct query?

EXPLAIN doesn't provide a lot of visibility into what the Aggregate plan
node is doing, but in this case what it's doing is an internal sort/uniq
operation to implement the DISTINCT.  You didn't say what value of
work_mem you're using, but it'd need to be probably 50-100MB to prevent
that sort from spilling to disk (and therefore being slow).

Note that the indexscan is actually *slower* than the seqscan so far as
the table access is concerned; if the table were big enough to not fit
in RAM, this would get very much worse.  So I'm not impressed with trying
to force the optimizer's hand as you've done here --- it might be a nice
plan now, but it's brittle.  See if a bigger work_mem improves matters
enough with the regular plan.

 *I'm concerned about setting the enable_bitmapscan and seq_page_cost values
 because I'm not yet sure what the consequences are.  Can anyone enlighten
 me on the recommended way to speed up this query?*

Turning off enable_bitmapscan globally would be a seriously bad idea.
Changing the cost settings to these values globally might be all right;
it would amount to optimizing for all-in-memory cases, which might or
might not be a good idea for your situation.  For that matter, greatly
increasing work_mem globally is usually not thought to be smart either;
remember that it's a per-sort-operation setting so you may need to
provision a considerable multiple of the setting as physical RAM,
depending on how many queries you expect to run concurrently.  So all in
all you might be well advised to just set special values for this one
query, whichever solution approach you use.

I doubt you need the where email=email hack, in any case.  That isn't
forcing the optimizer's decision in any meaningful fashion.

regards, tom lane


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Slow Count-Distinct Query

2014-03-30 Thread Christopher Jackson
  Hi all,

  tl;dr - How can I speed up my count-distinct query?

  I apologize in advance if this question has been asked already.  I'm
finding the mailing list hard to navigate.  I'm trying to speed up a query
that will find a count of distinct emails with in a table using Postgres
9.3.3.  The name of the table is participants.  Our domain is set up such
that duplicate emails are allowed so long as a particular corresponding
column value is unique.


*TABLE participants*

*  id serial NOT NULL (primary key)*

*  email character varying(255)*

*  (other columns omitted)*



I have the following index defined:

*index_participants_on_email ON participants USING btree (email COLLATE
pg_catalog.default);*

The query I'm trying to run is select count(distinct email) from
participants.  I've also tried the *group by* equivalent.  *On a table size
of 2 million rows, the query takes about 1 minute to return.*  This is way
too long.  After running analyze, I see that the index is being ignored and
a full table scan is performed.

So, I tried running the following after dropping the index:
create index email_idx on participants(email) where email=email;
set enable_bitmapscan = false;
set seq_page_cost = 0.1;
set random_page_cost = 0.2;
create index email_idx_2 on participants(email);
cluster participants using email_idx_2;

With these settings in place, if I run *select count(distinct email) from
participants* I get

Aggregate  (cost=29586.20..29586.21 rows=1 width=18) (actual
time=54243.643..54243.644 rows=1 loops=1)
  -  Seq Scan on participants  (cost=0.00..24586.18 rows=208
width=18) (actual time=0.030..550.296 rows=208 loops=1)
*Total runtime: 54243.669 ms*

When I run the following, I get MUCH better results
*select count(1) from (select email from participants where email=email
group by email) x;*

Aggregate  (cost=1856.36..1856.37 rows=1 width=0) (actual
time=1393.573..1393.573 rows=1 loops=1)
  Output: count(1)
  -  Group  (cost=0.43..1731.36 rows=1 width=18) (actual
time=0.052..1205.977 rows=208 loops=1)
Output: participants.email
-  Index Only Scan using email_idx on public.participants
 (cost=0.43..1706.36 rows=1 width=18) (actual time=0.050..625.248
rows=208 loops=1)
  Output: participants.email
  Heap Fetches: 208
*Total runtime: 1393.599 ms*

This query has a weird where clause (email=email) because I'm trying to
force the analyzer's hand to use the index.

*I'm concerned about setting the enable_bitmapscan and seq_page_cost values
because I'm not yet sure what the consequences are.  Can anyone enlighten
me on the recommended way to speed up this query?*

 Thanks