[SQL] PG7.4.5: query not using index on date column
Hi Folks, I was hoping someone could help me to improve the performance of a query I've got that insists on doing a seq. scan on a large table. I'm trying to do some reporting based on my spam logs which I've partly digested and stored in a table. Here are the particulars: The messages table: Column | Type | Modifiers ---+---+-- message_id| integer | not null default nextval('spamreport.messages_message_id_seq'::text) received_date | date | not null score | numeric | not null user_threshold| numeric | not null raw_email_address | character varying(64) | not null processing_time | numeric | not null size | integer | not null fuzzed_address| character varying(64) | not null domain| character varying(64) | not null Indexes: "messages_pkey" primary key, btree (message_id) "domain_idx" btree ("domain") "fuzzy_idx" btree (fuzzed_address) "received_date_idx" btree (received_date) And here's the primary query I run, along with explain analyze output: >> explain analyze SELECT * FROM ( SELECT domain, count(*) as count, max(score) as max_score, avg(score) as average_score, stddev(score) as stddev_score, sum(CASE WHEN score > user_threshold THEN 1 ELSE 0 END) as spam_count, avg(processing_time) as average_time, avg(size) as average_size FROM messages WHERE received_date BETWEEN '2004-11-01' AND '2004-11-30' GROUP BY domain ) AS aggs ORDER BY count DESC; QUERY PLAN --- Sort (cost=30303.51..30303.60 rows=35 width=226) (actual time=29869.716..29869.883 rows=69 loops=1) Sort Key: count -> Subquery Scan aggs (cost=30301.56..30302.61 rows=35 width=226) (actual time=29861.705..29869.240 rows=69 loops=1) -> HashAggregate (cost=30301.56..30302.26 rows=35 width=54) (actual time=29861.681..29868.261 rows=69 loops=1) -> Seq Scan on messages (cost=0.00..21573.04 rows=436426 width=54) (actual time=5.523..6304.657 rows=462931 loops=1) Filter: ((received_date >= '2004-11-01'::date) AND (received_date <= '2004-11-30'::date)) Total runtime: 29870.437 ms This database gets vacuumed nightly. Also, the query plan stays the same even if I restrict the received_date column down to a single day. Thanks in advance, -- Dave Steinberg http://www.geekisp.com/ ---(end of broadcast)--- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]
Re: [SQL] PG7.4.5: query not using index on date column
Sorry for the delay in getting back on this thread, painting my apartment meant no internet over this thanksgiving break... :) How many rows in the table altogether? A rough guess is a few million based on the estimated seqscan cost. That would mean that this query is retrieving about 10% of the table, which is a large enough fraction that the planner will probably think a seqscan is best. It may be right. If you do "set enable_seqscan = off", how does the EXPLAIN ANALYZE output change? Right now it's a hair under 800k rows. With enable_seqscan = off, I get this as my analyze results: QUERY PLAN - Sort (cost=361676.23..361676.33 rows=38 width=226) (actual time=46076.756..46076.924 rows=69 loops=1) Sort Key: count -> Subquery Scan aggs (cost=361674.10..361675.24 rows=38 width=226) (actual time=46068.621..46076.159 rows=69 loops=1) -> HashAggregate (cost=361674.10..361674.86 rows=38 width=54) (actual time=46068.596..46075.170 rows=69 loops=1) -> Index Scan using received_date_idx on messages (cost=0.00..349968.44 rows=585283 width=54) (actual time=20.988..15020.821 rows=589543 loops=1) Index Cond: ((received_date >= '2004-11-01'::date) AND (received_date <= '2004-11-30'::date)) Total runtime: 46091.315 ms Quite a bit higher with these estimates. Using your 10% of the table guess above, I did a breakdown by date and got these results: [EMAIL PROTECTED]:geekisp=>select received_date, count(received_date) from spamreport.messages group by received_date order by received_date asc; received_date | count ---+--- 2004-10-20| 7592 2004-10-21| 19699 2004-10-22| 17311 2004-10-23| 16730 2004-10-24| 18249 2004-10-25| 16718 2004-10-26| 16951 2004-10-27| 19818 2004-10-28| 19580 2004-10-29| 17610 2004-10-30| 16210 2004-10-31| 20468 2004-11-01| 12337 2004-11-02| 9012 2004-11-03| 20871 2004-11-04| 20103 2004-11-05| 18807 2004-11-06| 20131 2004-11-07| 22291 2004-11-08| 23041 2004-11-09| 20254 2004-11-10| 17810 2004-11-11| 21091 2004-11-12| 21976 2004-11-13| 18824 2004-11-14| 20543 2004-11-15| 18829 2004-11-16| 24248 2004-11-17| 18093 2004-11-18| 25675 2004-11-19| 27084 2004-11-20| 22362 2004-11-21| 25187 2004-11-22| 26451 2004-11-23| 26016 2004-11-24| 23147 2004-11-25| 25785 2004-11-26| 20584 2004-11-27| 25615 2004-11-28| 6931 2004-11-29| 6549 (41 rows) So it looks like an aggregation of 2 weeks worth of data is more than 10%, so its aligned with what you were saying. That also jives, since when I re-enable seq_scan and do the report for one or two day's of data - it uses the index (previously it did not - perhaps there was not enough data). If it's not right, you may want to try to adjust random_page_cost and/or effective_cache_size so that the planner's estimated costs are more in line with reality. Beware of making such adjustments on the basis of only one test case, though. I'll look in the manual and try playing with these options, thanks for your suggestions! In the end, I'm guessing that if I need to retrieve a few hundred thousand rows from disk, faster disks might make the biggest impact on performance. Thanks for your help! -- Dave Steinberg http://www.geekisp.com/ ---(end of broadcast)--- TIP 7: don't forget to increase your free space map settings
Re: [SQL] [SQL NOTIFY Help] sending NOTIFY to the external program
On Wed, Jan 03, 2007 at 07:47:34PM +0530, [EMAIL PROTECTED] wrote: Yes, our application is supposed to know *immediately* that a change in the database has occurred since, I'd recommend some sort of "whiteboard" layer or messaging fabric as an option for this sort of thing. This offloads the notification element from the database, and lets you pass along any data you want to the listeners - even the complete record. Also this sort of thing scales well to multiple listeners. Anyway, just a thought... Regards, -- Dave Steinberg http://www.geekisp.com/ http://www.steinbergcomputing.com/ ---(end of broadcast)--- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate
Re: [SQL] many-to-many relationship
Louis-David Mitterrand wrote: Hi, Say you have several objects (tables): person, location, event, etc. all of which can have several images attached. What is the best way to manage relations between a single 'image' table and these different objects? For now each 'image' row has pointers to id_person, id_location, id_event, etc. (only one of which is used for any given row). Is there a better way, more elegant way to do it, without using redundant id_* pointers on each row and yet still enforce foreign keys? The typical way to do this would be to have your image table be just about images, and then to isolate the relationship information into mapping tables. Those would look like: image <=> people (image_id, person_id), with the primary key being the pair of columns. In SQL, roughly: create table image_people_map ( image_id integer not null, person_id integer not null, primary key (image_id, person_id) ); Similarly, for locations it'd be (image_id, location_id), and for events (image_id, event_id). Then you can have a single image associated with any number of people, events, or locations. Regards, -- Dave Steinberg http://www.geekisp.com/ http://www.steinbergcomputing.com/ -- Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql