[SQL] PG7.4.5: query not using index on date column

2004-11-25 Thread Dave Steinberg
Hi Folks, I was hoping someone could help me to improve the performance of a 
query I've got that insists on doing a seq. scan on a large table.  I'm trying 
to do some reporting based on my spam logs which I've partly digested and 
stored in a table.  Here are the particulars:

The messages table:

  Column   | Type  |  Modifiers 
---+---+--
 message_id| integer   | not null default 
nextval('spamreport.messages_message_id_seq'::text)
 received_date | date  | not null
 score | numeric   | not null
 user_threshold| numeric   | not null
 raw_email_address | character varying(64) | not null
 processing_time   | numeric   | not null
 size  | integer   | not null
 fuzzed_address| character varying(64) | not null
 domain| character varying(64) | not null
Indexes:
"messages_pkey" primary key, btree (message_id)
"domain_idx" btree ("domain")
"fuzzy_idx" btree (fuzzed_address)
"received_date_idx" btree (received_date)

And here's the primary query I run, along with explain analyze output:

>> explain analyze SELECT * FROM ( SELECT
domain,
count(*) as count,
max(score) as max_score,
avg(score) as average_score,
stddev(score) as stddev_score,
sum(CASE WHEN score > user_threshold THEN 1 ELSE 0 END) as spam_count,
avg(processing_time) as average_time,
avg(size) as average_size
  FROM messages
  WHERE received_date BETWEEN '2004-11-01' AND '2004-11-30'
GROUP BY domain ) AS aggs
ORDER BY count DESC;

   QUERY PLAN  
---
 Sort  (cost=30303.51..30303.60 rows=35 width=226) (actual 
time=29869.716..29869.883 rows=69 loops=1)
   Sort Key: count
   ->  Subquery Scan aggs  (cost=30301.56..30302.61 rows=35 width=226) (actual 
time=29861.705..29869.240 rows=69 loops=1)
 ->  HashAggregate  (cost=30301.56..30302.26 rows=35 width=54) (actual 
time=29861.681..29868.261 rows=69 loops=1)
   ->  Seq Scan on messages  (cost=0.00..21573.04 rows=436426 
width=54) (actual time=5.523..6304.657 rows=462931 loops=1)
 Filter: ((received_date >= '2004-11-01'::date) AND 
(received_date <= '2004-11-30'::date))
 Total runtime: 29870.437 ms

This database gets vacuumed nightly.  Also, the query plan stays the same even 
if I restrict the received_date column down to a single day.

Thanks in advance,
-- 
Dave Steinberg
http://www.geekisp.com/

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [SQL] PG7.4.5: query not using index on date column

2004-11-29 Thread Dave Steinberg
Sorry for the delay in getting back on this thread, painting my  
apartment meant no internet over this thanksgiving break... :)

How many rows in the table altogether?  A rough guess is a few million
based on the estimated seqscan cost.  That would mean that this query
is retrieving about 10% of the table, which is a large enough fraction
that the planner will probably think a seqscan is best.  It may be  
right.
If you do "set enable_seqscan = off", how does the EXPLAIN ANALYZE
output change?
Right now it's a hair under 800k rows.  With enable_seqscan = off, I  
get this as my analyze results:

 
QUERY PLAN
 
 
-
 Sort  (cost=361676.23..361676.33 rows=38 width=226) (actual  
time=46076.756..46076.924 rows=69 loops=1)
   Sort Key: count
   ->  Subquery Scan aggs  (cost=361674.10..361675.24 rows=38  
width=226) (actual time=46068.621..46076.159 rows=69 loops=1)
 ->  HashAggregate  (cost=361674.10..361674.86 rows=38  
width=54) (actual time=46068.596..46075.170 rows=69 loops=1)
   ->  Index Scan using received_date_idx on messages   
(cost=0.00..349968.44 rows=585283 width=54) (actual  
time=20.988..15020.821 rows=589543 loops=1)
 Index Cond: ((received_date >= '2004-11-01'::date)  
AND (received_date <= '2004-11-30'::date))
 Total runtime: 46091.315 ms

Quite a bit higher with these estimates.  Using your 10% of the table  
guess above, I did a breakdown by date and got these results:

[EMAIL PROTECTED]:geekisp=>select received_date, count(received_date) from  
spamreport.messages group by received_date order by received_date asc;
 received_date | count
---+---
 2004-10-20|  7592
 2004-10-21| 19699
 2004-10-22| 17311
 2004-10-23| 16730
 2004-10-24| 18249
 2004-10-25| 16718
 2004-10-26| 16951
 2004-10-27| 19818
 2004-10-28| 19580
 2004-10-29| 17610
 2004-10-30| 16210
 2004-10-31| 20468
 2004-11-01| 12337
 2004-11-02|  9012
 2004-11-03| 20871
 2004-11-04| 20103
 2004-11-05| 18807
 2004-11-06| 20131
 2004-11-07| 22291
 2004-11-08| 23041
 2004-11-09| 20254
 2004-11-10| 17810
 2004-11-11| 21091
 2004-11-12| 21976
 2004-11-13| 18824
 2004-11-14| 20543
 2004-11-15| 18829
 2004-11-16| 24248
 2004-11-17| 18093
 2004-11-18| 25675
 2004-11-19| 27084
 2004-11-20| 22362
 2004-11-21| 25187
 2004-11-22| 26451
 2004-11-23| 26016
 2004-11-24| 23147
 2004-11-25| 25785
 2004-11-26| 20584
 2004-11-27| 25615
 2004-11-28|  6931
 2004-11-29|  6549
(41 rows)

So it looks like an aggregation of 2 weeks worth of data is more than  
10%, so its aligned with what you were saying.  That also jives, since  
when I re-enable seq_scan and do the report for one or two day's of  
data - it uses the index (previously it did not - perhaps there was not  
enough data).

If it's not right, you may want to try to adjust random_page_cost  
and/or
effective_cache_size so that the planner's estimated costs are more in
line with reality.  Beware of making such adjustments on the basis of
only one test case, though.
I'll look in the manual and try playing with these options, thanks for  
your suggestions!  In the end, I'm guessing that if I need to retrieve  
a few hundred thousand rows from disk, faster disks might make the  
biggest impact on performance.

Thanks for your help!
--
Dave Steinberg
http://www.geekisp.com/
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [SQL] [SQL NOTIFY Help] sending NOTIFY to the external program

2007-01-03 Thread Dave Steinberg

On Wed, Jan 03, 2007 at 07:47:34PM +0530, [EMAIL PROTECTED] wrote:

Yes, our application is supposed to know *immediately* that a change in
the database has occurred since, 


I'd recommend some sort of "whiteboard" layer or messaging fabric as an 
option for this sort of thing.  This offloads the notification element 
from the database, and lets you pass along any data you want to the 
listeners - even the complete record.


Also this sort of thing scales well to multiple listeners.  Anyway, just 
a thought...


Regards,
--
Dave Steinberg
http://www.geekisp.com/
http://www.steinbergcomputing.com/

---(end of broadcast)---
TIP 7: You can help support the PostgreSQL project by donating at

   http://www.postgresql.org/about/donate


Re: [SQL] many-to-many relationship

2008-10-06 Thread Dave Steinberg

Louis-David Mitterrand wrote:

Hi,

Say you have several objects (tables): person, location, event, etc. all
of which can have several images attached.

What is the best way to manage relations between a single 'image' table
and these different objects?

For now each 'image' row has pointers to id_person, id_location,
id_event, etc. (only one of which is used for any given row). 


Is there a better way, more elegant way to do it, without using
redundant id_* pointers on each row and yet still enforce foreign keys?


The typical way to do this would be to have your image table be just 
about images, and then to isolate the relationship information into 
mapping tables.  Those would look like:


image <=> people
(image_id, person_id), with the primary key being the pair of columns. 
In SQL, roughly:


create table image_people_map (
  image_id integer not null,
  person_id integer not null,
  primary key (image_id, person_id)
);

Similarly, for locations it'd be (image_id, location_id), and for events 
(image_id, event_id).  Then you can have a single image associated with 
any number of people, events, or locations.


Regards,
--
Dave Steinberg
http://www.geekisp.com/
http://www.steinbergcomputing.com/

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql