I am attempting to write a spam management add-on for my corporate
server, I have the database and the apps written but one part is WAY to
slow to be usefull.

The following view:

        Column         |  Type   | Modifiers
-----------------------+---------+-----------
 env_sender_num        | integer |
 msg_from_claims_count | bigint  |

is built on the follow table:

     Column      |            Type             | Modifiers
-----------------+-----------------------------+-----------
 msg_sender_num  | integer                     |
 env_sender_num  | integer                     |
 msg_from_domain | character varying(255)      |
 env_from_domain | character varying(255)      |
 recorded_date   | timestamp without time zone |
Indexes: record_of_claims_env_sender_num_idx btree (env_sender_num),
         record_of_claims_msg_sender_num_idx btree (msg_sender_num)

With the following SQL:


SELECT record_of_claims.env_sender_num, count(DISTINCT
record_of_claims.msg_sender_num)
 AS msg_from_claims_count 
 FROM record_of_claims 
 GROUP BY record_of_claims.env_sender_num;

A sample dataset follows:

msg_sender_num | env_sender_num |    msg_from_domain    |
env_from_domain    |    recorded_date
----------------+----------------+-----------------------+--------------
---------+---------------------
              1 |              1 | yahoo.com             | yahoo.com
| 2003-11-18 13:21:07
              2 |              2 | mx128.optinvc13y.com  |
mx128.optinvc13y.com  | 2003-11-18 13:21:16
              3 |              3 | yahoo.com             | yahoo.com
| 2003-11-18 13:21:17
              4 |              4 | yahoo.com             | yahoo.com
| 2003-11-18 13:21:21
              5 |              5 | biomarketgroup.com    |
biomarketgroup.com    | 2003-11-18 13:21:24
              6 |              6 | sohu.com              | sohu.com
| 2003-11-18 13:21:28
              7 |              7 | lycos.com             | lycos.com
| 2003-11-18 13:21:38
              8 |              8 | mail.expressrx.info   |
mail.expressrx.info   | 2003-11-18 13:21:41
              9 |              9 | approveddeals.com     |
approveddeals.com     | 2003-11-18 13:21:41
             10 |             10 | conceptholidays.co.uk |
conceptholidays.co.uk | 2003-11-18 13:21:48


The msg_sender_num and env_sender_num come from another table of unique
names of senders. What I am attempting to do is see how many times
msg_sender_num 1 claims to be a different env_sender_num. So I have to
find all the entries in msg_sender_num equal to 1 and build a count of
the distinct numbers in env_sender_num. This number is then used later
to say that if a msg_sender_num claims to be more then 2
env_sender_num's then the sender is a spammer and gets added to a list. 

Everything is working fine except the SQL above. It takes WAY to long to
process on a 500000+ record database. Hell it takes 12 seconds or so on
a 50000 record database. I have included the query plan to show that the
indexes are being used.

Query Plan:

Aggregate  (cost=0.00..166.16 rows=264 width=8) (actual
time=0.98..7768.19 rows=62911 loops=1)
   ->  Group  (cost=0.00..159.57 rows=2635 width=8) (actual
time=0.56..3179.14 rows=80466 loops=1)
         ->  Index Scan using record_of_claims_env_sender_num_idx on
record_of_claims  (cost=0.00..152.99 rows=2635 width=8) (actual
time=0.55..2240.15 rows=80466 loops=1)

Total runtime: 7931.63 msec


Is there a better, read "Faster", way to achieve this?


Jerry Wintrode
Very Newbie Postgres User







---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])

Reply via email to