RE: slow sql bayes store

2006-08-10 Thread Gary W. Smith
In the past we have seen some slowness for bayes and AWL (mostly AWL).
We found after a couple million rows in AWL that the system starts
getting real slow.  We setup a script to prune records that have a high
bayes threshold but a count of 1 (usually anonymous spammers).  They
will not use that IP/sender combination again anyways.  This keeps it
nice and tidy.

As for the bayes, you might want to manually expire the old tokens.
That might help.

But this is just a guess at this time.  What would be more useful are
things like the number of records you have in the db, the hardware of
the DB (memory, etc), and any other good information that might help
make a better guess.

Gary Wayne Smith

 -Original Message-
 From: David Morton [mailto:[EMAIL PROTECTED]
 Sent: Thursday, August 10, 2006 2:28 PM
 To: users@spamassassin.apache.org
 Subject: slow sql bayes store
 
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Greetings...
 
 On the Maia Mailguard mailing list, we have encountered a number of
folks
 (myself included) that are seeing some slow performance in the bayes
 storage
 when using mysql (innodb engine), taking anywhere from .5 to 10
seconds to
 store/update all the tokens for a message.   Has anyone else seen
this?
 
 
 
 - --
 David Morton
 Maia Mailguard- http://www.maiamailguard.com
 Morton Software Design and Consulting - http://www.dgrmm.net
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.4.2.2 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
 iD8DBQFE26TLUy30ODPkzl0RAgNHAJ9UNgS4zudN5dAdkcOGw/ljmAe5tACgzzNQ
 j0YStIUlkDn2qx9LXVZpUus=
 =tvfh
 -END PGP SIGNATURE-


Re: slow sql bayes store

2006-08-10 Thread David Morton
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

This has been seen on a variety of systems, from my own small 1Ghz AMD system to
 dual xeon w/ SCSI drives

On my somewhat slowish system:

 sa-learn --dump magic
0.000  0  3  0  non-token data: bayes db version
0.000  0   2937  0  non-token data: nspam
0.000  0  30745  0  non-token data: nham
0.000  0 130608  0  non-token data: ntokens
0.000  0 1148246665  0  non-token data: oldest atime
0.000  0 1155262955  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal sync atime
0.000  0 1153091434  0  non-token data: last expiry atime
0.000  04847556  0  non-token data: last expire atime delta
0.000  0  13576  0  non-token data: last expire reduction 
count


on a fast system with 10k SATA raptors:

sa-learn --dump magic

0.000  0  3  0  non-token data: bayes db version
0.000  09685479  0  non-token data: nspam
0.000  0 794330  0  non-token data: nham
0.000  0 143002  0  non-token data: ntokens
0.000  0 1155209840  0  non-token data: oldest atime
0.000  0 1155260496  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal sync atime
0.000  0 1155253048  0  non-token data: last expiry atime
0.000  0  43200  0  non-token data: last expire atime delta
0.000  0  0  0  non-token data: last expire reduction 
count


We have experimented with auto expiry, and batch expiry at night. So far, I
haven't found a suitable answer.

On factor I'm testing, but I don't think it made a difference, I reversed the
order of the column in the index, since all of our mail is stored under one
user, so user id = 1 isn't much of an index.  Still, it doesn't seem to help.

I'm intuitively guessing that it takes a while to write the new indexes, but I
don't have anything to substantiate that.


Gary W. Smith wrote:
 In the past we have seen some slowness for bayes and AWL (mostly AWL).
 We found after a couple million rows in AWL that the system starts
 getting real slow.  We setup a script to prune records that have a high
 bayes threshold but a count of 1 (usually anonymous spammers).  They
 will not use that IP/sender combination again anyways.  This keeps it
 nice and tidy.
 
 As for the bayes, you might want to manually expire the old tokens.
 That might help.
 
 But this is just a guess at this time.  What would be more useful are
 things like the number of records you have in the db, the hardware of
 the DB (memory, etc), and any other good information that might help
 make a better guess.
 
 Gary Wayne Smith
 
 -Original Message-
 From: David Morton [mailto:[EMAIL PROTECTED]
 Sent: Thursday, August 10, 2006 2:28 PM
 To: users@spamassassin.apache.org
 Subject: slow sql bayes store

 Greetings...
 
 On the Maia Mailguard mailing list, we have encountered a number of
 folks
 (myself included) that are seeing some slow performance in the bayes
 storage
 when using mysql (innodb engine), taking anywhere from .5 to 10
 seconds to
 store/update all the tokens for a message.   Has anyone else seen
 this?
 
 
 --
 David Morton
 Maia Mailguard- http://www.maiamailguard.com
 Morton Software Design and Consulting - http://www.dgrmm.net

- --
David Morton
Maia Mailguard- http://www.maiamailguard.com
Morton Software Design and Consulting - http://www.dgrmm.net
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE2+3VUy30ODPkzl0RAjimAJ9iroEdMbb/BOLsYPdA3ksvVPY1ZgCdHas0
tUI3n/PUTqzOH6WluBXykro=
=s9MW
-END PGP SIGNATURE-


RE: slow sql bayes store

2006-08-10 Thread Gary W. Smith
Are you running dstat (vstat/iostat) on the SQL server?  I'd be
interested in seeing what the disk/mem/procs are doing during a load
situation.  We don't have 9m rows in ours but with 1m and a simple
processor (ht) with 4gb of ram it works without any significant
problems.  

We are seeing on the outside 3 seconds to process a message under load
(avg 5k), average .5sec normally.  Now it's the same average if we have
1 message or 5 messages coming through (per server -- we have 4 of
them).

Is the database on the same box as SA?  Ours is not so we count a little
latency in there with ours as well.


-Original Message-
From: David Morton [mailto:[EMAIL PROTECTED] 
Sent: Thursday, August 10, 2006 7:39 PM
To: Gary W. Smith
Cc: users@spamassassin.apache.org
Subject: Re: slow sql bayes store

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

This has been seen on a variety of systems, from my own small 1Ghz AMD
system to
 dual xeon w/ SCSI drives

On my somewhat slowish system:

 sa-learn --dump magic
0.000  0  3  0  non-token data: bayes db version
0.000  0   2937  0  non-token data: nspam
0.000  0  30745  0  non-token data: nham
0.000  0 130608  0  non-token data: ntokens
0.000  0 1148246665  0  non-token data: oldest atime
0.000  0 1155262955  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal
sync atime
0.000  0 1153091434  0  non-token data: last expiry
atime
0.000  04847556  0  non-token data: last expire
atime delta
0.000  0  13576  0  non-token data: last expire
reduction count


on a fast system with 10k SATA raptors:

sa-learn --dump magic

0.000  0  3  0  non-token data: bayes db version
0.000  09685479  0  non-token data: nspam
0.000  0 794330  0  non-token data: nham
0.000  0 143002  0  non-token data: ntokens
0.000  0 1155209840  0  non-token data: oldest atime
0.000  0 1155260496  0  non-token data: newest atime
0.000  0  0  0  non-token data: last journal
sync atime
0.000  0 1155253048  0  non-token data: last expiry
atime
0.000  0  43200  0  non-token data: last expire
atime delta
0.000  0  0  0  non-token data: last expire
reduction count


We have experimented with auto expiry, and batch expiry at night. So
far, I
haven't found a suitable answer.

On factor I'm testing, but I don't think it made a difference, I
reversed the
order of the column in the index, since all of our mail is stored under
one
user, so user id = 1 isn't much of an index.  Still, it doesn't seem to
help.

I'm intuitively guessing that it takes a while to write the new indexes,
but I
don't have anything to substantiate that.


Gary W. Smith wrote:
 In the past we have seen some slowness for bayes and AWL (mostly AWL).
 We found after a couple million rows in AWL that the system starts
 getting real slow.  We setup a script to prune records that have a
high
 bayes threshold but a count of 1 (usually anonymous spammers).  They
 will not use that IP/sender combination again anyways.  This keeps it
 nice and tidy.
 
 As for the bayes, you might want to manually expire the old tokens.
 That might help.
 
 But this is just a guess at this time.  What would be more useful are
 things like the number of records you have in the db, the hardware of
 the DB (memory, etc), and any other good information that might help
 make a better guess.
 
 Gary Wayne Smith
 
 -Original Message-
 From: David Morton [mailto:[EMAIL PROTECTED]
 Sent: Thursday, August 10, 2006 2:28 PM
 To: users@spamassassin.apache.org
 Subject: slow sql bayes store

 Greetings...
 
 On the Maia Mailguard mailing list, we have encountered a number of
 folks
 (myself included) that are seeing some slow performance in the bayes
 storage
 when using mysql (innodb engine), taking anywhere from .5 to 10
 seconds to
 store/update all the tokens for a message.   Has anyone else seen
 this?
 
 
 --
 David Morton
 Maia Mailguard- http://www.maiamailguard.com
 Morton Software Design and Consulting - http://www.dgrmm.net

- --
David Morton
Maia Mailguard- http://www.maiamailguard.com
Morton Software Design and Consulting - http://www.dgrmm.net
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFE2+3VUy30ODPkzl0RAjimAJ9iroEdMbb/BOLsYPdA3ksvVPY1ZgCdHas0
tUI3n/PUTqzOH6WluBXykro=
=s9MW
-END PGP SIGNATURE-