Testing Bayes filters

2007-06-16 Thread Lindsay Haisley
I saw a number of posts on this list earlier indicating that Bayesian
filter learning and/or application of learned information wasn't working
properly if the Bayesian analysis data were stored in a MySQL database,
as is the case on my server at fmp.com.  I have a couple of questions.

What's the status of this bug, if it is one, or if it's a
misconfiguration issue, what should I know to avoid it?

Is there any simple method to test Bayesian filter learning and
filtering so that I can see the results in a spam score before and after
a spam is learned?

My SA installation here is on a commercial server, and is in beta until
I can determine whether or not it's working as expected.  My wife and I
are beta testers until I determine that everything is working properly,
at which point I'll turn it loose on my customers :-)

-- 
Lindsay Haisley   | In an open world,| PGP public key
FMP Computer Services |who needs Windows  |  available at
512-259-1190  |  or Gates| http://pubkeys.fmp.com
http://www.fmp.com|   |



Re: Testing Bayes filters

2007-06-16 Thread Alex Woick

I saw a number of posts on this list earlier indicating that Bayesian
filter learning and/or application of learned information wasn't working
properly if the Bayesian analysis data were stored in a MySQL database



What's the status of this bug, if it is one, or if it's a
misconfiguration issue, what should I know to avoid it?


I am using Bayes with MySQL for about 2 years and I found it working 
perfectly. I experienced no bugs. In comparison, my previous 
configuration with the default db files was not working well at all.


I installed according to the manual. It is not a big server (about 15 
users), so I use a global database with a fixed user.

My bayes-related and awl-related configuration from local.cf:

bayes_expiry_max_db_size 50
bayes_sql_override_username mail
bayes_store_module Mail::SpamAssassin::BayesStore::MySQL
bayes_sql_dsn  DBI:mysql:sa:my-server-name.domain.com
bayes_sql_username dbuser
bayes_sql_password dbpassw

bayes_ignore_header X-Account-Key
bayes_ignore_header X-UIDL
bayes_ignore_header X-Mozilla-Status
bayes_ignore_header X-Mozilla-Status2
bayes_ignore_header X-Spam-Flag
bayes_ignore_header X-Spam-Status

use_auto_whitelist 1
user_awl_sql_override_username mail
auto_whitelist_factory Mail::SpamAssassin::SQLBasedAddrList
user_awl_dsn DBI:mysql:sa:my-server.name.domain.com
user_awl_sql_usernamedbuser
user_awl_sql_passworddbpassw
user_awl_sql_table   awl

My bayes and awl tables were created according to the manual, but I 
added a timestamp column to the awl table and to the bayes_seen table to 
be able to expire them by date.


Additionally, I added a feature to learn from spam and nonspam imap 
folders, where I manually copy spam or ham that was not already auto-learnt.
I didn't change anything with the default scores: 5 is still the spam 
threshold and 3.5 is still the bayes_99 score when used together with 
network tests.


An interesting observation: The spam messages that contain half spam and 
half mumbo-jumbo of unrelated random text that should probably irritate 
bayes filters, score in fact almost always bayes_99. I can only imagine 
that the additional random text is not really random but taken from a 
fixed library that is not very big and not changed very often.


Alex


Re: Testing Bayes filters

2007-06-16 Thread Lindsay Haisley
On Sun, 2007-06-17 at 01:41 +0200, Alex Woick wrote:
 My bayes and awl tables were created according to the manual, but I 
 added a timestamp column to the awl table and to the bayes_seen table to 
 be able to expire them by date.

I've added these fields, with default=CURRENT_TIMESTAMP.

When do you expire these records?

 Additionally, I added a feature to learn from spam and nonspam imap 
 folders, where I manually copy spam or ham that was not already auto-learnt.
 I didn't change anything with the default scores: 5 is still the spam 
 threshold and 3.5 is still the bayes_99 score when used together with 
 network tests.

I've put together a similar setup using Courier's maildrop filtering and
some python scripts, still under development.

 An interesting observation: The spam messages that contain half spam and 
 half mumbo-jumbo of unrelated random text that should probably irritate 
 bayes filters, score in fact almost always bayes_99. I can only imagine 
 that the additional random text is not really random but taken from a 
 fixed library that is not very big and not changed very often.

Interesting!

-- 
Lindsay Haisley   | In an open world,| PGP public key
FMP Computer Services |who needs Windows  |  available at
512-259-1190  |  or Gates| http://pubkeys.fmp.com
http://www.fmp.com|   |