Testing Bayes filters
I saw a number of posts on this list earlier indicating that Bayesian filter learning and/or application of learned information wasn't working properly if the Bayesian analysis data were stored in a MySQL database, as is the case on my server at fmp.com. I have a couple of questions. What's the status of this bug, if it is one, or if it's a misconfiguration issue, what should I know to avoid it? Is there any simple method to test Bayesian filter learning and filtering so that I can see the results in a spam score before and after a spam is learned? My SA installation here is on a commercial server, and is in beta until I can determine whether or not it's working as expected. My wife and I are beta testers until I determine that everything is working properly, at which point I'll turn it loose on my customers :-) -- Lindsay Haisley | In an open world,| PGP public key FMP Computer Services |who needs Windows | available at 512-259-1190 | or Gates| http://pubkeys.fmp.com http://www.fmp.com| |
Re: Testing Bayes filters
I saw a number of posts on this list earlier indicating that Bayesian filter learning and/or application of learned information wasn't working properly if the Bayesian analysis data were stored in a MySQL database What's the status of this bug, if it is one, or if it's a misconfiguration issue, what should I know to avoid it? I am using Bayes with MySQL for about 2 years and I found it working perfectly. I experienced no bugs. In comparison, my previous configuration with the default db files was not working well at all. I installed according to the manual. It is not a big server (about 15 users), so I use a global database with a fixed user. My bayes-related and awl-related configuration from local.cf: bayes_expiry_max_db_size 50 bayes_sql_override_username mail bayes_store_module Mail::SpamAssassin::BayesStore::MySQL bayes_sql_dsn DBI:mysql:sa:my-server-name.domain.com bayes_sql_username dbuser bayes_sql_password dbpassw bayes_ignore_header X-Account-Key bayes_ignore_header X-UIDL bayes_ignore_header X-Mozilla-Status bayes_ignore_header X-Mozilla-Status2 bayes_ignore_header X-Spam-Flag bayes_ignore_header X-Spam-Status use_auto_whitelist 1 user_awl_sql_override_username mail auto_whitelist_factory Mail::SpamAssassin::SQLBasedAddrList user_awl_dsn DBI:mysql:sa:my-server.name.domain.com user_awl_sql_usernamedbuser user_awl_sql_passworddbpassw user_awl_sql_table awl My bayes and awl tables were created according to the manual, but I added a timestamp column to the awl table and to the bayes_seen table to be able to expire them by date. Additionally, I added a feature to learn from spam and nonspam imap folders, where I manually copy spam or ham that was not already auto-learnt. I didn't change anything with the default scores: 5 is still the spam threshold and 3.5 is still the bayes_99 score when used together with network tests. An interesting observation: The spam messages that contain half spam and half mumbo-jumbo of unrelated random text that should probably irritate bayes filters, score in fact almost always bayes_99. I can only imagine that the additional random text is not really random but taken from a fixed library that is not very big and not changed very often. Alex
Re: Testing Bayes filters
On Sun, 2007-06-17 at 01:41 +0200, Alex Woick wrote: My bayes and awl tables were created according to the manual, but I added a timestamp column to the awl table and to the bayes_seen table to be able to expire them by date. I've added these fields, with default=CURRENT_TIMESTAMP. When do you expire these records? Additionally, I added a feature to learn from spam and nonspam imap folders, where I manually copy spam or ham that was not already auto-learnt. I didn't change anything with the default scores: 5 is still the spam threshold and 3.5 is still the bayes_99 score when used together with network tests. I've put together a similar setup using Courier's maildrop filtering and some python scripts, still under development. An interesting observation: The spam messages that contain half spam and half mumbo-jumbo of unrelated random text that should probably irritate bayes filters, score in fact almost always bayes_99. I can only imagine that the additional random text is not really random but taken from a fixed library that is not very big and not changed very often. Interesting! -- Lindsay Haisley | In an open world,| PGP public key FMP Computer Services |who needs Windows | available at 512-259-1190 | or Gates| http://pubkeys.fmp.com http://www.fmp.com| |