Bayes implementation questions

NFN Smith Thu, 03 Jun 2010 14:15:38 -0700

After using SpamAssassin for a number of years, I'm finally gettingaround to implementing Bayesian filters. For my particular setup, thebulk of my users are non-technical users who make POP connections(although there are some that use IMAP clients, both offline andwebmail). Thus, I'm wanting to apply Bayesian learning on a server-widebasis, rather than a per-user basis.

On this one, I do recognize the challenges of keeping both spamtraps andhamtraps adequately fed, and I've got a couple of ideas of how tofacilitate that.

I have most of the basics figured out, and working in a test situation,but not yet sufficiently confident that I adequately have the detailsthat I'm yet ready to try applying to a production server.

My current setup is that I my server is running Debian Lenny, and inthat, I'm running sendmail 8.14.3, MIMEDefang 2.64, SpamAssassin 3.3.1(taken from the Debian lenny-backports branch), and cyrus-imapd 2.2.13.

In this one, I use MIMEDefang to call SpamAssassin, rather than usingspamc/spamd, and where my SA configs are done in /etc/mail/sa-mimedefang.cf.

I've also found sa-learn-cyrus, and that appears to be working well, butI'm not sure if it's necessarily doing everything I need -- thus, ifthere's a different method of scanning Cyrus-format mailboxes, I'm quitewilling to try that.


Areas of question:

1) I'm assuming that I want to run sa-learn-cyrus as the same user ID asis used to run SpamAssassin (mail:mail).

2) I'm struggling a bit with location of the Bayesian database. Insa-mimedefang.cf, I have specified:


   bayes_path /var/spamassassin/bayes/bayes
   bayes_file_mode 0777

but it looks like sa-learn-cyrus is ignoring that, even though I have

   prefs_file = /etc/mail/sa-mimedefang.cf

included in /etc/spamassassin/sa-learn-cyrus.conf. As a result, when Irun sa-learn-cyrus, the Bayesian data is being located in~mail/.spamassassin, which on my system is /var/mail/.spamassassin. Thedata is all there correctly, it's just not where I would choose to putit, but maybe that's not a problem.

3) For ongoing usage, I will offer my users who do make use of IMAPaccounts the option of submitting spam and ham samples via learn.ham andlearn.spam folders (as per documentation of sa-learn-cyrus). However,for POP users, I haven't yet figured out a way of being able to allowthem to be able to make occasional sample submissions. My primaryconcern is for response on the occasions when a legitimate message getsscored with BAYES_99, and getting that cleared, but by the same token, Ido want to allow for submissions of stuff that may be reaching liveusers, but not hitting my spamtraps.

4) I run several servers in parallel. My spamtraps indicate that somespam operations hit user ids on two or more of my servers, while otherops seem to have only user addresses on a single server. Is there a wayof feeding the Bayesian data on one server to the other servers?

Most of the spam data I work from is in what's hitting my spamtraps, andwith a little judicious use of rules via an IMAP client (e.g., copyingcontent to learn.ham folders on accounts on each server), but I'mwondering if there's an easier way. From my reading of previousdiscussions, I know that sharing database files is something that isbest avoided, so I'm thinking that a better approach is simply gettingthe message traffic copied from one server to another, and then lettingsa-learn-cyrus learn the content on each server. At that point, thequestion is in how to get content copied from a Cyrus mailbox on oneserver to a mailbox on another server via scripting, rather than havingto play with an IMAP client. But maybe that's a Cyrus-specific question.


Thanks in advance for advice.

Smith

Bayes implementation questions

Reply via email to