After using SpamAssassin for a number of years, I'm finally getting around to implementing Bayesian filters. For my particular setup, the bulk of my users are non-technical users who make POP connections (although there are some that use IMAP clients, both offline and webmail). Thus, I'm wanting to apply Bayesian learning on a server-wide basis, rather than a per-user basis.

On this one, I do recognize the challenges of keeping both spamtraps and hamtraps adequately fed, and I've got a couple of ideas of how to facilitate that.

I have most of the basics figured out, and working in a test situation, but not yet sufficiently confident that I adequately have the details that I'm yet ready to try applying to a production server.

My current setup is that I my server is running Debian Lenny, and in that, I'm running sendmail 8.14.3, MIMEDefang 2.64, SpamAssassin 3.3.1 (taken from the Debian lenny-backports branch), and cyrus-imapd 2.2.13.

In this one, I use MIMEDefang to call SpamAssassin, rather than using spamc/spamd, and where my SA configs are done in /etc/mail/sa-mimedefang.cf.

I've also found sa-learn-cyrus, and that appears to be working well, but I'm not sure if it's necessarily doing everything I need -- thus, if there's a different method of scanning Cyrus-format mailboxes, I'm quite willing to try that.

Areas of question:

1) I'm assuming that I want to run sa-learn-cyrus as the same user ID as is used to run SpamAssassin (mail:mail).

2) I'm struggling a bit with location of the Bayesian database. In sa-mimedefang.cf, I have specified:

   bayes_path /var/spamassassin/bayes/bayes
   bayes_file_mode 0777

but it looks like sa-learn-cyrus is ignoring that, even though I have

   prefs_file = /etc/mail/sa-mimedefang.cf

included in /etc/spamassassin/sa-learn-cyrus.conf. As a result, when I run sa-learn-cyrus, the Bayesian data is being located in ~mail/.spamassassin, which on my system is /var/mail/.spamassassin. The data is all there correctly, it's just not where I would choose to put it, but maybe that's not a problem.

3) For ongoing usage, I will offer my users who do make use of IMAP accounts the option of submitting spam and ham samples via learn.ham and learn.spam folders (as per documentation of sa-learn-cyrus). However, for POP users, I haven't yet figured out a way of being able to allow them to be able to make occasional sample submissions. My primary concern is for response on the occasions when a legitimate message gets scored with BAYES_99, and getting that cleared, but by the same token, I do want to allow for submissions of stuff that may be reaching live users, but not hitting my spamtraps.

4) I run several servers in parallel. My spamtraps indicate that some spam operations hit user ids on two or more of my servers, while other ops seem to have only user addresses on a single server. Is there a way of feeding the Bayesian data on one server to the other servers?

Most of the spam data I work from is in what's hitting my spamtraps, and with a little judicious use of rules via an IMAP client (e.g., copying content to learn.ham folders on accounts on each server), but I'm wondering if there's an easier way. From my reading of previous discussions, I know that sharing database files is something that is best avoided, so I'm thinking that a better approach is simply getting the message traffic copied from one server to another, and then letting sa-learn-cyrus learn the content on each server. At that point, the question is in how to get content copied from a Cyrus mailbox on one server to a mailbox on another server via scripting, rather than having to play with an IMAP client. But maybe that's a Cyrus-specific question.

Thanks in advance for advice.

Smith



Reply via email to