After using SpamAssassin for a number of years, I'm finally getting
around to implementing Bayesian filters. For my particular setup, the
bulk of my users are non-technical users who make POP connections
(although there are some that use IMAP clients, both offline and
webmail). Thus, I'm wanting to apply Bayesian learning on a server-wide
basis, rather than a per-user basis.
On this one, I do recognize the challenges of keeping both spamtraps and
hamtraps adequately fed, and I've got a couple of ideas of how to
facilitate that.
I have most of the basics figured out, and working in a test situation,
but not yet sufficiently confident that I adequately have the details
that I'm yet ready to try applying to a production server.
My current setup is that I my server is running Debian Lenny, and in
that, I'm running sendmail 8.14.3, MIMEDefang 2.64, SpamAssassin 3.3.1
(taken from the Debian lenny-backports branch), and cyrus-imapd 2.2.13.
In this one, I use MIMEDefang to call SpamAssassin, rather than using
spamc/spamd, and where my SA configs are done in /etc/mail/sa-mimedefang.cf.
I've also found sa-learn-cyrus, and that appears to be working well, but
I'm not sure if it's necessarily doing everything I need -- thus, if
there's a different method of scanning Cyrus-format mailboxes, I'm quite
willing to try that.
Areas of question:
1) I'm assuming that I want to run sa-learn-cyrus as the same user ID as
is used to run SpamAssassin (mail:mail).
2) I'm struggling a bit with location of the Bayesian database. In
sa-mimedefang.cf, I have specified:
bayes_path /var/spamassassin/bayes/bayes
bayes_file_mode 0777
but it looks like sa-learn-cyrus is ignoring that, even though I have
prefs_file = /etc/mail/sa-mimedefang.cf
included in /etc/spamassassin/sa-learn-cyrus.conf. As a result, when I
run sa-learn-cyrus, the Bayesian data is being located in
~mail/.spamassassin, which on my system is /var/mail/.spamassassin. The
data is all there correctly, it's just not where I would choose to put
it, but maybe that's not a problem.
3) For ongoing usage, I will offer my users who do make use of IMAP
accounts the option of submitting spam and ham samples via learn.ham and
learn.spam folders (as per documentation of sa-learn-cyrus). However,
for POP users, I haven't yet figured out a way of being able to allow
them to be able to make occasional sample submissions. My primary
concern is for response on the occasions when a legitimate message gets
scored with BAYES_99, and getting that cleared, but by the same token, I
do want to allow for submissions of stuff that may be reaching live
users, but not hitting my spamtraps.
4) I run several servers in parallel. My spamtraps indicate that some
spam operations hit user ids on two or more of my servers, while other
ops seem to have only user addresses on a single server. Is there a way
of feeding the Bayesian data on one server to the other servers?
Most of the spam data I work from is in what's hitting my spamtraps, and
with a little judicious use of rules via an IMAP client (e.g., copying
content to learn.ham folders on accounts on each server), but I'm
wondering if there's an easier way. From my reading of previous
discussions, I know that sharing database files is something that is
best avoided, so I'm thinking that a better approach is simply getting
the message traffic copied from one server to another, and then letting
sa-learn-cyrus learn the content on each server. At that point, the
question is in how to get content copied from a Cyrus mailbox on one
server to a mailbox on another server via scripting, rather than having
to play with an IMAP client. But maybe that's a Cyrus-specific question.
Thanks in advance for advice.
Smith