Hi Kevin, everyone,

Kevin Peuhkurinen wrote:

2.) There should be a framework within SpamAssassin that makes it easy for end-users to submit their spam for training. Currently, there are all kinds of scripts available outside the main SpamAssassin distribution (I've written my own, too) that attempt to get the message out of the mail-client or server and as close as possible to the original, to feed back to Bayes. Which is close to impossible with some of the mail-servers out there. SpamAssassin currently only includes half the Bayes interface: you can have auto-learning, but for manual learning or retraining you're on your own to some extent.

This I have to disagree with you on. SA is used on too many different types of systems in too many different environments for it to make any sort of sense to try to concoct a one-size-fits-all solution to learning. A better approach would be a one-stop source of information on how to implement learning in various environments, perhaps here: http://wiki.apache.org/spamassassin/BayesInSpamAssassin

I agree that this would be difficult, but right now we're all facing that difficulty on our own, so to speak. A more comprehensive Wiki would help, but my goal is to find a way of doing this that is independent of the rest of the mail-system, and can then become an integral part of SA.


I agree that basing the classification on message IDs is "fragile", but I'm not sure that any other approach would be better. Perhaps an MD5 sum of the contents not including headers or attachments? It would require a fair bit of testing in various real-world environments of various methods before you could authoratatively say that one method is clearly superior than the one currently used.

The current system works well if your mailbox is on the system where you run SpamAssassin and you can retrain from the commandline. That's only a small subset of all email-users though. Once the setup gets a bit more complicated, involves IMAP servers, forwarding etc., you get in trouble.


4.) The Bayes subsystem should store this classification, and all the tokens it learned. This way we can be sure that we correctly unlearn a autolearned message. The entries in this database could be timestamped so they can be removed after some months, to prevent unlimited growth.

Sounds like a good idea. However, my Bayes database is already about 60MB. A significantly larger database may be a problem for some systems with limited storage space.

Fortunately, this would not increase the Bayes token database itself, only the Bayes_seen database which only gets accessed during (auto)learning, not during classification.


Regards, Paul Boven.

Reply via email to