At 11:55 AM 5/16/04 -0500, Jim Sabatke wrote:
Should I save my ***SPAM**** flagged emails for training Spamassassin, or is this a waste of time as it already knows it's spam?

Training spam that SA recognizes is definitely worthwhile.. Even if it gets BAYES_99 it's still worthwhile.


After all, if there was no point in training detected spam, why would SA have the autolearn feature?

The reason that it's valuable is that the bayes engine doesn't learn emails, it learns the tokens (words, domains, etc) in the emails.

Since spam tends to "mutate over time" it's quite common for a email to contain a new obfuscation spelling of a spam word that it hasn't seen before. There may be enough other high-scoring words for SA to call it spam and/or BAYES_99, but training that email can still be valuable in helping SA recognize future emails with more obfuscations in them.

Now, I'd still make a greater priority of training false negatives, but I'd still train detected spam as well, especially since autolearning is so rare.

Notes:
1) sa-learn can handle being fed marked-up spam without stripping out the markups as long as the markup was made by spamassassin and not some other tool.
2) Don't worry if some of your spam was already autolearned, SA will skip them when you try to re-learn. (Of course, if you already have autolearned mail separated, go ahead and don't bother to train them. There really is no point in that case).



Reply via email to