Re: SA Problem: spam with random words to defeat Baysian filtering ...

Bob George 12 Feb 2004 03:01:16 -0000

jdow <[EMAIL PROTECTED]> wrote:
> After watching the Bayes filter "learn" to auto white list
> spam when first installed I disabled the auto white list
> feature and explicitly generated lists if ham and spam.


AWL works well for me, but that may have been due a combination of add-on rules
and luck. I've left it enabled, but scoring of spam has swung to such extremes
(a good thing) thanks to bayes and other rules that it really hasn't impacted
things much one way or the other lately.

It does seem most of the auto-whitelist options are now missing from the
manpage (Mail::SpamAssassin::Conf) so perhaps they've been deprecated as of
late? (Must search archives.)

> When the Bayes filter kicked in after it had accumulated a couple
> hundred ham and spam messages the results were dramatic.

I learned my lesson and have begun storing a collection of 'borderline' spam
for training purposes. Thankfully, I had bayes trained before some of the more
clever spams began to hit, so non have gotten through lately, depite all their
attempts.

> Before then it was somewhat discouraging. I do believe I shall
> leave automatic learning and white listing turned off because
> it seems to false entirely too often for my tastes.

Now that I've read the latest manpage, I'm not really sure WHAT AWL is doing in
my case. I do see AWL score adjustments, but they tend to be slight... at least
in comparison to the massive scores most spam gets. Unless I'm mistaken, unless
spammers have forged addresses from real people I get good messages from, AWL
should NOT result in false positives.

> (The concept also seems a little strange. If it already knows it's
> spam then train it that the message is spam. I'd rather teach
> it with the new spam that is not found than simply rack up
> higher scores by training it that material it knows is spam is
> indeed spam. What am I missing here?)

I think there's a difference between auto-whitelist (AWL) -- based on sender -- 
and bayes_auto, which trains on content. AWL makes good sense... especially for
messages from my good friend that occasionally forwards spammy stuff of
interest. I've left the defaults for bayes_auto (to autolearn high-scoring
spam), but I do augment it with training from my corpus of about 1,000
low-scoring spams that I verified by hand, and the (infrequent) false negative.

I think the reason for bayes auto-learning being useful is that the words in
spam that DIDN'T trip the score get added as well. If those same words appear
commonly in non-spam, they cancel out. But as was pointed out recently, if
spammers use random dictionary words that DON'T appear in non-spam, that itself
is a hint that it might be spammy. It adds to the "smell" of spam, which is why
I think bayes has been so effective at catching the random-word spams that
bypass so many rudimentary filters.

Then again, this may simply be an indicator that I subscribe to low-brow lists.
:)

- Bob

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Reply via email to