jdow wrote:
I'm no statistical or bayes expert of any sort, but assuming that there are similarities between spamassassin's bayes implementation and bogofilter's, I would expect the following to hold true (paraphrasing, no doubt poorly, quotes from the bogofilter list): New words will register as "neutral" with bayes. That is, if someone sends you a NASA joke, and you normally don't get them, the words in the joke itself won't weigh scoring much one way or the other. However, if amongst all the new words are:[...]
It looks like the Bayes filters will train on the jokes or the text
extracted from even NASA web sites making the Bayes filter more prone
to false positives.
1. Spam words and phrases previously associated with spam, THOSE will be significant.
2. Unique non-spam words or phrases (i.e. list mails) associated with non-spam are used, THOSE will be significant.
3. Words that appear equally in both (statistically speaking), which will NOT be significant.
So if Spammers:
1. Use uncommonly used words (for MY specific patterns), those won't help them. (Quoting Shakespeare won't help with my scoring.)
2. Still have to sneak in their buzzwords and phrases, those will give them away.
3. TARGET their message (i.e. to specific list subscribers) using lots of normally-good words (for that list) and intersperse their message, that WILL help them. (1 & 2 from the previous list cancel each other out)
The key to success with bayes is consistent training, especially on errors. So yes, if you just feed bayes a lot of stuff without CAREFUL training up front, it can easily fail. Even with consistent training, it's not a silver bullet. Strategy 3, I think, WILL let a spammer slip things in, but they have to monitor and snarf text from a list you're actually on (so lists are most vulnerable). Hopefully, the limited target base will make this less appealing to them, and quality lists will hopefully be moderated to limit even this approach. That won't stop them from harvesting emails from those lists, but at this point, it's -- hopefully -- "a lot of work."
That's why I particularly like spamassassin's multi-layered approach. Tell-tale giveaways that aren't significant to bayes can still expose spam, and the cumulative scoring used to determine whether the message "smells" like spam. I'm currently experimenting using several bayes tools in addition to spamassassin's, and I'm using those as a 2nd level test. I'm not to the point where I trust them exclusively yet. Despite that, they are consistently scoring right along with the SA rules lately, even for the "random words" and "random gibberish" messages. There are a few exceptions either way. Accordingly, I've created some meta rules for SA:
* If a message is flagged by all 3 non-SA bayes, it gets a "smells like spam" added score, and gets fed into a queue for classifying for bayes training.
* If a message is in either razor or pyzor, and a bayes tool catches it, it gets a "burnt spam" added score (I sometimes don't agree with razor/pyzor for some spammy stuff), and gets fed into a queue for classifying for bayes training.
* If SA scores it over 12, it's definite spam, and automatically feeds it into the bayes spam training queue (I still trust SA most, using a few of the add-on rule sets.)
* If SA scores it well below threshold, it gets automatically fed into the bayes ham training queue.
Razor/pyzor make spammers tweak the content. Bayes makes them try to bury it. SA makes them constantly change the message they embed, and where they send it from. Any one of these would catch a lot. VERY FEW spams can escape multiple levels of checks.
If anyone's interested in bayes, in addition to this list, the bogofilter list is most informative (and gets WAY into the topic).
- Bob
