On Monday, September 29, 2003 at 11:03 PM, David wrote: > I would be happy to supply a lot of spam messages to folks who need > some samples for training purposes! ;-)
Thanks for the offer, but I get enough myself. :) With all the talk about spam on the list lately, I thought I'd share some information I came across. Using others' messages may not be the best thing to do and in some instances, might decrease the accuracy of the Bayesian filter, or so I was told on another mailing list. I was referred to www.paulgraham.com/spam.html and from what I've gathered, it seems to be the case. I've been reading Paul Graham's "Plan for Spam" and the newer "Better Bayesian Filtering". It makes for interesting reading. Bayesian filtering is based on the statistical probability of an e-mail being spam as it relates to *your* e-mail and not anyone else's. The probability that a particular word used in an e-mail will identify it as spam may be different for me than for you. Take the word "click" for example. Suppose I'm on a list where people are always using the word "click", so the word is present in both spam and non-spam for me. You never receive any e-mail with the word "click" in it unless it's spam. The probability score for spam for "click" will be much higher for you than it would be for me. You need both spam and ham to train a filter. It only follows that if I use your e-mail to train my filter, I may generate false positives because my e-mail will be different from yours. As a side-benefit to this "personalization", Graham explains that it makes it difficult for spammers to fine-tune messages since what would be fine-tuning for me wouldn't necessarily be fine-tuning for you. This then limits the avenues that spammers have to alter their messages to get through the filters. Spammers can do it for rules-based programs, such as SpamAssassin (without Bayes) by looking at the rules and then coming up with methods to counteract the rules. In that situation, it's a game of tag. Rule --> Way around the rule --> New Rule --> Way around the new rule --> ad infinitum. Of course, I could be wrong about this as it is my interpretation of Graham's writings. If I am, I hope someone will let me know. -- Regards, Terry Using The Bat! v1.62r on Windows 2000 5.0 Build 2195 Service Pack 3 ________________________________________________ Current version is 2.00.6 | "Using TBUDL" information: http://www.silverstones.com/thebat/TBUDLInfo.html