Darren Coleman <[EMAIL PROTECTED]> wrote: > Apologies if this has been discussed and/or dismissed before, > but I was wondering if anyone had thought about rules or > indeed modifying SA itself to detect Bayes poison? It strikes > me that a lot of emails that include this poison tend to just > have a series of words without any articles ("and", "an", "a", > "the", etc).
It depends on WHICH bayes-poison technique they try. Rules to detect "much text without articles" might help, but that assumes that non-spam messages are mostly english, and not things like sample code, shell scripts and the like. And many spammers seem to be using extensive quotes from literature, magazines and the like. A low score wouldn't be harmful, perhaps as part of a meta score. Then again, bayes itself "detects" bayes poison (there have been many recent threads on this). If words show up that are NOT in "normal" (as defined by ones own bayes training) messages, they tend to flag spam. More so as similar types of poison are used (i.e. if Tom Sawyer text starts showing up frequently in MY inbox, it's likely spam, but it might be fine for others). If nothing else, the score leans towards "less good". > I would've thought it would be fairly trivial to attribute > weight to a paragraph of text depending on how often (if at > all) articles appear. Wouldn't this remove most of the Bayes > poison we see? Again, might flag LOTS of things besides just bayes poison. Other patterns such as lack of capitalization, punctuation etc. might be useful (i.e. "NOT_ENGLISH_PARAGRAPH" score) but false-posive prone. I don't know that bayes is a good tool for that though (speaking as a non-researcher, end-user). The bigger trick seems to be keeping lists like this (with lots of poison samples) OUT of bayes training as non-spam. In my (admittedly limited) experience, bayes-poison attempts have NOT been successful. - Bob