Darren Coleman <[EMAIL PROTECTED]> wrote:
> Apologies if this has been discussed and/or dismissed before,
> but I was wondering if anyone had thought about rules or
> indeed modifying SA itself to detect Bayes poison?  It strikes
> me that a lot of emails that include this poison tend to just
> have a series of words without any articles ("and", "an", "a",
> "the", etc).

It depends on WHICH bayes-poison technique they try. Rules to detect "much text
without articles" might help, but that assumes that non-spam messages are
mostly english, and not things like sample code, shell scripts and the like.
And many spammers seem to be using extensive quotes from literature, magazines
and the like. A low score wouldn't be harmful, perhaps as part of a meta score.

Then again, bayes itself "detects" bayes poison (there have been many recent
threads on this). If words show up that are NOT in "normal" (as defined by ones
own bayes training) messages, they tend to flag spam. More so as similar types
of poison are used (i.e. if Tom Sawyer text starts showing up frequently in MY
inbox, it's likely spam, but it might be fine for others). If nothing else, the
score leans towards "less good".

> I would've thought it would be fairly trivial to attribute
> weight to a paragraph of text depending on how often (if at
> all) articles appear. Wouldn't this remove most of the Bayes
> poison we see?

Again, might flag LOTS of things besides just bayes poison. Other patterns such
as lack of capitalization, punctuation etc. might be useful (i.e.
"NOT_ENGLISH_PARAGRAPH" score) but false-posive prone. I don't know that bayes
is a good tool for that though (speaking as a non-researcher, end-user).

The bigger trick seems to be keeping lists like this (with lots of poison
samples) OUT of bayes training as non-spam. In my (admittedly limited)
experience, bayes-poison attempts have NOT been successful.

- Bob

Reply via email to