Am 21.01.2016 um 14:17 schrieb RW:
On Thu, 21 Jan 2016 13:45:08 +0100
Christian Laußat wrote:

Am 21.01.2016 13:19, schrieb Reindl Harald:
no entirely when "urrently, SA's bayes tokens are single words" from
https://mail-archives.apache.org/mod_mbox/spamassassin-dev/201211.mbox/%3c509d55a8.30...@gmail.com%3E
is still true

please review that response below and consider 2/4 word tokes
*additionally* in the SA-tokenizer and it will beat out the "new
magic" easily witha well trained bayes in all cases

Bogofilter has an option to specify how many tokens to put into
bayes. Here is an analysis of how effective this was:
http://www.bogofilter.org/pipermail/bogofilter-dev/2006q3/003349.html

In my opinion it's not worth the effort. You'll blow up your database
for little better matching rate.

The FNs dropped from 287 to 69, which I'd call a four-fold improvement.

The FPs rose from 0 to 1, but that mail was ham quoting a full spam, so
arguably it just did a better job in detecting the embedded spam.

also see http://www.paulgraham.com/sofar.html

When the spammers do try to rewrite their messages, they'll probably do it by replacing individual spammy tokens with phrases of more neutral words. But multi-word filters will learn and catch these phrases too
_____________________________________

in doubt that "blown up database" can have the effect that you need less training samples for the same outcome

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to