Am 21.01.2016 um 14:17 schrieb RW:
On Thu, 21 Jan 2016 13:45:08 +0100 Christian Laußat wrote:Am 21.01.2016 13:19, schrieb Reindl Harald:no entirely when "urrently, SA's bayes tokens are single words" from https://mail-archives.apache.org/mod_mbox/spamassassin-dev/201211.mbox/%3c509d55a8.30...@gmail.com%3E is still true please review that response below and consider 2/4 word tokes *additionally* in the SA-tokenizer and it will beat out the "new magic" easily witha well trained bayes in all casesBogofilter has an option to specify how many tokens to put into bayes. Here is an analysis of how effective this was: http://www.bogofilter.org/pipermail/bogofilter-dev/2006q3/003349.html In my opinion it's not worth the effort. You'll blow up your database for little better matching rate.The FNs dropped from 287 to 69, which I'd call a four-fold improvement. The FPs rose from 0 to 1, but that mail was ham quoting a full spam, so arguably it just did a better job in detecting the embedded spam.
also see http://www.paulgraham.com/sofar.htmlWhen the spammers do try to rewrite their messages, they'll probably do it by replacing individual spammy tokens with phrases of more neutral words. But multi-word filters will learn and catch these phrases too
_____________________________________in doubt that "blown up database" can have the effect that you need less training samples for the same outcome
signature.asc
Description: OpenPGP digital signature