Am 29.05.2016 um 02:46 schrieb Dianne Skoll:
And also, two-word phrases can be stronger indicators than the individual words; "hot" and "sex" in isolation may not be strong spam indicators, but "hot sex" probably is stronger. Going from one-word tokens to one+two-word tokens will have a pretty big payoff, I think. I'm not so sure about two to three
+1the best result for many of the sort spams which try to defeat bayes would be 2 or 3 word tokes - we complement bayes with currently 1500 handcrafted body rules with scores of 0.5/1.5/2.5/3.5/4.5 points
the majority of that rules have 2 or 3 wordsthe current toekns should stay as the are and *additional* 2-word tokens of the same messages - that would boost bayes to a completly different level with enough training data
one word tokens are limited in many ways (while it work not bad to say)
signature.asc
Description: OpenPGP digital signature