On Fri, 2015-12-11 at 09:05 -0800, Marc Perkel wrote:
> For example. I create rules that look for many phrases about a
> subject 
> and the subject becomes a token. For examples:
> 
> JESUS
> ROYALTY
> MONEY
> 
> But themselves not an indicator of spam. But if you have all 3 then
> it's 
> definitely spam. The idea is to not look at words but look at the 
> meaning of phrases. 
>
This approach works well for me too, but doesn't need Bayes to make it
perform: just two or more portmanteau rules[*] that are combined by a
meta with a relatively high score. The idea is that the triggering
phrases are not spam indicators by themselves, but that the combination
is something that virtually never occurs in ham but is a reliable spam
indicator.

For instance, I have two portmanteau rules, SALE (contains sales
phrases like "huge discount") and PRODUCT (contains phrases like "fur
coat") that are ANDed by a meta called SALESPAM. The nice thing about
this approach is that, once the SALE and PRODUCT lists have grown to a
decent size the SALESPAM meta starts to fire on previously unseen
combinations without generating FPs. The only downside is that, unlike
Bayes, you have to build the lists manually but thats probably no worse
to do than building a hand-crafted Bayes DB like Reyndl does. 

[*] My term: a portmanteau rule is rule with a very long alternate list
and a low score in the range 0.01 - 0.1. These things are hard to read
and maintain, so I have an awk script that generates a syntactically
correct SA rule from a file that names the rule, sets the score and has
all the regexes written one per line.  

Martin

Reply via email to