HTML based accent characters seems to be becoming more popular in my personal 
corpus. So far I've seen accents that match this basic regex, not usable as a 
rule alone mind you however it should be accounted for in keyword based 
rules.

/&.{1}(acute|uml|ring|grave|circ|tilde);/

A quick and dirty egrep vs corpus:
egrep '&.{1}(acute|uml|ring|grave|circ|tilde);' spam-corpus -c
1817
egrep '&.{1}(acute|uml|ring|grave|circ|tilde);' spam-corpus -v -c
7161007
egrep '&.{1}(acute|uml|ring|grave|circ|tilde);' ham-corpus -c
3
egrep '&.{1}(acute|uml|ring|grave|circ|tilde);' ham-corpus -v -c
2052964

Of course that's just simple matched lines. Again as a generic rule it isn't 
useful however it is being used to evade keyword matching such as the 
anti-drug custom rules.

Reply via email to