Re: UTF-8 Spam rules

David F. Skoll Fri, 20 Sep 2013 11:38:39 -0700

On Fri, 20 Sep 2013 14:20:58 -0400
"Kevin A. McGrail" <[email protected]> wrote:


> As of yet, I'm not using normalize_charset and researching what hits 
> things the best.

You won't like my answer, but...

You really *have* to normalize everything to Unicode (possible using UTF-8
as the canonical on-disk format) before trying to apply rules or extract
Bayes tokens.  Then you can do nice things like blocking CJK spams
with a rule like:

header CJK_SUBJECT Subject =~ /\p{CJK_Unified_Ideographs}

and have absolute confidence it will work no matter how the subject is
encoded.

I haven't looked extremely closely at the SpamAssassin code so I'm not
sure how its normalization works nor whether it can do the necessary
transformations for a subject rule such as my example to work.

Regards,

David.

Re: UTF-8 Spam rules

Reply via email to