Rules with non-ASCII characters and normalize_charset 1

Ben Winslow Fri, 31 Oct 2008 12:13:55 -0700

Hello,

I've been trying to figure out how to write rules matching
international spam with non-ASCII characters -- especially languages
with a completely different script (e.g. Cyrillic) -- for SA 3.2.5 on
perl 5.10.0.  Rather than write rules to match words in 3+ character
sets per language (which is a maintenance nightmare and probably prone
to false-positives), it looks like 'normalize_charset 1' should allow me
to write the rules once in UTF-8, but this isn't the way it's working.


I'm not entirely sure why it doesn't Just Work (without support for
case-insensitive matches) as the code stands right now.  If I add the
following body rule (the rule is written in UTF-8), it will fail to
match "Привет, мир!" in the body regardless of the body character set
(I tried UTF-8 and KOI8-R):

body TEST_RU                    /Привет, мир!/

If I change run_generic_tests to 'use utf8;' at the beginning of the
test body (patch attached), UTF-8 rules work perfectly (and the TEST_RU
rule fires.)  Is there a better way to do this?  If not, is there any
chance of the patch (or something similar) being incorporated into
SpamAssassin?

-- 
Ben Winslow <[EMAIL PROTECTED]>

--- Check.pm.orig	2008-10-31 10:36:36.396760053 -0400
+++ Check.pm	2008-10-31 14:51:30.341921486 -0400
@@ -243,7 +243,7 @@
 
   # build up the eval string...
   $self->{evalstr} = $self->start_rules_plugin_code($ruletype, $priority);
-  $self->{evalstr2} = '';
+  $self->{evalstr2} = $self->{main}->{conf}->{normalize_charset} ? q{use utf8; no warnings 'utf8';} : '';
 
   # use %nopts for named parameter-passing; it's more friendly to future-proof
   # subclassing, since new parameters can be added without breaking third-party

Rules with non-ASCII characters and normalize_charset 1

Reply via email to