Hello, I've been trying to figure out how to write rules matching international spam with non-ASCII characters -- especially languages with a completely different script (e.g. Cyrillic) -- for SA 3.2.5 on perl 5.10.0. Rather than write rules to match words in 3+ character sets per language (which is a maintenance nightmare and probably prone to false-positives), it looks like 'normalize_charset 1' should allow me to write the rules once in UTF-8, but this isn't the way it's working.
I'm not entirely sure why it doesn't Just Work (without support for case-insensitive matches) as the code stands right now. If I add the following body rule (the rule is written in UTF-8), it will fail to match "Привет, мир!" in the body regardless of the body character set (I tried UTF-8 and KOI8-R): body TEST_RU /Привет, мир!/ If I change run_generic_tests to 'use utf8;' at the beginning of the test body (patch attached), UTF-8 rules work perfectly (and the TEST_RU rule fires.) Is there a better way to do this? If not, is there any chance of the patch (or something similar) being incorporated into SpamAssassin? -- Ben Winslow <[EMAIL PROTECTED]>
--- Check.pm.orig 2008-10-31 10:36:36.396760053 -0400 +++ Check.pm 2008-10-31 14:51:30.341921486 -0400 @@ -243,7 +243,7 @@ # build up the eval string... $self->{evalstr} = $self->start_rules_plugin_code($ruletype, $priority); - $self->{evalstr2} = ''; + $self->{evalstr2} = $self->{main}->{conf}->{normalize_charset} ? q{use utf8; no warnings 'utf8';} : ''; # use %nopts for named parameter-passing; it's more friendly to future-proof # subclassing, since new parameters can be added without breaking third-party