https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7091
Mark Martinec <[email protected]> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |RESOLVED Resolution|--- |FIXED --- Comment #1 from Mark Martinec <[email protected]> --- There's been a bunch of related tickets regarding the UTF-8 encoding and normalization in order to make it useful for 3.4.1 release, please see: - [Bug 7126] Incorrect character set detections by normalize_charset - [Bug 7144] To normalize_charset or not to normalize_charset, that is the question - [Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities - [Bug 7130] Bayes tokenization mangles/chops many UTF-8 words with accented, Cyrillic etc. letters - inappropriately assuming ISO-8859 encoding - [Bug 7141] Bayes truncates ('skip') long tokens on bytes, should it count characters instead? - [Bug 7135] Bayes tokenizer 'arbitrarily' breaks multibyte CJK utf-8 characters into digrams instead of breaking on UTF-8 character boundaries In short: with normalize_charset enabled, rules working on decoded text (not 'raw' rules) will see text transcoded to UTF-8, i.e. UTF-8 octets (not Unicode perl characters). That means that regexp in rules can now be written with a text editor in UTF-8 locale, no need to encode bytes as hex or octal sequences. Just keep in mind that these are octets (not perl characters), so rules like these are fine: body CRAZY_EURO /€uro/ header SUBJ_CREDIT_FR Subject =~ /crédit/ The /\xE2\x82\xACuro/ and /€uro/ are equivalent (assuming the encoding of a *.cf file is in UTF-8, which is common) The (u|µ|ù|ú|û|ü) is fine too, but a character class like [uµùúûü] is *not* (remember: these Latin characters are UTF-8 encoded multi-octet sequences, not perl characters). The decision to stay with octets and not to switch (yet) to Unicode perl characters was for speed and compatibility with existing code and old versions of perl, and with existing rules. This may change in future: transition to Unicode characters would further simplify writing rules and code, at the expense of some loss in speed and requiring newer versions of perl with useful Unicode support, e.g. 5.12 or later. > /ą/ needs to be written as /\x{104}/ The /ą/ is fine (assuming UTF-8 encoded .cf file, but /\x{104}/ is not. Also the -CI must not be used as an option to perl, UTF-8 rule files must be read as bytes and stay as bytes, not converted to Unicode - at least for the time being. Closing. -- You are receiving this mail because: You are the assignee for the bug.
