https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7091

Mark Martinec <[email protected]> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #1 from Mark Martinec <[email protected]> ---
There's been a bunch of related tickets regarding the UTF-8 encoding and
normalization in order to make it useful for 3.4.1 release, please see:

- [Bug 7126] Incorrect character set detections by normalize_charset

- [Bug 7144] To normalize_charset or not to normalize_charset, that is
  the question

- [Bug 7133] Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8
  will give garbage when decoding entities

- [Bug 7130] Bayes tokenization mangles/chops many UTF-8 words with
  accented, Cyrillic etc. letters - inappropriately assuming ISO-8859 encoding

- [Bug 7141] Bayes truncates ('skip') long tokens on bytes, should it count
  characters instead?

- [Bug 7135] Bayes tokenizer 'arbitrarily' breaks multibyte CJK utf-8
  characters into digrams instead of breaking on UTF-8 character boundaries


In short: with normalize_charset enabled, rules working on decoded text
(not 'raw' rules) will see text transcoded to UTF-8, i.e. UTF-8 octets
(not Unicode perl characters). That means that regexp in rules can now
be written with a text editor in UTF-8 locale, no need to encode bytes
as hex or octal sequences. Just keep in mind that these are octets
(not perl characters), so rules like these are fine:

  body CRAZY_EURO /€uro/
  header SUBJ_CREDIT_FR Subject =~ /crédit/

  The /\xE2\x82\xACuro/  and  /€uro/  are equivalent
  (assuming the encoding of a *.cf file is in UTF-8, which is common)

The (u|µ|ù|ú|û|ü) is fine too, but a character class like [uµùúûü]
is *not* (remember: these Latin characters are UTF-8 encoded multi-octet
sequences, not perl characters).

The decision to stay with octets and not to switch (yet) to Unicode
perl characters was for speed and compatibility with existing code
and old versions of perl, and with existing rules. This may change
in future: transition to Unicode characters would further simplify
writing rules and code, at the expense of some loss in speed and
requiring newer versions of perl with useful Unicode support, e.g.
5.12 or later.


> /ą/ needs to be written as /\x{104}/

The /ą/ is fine (assuming UTF-8 encoded .cf file, but /\x{104}/ is not.

Also the -CI must not be used as an option to perl, UTF-8 rule files must
be read as bytes and stay as bytes, not converted to Unicode - at least
for the time being.

Closing.

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to