--As of June 11, 2014 4:25:31 AM +0200, Karsten Bräckelmann is alleged to
have said:
On Tue, 2014-06-10 at 21:22 -0400, Daniel Staal wrote:
--As of June 11, 2014 2:45:25 AM +0200, Karsten Bräckelmann is alleged
to have said:
> Worse, enabling charset normalization completely breaks UTF-8 chars
> in the regex. At least in my ad-hoc --cf command line testing.
--As for the rest, it is mine.
This sounds like something where `use feature 'unicode_strings'` might
have an affect
Possibly.
enabling normalization is probably setting the internal utf8
flag on incoming text, which could change the semantics of the regex
matching.
Nope. *digging into code*
This option mainly affects rendered textual parts and headers, treating
them with Encode::Detect. More complex than just setting an internal
flag. What exactly made the ad-hoc regex rules fail is beyond the scope
of tonight's code-diving.
Right. And as a side-effect, Encode::Detect (as documented in Encode) is
probably setting the utf8 flag on the Perl string.
Note I mean internal to *perl*, not one of the modules or code. The utf8
flag affects what semantics perl uses when it compares strings, including
in regexes.
If that's the case, it raises the question of if we want Spamassassin to
require Perl 5.12 (which includes that feature) - the current base
version is 5.8.1. Unicode support has been evolving in Perl; 5.8
supports it generally, but there were bugs. I think 5.12 got most of
them, but I'm not sure. (And of course it's not the current version of
Perl.)
The normalize_charset option requires Perl 5.8.5.
All the ad-hoc rule testing in this thread has been done with SA 3.3.2
on Perl 5.14.2 (debian 7.5). So this is not an issue of requiring a more
recent Perl version.
`use feature 'unicode_strings'`, as a feature, only tangentially cares
about what version of Perl you are running. Yes, you need a new enough
version to use it, but since features are not enabled by default any affect
they might have doesn't occur unless they are requested.
While of course something to potentially improve on itself, the topic of
charset normalization is just a by-product explaining the original
issue: Header rules and string encoding, with a grain of charset
encoding salt.
True. I was just thinking aloud as it were, and wondering if an
explanation could be found for breaking UTF-8 strings in the regex.
Daniel T. Staal
---------------------------------------------------------------
This email copyright the author. Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------