--As of June 11, 2014 4:25:31 AM +0200, Karsten Bräckelmann is alleged to have said:

On Tue, 2014-06-10 at 21:22 -0400, Daniel Staal wrote:
--As of June 11, 2014 2:45:25 AM +0200, Karsten Bräckelmann is alleged
to  have said:
>     Worse, enabling charset normalization completely breaks UTF-8 chars
>     in the regex. At least in my ad-hoc --cf command line testing.

--As for the rest, it is mine.

This sounds like something where `use feature 'unicode_strings'` might
have  an affect

Possibly.

enabling normalization is probably setting the internal utf8
flag on incoming text, which could change the semantics of the regex
matching.

Nope. *digging into code*

This option mainly affects rendered textual parts and headers, treating
them with Encode::Detect. More complex than just setting an internal
flag. What exactly made the ad-hoc regex rules fail is beyond the scope
of tonight's code-diving.

Right. And as a side-effect, Encode::Detect (as documented in Encode) is probably setting the utf8 flag on the Perl string.

Note I mean internal to *perl*, not one of the modules or code. The utf8 flag affects what semantics perl uses when it compares strings, including in regexes.

If that's the case, it raises the question of if we want Spamassassin to
require Perl 5.12 (which includes that feature) - the current base
version  is 5.8.1.  Unicode support has been evolving in Perl; 5.8
supports it  generally, but there were bugs.  I think 5.12 got most of
them, but I'm not  sure.  (And of course it's not the current version of
Perl.)

The normalize_charset option requires Perl 5.8.5.

All the ad-hoc rule testing in this thread has been done with SA 3.3.2
on Perl 5.14.2 (debian 7.5). So this is not an issue of requiring a more
recent Perl version.

`use feature 'unicode_strings'`, as a feature, only tangentially cares about what version of Perl you are running. Yes, you need a new enough version to use it, but since features are not enabled by default any affect they might have doesn't occur unless they are requested.

While of course something to potentially improve on itself, the topic of
charset normalization is just a by-product explaining the original
issue: Header rules and string encoding, with a grain of charset
encoding salt.

True. I was just thinking aloud as it were, and wondering if an explanation could be found for breaking UTF-8 strings in the regex.

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Reply via email to