[Bug 7091] UTF-8 characters don't work in rules

bugzilla-daemon Tue, 24 Feb 2015 18:08:21 -0800

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7091


--- Comment #5 from Mark Martinec <[email protected]> ---
> normalize_charset is a local configuration option - it can be disabled.
> 
> A rule written for use when normalize_charset is enabled will generally be
> simpler than one that needs to directly deal with multiple encodings. Is
> there a way to write rule alternatives such that one will be used when the
> normalize_charset option is enabled and the other when it is not? I'm
> wondering if there is something similar to rule variants using or not using
> a perl-5.10-ism switched by
>   "if can(Mail::SpamAssassin::Conf::perl_min_version_5010000)"
> Is there no way to intelligently choose between different rules based on
> such configuration options? That kinda leaves us with unwelcome
> alternatives: write for one mode and ignore the other (which will probably
> be broken in you write to normalized text or inefficient and complex if you
> write to non-normalized) or write two rules (which will be double the work
> to scan - not recommended at all).
> Do we need a "can(Mail::SpamAssassin::Conf::normalize_enabled)" or some such?

I understand your concern and it is mine too. The can(...normalize_enabled)
is maybe a temporary workaround, although as rules parsing goes by the order
of config file names, one would have to make sure that normalize_charset
setting would be seen before the rules requiring a can() - probably
impractical.

I'm toying with an idea to let Unicode propagate further/closer to rules
and plugins, which would give them an individual choice to encode to UTF-8
octets or work in Unicode characters. A notable piece of information is
that utf8::encode() is *very* quick when it has nothing to do, e.g. when
its argument is a string with utf8 flag turned on (i.e. is in Unicode
characters). But that's probably not something that could go into 3.4.1,
as it's opening a can of worms.

There is another option for some of the rules that want to deal with
original message encoding: use 'rawbody' instead of 'body' rules.
Both the 'rawbody' and 'body' do the MIME and QP/Base64 decoding,
where the 'rawbody' stops, while the 'body' further does the optional
normalization, followed by HTML parsing, stripping of whitespace
and chopping into paragraphs.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7091] UTF-8 characters don't work in rules

Reply via email to