https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7091
--- Comment #5 from Mark Martinec <[email protected]> --- > normalize_charset is a local configuration option - it can be disabled. > > A rule written for use when normalize_charset is enabled will generally be > simpler than one that needs to directly deal with multiple encodings. Is > there a way to write rule alternatives such that one will be used when the > normalize_charset option is enabled and the other when it is not? I'm > wondering if there is something similar to rule variants using or not using > a perl-5.10-ism switched by > "if can(Mail::SpamAssassin::Conf::perl_min_version_5010000)" > Is there no way to intelligently choose between different rules based on > such configuration options? That kinda leaves us with unwelcome > alternatives: write for one mode and ignore the other (which will probably > be broken in you write to normalized text or inefficient and complex if you > write to non-normalized) or write two rules (which will be double the work > to scan - not recommended at all). > Do we need a "can(Mail::SpamAssassin::Conf::normalize_enabled)" or some such? I understand your concern and it is mine too. The can(...normalize_enabled) is maybe a temporary workaround, although as rules parsing goes by the order of config file names, one would have to make sure that normalize_charset setting would be seen before the rules requiring a can() - probably impractical. I'm toying with an idea to let Unicode propagate further/closer to rules and plugins, which would give them an individual choice to encode to UTF-8 octets or work in Unicode characters. A notable piece of information is that utf8::encode() is *very* quick when it has nothing to do, e.g. when its argument is a string with utf8 flag turned on (i.e. is in Unicode characters). But that's probably not something that could go into 3.4.1, as it's opening a can of worms. There is another option for some of the rules that want to deal with original message encoding: use 'rawbody' instead of 'body' rules. Both the 'rawbody' and 'body' do the MIME and QP/Base64 decoding, where the 'rawbody' stops, while the 'body' further does the optional normalization, followed by HTML parsing, stripping of whitespace and chopping into paragraphs. -- You are receiving this mail because: You are the assignee for the bug.
