https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656
Bug ID: 7656
Summary: UTF8 rules, normalize_charset etc overhaul
Product: Spamassassin
Version: SVN Trunk (Latest Devel Version)
Hardware: All
OS: All
Status: NEW
Severity: blocker
Priority: P2
Component: Libraries
Assignee: [email protected]
Reporter: [email protected]
Target Milestone: Undefined
There are few relating bugs, but I'm creating new to oversee this.
I don't think we should release 4.0.0 before all UTF8 related functionality
works adequately and is documented properly.
I made few tests with a message that either contains latin1 or utf8 encoded
text (or simple html without any encoding clauses). Also three variants with
Content-Type missing or specified as such.
body RULE_LATIN1 /päivää/
body RULE_UTF8 /pÀivÀÀ/
TEXT/PLAIN normalize_charset 0 / 1
utf8 message, no ct RULE_UTF8 / RULE_UTF8
utf8 message, utf8 ct RULE_UTF8 / RULE_UTF8
utf8 message, latin1 ct RULE_UTF8 / RULE_UTF8
latin1 message, no ct RULE_LATIN1 / <no hits>
latin1 message, utf8 ct RULE_LATIN1 / <no hits>
latin1 message, latin1 ct RULE_LATIN1 / RULE_UTF8
TEXT/HTML normalize_charset 0 / 1
utf8 message, no ct RULE_UTF8 / RULE_UTF8
utf8 message, utf8 ct RULE_UTF8 / RULE_UTF8
utf8 message, latin1 ct RULE_UTF8 / RULE_UTF8
latin1 message, no ct RULE_UTF8 / <no hits>
latin1 message, utf8 ct RULE_UTF8 / <no hits>
latin1 message, latin1 ct RULE_UTF8 / RULE_UTF8
- normalize_charset 1 doesn't hit either rule unless message contains
Content-Type..ISO-8859-1 ??
- html parser apparently assumes everything is UTF8. Only matches UTF8 rules?
One can't even use simple workarounds such as "body RULE_FOO /p.iv../" to match
umlauts(diacritic?) from UTF8 messages, as they obviously eat up two
characters.
Let's not even get into other things yet like sa-compile (bug 7645), textcat
etc that all expect some correct encoding to work..
Unless people want to use multiple rules to match non-utf8 and utf8 messages,
perhaps the only sane solution would be to "upgrade" all non-utf8 rules to utf8
internally and do the matching to utf8 upgraded body. In such case the two
rules above would actually be duplicates and work on any message.
--
You are receiving this mail because:
You are the assignee for the bug.