[Bug 7656] New: UTF8 rules, normalize_charset etc overhaul

bugzilla-daemon Sat, 17 Nov 2018 03:14:32 -0800

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7656


            Bug ID: 7656
           Summary: UTF8 rules, normalize_charset etc overhaul
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Hardware: All
                OS: All
            Status: NEW
          Severity: blocker
          Priority: P2
         Component: Libraries
          Assignee: [email protected]
          Reporter: [email protected]
  Target Milestone: Undefined

There are few relating bugs, but I'm creating new to oversee this.

I don't think we should release 4.0.0 before all UTF8 related functionality
works adequately and is documented properly.

I made few tests with a message that either contains latin1 or utf8 encoded
text (or simple html without any encoding clauses). Also three variants with
Content-Type missing or specified as such.

body RULE_LATIN1 /päivää/
body RULE_UTF8 /pÃ€ivÃ€Ã€/

TEXT/PLAIN  normalize_charset 0 / 1
utf8 message, no ct       RULE_UTF8   / RULE_UTF8
utf8 message, utf8 ct     RULE_UTF8   / RULE_UTF8
utf8 message, latin1 ct   RULE_UTF8   / RULE_UTF8
latin1 message, no ct     RULE_LATIN1 / <no hits>
latin1 message, utf8 ct   RULE_LATIN1 / <no hits>
latin1 message, latin1 ct RULE_LATIN1 / RULE_UTF8

TEXT/HTML  normalize_charset 0 / 1
utf8 message, no ct       RULE_UTF8 / RULE_UTF8
utf8 message, utf8 ct     RULE_UTF8 / RULE_UTF8
utf8 message, latin1 ct   RULE_UTF8 / RULE_UTF8
latin1 message, no ct     RULE_UTF8 / <no hits>
latin1 message, utf8 ct   RULE_UTF8 / <no hits>
latin1 message, latin1 ct RULE_UTF8 / RULE_UTF8

- normalize_charset 1 doesn't hit either rule unless message contains
Content-Type..ISO-8859-1 ??

- html parser apparently assumes everything is UTF8. Only matches UTF8 rules?

One can't even use simple workarounds such as "body RULE_FOO /p.iv../" to match
umlauts(diacritic?) from UTF8 messages, as they obviously eat up two
characters.

Let's not even get into other things yet like sa-compile (bug 7645), textcat
etc that all expect some correct encoding to work..

Unless people want to use multiple rules to match non-utf8 and utf8 messages,
perhaps the only sane solution would be to "upgrade" all non-utf8 rules to utf8
internally and do the matching to utf8 upgraded body. In such case the two
rules above would actually be duplicates and work on any message.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7656] New: UTF8 rules, normalize_charset etc overhaul

Reply via email to