[Bug 7133] New: Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

bugzilla-daemon Thu, 05 Feb 2015 18:54:55 -0800

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7133


            Bug ID: 7133
           Summary: Revisiting Bug 4046 - HTML::Parser: Parsing of
                    undecoded UTF-8 will give garbage when decoding
                    entities
           Product: Spamassassin
           Version: 3.4.0
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Libraries
          Assignee: [email protected]
          Reporter: [email protected]

Back in 2004 the Bug 4046 was opened, noting the HTML::Parser module
issues a warning:
  Parsing of undecoded UTF-8 will give garbage when decoding entities
when processing certain text/html mail parts.

The solution as proposed there right away by Sebastian Jaenicke (the
bug submitter) suggested to turn on the HTML::Parser's utf8_mode(),
which mostly is the right thing to do (unless we want to decode all
HTML parts into Unicode first - i.e. to perl characters).

Unfortunately that solution was rejected in 2005, mainly because
the HTML::Parser's utf8_mode requires perl 5.8  (5.7.?), which was
not considered widely deployed ten years ago.

So the chosen solution was just to mask the warning so that it
does not show up, effectively hiding the actual problem, which was
not considered serious enough, and/or possibly not fully understood.

====

Fast-forward ten years. Analyzing why some (but not all) Bayes
tokens (words) as obtained from text/html mail parts look
like an unrecognizable 8-bit mess (like encoded into UTF-8 twice
in a row), it turns out that the culprit is incorrect use of
HTML::Parser - which would indeed rightfully produce a warning,
have it not been hidden under a carpet.

Consider the following text:

  Ne traîne pas, adieu - Et tâche d'être heureux

which encoded into
  Content-Transfer-Encoding: quoted-printable
  Content-Type: text/html; charset="utf-8"

could look in a mail message like:

  From: [email protected]
  To: [email protected]
  Subject: test
  Date: Wed, 4 Feb 2015 00:01:56 +0100
  Message-ID: <[email protected]>
  Mime-Version: 1.0
  Content-Transfer-Encoding: quoted-printable
  Content-Type: text/html; charset="utf-8"

  <html><body><p>
  Ne tra=C3=AEne pas, adieu - Et t&acirc;che d'=C3=AAtre heureux
  </body></html>

Note that characters are encoded as QP UTF-8, except the â character
in tâche, which is represented by the &acirc; HTML entity (a-circumflex).
Both representations are perfectly legal and equivalent in such
text/html MIME part.


Unfortunately this text (after QP decoding and HTML decoding)
ends up as:
  Ne traîne pas, adieu - Et t�che d'être heureux

which is (as octets):
  Ne tra<C3><AE>ne pas, adieu - Et t<E2>che d'<C3><AA>tre heureux

and this is then given to rules and to bayes tokenization. The
HTML::Parser does rightfully issue a warning (which is suppressed
by Mail::SpamAssassin::HTML), and the result is clearly wrong:
characters are encoded as UTF-8, except for the &acirc; entity,
which is encoded as ISO-8859-1 (i.e. Latin-1).


And it gets worse.  Replacing the '-' by an HTML entity  &mdash;
like this:

  <html><body><p>
  Ne tra=C3=AEne pas, adieu &mdash; Et t&acirc;che d'=C3=AAtre heureux
  </body></html>

yields:
  Ne traÃ®ne pas, adieu — Et tâche d'Ãªtre heureux

which is (as octets):
  Ne tra<C3><83><C2><AE>ne pas, adieu <E2><80><94> Et t<C3><A2>che
d'<C3><83><C2><AA>tre heureux


Note that just adding one dash character (leaving everything else
unchanged), the entire encoding is changed into a mojibake scramble:
characters represented by HTML entities are correct, but all other
UTF-8 octet pairs are incorrectly assumed to be in Latin-1 bytes and
encoded into UTF-8 again, resulting in four octets!

====

So what is going on here? As soon as some HTML entity (like
&trade; or &euro; or &scaron; or &circ; or &tilde; ...)
which has no representation in Latin-1 appears in an UTF-8 text,
the entire text is upgraded into perl characters (i.e. Unicode,
utf8 flag on), and during this process existing bytes with codes
between 128 and 255 are considered to be in Latin-1, thus resulting
in doubly-encoded UTF-8 mumble-jumble.

It also means that rules and other plugins receive such text
flagged as perl characters (utf8 flag on), so some rules may misfire
or take longer to evaluate. Just adding one  &trade;  entity ruins
the encoding of the entire HTML text!

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7133] New: Revisiting Bug 4046 - HTML::Parser: Parsing of undecoded UTF-8 will give garbage when decoding entities

Reply via email to