Hello,

https://lynx.invisible-island.net/lynx_help/body.html#ASSUME_CHARSET

"ASSUME_CHARSET changes the handling of documents which do not explicitly 
specify a charset."


So, now I've got a document which
does specify a charset, and which
is... *wrong* (Microsoft, again?).

My use case / environment is:
- mail MUA: Mutt 2.2.14 (2025-02-20)
- mailcap entry
   text/html; lynx -assume_charset=%{charset} -display_charset=utf-8 
-collapse_br_tags -dump %s; nametemplate=andi_%s.html; copiousoutput

I received an email with the following parts:

  I     1 <no description>                      [multipa/alternativ, 7bit, 30K]
  I     2 ├─><no description>                 [text/plain, quoted, utf-8, 2.9K]
  I     3 ├─><no description>                   [text/html, quoted, utf-8, 19K]
  I     4 └─><no description>              [text/calendar, base64, utf-8, 6.8K]
  A     5 image001.png                                 [image/png, base64, 12K]

So, as one can see, the text/html part has quoted / utf-8 encoding attribution
(in the mail's MIME multi-part structure def, I'd assume).

HOWEVER, the HTML document itself has:
- iso-8859-1 encoding declaration
  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
  <meta name="Generator" content="Microsoft Word 15 (filtered medium)">
- UTF-8 code units (as can be seen from
  transport-induced quoted-printable activity:
  <p class=3D"MsoNormal" style=3D"margin-bottom:0cm;line-height:normal"><span=
   style=3D"color:#003C74;mso-fareast-language:DE">vielen Dank f=C3=BCr Ihre
  https://stackoverflow.com/questions/23009232/php-translating-c3bc-to-%C3%BC

This document will be either
directly processed by
mutt (i.e. lynx - via above mailcap entry), or
it can be saved to disk (mutt 'v', then 's'ave entry).
In both cases, the MIME structure's utf-8 attribution will be ignored.
Thus, the file will be processed with its
  <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
encoding declaration.

So, given a plain
lynx -dump test.mre.html
, one will receive nice
"UTF-8 code units interpreted as iso-8859-1" mojibake (while terminal locale
is properly configured to UTF-8, of course...).

Dito with
links -dump test.mre.html
(ELinks 0.18.0)


So, we have a structuring/encapsulation issue situation where:
- the HTML document body/content is UTF-8-based
  (as can be verified via
  iconv -f utf-8 -t utf-8 <file>)
- the document (the authoritative container scope unit) declares
  iso-8859-1 encoding for its body/content
- "outer-scope" MIME multi-part attribution is utf-8


When this mail hit my inbox, I thought:
"oh well, will take some time to get this encoding mojibake annoyance treated".

It took me >> 30 minutes to
try to get this configuration-workaround-resolved... in vain.
(I actually had to resort to
a much more coarse resolution:
disable mutt's
auto_view text/html
i.e. downgrade(?) from HTML processing to text processing).


"ASSUME_CHARSET changes the handling of documents which do not explicitly 
specify a charset."

I tried to bend various lynx config items or cmdline options, yet
nothing worked to get this encoding trainwreck bent properly.
(and I'm talking about a "live" setup within mutt, thus
dirtily tweaking an existing HTML file is
pretty much out of the question).


Current lynx.cfg says:
# Lynx normally translates characters from a document's charset to display
# charset, using ASSUME_CHARSET value (see below) if the document's charset
# is not specified explicitly.  Raw (CJK) mode is OFF for this case.
# When the document charset is specified explicitly, that charset
# overrides any assumption like ASSUME_CHARSET or raw (CJK) mode.



So, BUG:
ASSUME_CHARSET docs seem to be *insufficient*: since
they say that it is responsible for non-specified charset state.
Yet here we do have charset declaration specified, and it's *wrong*.

So, there seems to be no mention of
which last-ditch efforts (override / workaround / bending / tweaking)
one would then be able to resort to.
If such functionality is not provided, then
it should clearly say so (and also: why! - "XXX JUSTIFICATION").


Though one could still ask the hard question of
whether one should even be providing last-ditch bending possibilities for
b0rken input (Microsoft-originating data).
(since this could be judged as
"not encouraging" having broken activities / software / mis-configuration 
yanked in a timely manner).


Shortened (MRE) file available on request (avoiding publication of private 
content).

Thank you!!

Greetings

Andreas Mohr

-- 
GNU/Linux. It's not the software that's free, it's you.

Reply via email to