Hello,
https://lynx.invisible-island.net/lynx_help/body.html#ASSUME_CHARSET
"ASSUME_CHARSET changes the handling of documents which do not explicitly
specify a charset."
So, now I've got a document which
does specify a charset, and which
is... *wrong* (Microsoft, again?).
My use case / environment is:
- mail MUA: Mutt 2.2.14 (2025-02-20)
- mailcap entry
text/html; lynx -assume_charset=%{charset} -display_charset=utf-8
-collapse_br_tags -dump %s; nametemplate=andi_%s.html; copiousoutput
I received an email with the following parts:
I 1 <no description> [multipa/alternativ, 7bit, 30K]
I 2 ├─><no description> [text/plain, quoted, utf-8, 2.9K]
I 3 ├─><no description> [text/html, quoted, utf-8, 19K]
I 4 └─><no description> [text/calendar, base64, utf-8, 6.8K]
A 5 image001.png [image/png, base64, 12K]
So, as one can see, the text/html part has quoted / utf-8 encoding attribution
(in the mail's MIME multi-part structure def, I'd assume).
HOWEVER, the HTML document itself has:
- iso-8859-1 encoding declaration
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
- UTF-8 code units (as can be seen from
transport-induced quoted-printable activity:
<p class=3D"MsoNormal" style=3D"margin-bottom:0cm;line-height:normal"><span=
style=3D"color:#003C74;mso-fareast-language:DE">vielen Dank f=C3=BCr Ihre
https://stackoverflow.com/questions/23009232/php-translating-c3bc-to-%C3%BC
This document will be either
directly processed by
mutt (i.e. lynx - via above mailcap entry), or
it can be saved to disk (mutt 'v', then 's'ave entry).
In both cases, the MIME structure's utf-8 attribution will be ignored.
Thus, the file will be processed with its
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
encoding declaration.
So, given a plain
lynx -dump test.mre.html
, one will receive nice
"UTF-8 code units interpreted as iso-8859-1" mojibake (while terminal locale
is properly configured to UTF-8, of course...).
Dito with
links -dump test.mre.html
(ELinks 0.18.0)
So, we have a structuring/encapsulation issue situation where:
- the HTML document body/content is UTF-8-based
(as can be verified via
iconv -f utf-8 -t utf-8 <file>)
- the document (the authoritative container scope unit) declares
iso-8859-1 encoding for its body/content
- "outer-scope" MIME multi-part attribution is utf-8
When this mail hit my inbox, I thought:
"oh well, will take some time to get this encoding mojibake annoyance treated".
It took me >> 30 minutes to
try to get this configuration-workaround-resolved... in vain.
(I actually had to resort to
a much more coarse resolution:
disable mutt's
auto_view text/html
i.e. downgrade(?) from HTML processing to text processing).
"ASSUME_CHARSET changes the handling of documents which do not explicitly
specify a charset."
I tried to bend various lynx config items or cmdline options, yet
nothing worked to get this encoding trainwreck bent properly.
(and I'm talking about a "live" setup within mutt, thus
dirtily tweaking an existing HTML file is
pretty much out of the question).
Current lynx.cfg says:
# Lynx normally translates characters from a document's charset to display
# charset, using ASSUME_CHARSET value (see below) if the document's charset
# is not specified explicitly. Raw (CJK) mode is OFF for this case.
# When the document charset is specified explicitly, that charset
# overrides any assumption like ASSUME_CHARSET or raw (CJK) mode.
So, BUG:
ASSUME_CHARSET docs seem to be *insufficient*: since
they say that it is responsible for non-specified charset state.
Yet here we do have charset declaration specified, and it's *wrong*.
So, there seems to be no mention of
which last-ditch efforts (override / workaround / bending / tweaking)
one would then be able to resort to.
If such functionality is not provided, then
it should clearly say so (and also: why! - "XXX JUSTIFICATION").
Though one could still ask the hard question of
whether one should even be providing last-ditch bending possibilities for
b0rken input (Microsoft-originating data).
(since this could be judged as
"not encouraging" having broken activities / software / mis-configuration
yanked in a timely manner).
Shortened (MRE) file available on request (avoiding publication of private
content).
Thank you!!
Greetings
Andreas Mohr
--
GNU/Linux. It's not the software that's free, it's you.