Hello Kai, hello Kyle, hello Christian, On Friday, August 3, 2007 at 15:37:38 +0200, Kai Grossjohann wrote:
> I (fairly) often get messages with no charset specified, or with the > wrong charset specified, so I do Ctrl-E on them and edit the charset > parameter to windows-1252 As Kyle said, an $assumed_charset and a set of charset-hooks will pretty well solve most such problems, say 95%. The rare residual problem mails can then bearably be treated via <edit-type> manually. I'll just add some comments to the discussion: -1) The second parameter of a charset-hook is a regexp. To avoid annoying false positive matches, I advice to *always* write the strictest possible regexp for your goal. This leads to "charset-hook ^iso-8859-1$ windows-1252" and all such. Otherwise Latin-9 (iso-8859-15) would be matched, and would be aliased to CP-1252. This would break Euros, and is definitely unwanted. -2) Drop "charset-hook windows-1251 windows-1252". This aliases one charset to an entirely different and incompatible one. This may well fix a mislabelling in a few mails, but it will also break all properly 1251 labelled ones. Definitely unwanted. Such deep mislabellings need another solution than static charset-hooks. Perhaps dynamic charset-hooks declared inside folder- or message-hooks (and unhooked by default). Or if rare enough, just the occasional manual <edit-type>. -3) Drop "charset-hook ^us-ascii$ utf-8". This may well fix the wrong (or lack of) label in a few MIME mails (containing UTF-8), but will break the majority of such mails (really containing CP-1252). What you want is a generic fix "charset-hook ^us-ascii$ cp1252" for the majority. And additionally either some dynamic charset-hook or manual <edit-type> (as above in point #2) to fix such UTF-8 corner case. -4) Kyle: You listed "charset-hook none windows-1252". I don't recall having ever seen a charset=none label. Does it really happen? -5) Most people should not set $charset, but set LANG and let $charset automatically inherit the right value. -6) "utf8" does not exist. Sometimes it's known by iconv as an alias, but not on all platforms. This must be spelled "utf-8" with the dash. We should realy consider populating Mutt's internal list of aliases for charset.c:mutt_canonical_charset(). -7) $assumed_charset takes a list of charsets, right. For raw headers, Mutt scans the list and takes the first charset in which the header is fully valid. However for bodies, Mutt takes... item #1 in list, period. There is *no* charset auto-sensing for bodies. -8) Those charset auto-sensing lists (like $assumed_charset, $file_charset/$attach_charset, or Vim's fileencodings) could list utf-8 first, then Latin-1 or such. And nothing appended. The reason is that any string is *always* valid Latin-1 (yes, even if it contains bytes between 128 and 159). Nothing further will never be checked. The same applies to nearly all 256 characters charsets in place of Latin-1 (CP-125*, ISO-8859-*, CP-85*, KOI-8*, USW...). A few exceptions do exist (example: byte 213 is invalid in CP-857), but don't really invalidate this rule. To the contrary, UTF-8 strings are much more specific: UTF-8 uses any bytes, but in specific sequences. This means that a Latin-1 text has a fairly low risk to be wrongly sensed as being UTF-8. And that if some text is valid UTF-8, then it very probably really is UTF-8. -9) There is no point in listing a subset together with it's superset in $assumed_charset. The superset alone suffices. -10) Due to points #7 and #8, the optimal generic $assumed_charset for westerners (ie for all Latin-1 centered languages) is the mono-charset $assumed_charset=windows-1252 Appending anything is (practically) useless. Prepending "utf-8" would be good to headers, but would harm bodies. -11) MIME mails with a "Content-Type:" header but without charset label are by default treated by Mutt *nearly* as if the label was "us-ascii". However this case is a border case, and is impacted either by $assumed_charset *or* by a "charset-hook ^us-ascii$ something". If both exist, $assumed=blah wins over "charset-hook ^us-ascii$". But then a "charset-hook ^blah$" would have the last word and win. There are some more subtilities I won't try to explain, but for such mails <edit-type> (and the attachments menu) shows charset=blah, provided $assumed was "blah" during last folder loading. Runtime $assumed changes are ignored (until next reload). -12) Christian gave the state-of-the-art generic set of hooks for westerners. People on platforms where iconv knows EUC-JP-MS (ie *not* unpatched libiconv) can just add this one: | charset-hook ^euc-jp$ euc-jp-ms Iconv must know the target charset. Otherwise such a charset-hook is worse than nothing. Bye! Alain. -- Mutt muttrc tip to send mails in best adapted first necessary and sufficient charset (version for East Europe Latin-2/CP-852/CP-1250 terminal users): set send_charset="us-ascii:iso-8859-1:iso-8859-15:windows-1252:iso-8859-2:windows-1250:utf-8"