New submission from era <[email protected]>:
The email.charset module should contain common informal character-set
identifiers even if they are not formally specified in a IANA RFC.
>From a quick grep of a pile of recent email, I find the following:
46 "cp-850"
6 "windows-874"
For scale, the same collection contained around 10,000 messages with "utf-8"
and 2,000 with "iso-8859-1". Still, the fact that there are multiple
occurrences in a spool of recent messages indicates that they are fairly common.
Currently, the email module throws a traceback if you attempt to parse a
message whose character set is not known to Python. This is not possible to
prevent in the general case, but making it more robust with encodings which are
reasonably prevalent in the wild would definitely be desirable.
For what it's worth, "cp-850" is apparently an alias for IBM code page 850
which is defined with the name "cp850" in RFC1345. "windows-874" is an
official designation which is detailed in
https://www.iana.org/assignments/charset-reg/windows-874 which is apparently
equivalent to the Python codec "cp784".
----------
components: email
messages: 323870
nosy: barry, era, r.david.murray
priority: normal
severity: normal
status: open
title: email.charset: common IANA labels missing
versions: Python 3.6
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue34460>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com