[issue45120] Windows cp encodings "UNDEFINED" entries update

Eryk Sun Thu, 16 Sep 2021 15:11:56 -0700

Eryk Sun <eryk...@gmail.com> added the comment:

> in CP1252, bytes \x81 \x8d \x8f \x90 \x9d map to "UNDEFINED", 
> whereas in bestfit1252, they map to \u0081 \u008d \u008f 
> \u0090 \u009d respectively


This is the normal mapping in Windows, not a best-fit encoding. Within Windows, 
you can access the native encoding via codecs.code_page_encode() and 
codecs.code_page_decode(). For example:

    >>> codecs.code_page_encode(1252, '\x81\x8d\x8f\x90\x9d')[0]
    b'\x81\x8d\x8f\x90\x9d'

    >>> codecs.code_page_decode(1252, b'\x81\x8d\x8f\x90\x9d')[0]
    '\x81\x8d\x8f\x90\x9d'

WinAPI WideCharToMultiByte() uses a best-fit encoding unless the flag 
WC_NO_BEST_FIT_CHARS is passed. For example, with code page 1252, Greek "α" is 
best-fit encoded as Latin b"a". code_page_encode() uses the native best-fit 
encoding when the "replace" error handler is specified. For example:

    >>> codecs.code_page_encode(1252, 'α', 'replace')[0]
    b'a'

Regarding Python's encodings, if you need a specific mapping to match Windows, 
I think this should be discussed on a case by case basis. I see no benefit to 
supporting a mapping such as "\x81" <-> b"\x81" in code page 1252. That it's 
not mapped in Python is possibly a small benefit, since to some extent this 
helps to catch a mismatched encoding. For example, code page 1251 (Cyrilic) 
maps ordinal b"\x81" to "Ѓ" (i.e. "\u0403").

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue45120>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45120] Windows cp encodings "UNDEFINED" entries update

Reply via email to