Eryk Sun <[email protected]> added the comment:
Rafael, I was discussing code_page_decode() and code_page_encode() both as an
alternative for compatibility with other programs and also to explore how
MultiByteToWideChar() and WideCharToMultiByte() work -- particularly to explain
best-fit mappings, which do not roundtrip. MultiByteToWideChar() does not
exhibit "best fit" behavior. I don't even know what that would mean in the
context of decoding.
With the exception of one change to code page 1255, the definitions that you're
looking to add are just for the C1 controls and private use area codes, which
are not meaningful. Windows uses these arbitrary definitions to be able to
roundtrip between the system ANSI and Unicode APIs.
Note that Python's "mbcs" (i.e. "ansi") and "oem" encodings use the code-page
codec. For example:
>>> _winapi.GetACP()
1252
>>> '\x81\x8d\x8f\x90\x9d'.encode('ansi')
b'\x81\x8d\x8f\x90\x9d'
Best-fit encode "α" in code page 1252 [1]:
>>> 'α'.encode('ansi', 'replace')
b'a'
In your PR, the change to code page 1255 to add b"\xca" <-> "\u05ba" is the
only change that I think is really worthwhile because the unicode.org data has
it wrong. You can get the proper character name for the comment using the
unicodedata module:
>>> print(unicodedata.name('\u05ba'))
HEBREW POINT HOLAM HASER FOR VAV
I'm +0 in favor of leaving the mappings undefined where Windows completes
legacy single-byte code pages by using C1 control codes and private use area
codes. It would have been fine if Python's code-page encodings had always been
based on the "WindowsBestFit" tables, but only the decoding MBTABLE, since it's
reasonable.
Ideally, I don't want anything to use the best-fit mappings in WCTABLE. I would
rather that the 'replace' handler for code_page_encode() used the replacement
character (U+FFFD) or system default character. But the world is not ideal; the
system ANSI API uses the WCTABLE best-fit encoding. Back in the day with Python
2.7, it was easy to demonstrate how insidious this is. For example, in 2.7.18:
>>> os.listdir(u'.')
[u'\u03b1']
>>> os.listdir('.')
['a']
---
[1]
https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit1252.txt
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue45120>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com