Eryk Sun <[email protected]> added the comment:
> From Eryk's description it sounds like we should always add
> WC_NO_BEST_FIT_CHARS as an option to MultiByteToWideChar()
> in order to make sure it doesn't use best fit variants
> unless explicitly requested.
The concept of a "best fit" encoding is unrelated to decoding with
MultiByteToWideChar(). By default WideCharToMultiByte() best-fit encodes some
otherwise unmapped ordinals to characters in the code page that have similar
glyphs. This doesn't round trip (e.g. "α" -> b"a" -> "a"). The
WC_NO_BEST_FIT_CHARS flag prevents this behavior. code_page_encode() uses
WC_NO_BEST_FIT_CHARS for legacy encodings, unless the "replace" error handler
is used.
Windows maps every value in single-byte ANSI code pages to a Unicode ordinal,
which round trips between MultiByteToWideChar() and WideCharToMultiByte().
Unless otherwise defined, a value in the range 0x80-0x9F is mapped to the
corresponding ordinal in the C1 controls block. Otherwise values that have no
legacy definition are mapped to a private use area (e.g. U+E000 - U+F8FF).
There is no option to make MultiByteToWideChar() fail for byte values that map
to a C1 control code. But mappings to the private use area are strictly
invalid, and MultiByteToWideChar() will fail in these cases if the flag
MB_ERR_INVALID_CHARS is used. code_page_decode() always uses this flag, but to
reliably fail one needs to pass final=True, since the codec doesn't know it's a
single-byte encoding. For example:
>>> codecs.code_page_decode(1253, b'\xaa', 'strict')
('', 0)
>>> codecs.code_page_decode(1253, b'\xaa', 'strict', True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'cp1253' codec can't decode bytes in position 0--1:
No mapping for the Unicode character exists in the target code page.
Here are the mappings to the private use area in the single-byte "ANSI" code
pages:
1255 Hebrew
0xD9 U+F88D
0xDA U+F88E
0xDB U+F88F
0xDC U+F890
0xDD U+F891
0xDE U+F892
0xDF U+F893
0xFB U+F894
0xFC U+F895
0xFF U+F896
Note that 0xCA is defined as the Hebrew character U+05BA [1]. The definition is
missing in the unicode.org data and Python's "cp1255" encoding.
874 Thai
0xDB U+F8C1
0xDC U+F8C2
0xDD U+F8C3
0xDE U+F8C4
0xFC U+F8C5
0xFD U+F8C6
0xFE U+F8C7
0xFF U+F8C8
1253 Greek
0xAA U+F8F9
0xD2 U+F8FA
0xFF U+F8FB
1257 Baltic
0xA1 U+F8FC
0xA5 U+F8FD
There's no way to get these private use area results from code_page_decode(),
but code_page_encode() allows them. For example:
>>> codecs.code_page_encode(1253, '\uf8f9')[0]
b'\xaa'
---
[1] https://en.wikipedia.org/wiki/Windows-1255
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue45120>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com