Eryk Sun added the comment:
Thanks, Serihy. When I looked at this previously, I mistakenly assumed that any
undefined codes would be decoded using the codepage's default Unicode
character. But for single-byte codepages in the range above 0x9F, Windows
instead maps undefined codes to the Private Use Area (PUA). For example, using
decode() from above:
ERROR_NO_UNICODE_TRANSLATION = 0x0459
codepages = 857, 864, 874, 1253, 1255, 1257
for cp in codepages:
undefined = []
for i in range(256):
b = bytes([i])
try:
decode(cp, b)
except OSError as e:
if e.winerror == ERROR_NO_UNICODE_TRANSLATION:
c = decode(cp, b, False)
undefined.append('{:02x}=>{:04x}'.format(ord(b), ord(c)))
print(cp, *undefined, sep=', ')
output:
857, d5=>f8bb, e7=>f8bc, f2=>f8bd
864, a6=>f8be, a7=>f8bf, ff=>f8c0
874, db=>f8c1, dc=>f8c2, dd=>f8c3, de=>f8c4, fc=>f8c5, fd=>f8c6,
fe=>f8c7, ff=>f8c8
1253, aa=>f8f9, d2=>f8fa, ff=>f8fb
1255, d9=>f88d, da=>f88e, db=>f88f, dc=>f890, dd=>f891, de=>f892,
df=>f893, fb=>f894, fc=>f895, ff=>f896
1257, a1=>f8fc, a5=>f8fd
Do you think Python's 'replace' handler should prevent adding the
MB_ERR_INVALID_CHARS flag for PyUnicode_DecodeCodePageStateful? One benefit is
that the PUA code can be encoded back to the original byte value:
>>> codecs.code_page_encode(1257, '\uf8fd')
(b'\xa5', 1)
> cp932: 0xA0, 0xFD, 0xFE, 0xFF are errors instead of mapping to U+F8F0-U+F8F3.
Windows maps these byte values to PUA codes if the MB_ERR_INVALID_CHARS flag
isn't used:
>>> decode(932, b'\xa0\xfd\xfe\xff', False)
'\uf8f0\uf8f1\uf8f2\uf8f3'
----------
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue28712>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com