Eryk Sun added the comment: The ANSI and OEM codepages are conveniently supported on a Windows system as the encodings 'mbcs' and 'oem' (new in 3.6). The best-fit mapping is used by the 'replace' error handler (see the encode_code_page_flags function in Objects/unicodeobject.c). For other Windows codepages, while it's not as convenient, you can use codecs.code_page_encode. For example:
>>> codecs.code_page_encode(1252, 'α', 'replace') (b'a', 1) For decoding, MB_ERR_INVALID_CHARS has no effect on decoding single-byte codepages because they map every byte. It only affects decoding byte sequences that are invalid in multibyte codepages such as 932 and 65001. Without this flag, invalid sequences are silently decoded as the codepage's Unicode default character. This is usually "?", but for 932 it's Katakana middle dot (U+30FB), and for UTF-8 it's U+FFFD. codecs.code_page_decode uses MB_ERR_INVALID_CHARS almost always, except not for UTF-7 (see the decode_code_page_flags function). So its 'replace' error handling is completely Python's own implementation. For example: MultiByteToWideChar without MB_ERR_INVALID_CHARS: >>> decode(932, b'\xe05', strict=False) '\u30fb' versus code_page_decode: >>> codecs.code_page_decode(932, b'\xe05', 'replace', True) ('\ufffd5', 2) ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue28712> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com