On Tue, Feb 9, 2016 at 3:21 AM, Victor Stinner <victor.stin...@gmail.com> wrote: > 2016-02-09 1:37 GMT+01:00 eryk sun <eryk...@gmail.com>: >> For example, in codepage 932 (Japanese), it's an error if a lead byte >> (i.e. 0x81-0x9F, 0xE0-0xFC) is followed by a trailing byte with a >> value less than 0x40 (note that ASCII 0-9 is 0x30-0x39, so this is not >> uncommon). In this case the ANSI API substitutes the default character >> for Japanese, '・' (U+30FB, Katakana middle dot). >> >> >>> locale.getpreferredencoding() >> 'cp932' >> >>> open(b'\xe05', 'w').close() >> >>> os.listdir('.') >> ['・'] >> >>> os.listdir(b'.') >> [b'\x81E'] > > Hum, I'm not sure that I understand your example.
Say I create a sequence of files with the names "file_à[N].txt" encoded in Latin-1, where N is 0-2. They all map to the same file in a Japanese system locale: >>> open(b'file_\xe00.txt', 'w').close(); os.listdir('.') ['file_・.txt'] >>> open(b'file_\xe01.txt', 'w').close(); os.listdir('.') ['file_・.txt'] >>> open(b'file_\xe02.txt', 'w').close(); os.listdir('.') ['file_・.txt'] >>> os.listdir(b'.') [b'file_\x81E.txt'] This isn't a problem with a single-byte codepage such as 1251. For example, codepage 1251 doesn't map b"\x98" to any character, but harmlessly maps it to "\x98" (SOS in the C1 Controls block). Single-byte code pages still have the problem that when a filename is created using the wide-character API, listing it as bytes may use either an approximate mapping (e.g. "à" => "a" in 1251) or the codepage default character (e.g. "\xd7" => "?" in 1251). _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com