On Mon, Feb 8, 2016 at 2:41 PM, Chris Barker <chris.bar...@noaa.gov> wrote:
> Just to clarify -- what does it currently do for bytes? IIUC, Windows uses
> UTF-16, so can you pass in UTF-16 bytes? Or when using bytes is is assuming
> some Windows ANSI-compatible encoding? (and what does it return?)

UTF-16 is used in the [W]ide-character API. Bytes paths use the [A]NSI
codepage. For a single-byte codepage, the ANSI API rountrips, i.e. a
bytes path that's passed to CreateFileA matches the listing from
FindFirstFileA. But for a DBCS codepage arbitrary bytes paths do not
roundtrip. Invalid byte sequences map to the default character. Note
that an ASCII question mark is not always the default character. It
depends on the codepage.

For example, in codepage 932 (Japanese), it's an error if a lead byte
(i.e. 0x81-0x9F, 0xE0-0xFC) is followed by a trailing byte with a
value less than 0x40 (note that ASCII 0-9 is 0x30-0x39, so this is not
uncommon). In this case the ANSI API substitutes the default character
for Japanese, '・' (U+30FB, Katakana middle dot).

    >>> locale.getpreferredencoding()
    'cp932'
    >>> open(b'\xe05', 'w').close()
    >>> os.listdir('.')
    ['・']
    >>> os.listdir(b'.')
    [b'\x81E']

All invalid sequences get mapped to '・', which roundtrips as
b'\x81\x45', so you can't reliably create and open files with
arbitrary bytes paths in this locale.
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to