On Mar 10, 2020, at 13:18, Christopher Barker <[email protected]> wrote:
>
> Getting a bit OT, but I *think* this is the story:
>
> I've heard it argued, by folks that want to write Python software that uses
> bytes for filenames, that:
>
> A file path on a *nix system can be any string of bytes, except two special
> values:
>
> b'\x00' : null
> b'\x2f' : slash
>
> (consistent with this SO post, among many other sources:
> https://unix.stackexchange.com/questions/39175/understanding-unix-file-name-encoding)
>
> So any encoding will work, as long as those two values mean the right thing.
> Practically, null is always null, so that leaves the slash
No; there are plenty of encodings where a 0 byte doesn’t always mean NUL. And
in fact, that’s exactly the problem with UTF-16: every ASCII character in
UTF-16 is the same byte preceded or followed (depending on endianness) by a 0
byte. And you aren’t allowed to have arbitrary 0 bytes like that in your paths.
> So any encoding that uses b'\x2f' for the slash would work.
Even besides the zero problem, it has to not only always use 0x2f for slash,
but also never use 0x2f for anything else. This was a problem for many earlier
East Asian encodings, where a slash is 0x2f, but some kanji character is also
0x93 0x2f, or some kana character is 0x2f after a mode shift, etc. In such
cases, every 0x2f byte gets treated as a path separator, even the ones that
don’t mean slash.
There are encodings that are not ASCII compatible that nevertheless guarantee
that 0x00 always means NUL and vice versus and that 0x2f always means slash and
vice-versa, like Shift-JIS. Many of them will cause problems in the shell, file
manager GUIs, etc., but that’s a different part of the specification (and Unix
already allows you to have non printable, etc. characters in file names, so
that problem is there even with ASCII). Many of them also aren’t usable for
pathnames on other platforms (e.g., Shift-JIS does guarantee that 0x2f always
means slash, but 0x5c doesn’t always mean backslash; it means yen or the second
half of various kanji, so you don’t want to use it for byte paths on Windows).
But for Unix pathnames, they are usable. But again, UTF-16 is not one of them.
> Which seems to include, for instance, UTF-16:
>
> In [31]: "/".encode('utf-16')
>
> Out[31]: b'\xff\xfe/\x00'
In this case, you will get very lucky—or, maybe better, unlucky. This is
illegal, but in practice no API can detect that it’s illegal, because all of
the POSIX and libc functions and most third-party functions just take a
null-terminated string, meaning they silently truncate right after the first
Latin-1 character, and your string is exactly one Latin-1 character long.
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/[email protected]/message/BAH5RBIKYRPSPRLNYJ7WFUWBGX5MZRKY/
Code of Conduct: http://python.org/psf/codeofconduct/