[Python-ideas] Re: prefix/suffix for bytes (was: New explicit methods to trim strings)

Andrew Barnert via Python-ideas Tue, 10 Mar 2020 16:08:11 -0700

On Mar 10, 2020, at 13:18, Christopher Barker <[email protected]> wrote:
> 
> Getting a bit OT, but I *think* this is the story:
> 
> I've heard it argued, by folks that want to write Python software that uses 
> bytes for filenames, that:
> 
> A file path on a *nix system can be any string of bytes, except two special 
> values:
> 
> b'\x00'   : null
> b'\x2f'    : slash  
> 
> (consistent with this SO post, among many other sources: 
> https://unix.stackexchange.com/questions/39175/understanding-unix-file-name-encoding)
> 
> So any encoding will work, as long as those two values mean the right thing. 
> Practically, null is always null, so that leaves the slash


No; there are plenty of encodings where a 0 byte doesn’t always mean NUL. And 
in fact, that’s exactly the problem with UTF-16: every ASCII character in 
UTF-16 is the same byte preceded or followed (depending on endianness) by a 0 
byte. And you aren’t allowed to have arbitrary 0 bytes like that in your paths.

> So any encoding that uses b'\x2f' for the slash would work.

Even besides the zero problem, it has to not only always use 0x2f for slash, 
but also never use 0x2f for anything else. This was a problem for many earlier 
East Asian encodings, where a slash is 0x2f, but some kanji character is also 
0x93 0x2f, or some kana character is 0x2f after a mode shift, etc. In such 
cases, every 0x2f byte gets treated as a path separator, even the ones that 
don’t mean slash.

There are encodings that are not ASCII compatible that nevertheless guarantee 
that 0x00 always means NUL and vice versus and that 0x2f always means slash and 
vice-versa, like Shift-JIS. Many of them will cause problems in the shell, file 
manager GUIs, etc., but that’s a different part of the specification (and Unix 
already allows you to have non printable, etc. characters in file names, so 
that problem is there even with ASCII). Many of them also aren’t usable for 
pathnames on other platforms (e.g., Shift-JIS does guarantee that 0x2f always 
means slash, but 0x5c doesn’t always mean backslash; it means yen or the second 
half of various kanji, so you don’t want to use it for byte paths on Windows). 
But for Unix pathnames, they are usable. But again, UTF-16 is not one of them.

> Which seems to include, for instance, UTF-16: 
> 
> In [31]: "/".encode('utf-16')                                                 
>  
> Out[31]: b'\xff\xfe/\x00'

In this case, you will get very lucky—or, maybe better, unlucky. This is 
illegal, but in practice no API can detect that it’s illegal, because all of 
the POSIX and libc functions and most third-party functions just take a 
null-terminated string, meaning they silently truncate right after the first 
Latin-1 character, and your string is exactly one Latin-1 character long.

_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/BAH5RBIKYRPSPRLNYJ7WFUWBGX5MZRKY/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: prefix/suffix for bytes (was: New explicit methods to trim strings)

Reply via email to